Skip to main content
ACS AuthorChoice logoLink to ACS AuthorChoice
. 2023 Jun 8;19(12):3509–3525. doi: 10.1021/acs.jctc.3c00279

Lifelong Machine Learning Potentials

Marco Eckhoff 1,*, Markus Reiher 1,*
PMCID: PMC10308836  PMID: 37288932

Abstract

graphic file with name ct3c00279_0007.jpg

Machine learning potentials (MLPs) trained on accurate quantum chemical data can retain the high accuracy, while inflicting little computational demands. On the downside, they need to be trained for each individual system. In recent years, a vast number of MLPs have been trained from scratch because learning additional data typically requires retraining on all data to not forget previously acquired knowledge. Additionally, most common structural descriptors of MLPs cannot represent efficiently a large number of different chemical elements. In this work, we tackle these problems by introducing element-embracing atom-centered symmetry functions (eeACSFs), which combine structural properties and element information from the periodic table. These eeACSFs are key for our development of a lifelong machine learning potential (lMLP). Uncertainty quantification can be exploited to transgress a fixed, pretrained MLP to arrive at a continuously adapting lMLP, because a predefined level of accuracy can be ensured. To extend the applicability of an lMLP to new systems, we apply continual learning strategies to enable autonomous and on-the-fly training on a continuous stream of new data. For the training of deep neural networks, we propose the continual resilient (CoRe) optimizer and incremental learning strategies relying on rehearsal of data, regularization of parameters, and the architecture of the model.

1. Introduction

For the prediction and understanding of the properties and reactivity of atomistic systems, the knowledge of the potential energy surface is inevitable. The potential energy surface is given by the electronic energy as a function of the nuclear positions and can be obtained from electronic structure methods such as density functional theory (DFT) or approximate wave function theories.1,2 These methods are generally applicable and achieve accurate results for many systems, but even state-of-the-art approaches still lead to high computational demands in extended simulations.35 Instead of explicitly calculating the electronic energy for each nuclear conformation in Born–Oppenheimer approximation, empirical force fields rely on approximate (simple) analytical expressions of the potential energy surface. Hence, they avoid the explicit integration of quantum mechanical equations and thereby enabling efficient atomistic simulations. However, force field expressions are of limited accuracy and they are typically not universal for all chemical systems.69

By contrast, machine learning potentials (MLPs)1015 can preserve the high accuracy of electronic structure methods but at low computational cost comparable to that of force fields. MLPs are not based on physical approximations but rely on very flexible mathematical expressions to obtain an analytical potential energy surface. The parameters of these expressions are trained on electronic structure reference data including chemical structures and their respective energies and atomic forces. While first-generation MLPs were only applicable to low-dimensional systems,16 the introduction of second-generation MLPs in the form of high-dimensional neural network potentials (HDNNPs)1719 has facilitated atomistic simulations of systems with tens of thousands of atoms on nanosecond time scales with an accuracy similar to that of first-principles methods. In recent years, various other types of MLPs have been proposed such as neural network potentials,2023 Gaussian approximation potentials,24 moment tensor potentials,25 and many more. Second-generation MLPs rely on the locality of the major part of atomic interactions.19 However, third-generation MLPs add the description of long-range interactions24,26 and fourth-generation MLPs even nonlocal interactions27,28 beyond the applied cutoff sphere for atomic interactions.

Reference atomistic structures need to represent sufficiently well the conformation space to be explored in subsequent simulations. The reason is that MLPs are designed for interpolation of learned atomic interactions, but the extrapolation capability far beyond the trained atomic environment space is limited.12,18 For example, an MLP trained on water will fail for gaseous dihydrogen and dioxygen. A systematic construction approach is not feasible for most systems because of the enormous conformation space that is typically accessible for an atomistic system, especially when one considers various large-amplitude motions or chemical reactions. To obtain reasonable reference conformations, iterative active learning protocols have been devised.2933 In such protocols, preliminary MLPs are validated in atomistic simulations to identify important yet unknown conformations. These conformations are recalculated with the reference method and then added to the reference data to train an improved MLP. However, it is difficult to ensure that a reference data set is complete even for a specific application because some conformations may occur only very rarely. Also, different scientific targets and purposes may, for the same system, highlight different regions of its conformation space and, therefore, require different reference conformations.

Current MLPs learn in isolation, i.e., a model is trained on predefined data and is subsequently applied in simulations without exploiting for future MLPs the knowledge already learned. Despite the general functional form, MLPs can therefore only be applied reliably to a specific conformation space—which can span various chemical formulas and configurations—and the transferability significantly beyond their learned atomic environment space is limited.8,9 As a consequence, a vast number of single-purpose MLPs have been constructed.3442 Recent work has attempted to overcome this limitation by leveraging very large and broad training data sets,43,44 but future applications may still require different reference data.

To achieve a general-purpose solution, MLPs need to be adaptable to learn new chemical systems based on already acquired knowledge. Obviously, the training data sets of present-day MLPs can be simply extended to include additional systems in active learning procedures. However, the training process of new or extended data sets is often started from scratch with randomly initialized MLP parameters. Even if previously obtained parameters are used, the training typically does not differentiate between new and old data points. Hence, all data points need to be trained simultaneously from a stationary batch of data every time new data is added. Consequently, learning larger data sets becomes increasingly inefficient as the amount of repeated training of old data grows. This trend is opposite to how the human brain operates, which is by efficiently learning additional knowledge based on already acquired expertise.4547

This ability of incremental learning is referred to as lifelong learning or continual learning in the field of machine learning,46,47 and it is also desirable for MLPs. Lifelong machine learning means that the model continually acquires and fine-tunes knowledge. For MLPs it will be a single-incremental-task scenario because new chemical structures are sequentially added, but the energy needs to be distinguished among all encountered structures.48 Lifelong learning is a challenge for machine learning because incremental learning from a continuous stream of information typically results in catastrophic forgetting and interference.49,50 The MLP parameters are adapted to the new data, while the memory about the old data fades away.

In principle, building on prior knowledge could even reduce the necessary amount of training data and more complex tasks could be learned more efficiently. According to complementary learning systems theory,47 learning is based on episodic memory and generalization. While the episodic memory learns arbitrary information fast, the generalization is a slow learning process to structure the knowledge. Therefore, the episodic memory is similar to the training data set, while the MLP parameters are trained to generalize the data. Consequently, the parameters need to be plastic to integrate new knowledge, but, at the same time, they need to be stable to avoid forgetting.4547 In recent years, several algorithms have been developed to reach this stability–plasticity balance and to mitigate catastrophic forgetting; examples are ExStream,51 gradient episodic memory,52 synaptic intelligence,53 AR1,48 progressive neural networks,54 and growing dual-memory.55 These approaches rely on rehearsal of selected training data, regularization of machine learning parameters, and/or the architecture of the machine learning model.

The situation becomes more complex for a general-purpose or universal MLP as it would be further hindered by the unfavorable scaling in computational cost with the number of chemical elements. This feature plagues most common MLP descriptors9,56,57 such as atom-centered symmetry functions (ACSFs),58 smooth overlap of atomic positions,59 and many more60,61—with exception of graph representations.6264 The reason is a rapid increase in the number of structural descriptors since these are constructed independently of one another for each element combination. Some studies have attempted to overcome this problem by combining element and structure information in the structural descriptor through the inclusion of a single additional element-dependent weight.6567 Despite some successful applications, employing, for example, the atomic number as in weighted ACSFs67 is restricted to elements of similar atomic numbers. Otherwise, the contributions of light atoms, such as H, are obscured by contributions of heavy atoms, such as iodine (e.g., factor 53 · 53 in angular contributions).

In this work, we exploit additional weight terms based on different properties for separate ACSFs that can reduce bias toward specific elements and increase the discriminability of the descriptor representation for different conformations. Chemical intuition is based on regularities in molecular structure and on trends in the periodic table. To exploit these trends, we use information about the period and the group,68 or more precisely, the position in the s-, p-, d-, and f-block instead of the atomic number. For instance, all halogens lead to a similar type of bonding, while the bond length increases in higher periods.

Furthermore, we propose that uncertainty quantification can enable the application of adaptable MLP parameters. Adaptable parameters require that the reliability of an MLP is probed on the fly to avoid full validation of each MLP training extension. As long as the error of the MLP is below a tolerance in production calculations, little variations with the learning process will not significantly affect the results. Moreover, MLPs with uncertainty quantification are able to report warnings in case of unknown conformations, even if these are within the general training space. Therefore, the uncertainty quantification can be employed to estimate the transferability of the MLP, with low uncertainty meaning high transferability. To obtain uncertainty quantification, for example, for HDNNPs, some studies have applied an ensemble approach.6973 A small ensemble of HDNNPs is then trained differently and independently on the same reference data. The deviations between the predictions of the individual HDNNPs can be employed as a proxy to the prediction error due to the high flexibility of the neural networks. Moreover, taking the average of the ensemble improves the accuracy of the representation. We note that such uncertainty measures should accompany any sort of modeling approach in atomistic simulations,74 although their implementation has only started recently.

Consequently, this work introduces element-embracing atom-centered symmetry functions (eeACSFs) to overcome the unfavorable scaling with the number of elements up to an arbitrary number. In combination with uncertainty quantification and continual learning strategies, the concept of a lifelong machine learning potential (lMLP) is proposed, which can be trained in a rolling fashion by a continuous stream of new data. To perform this challenging training process, we present the new continual resilient (CoRe) optimizer which is an adaptive method for stochastic first-order iterative optimization. CoRe is applied in combination with our lifelong training strategies, which provide an adaptive selection of employed training data, including a reduction of the data set size and the removal of doubtful data. We demonstrate the performance of our lMLP concept for a data set including 42 different SN2 reactions and thereby ten different chemical elements. Although our work on the lMLP concept rests on a second-generation HDNNP representation, we emphasize that the concept can also be applied for a different base model. Furthermore, the CoRe optimizer and the lifelong adaptive data selection can be employed in training of machine learning models beyond lMLPs and contribute to the development of lifelong machine learning to mitigate catastrophic forgetting.

This work is organized as follows: In Section 2 we summarize the HDNNP method and introduce eeACSFs, the CoRe optimizer, lifelong adaptive data selection, uncertainty quantification, and lMLPs. After presenting the computational details in Section 3, we move to Section 4, which starts with a description of the reference data. Section 4 continues with a performance assessment of the eeACSF representation, a comparison between the results of CoRe and those of other optimizers, performance tests of the lifelong training strategies, and a validation of the uncertainty quantification in potential energy surface predictions. This work ends with a conclusion in Section 5.

2. Methods

2.1. High-Dimensional Neural Network Potential with Standardization

For a system containing Nelem elements and Natomm atoms of element m, the second-generation HDNNP energy1719 is given by a sum of atomic energy contributions Eatom,n of every atom n:

2.1. 1

Each atomic energy contribution is obtained from a feed-forward neural network,

2.1. 2

The weights a and b are trained individually for each element m according to energies and atomic forces of a training data set containing multiple chemical structures. We note that the number of hidden layers can differ from the case shown, which is for two hidden layers with n1 and n2 neurons each. A vector of local structural descriptors Gnm of dimension nG, which is explained in the next section, is given as input to this atomic neural network. We propose the activation function f(x) = 1.59223 · tanh(x), which is discussed in Section S1.1 of the Supporting Information along with a tailored weight initialization scheme.40,75 The convergence advantage and accuracy increase of this activation function and weight initialization compared to a hyperbolic tangent and a nontailored weight initialization is shown in Table S3 and Figures S1 (a), S1 (b), and S2 in the Supporting Information.

In contrast to the original HDNNP method17 the trainable weights αim and βi are introduced for standardization of the structural descriptor input Gn,im. For this purpose, the structural descriptor is shifted by βi, which is initialized by the mean of the descriptor values Gn,im in the initial training data set for all atoms n of the respective element m. The result is scaled by αi, which is initialized by the inverse standard deviation of the respective descriptor values in the initial training data. In this way, the weights am,01 of the input layer are multiplied with values, which are centered around zero and show a standard deviation of one for the initial training data. This standardization can improve the training performance and is adjustable in case of additional training data. We note that the weight initialization can be restricted to values inside a certain interval to avoid numerical issues. The weight pair a and αim may be combined, but we treat them separately for technical reasons during the optimization.

2.2. Element-Embracing Atom-Centered Symmetry Functions

HDNNPs usually employ vectors of many-body atom-centered symmetry functions58 to represent the local atomic environments. These descriptors fulfill the translational, rotational, and permutational invariances of the potential energy surface. They depend on distances and angles of all neighboring atoms which are inside a cutoff sphere of radius Rc. This radius needs to be sufficiently large to account for all relevant interactions. Since no connectivities are used, the HDNNP is able to describe chemical reactions. The dimensionality of the input vector does not depend on the individual atomic environment—a requirement to obtain generally applicable atomic neural networks. Conventional ACSF vectors are constructed in such a way that each ACSF represents only all interactions between a specific chemical element pair or triple. This construction, however, leads to an unfavorable scaling in the number of descriptors with respect to the number of elements.

For this reason, we introduce element-embracing atom-centered symmetry functions (eeACSFs) that explicitly depend on the element information H from the periodic table (see next paragraph). Similar to ACSFs, there are two types of eeACSFs. The radial eeACSFs,

2.2. 3

are a function of the distances Rnj between the central atom n and the neighboring atoms j. The angular eeACSFs,

2.2. 4

depend in addition on the angle θnjk between atom n and the two neighbors j and k. Different values for the parameters ηirad ⩾ 0, ηi ⩾ 0, λi = ± 1, and ξi ⩾ 1 of each eeACSF i eventually produce a structural fingerprint of the atomic environment of atom n. The cutoff function,

2.2. 5

damps the contributions smoothly to zero beyond the cutoff radius Rc. The advantage of this cutoff function is that also the derivatives of all orders are zero at Rc, which is beneficial for the calculation of forces, normal modes, and so forth. In contrast to ACSFs, a square root is applied to eeACSFs to mitigate a too strong effect of the number of neighbors on the eeACSF value. Such a strong effect can be observed for training data including very different molecule sizes and particle densities, and it can decrease parametrization performance due to a too broad range of input values.

As element descriptors we propose n for the period number of the element in the periodic table, m for the group number in the s- and p-block (main group 1 to 8), and d for the group number in the d-block. Main group elements obey d = 0 and d-block elements m = 0. A special case is helium with m = 8. The values of these descriptors for the neighboring atom j are used in the element-dependent term Hi,jrad of the radial eeACSFs,

2.2. 6

For the construction of radial eeACSFs without element dependence, Hi,jrad = 1 can be applied. In this work, the maximum period is set to X – 1 = 5, i.e., elements up to xenon are considered. The descriptors n, m, and d are employed to balance contributions of light and heavy elements as well as of elements with few and many valence electrons. For main group elements d = 0 is used and for d-block elements m = 0. For elements of higher periods, the group in the f-block can be implemented in the same way as for the d-block and X – 1 can be set to 7. To keep the contribution of each interaction to a value between 0 and 1, the radial eeACSF is divided by Hmax,i, which is the maximum possible value of Hi,jrad.

The element-dependent terms Hi,jkang of the angular eeACSFs are calculated as linear combinations,

2.2. 7

with γi = ± 1 and

2.2. 8

As a consequence, the element-dependent prefactor of the angular eeACSF is defined as

2.2. 9

In conclusion, for systems with/without d-block elements five/seven different Hi,jrad and nine/eleven Hi,jk terms are required. In combination with typically around five ηirad, two ηi, two λi, and three ζi parameter values, the descriptor vector will consist of 25/35 radial and 108/132 angular eeACSFs independent of the number of elements. By contrast, the number of radial ACSFs is proportional to the number of elements and the number of angular ACSFs scales with ∑m = 1Nelemm. For example, in the case of four elements, 20 radial and 120 angular ACSFs are obtained for the same number of parameter values. Hence, at around four elements is the break-even point of computational cost, since the additional effort for determining Hi,j and Hi,jkang from the listed element-dependent values is small. We note that the number of different parameters values can affect the resolution of the representation. The computational bottleneck in the descriptor calculation during MLP applications is the determination of the derivatives as a function of the atomic positions to obtain the atomic forces. Since no additional position-dependent properties are included in eeACSFs compared to ACSFs, the computational cost per eeACSF is similar to that of ACSFs.

2.3. Training of High-Dimensional Neural Network Potentials

To optimize the weights of the atomic neural networks with respect to potential energies Eref,r and Cartesian atomic force components Fα,nref,r of reference conformations r, a loss function is defined:

2.3. 10

The atomic force component is the negative gradient of the energy with respect to the Cartesian coordinate αnr = xn, ynr, zn of atom n of conformation r,

2.3. 11

A set of Nconf conformations is trained simultaneously in each training epoch t to accelerate the optimization and reduce overfitting of single data points. Since the HDNNP prediction of a conformation can be dependent on the atomic neural networks of different chemical elements, these networks are also trained simultaneously. To balance the contributions of energies and forces to the loss function, the hyperparameter q is used. This hyperparameter needs to be chosen with care as it can significantly affect the training performance. To optimize the weights, in each training epoch the gradient of the loss function with respect to the weights,

2.3. 12

is calculated, where ξ is a unique index of each weight.

2.4. Continual Resilient (CoRe) Optimizer

To improve the training process, we developed the continual resilient (CoRe) optimizer that aims to combine the robustness of resilient backpropagation (RPROP)76,77 with the performance of the Adam optimizer.78 Moreover, we introduce adaptive decay rates of moving averages of the loss function gradients, plasticity factors of the weights obtained from an importance score, and weight decays bounding the weight values to increase the convergence speed and final accuracy beyond state-of-the-art optimizers.

CoRe is a first-order gradient-based optimizer. It employs individual adaptive learning rates for each weight wξ, which depend on optimization history. The algorithm is intended for stochastic iterative optimizations, i.e., subsamples of the batch of training data are used in each training epoch. Thus, computational efficiency benefits of stochastic gradient decent (SGD)79 can be exploited.

In the spirit of the Adam optimizer, exponential moving averages of the gradient,

2.4. 13

and the squared gradient,

2.4. 14

with decay rates β1τ, β2 ∈ [0, 1) are activated in the computation of the individual adaptive learning rates of each weight wξ. τ is the optimization step counter for each weight, which can be different from the training epoch t. For example, for HDNNPs a training data subsample of a training epoch may not contain every chemical element of the entire training data set leading to weight updates only for some of the atomic neural networks in the respective training epoch. In this way, the HDNNP model already provides an architectural strategy to mitigate catastrophic forgetting. We note that Equations 13 and 14 represent the case of minimization, while for maximization the sign of the loss function derivative with respect to the weight must be inverted.

In contrast to the Adam optimizer, β1 is a function of τ,

2.4. 15

The hyperparameters β1a, β1 ∈ [0,1) define the initial and final values of β1. These values are interconverted by a Gaussian with hyperparameter β1c > 0. A larger β1 value increases the dependence on previous gradient information, that is, on the optimization history of all recent training data subsamples. Thereby, the optimization performance with respect to the entire training data set can be improved. A smaller β1 value yields a higher dependence on the current gradient of the subsample, decreasing in some way the moment of inertia of the optimization process. The latter is beneficial for rapidly and strongly changing gradients occurring in fast convergence at the beginning of an optimization, which is started from randomly initialized weights. In conclusion, Equation 15 can be employed to increase β1 in the course of the optimization to improve the training performance.

One contribution of the weight update magnitudes of CoRe is obtained, in analogy to the Adam optimizer, according to

2.4. 16

The moving averages gξτ and hξ get bias corrected by 1–(β1τ)τ and 1–(β2)τ, respectively, to counteract their initialization bias toward zero (gξ, hξ0 = 0). The division of the two bias-corrected moving averages makes the weight update magnitudes invariant to gradient rescaling. A form of step size annealing is obtained, because a decreasing gradient value leads to a decrease of the update uξ. Therefore, for well-behaving optimizations the absolute value of uξτ typically decreases from ±1 in the first optimization step τ = 1 toward zero. Higher values of the hyperparameter β2 promote this annealing. The hyperparameter ϵ ≳ 0 is added for numerical stability.

As a second contribution to the weight update, the plasticity factor,

2.4. 17

is introduced to mitigate forgetting of old information. In the initial training phase, this factor equals one. When τ > thist, with hyperparameter thist > 0, the plasticity factor can freeze the values of some weights by setting Pξτ to zero. The selection of these weights depends on a score value Sξ, which ranks the importance of the weights with regard to previously predicted loss function decrease, as will be explained in Equation 20. The score value Sξτ–1 is compared to all other score values Sχ of the same group. For HDNNPs, a group is composed by the weights within the atomic neural network of the same element m, with the same weight type χ, and the same layer assignment μν. The weights belonging to the nfrozen highest score values in their respective group are frozen for the optimization step τ. The hyperparameter nfrozen,χm,μν ⩾ 0 can be set individually for each group of weights. In this way, a regularization is established for weights with the highest estimated importance.

A further contribution of the step size adjustment is adapted from RPROP,

2.4. 18

In general, the step size sξτ does not depend on the magnitude of the gradient but only on its sign, yielding a robust optimization. Every time gξ changes its sign compared to gξτ–1, the previous optimization step probably jumped over a local minimum. Then, the previous step size sξ was too large and needs to be decreased. If the signs of gξτ and gξ are the same, sξτ–1 can be increased to speed up convergence. The decrease factor η and increase factor η+, with 0 < η ⩽ 1 ≤ η+, are applied to obtain the current step size sξ, whereby sξτ is bounded by minimal and maximal step sizes smin,smax > 0. If Pξ, gξτ, and/or gξ are zero, the step size update is omitted. The step size sξ0 needs to be initialized determining the first optimization step size sξ. The value can be chosen in reasonable proportion to the initial weight values, and experience has shown that the precise choice of this parameter is rather noncritical due to the fast adaption. We note that the gradient is not reset to gξτ = 0 for gξ · gξτ < 0, in contrast to some RPROP variants that include a backtracking weight step.76,80 As a consequence, the history of gξ can be retained.

To decrease the risk of overfitting, the weight update,

2.4. 19

includes a weight decay with weight decay hyperparameter dχm,μν ∈ [0, (smax)−1) . The weight wξ is reduced by the fraction obtained from the product of dχm,μν and the absolute current weight update |uξ| · Pξτ · sξ. We note that the sign of the weight update—equal to the sign of gξτ—is only encoded in uξ. The inverse weight decay hyperparameter is the maximal absolute weight value in well-behaving optimizations, i.e., uξτ ≤ ± 1, preventing strong increases or decreases of weights. To obtain the new weight wξ, the current weight update uξτ · Pξ · sξτ is subtracted from the previous weight wξ. Hence, the previous weight is changed in the opposite direction than the sign of gξτ with an individually adapted learning rate.

The score value,

2.4. 20

accounts for weight specific contributions to previous loss function decreases. It is inspired from the synaptic intelligence method.53 The loss function decrease is estimated by the product of the moving average of the gradient gξτ and the weight update uξ · Pξτ · sξ for each step τ. For infinitesimally small changes in the opposite direction of the gradient, the product of gradient and change equals the respective loss function decrease. For larger updates in optimizations, as in the present case, the loss function decrease is typically overestimated by this product, but still reasonable. In addition, the gradient will be noisy if it is calculated for a subsample of the entire training data set. The sign inversion of the update is omitted in the calculation of the score value. Consequently, a higher positive score value represents a larger loss function decrease. The score value is initialized as Sξ0 = 0. For each τ ≤ thist, gξ · uξτ · Pξ · sξτ is summed with equal contribution (thist)−1 to build an initial history. Afterward, Sξ is calculated as an exponential moving average with decay parameter 1 – (thist)−1. The score value identifies the most important weights for the accurate prediction of previous training data. These weights can then be restricted by the plasticity factor (Equation 17), to balance the stability–plasticity ratio.

As a variant, CoRe can also use the sign of the moving average of the gradient gξτ as update factor uξ, i.e., uξτ = sgn(gξ) . Furthermore, CoRe can be employed without plasticity factors, i.e., nfrozen,χm,μν = 0. The Adam optimizer is a special case of CoRe for the hyperparameter settings β1a = β1, nfrozen,χm,μν = 0, η+ = 1, and dχ = 0. RPROP without the backtracking weight step is another special case for the hyperparameter settings β1a, β1, β2 = 0, nfrozen,χm,μν = 0, and dχ = 0.

General recommendations of the hyperparameter values can be provided for β2 = 0.999, ϵ = 10–8, η = 0.5, η+ = 1.2, smin = 10–6, and smax = 1, which are mostly in agreement with the Adam optimizer and RPROP. Typically, sξ0 = 10–3 is a good choice for the initial learning rate. The weight decay dχ depends on the desired range of weight values. For example, weights associated with the output neuron in atomic neural networks should not be restricted to allow for an arbitrary atomic energy value, while weights connecting input and hidden layers or hidden and hidden layers applying hyperbolic tangents as activation functions can be restricted by using dχm,μν = 0.1. β1 and thist depend on the convergence speed and the total number of epochs. For example, β1c = 500 and thist = 500 were employed in this work, while the total number of epochs were 1500, 2000, and 2500. In addition, nfrozen,χm,μν was set to 1% of the number of weights aκλ and bλm in their respective group, except for those weights associated with the output neuron. β1 and β1b need to be smaller for rapidly and strongly changing gradients. For example, β1 = 0.45 and β1b = 0.7,0.725 were used in this work. We note that for highly uncorrelated training data an increase of the subsample size can be beneficial to reduce gradient fluctuations.

2.5. Lifelong Adaptive Data Selection

Stochastic optimization, which is gradient calculation on a subsample of the training data in each epoch, can retain the computational demand on a manageable level for large data sets. However, random subsamples may lead to an inefficient approach because various data are typically represented at different levels of accuracy. This issue will be even more severe if additional training data are added during the training process in lifelong machine learning. Moreover, redundant and incorrect data need to be removed during the training process to be able to learn autonomously from a continuous stream of new data. Still, rehearsal of representative old data is very important to avoid catastrophic forgetting.

To find a solution for these issues, we developed an algorithm for lifelong adaptive data selection. The first key ingredient of the algorithm is an adaptive selection factor Shistr for each training conformation r depending on the conformation’s loss function contribution and the training history. The second key ingredient is a balancing scheme for the stochastic choice of good and bad represented training data. The algorithm intends to exclude redundant and incorrect data and to improve the training performance by balancing adjustment to new or bad represented data and rehearsal of good represented data.

The lifelong adaptive data selection algorithm can be subdivided into a part carried out before the loss function calculation and a part after this calculation. In the first part (Algorithm 1) the training data subsample is chosen for which the loss function and the respective gradients are calculated. This subsample is indicated by the index “fit” as these data are used for fitting the weights. The number of conformations to be fitted per epoch Nfit is a training hyperparameter and can be chosen based on the data set size and the correlation between the data points. Since conformations can be excluded by an adaptive selection factor Shistr (Algorithm 2), Nfit cannot be larger than the number of conformations with Shist > 0, i.e., len (Shist>0) . The training data subsample is split into a set of Ngood already well represented conformations and Nbad new or insufficiently represented conformations. The fraction of good data is defined by pgood (Algorithm 2), while Ngood has to be an integer. pgood is initialized as 0.

2.

To determine the probabilities Pbad and Pgood for conformations to be chosen as bad or good data, respectively, the contributions of the conformations to the loss function are employed. The last calculated contributions Lold (or Not a Number (NaN) if no contribution was calculated for a conformation so far) are divided by the maximal, non-NaN contribution Loldmax. These quotients are multiplied by the conformation specific adaptive selection factors Shist to yield the probabilities Pbad. Shist is initialized as unity vector. All NaN components of Pbad are set to the maximum of Shist. For normalization Pbad is divided by the sum of its components. A random choice of Nbad conformations with probabilities Pbad is selected from the training data set D to be part of the training data subsample.

To obtain Ngood good data from the remaining data set D\fit, the probability Pgood is determined. Pgood is proportional to the difference between a unity vector 1 and the quotient of Lold\fit and Lold. The minimum nonzero probability of Pgood is determined and multiplied by ϵ′ ≳ 0 to obtain a very low probability Pgoodmin for nonexcluded but unfavored conformations. Pgood entries are set to Pgood if the loss function contribution of the corresponding conformation is equal to the maximal contribution or if the Pgood entry is NaN. Subsequently, Pgood is divided by the sum of its contributions. The good data subsample is a random choice of Ngood data from D\fit with probabilities Pgood. The union of the selected bad and good data is then employed as training data subsample to calculate the loss function.

2.

To update the adaptive selection factors Shist, the new loss function contributions Lnewfit of the conformations in the training data subsample are used (Algorithm 2). Therefore, a weighted sum of energy loss LE and force loss LFfit is employed, which is divided by the number of atoms of the respective conformations Natom. Lrel, which is the quotient of Lnewfit and the total loss Lnew of the training epoch as defined in Equation 10, is compared to different thresholds TF1 < 1 < TF < TF3 < TX to update Shist and the exclusion strike counter X. The latter is introduced to exclude prediction outliers. Under the assumption that the model is reasonable, outliers are presumably incorrect data points. Therefore, X—initialized as zero vector—counts how many consecutive evaluations Lrel are greater than the threshold TX. If Xr reaches the hyperparameter NX, the adaptive selection factor Shistr is set to zero and the respective conformation is thus excluded from further training.

The adaptive selection factors Shist are further modified depending on the following conditions: If LrelrTF and Shistr < 1, Shist is set to one. If LrelrTF and Shistr > 1, Shist is also changed to one. In this way, Shistr of well represented data (Lrel < TF1) can be lower than one leading to lowering of the probability to be chosen for the training data subsample. On the other side, Shist of bad represented data (Lrelr > TF) can be greater than one increasing the selection probability. For conformations with TF1LrelTF2, Shist is resetted to one, because Lrelr is around the average loss contribution. In addition, Shist can be modified by decrease factors F–– and F and increase factors F+ and F++. In this way, Shist tracks the training history to assess the training importance of the associated conformation r. The decrease and increase factors are initialized as F–(−) = (Smin)(NF–(−))–1 and F+(+) = (Smax)(NF+(+))–1, respectively. Consequently, the hyperparameters NF––, NF, NF+, and NF++ define how many consecutive applications of the associated decrease and increase factor lead to an adaptive selection factor below and above the hyperparameters Smin and Smax, respectively. For well represented data Shistr is decreased by F–– if the loss function contribution stayed constant or decreased compared to the previously calculated one. Otherwise Shist of well represented data is less strongly decreased by F. For TF2LrelTF3 and Lnew > Loldr, Shist is increased by F+, while it is increased by the larger factor F++ for LrelrTF and Lnewr > Lold.

Afterward, all Shistr values below the Smin are set to zero for data set reduction because the associated conformations are steadily well represented by the model. Thus, these data are presumably redundant and can be excluded since rehearsal of other data and/or other learning strategies are sufficient to not forget these conformations. All Shist values above the Smax are also set to zero because the model was not able for many optimization steps to represent these conformations accurately. Therefore, under the assumption that the model is reasonable, these conformations are not consistent to the major part of the other data and can be removed from the training data. In this way, the training performance improves for the other data.

For adaptive balancing of learning bad represented data and not forgetting good represented data, the fraction of good data pgood is modified according to the change of the total loss LnewfitLold. pgood is increased by p± if the loss change is positive, while it is decreased by p± for negative loss change. Lold is initialized as infinity and p± is defined as pgoodmax · (Np)−1, with the maximal pgood value pgood and the total number of possible pgood values Np. Lastly, the new loss function contributions Lnewfit overwrite the old loss function contributions Lold for the conformations of the training data subsample and the new total loss value Lnewfit replaces the old total loss value Lold.

2.6. Uncertainty Quantification

An lMLP requires the prediction of the uncertainties of the energy ΔE and the atomic force components ΔFα,n in addition to their values for a chemical structure. The reason is that a prerequisite of an lMLP is that it can produce results at every training stage. Since the predictions at different training stages can vary, the lMLP needs to provide a measure of the prediction accuracy. Then, the predictions can be compared within their accuracy in which they should agree. In this way, a reliable method is obtained that still can adapt to new information over time. We note that uncertainty quantification also significantly advances other computational chemistry methods because overinterpretation of features beyond the method’s accuracy can be prevented.

The uncertainty of machine learning models can be separated into contributions from noise in the reference data, bias of the model, and model variance contributions.81 The handling of a noisy training data is addressed by the lifelong adaptive data selection, while noise of the test data does not influence the model performance but only the performance evaluation. For correctly converged electronic structure calculations the noise is determined by the convergence thresholds and thus small for MLP reference data. Model bias is caused by limitations in the model architecture, representation by the descriptors, and reference data set size. Therefore, the model architecture needs to be optimized for different applications and elaborate descriptors need to be tested to keep this error contribution small. Lifelong learning addresses the remaining part of incomplete reference data. Model variance error can be reliably reduced by ensemble (or committee) models and its size can be predicted by statistics. The respective uncertainty estimation is based on the huge flexibility of the mathematical expressions underlying machine learning models like artificial neural networks. For well trained conformation space the ensemble member predictions presumably agree, while the predictions are arbitrary for unknown conformation space. Therefore, they can spread largely for an ensemble of independently and differently trained MLPs, i.e., using, for example, different initial weights.

Consequently, ensembles of MLPs have been used for a straightforward uncertainty quantification of model variance contributions.6973,82 The final energy prediction E is obtained by the mean of the ensemble of NMLP MLP energies Ep,

2.6. 21

Analogously, the atomic force components Fα,n are calculated as mean of the negative energy gradient with respect to the Cartesian coordinates αn of an MLP ensemble,

2.6. 22

The energy uncertainties ΔE can then be obtained from the sample standard deviation of the ensemble member energy predictions,

2.6. 23

while the minimal uncertainty is defined by the root-mean-square error RMSE(E) of the reference data. The atomic force component uncertainties ΔFα,n can be analogously calculated.

The scaling factor c is introduced to adjust the sample standard deviation providing a certain confidence interval that the uncertainty quantification is equal to or larger than the actual error with respect to the reference data. However, this uncertainty quantification misses model bias contributions and hence underestimates the error in cases where the lack of training data is the primary error source. Despite that, as also the model variance increases in these cases, this uncertainty quantification can still signal whether the error will be high even if the predicted uncertainty value is not accurate. For applications of MLPs this information is sufficient because small uncertainties can be well predicted and the occurrence of large errors requires in any way an improvement of the calculation which can be achieved by lifelong machine learning.

2.7. Perspective of Lifelong Machine Learning Potentials

On the long-term perspective an lMLP that learns more and more systems over time while the training benefits from prior learned knowledge of chemical interactions is the grand goal. The lMLP learns then in the same way as chemistry evolves, i.e., by systematically expanding prior knowledge. As this work is a proof of concept, the presented lifelong training strategies are naturally not in such an advanced state yet to reach this grand goal. In the short term perspective with the presented algorithms, lMLPs will presumably be trained on data of related systems because the benefit is the largest and the difficulty is the lowest in this case. When the lifelong training strategies as well as the MLP models are developed to a more elaborate state, more and more information can be combined in one lMLP.

This work presents a comprehensive basis of ingredients for lMLPs. Future works need, for example, to develop algorithms for the continual expansion of the model architectures to avoid information capacity limitations of the underlying deep learning representations. For instance, neural network growth can be established by using differently sized architectures of the lMLP ensemble members. By tracking their relative performance the requirement to grow can be identified and individual ensemble members can be adapted. Moreover, coarse-grained structural identifiers like the bin and hash method83 can be applied to augment the selection and exclusion criteria of training data used for rehearsal. In addition, they can be employed for fast checks whether a certain conformation is represented by similar ones in the training data. From this information unknown and badly represented conformations can be distinguished to avoid redundant reference calculations.

Current MLP models are often limited to the representation of a single electronic state. Thus, for example, only a specific total charge and spin multiplicity can often be represented by a single MLP. However, recent works target these limitations for more general MLP models28,33,84 which can then be used as base model for lMLPs. A current restriction of MLPs is the requirement of consistent training data, i.e., all training data have to be calculated by the same method, basis set, etc. Therefore, MLP models need to be developed that , for example, treat different reference methods in a similar fashion as different electronic states to establish lMLPs learning from data of different reference methods.

Further, open, free, and accessible reference data of published work is obviously highly important. To continue the training of previous work, not only the final weight values but also the associated CoRe optimizer properties of the weights τ, gξτ, hξ, sξτ, and Sξ and the lifelong adaptive data selection properties Loldr, Shist, and Xr of every still required training conformation as well as the values Lold and pgood need to be saved and made available. Moreover, autonomous workflows need to be set up in future work for a user-friendly development of an lMLP with direct interfaces to electronic structure and atomistic simulation software.

3. Computational Details

In the HDNNP models of this work, each atomic neural network consisted of an input layer with nG = 153 (eeACSFs) or 156 (ACSFs) neurons, three hidden layers with n1 = 102, n2 = 61, and n3 = 44 neurons, and a single output neuron. The weight initialization is described in Section S1.1 of the Supporting Information. In predictions, the ensemble size NMLP = 10 was applied with the error scaling factor c = 2. We note that the ensemble size was not optimized for individual cases, but it can be in future applications depending on the improvement of value and uncertainty prediction versus additional computational demand.

For the training of each HDNNP, we split the reference data set randomly into 90% training conformations and 10% test conformations. In each training epoch, 10% of all training conformations were used for fitting reference data sets A and B (Table 2) and 4.07% were employed for fitting reference data set C. The latter yields the same number of trained conformations per epoch for reference data set C compared to B. This counting includes conformations with an adaptive selection factor of Shistr = 0 and conformations which were first added at a late training epoch. To represent the molecular structures of the reference conformations, eeACSF vectors were equipped with the parameters given in Table 1. A larger than usual cutoff radius of Rc = 12 Å was applied to avoid issues with electrostatic interactions, which are not in the focus of this work. However, future studies can easily adopt more elaborate schemes for electrostatics like fourth-generation HDNNPs.28

Table 2. Number of SN2 Reactions NSN2, Conformations Nconf, Atoms in total Natomtotal, and Elements Nelem for the Different Reference Data Sets A, B, and Ca.

Reference data set A B C
NSN2 10 42 42
Nconf 2026 8600 21151
Natomtotal 22983 100117 190065
Nelem 4 10 10
Erangeref/meV atom–1 1715.3 1855.0 2018.0
Estdref/meV atom–1 382.6 373.8 391.5
Fα,n,rangeref/meV Å–1 29883 29984 29998
Fα,n,stdref/meV Å–1 1098 1066 1706
a

Additionally, the energy range Erangeref and standard deviation Estd and the atomic force component range Fα,n,rangeref and standard deviation Fα,n,std are provided.

Table 1. All Combinations of the Listed Parameters Were Applied for Radial and Angular eeACSFs, Respectivelya.

Radial eeACSFs
Hrad 1, n, m, n, m
ηrad–2 0, 0.010702, 0.023348, 0.044203,
  0.066118, 0.104168, 0.180285,
  0.370959, 1.115414
Angular eeACSFs
Hang 1, n, m, n, m
ηang–2 0.011238, 0.090144
λ –1, 1
ξ 1, 2.409421, 9.996864
a

The cutoff radius was set to Rc = 12 Å for all eeACSFs. γ = ±1 was used except for Hang = 1, where only γ = 1 was applied.

The supervised training was performed on the total DFT energy minus the sum of the atomic DFT energies of the neutral free atoms in their lowest spin state (see Table S1 in the Supporting Information). All DFT data of SN2 reactions were obtained for a total charge of −1 e, with e being the elementary charge, and a spin multiplicity of 1. The DFT calculations were performed with the quantum chemistry program ORCA (version 5.0.3).85,86 The PBE exchange-correlation functional87 was chosen with the def2-TZVP basis set88 and use of the RI-J approximation. In addition, D3 semiclassical dispersion corrections89 with Becke-Johnson damping90,91 were applied. To balance the training on the reference DFT energies and forces, the loss function hyperparameter was set to q = 10.9.

For SGD optimizations the best found learning rate of 0.00075 was applied. RPROP optimizations were performed with the hyperparameters recommended in Reference (76). An exception was the initial learning rate, for which 0.001 was found to be the optimal choice in this work. For the Adam optimizer the same hyperparameters were used as in Reference (78) since no general improvement by hyperparameter variations was found.

The hyperparameters of the CoRe optimizer were: β1a = 0.45, β1 = 0.7,0.725, β1c = 500, β2 = 0.999, ϵ = 10–8, η = 0.5, η+ = 1.2, smin = 10–6, smax = 1, sξ = 10–3, and thist = 500. nfrozen,χm,μν was set to 1% of the number of weights aκλ and bλm in their respective group. Exceptions were those weights associated with the output neuron, for which nfrozen,χ was 0. For the weights α and β, nfrozen,χm,μν = 0 was applied. The weight decay hyperparameter dχ was 0.01 for χ = α,β, 0 for ν = νmax, and 0.1 otherwise. Training was performed for 1500, 2000, and 2500 epochs.

The hyperparameters of the lifelong adaptive data selection were: Shistmin = 0.1, Shist = 100, Inline graphic, Inline graphic, Inline graphic, NF–– = 30, NF = 100, NF+ = 500, NF++ = 150, Inline graphic, NX = 5, Inline graphic, Np = 20, and ϵ′ = 10–6.

The lMLP software was written in Python and exploits the scientific computing package NumPy (version 1.21.6)92 and the machine learning framework PyTorch (version 1.12.1).93 It is available on Zenodo (DOI: 10.5281/zenodo.7912832) alongside the reference data sets, generated output of this work, and scripts to analyze and plot this output.

For a fair and reliable comparison of the different MLP trainings, we performed for each setting 20 trainings with different random numbers used in the splitting of training and test data, weight initialization, and training data selection process. For each setting, the best ten MLPs were employed in the analysis to reduce the dependence on the explicit random numbers. In this way, the effect of outliers was minimized that can originate from an unfavorable initial parameter choice, which can affect all optimizers. Thus, the provided means and standard deviations represent the performance for well working cases.

As we employed different splittings of training and test data in the training of ensemble members in this work, the ensemble prediction of the reference data mixes training and test data predictions. While this approach efficiently uses all reference data for training, unbiased validation of the ensemble needs to be performed on additional data.

4. Results and Discussion

4.1. Reference Data

For the comparison of the performance of ACSFs and eeACSFs, a reference data set A was constructed containing ten different gas-phase SN2 reactions. These reactions were represented by 2026 reference conformations and their respective energies and atomic force components obtained from the reference method. The reference conformations were different chemical structures which sampled the conformation space of the SN2 reactions including the leaving groups X = Cl, I and nucleophiles Y = Cl, HCC, I for central methyl carbon atoms and tertiary tert-butyl carbon atoms (Figure 1). We note that steric hindrance leads to high energy barriers in SN2 reactions of tertiary carbon atoms. This set of SN2 reactions included only four different elements so that the application of ACSFs was still feasible in terms of computational demand. The combinatorial growth of the ACSF vector with the number of elements hampers the application to reference data including more elements.

Figure 1.

Figure 1

SN2 reaction at a methyl carbon atom with leaving group X and nucleophile Y.

By contrast, the size of the eeACSF vector stays constant with the number of elements enabling training of reference data containing an arbitrary number of elements. To test eeACSFs, a reference data set B was compiled consisting of 8600 conformations including 42 different SN2 reactions with leaving groups X = Cl, I, nucleophiles Y = Br, Cl, F, H2N, H3CO, HCC, HO, HS, HSe, I, NC, and central methyl and tert-butyl carbon atoms.

Both reference data sets A and B contained conformations obtained in constrained DFT optimizations. The distances carbon-leaving group and nucleophile-carbon were set in the range from 1.05 to 5.25 Å by an irregular sampling approach. A preliminary grid of distance values, which was more dense for smaller distances, was defined with individual minimal distances for each SN2 reaction system. To obtain the final distance values, random changes on the grid values were applied with the constraint that the values cannot get smaller or larger than adjacent grid values.

To add the representation of structural distortions, 12551 conformations of the SN2 reaction systems with central methyl carbon atoms were generated by random atomic displacements. For each conformation obtained in constrained DFT optimizations, three new conformations were generated by randomly displacing all atomic positions inside atom-centered spheres with the same radii of 0.05, 0.1, and 0.15 Å. If some interatomic distances turned out to be too small, the process was restarted with another set of random displacements. These conformations together with those of reference data set B formed reference data set C. We emphasize that these reference data were only employed in performance evaluations of lifelong learning and can be insufficient to carry out atomistic simulations because relevant reference conformations can be missing.

The only preprocessing of the reference data sets was a restriction to maximal absolute atomic force components of 15 eV Å–1 to exclude those conformations which only occur under extreme conditions. The energy and atomic force component ranges and standard deviations of the different data sets are compiled in Table 2 and are referred to in performance comparisons in the following sections. The mean energy ranges and standard deviations for the individual SN2 reaction systems are provided in Table S4 in the Supporting Information.

4.2. Element-Embracing Atom-Centered Symmetry Functions

For four elements the sizes of the optimized ACSF vector (156) and eeACSF vector (153) are similar (see Table S2 in the Supporting Information for parameters of the ACSFs). Therefore, this number of elements is the turning point at which the eeACSF representation becomes computationally advantageous. Table 3 shows that the representation by ACSFs and eeACSFs of reference data set A yields HDNNPs with similar RMSEs for the test data justifying the structural representation by eeACSFs (see Figures S3 (a) and (b) in the Supporting Information for the prediction error distribution of the ensemble). The accuracy of HDNNPs using ACSFs is on average slightly better. This trend is expected due to the full separation of contributions from different element combinations, while these contributions are mixed in eeACSFs. However, the eeACSF representation is less prone to overfitting according to the given training and test RMSEs. The reason may be that every eeACSF value depends on all neighbor atoms inside the cutoff radius, while an ACSF value depends only on certain neighbor atoms. The latter may adjust better to very specific environments but worsens generalization and transferability. The Figures S4 (a) and (b) and S5 in the Supporting Information show that the convergence and training process is similar for ACSF and eeACSF representations.

Table 3. RMSE Values of Individual HDNNPs, i.e., before Ensembling, and the Ensemble Trained on Reference Data Set A Using ACSF or eeACSFsa.

Individual HDNNPs ACSF eeACSF
RMSE(Etrain)/meV atom–1 2.0 ± 0.2 2.5 ± 0.3
RMSE(Etest)/meV atom–1 2.3 ± 0.2 2.8 ± 0.3
RMSE(Fα,ntrain)/meV Å–1 59 ± 3 73 ± 3
RMSE(Fα,ntest)/meV Å–1 87 ± 9 92 ± 8
Ensemble    
RMSE(E)/meV atom–1 0.9 1.3
RMSE(Fα,n)/meV Å–1 34 44
a

The CoRe optimizer and lifelong adaptive data selection were applied for 2000 epochs.

To train the ten element containing reference data set B, the size of the ACSF vector would be 750 to obtain the same resolution by the parameters ηrad, ηang, λ, and ξ. The resulting high computational demand can be prevented by applying eeACSFs with a constant vector size of 153. The accuracy of individual HDNNPs is RMSE(Etrain) = (3.9 ± 0.4) meV atom–1, RMSE(Etest) = (4.5 ± 0.6) meV atom–1, RMSE(Fα,ntrain) = (99 ± 7) meV Å–1, and RMSE(Fα,n) = (116 ± 4) meV Å–1. The higher accuracy of the results in Table 3 is a reason of the fewer and less complex reference data to be trained, while the model architecture and training hyperparameters remained unchanged. However, especially the HDNNP ensemble accuracy of RMSE(E) = 2.6 meV atom–1 and RMSE(Fα,n) = 64 meV Å–1 (see Figures S6 (a) and (b) in the Supporting Information for the prediction error distribution) is comparable to other state-of-the-art MLPs trained for less elements19 and evidence that eeACSFs are able to represent the different local atomic environments including various neighbor elements. Further, the significantly improved accuracy of the HDNNP ensemble compared to that of individual HDNNPs supports the use of an ensemble beyond the access of uncertainty quantification. For performance comparisons of HDNNPs with other MLPs we refer to the References (42) and (64).

The energy RMSE is slightly larger than usual due to the relatively broad energy range to be trained (Table 2). Moreover, the mean energy range for the individual SN2 reaction systems is also broad with 747 meV atom–1 for those with central methyl carbon atoms and 438 meV atom–1 for those with central tert-butyl carbon atoms. The atomic force component RMSE is somewhat lower than usual despite the broad range because a significant fraction of forces is close to zero.

4.3. Continual Resilient (CoRe) Optimizer

Accurate MLPs can only be obtained if the optimizer can efficiently and reliably find apposite weight values in the high-dimensional parameter space. Training with a fixed learning rate as in SGD yields HDNNPs of poor accuracy. Most RMSE values of SGD results for reference data set B listed in Table 4 are almost an order of magnitude larger than those of CoRe results highlighting the importance of the optimizer. We found RPROP and the Adam optimizer to be the best performing optimizers available in PyTorch 1.12.1 for the given machine learning model and reference data. We note that RPROP is intended for batch learning on all data at once, while we perform stochastic optimization with RPROP. RPROP converges fast and smooth but plateaus at not satisfying accuracy (Figures 2 (a) and (b)). By contrast, the Adam optimizer requires more steps to reach the accuracy of RPROP, but it is able to reach lower RMSE values in the end. However, the convergence is noisy and therefore hampers continual applications.

Table 4. RMSE Values of Individual HDNNPs and the Ensemble Trained on Reference Data Set B Using the Optimizers SGD, RPROP, Adam, and CoRea.

Individual HDNNPs SGD RPROP Adam CoRe
RMSE(Etrain)/meV atom–1 37 ± 7 9.5 ± 0.7 6.1 ± 1.5 3.9 ± 0.4
RMSE(Etest)/meV atom–1 37 ± 7 10.0 ± 0.7 6.4 ± 1.5 4.5 ± 0.6
RMSE(Fα,ntrain)/meV Å–1 564 ± 17 191 ± 5 119 ± 8 99 ± 7
RMSE(Fα,ntest)/meV Å–1 557 ± 11 205 ± 8 127 ± 8 116 ± 4
Ensemble        
RMSE(E)/meV atom–1 33 6.8 3.9 2.6
RMSE(Fα,n)/meV Å–1 529 131 95 64
a

The optimizers and lifelong adaptive data selection were applied for 1500 (RPROP) and 2000 (SGD, Adam, CoRe) epochs.

Figure 2.

Figure 2

Convergence of the optimizers Adam, RPROP, and CoRe for training reference data set B. The test set RMSE values of (a) energies Etest and (b) atomic force components Fα,ntest are shown as a function of the training epoch nepoch. RMSE values of individual HDNNPs are represented by dots, while their mean, which is unequal to the ensemble RMSE value, is shown by a solid line. Lifelong adaptive data selection was applied in the optimizations.

Our CoRe optimizer converges even faster than RPROP and reaches a better final accuracy than the Adam optimizer (Figures 2 (a) and (b)). The convergence is still almost as smooth as that of RPROP. Hence, CoRe combines and improves the benefits of both, RPROP and Adam. We note that these trends also hold for a random training data selection (see Figures S7 (a) and (b) in the Supporting Information) instead of the lifelong adaptive data selection underlying the results in Figures 2 (a) and (b).

4.4. Lifelong Adaptive Data Selection

Continual data set reduction is essential for lifelong machine learning to keep the amount of training data for rehearsal on a manageable level during incremental learning of new data. Figure 3 reveals how the training data set is narrowed during fitting. The training data reduction sets in after about 600 epochs because for data exclusion the adaptive selection factor Shistr needs to be decreased below Shist or increased above Shistmax by a specified number of consecutive applications of the decrease or increase factors, respectively (Algorithm 2). The optimization with RPROP plateaus already after about 600 epochs. Therefore, the data exclusion is too fast in the subsequent epochs, since the loss contribution of most conformations does not change much, biasing the importance evaluation by the adaptive selection factors. By contrast, the fluctuations in the convergence behavior of the optimizations with the Adam optimizer lead to a slow reduction of the training data. The reason for this is that the data importance measures also undergo the fluctuations which hamper to overcome the exclusion thresholds. The training data reduction of optimizations using CoRe is in between RPROP and Adam yielding a more balanced process. The number of excluded conformations per epoch also reduces for advanced training stages in later epochs making the training more stable (see also Figure S5 in the Supporting Information). By contrast, the optimizations with RPROP become unstable after about 1500 epochs due to the too rapid and strong data reduction.

Figure 3.

Figure 3

Training data set reduction of the optimizers Adam, RPROP, and CoRe for training reference data set B. The number of considered training conformations Ntrain is shown as a function of the training epoch nepoch. The values of Ntrain of individual HDNNPs are represented by dots, while their mean is shown by a solid line. The black dashed line represents the number of training conformations which was used for fitting in each epoch.

In the training of reference data set B using the CoRe optimizer, the lifelong adaptive data selection assigned on average (5.5 ± 0.3)·103 training conformations to be redundant after 2000 epochs. Therefore, the training data was reduced to 29% of the initial amount. Figures 2 (a) and (b) show that this data reduction does not lead to a decline of the accuracy, which would be expected if training is performed on a nonrepresentative subset of the training conformations. Despite the strong reduction of the amount of data, the lifelong adaptive data selection significantly improves the training accuracy compared to random data selection based on all training data (see Table S5 and Figures S7 (a) and (b) in the Supporting Information). This trend is observed for all optimizers. For the CoRe optimizer lifelong adaptive data selection yields an improvement of the ensemble RMSE values by 31% for the energies and 48% for the atomic force components compared to random data selection.

On average 38 ± 11 conformations were excluded from training because the model was not able to represent these conformations with a high accuracy. In this way, hindrance of the training process by these conformations can be avoided. Since the same 24 conformations were excluded in more than half of the training processes, these conformations are likely to be doubtful. Conformations, which are excluded only by a few individual HDNNPs of the ensemble, can still be predicted by the ensemble average (see Figures S6 (a) and (b) in the Supporting Information). In this way, the individual training processes can be improved, while the generalization of the ensemble prediction is still provided. Arising uncertainty for certain conformations due to their exclusion in some HDNNP trainings is covered by the uncertainty quantification. Therefore, this approach does not affect the reliability of the method.

4.5. Lifelong Machine Learning Potentials

Three frequently occurring example cases are explored, in which lifelong learning can be beneficial in comparison to iterative cycles of data set expansion and constructing new MLPs trained on all data. These cases represent training data completion of a sparsely sampled conformation space, expansion of the represented conformation space for the same chemical systems, and learning additional chemical systems. Lifelong learning can add an arbitrary number of new data points in each training epoch and does not have to be applied in a block-wise scheme as used in conventional active learning for MLPs. In this work, however, we added new data only in a single training epoch for a clear characterization of the resulting effects.

A sparsely sampled conformation space was obtained for the initial training epochs when a high random fraction of the training data was first available at a late epoch. Figures 4 (a) and (b) show that the proposed lifelong learning strategies can handle this case very well yielding an almost constant final lMLP accuracy with respect to the late data fraction plate. The number of epochs, after which the late data fraction was added, was lower for high plate because otherwise the small fraction of initial data would be overfitted. Figures S8 (b) and S9 in the Supporting Information show that the additional data were added at those epochs where the accuracy of the test atomic force components plateaus or even increases. The mean ensemble accuracy for plate ∈ [0.5, 0.8] is RMSE(E) = 2.6 meV atom–1 and RMSE(Fα,n) = 68 meV Å–1 and hence very similar to that for plate = 0 (Table 4). In addition to the flexibility gained in the training process, the lifelong learning approach requires less training data to be handled in the initial epochs compared to training on all data. Figure S9 in the Supporting Information reveals that the number of excluded conformations by the lifelong adaptive data selection is similar after 1500 epochs for different values of plate.

Figure 4.

Figure 4

Final accuracy of lMLPs for which a fraction of plate training data of reference data set B was first available at a late training epoch. These data were either chosen randomly or a certain block was used. More detailed information about the procedure is provided in the main text. The test RMSE values of (a) energies Etest and (b) atomic force components Fα,ntest are shown as a function of the late data fraction plate. RMSE values of individual HDNNPs are represented by dots, their mean by a solid line, and their range by a lighter colored band. The CoRe optimizer (β1 = 0.7 for “Block” with plate > 0 and β1b = 0.725 otherwise) and lifelong adaptive data selection were applied for 1500 epochs.

To examine the performance for the case of the expansion of the represented conformation space for the same chemical systems, an lMLP was first trained on reference data set B for 1250 epochs. Subsequently, the additional structurally distorted conformations of reference data set C were added and training was continued for another 1250 epochs. Table 5 reveals that lifelong learning yields RMSE values for individual HDNNPs, which are about 13% higher than those of learning on a stationary batch of all training data of reference data set C (see Figures S10 (a) and (b) and S11 in the Supporting Information for the training process). However, most of this lost accuracy is regained by the ensemble model which efficiently reduces the increased model variance (see Table 5 and Figures S12 (a) and (b) in the Supporting Information for the prediction error distribution). Hence, the lMLP concept is able to extend the represented conformation space, while it retains the accuracy.

Table 5. RMSE Values of Individual HDNNPs and the Ensemble Trained on Reference Data Set C Using Learning on a Stationary Batch of All Training Data and Lifelong Learninga.

Individual HDNNPs Stationary data Lifelong learning
RMSE(Etrain)/meV atom–1 6.1 ± 0.2 6.9 ± 0.6
RMSE(Etest)/meV atom–1 6.8 ± 0.2 7.8 ± 0.5
RMSE(Fα,ntrain)/meV Å–1 168 ± 3 185 ± 12
RMSE(Fα,ntest)/meV Å–1 182 ± 4 205 ± 10
Ensemble    
RMSE(E)/meV atom–1 4.3 4.5
RMSE(Fα,n)/meV Å–1 121 122
a

In lifelong learning, only the conformations of reference data set B were trained for the initial 1250 epochs and then the additional conformations of reference data set C were added. The CoRe optimizer with β1b = 0.7 and lifelong adaptive data selection were applied for 2500 epochs.

The higher RMSE values obtained in the training of reference data set C compared to B result from the even larger energy range and broader atomic force component distribution (see Table 2 and Table S4 in the Supporting Information), while the model architecture and training hyperparameters remained unchanged. Still, especially the ensemble atomic force component RMSE is similar to other state-of-the-art HDNNPs trained for less elements.19

To investigate the efficiency for learning additional chemical systems, the system-sorted reference data set B is split into two blocks, whereby the fraction of the second block is plate. The conformations are alphabetically sorted in the order central carbon atom, leaving group, and nucleophile. Learning additional SN2 reactions for central tert-butyl carbon atoms, i.e., adding nucleophiles and leaving groups at a late epoch which are only known for central methyl carbon atoms, yields a similar accuracy as learning on all data from the beginning (plate < 0.5 in Figures 4 (a) and (b)). For plate ⩾ 0.5 only reactions with central methyl carbon atoms are contained in the initial training data leading to an increase of the final test RMSE values. We emphasize that for plate ⩾ 0.8 also some elements are missing in the initial training data. Still, the accuracy is better than that obtained using RPROP and for the energies it is similar to the Adam results (Table 4).

Similar to the aforementioned case, ensembling can efficiently reduce the model variance introduced by incremental learning and hence is an important tool for lifelong machine learning. The ensemble accuracy is RMSE(E) = 3.1 meV atom–1 and RMSE(Fα,n) = 78 meV Å–1 for plate ∈ [0.5,0.8] and therefore about 21% larger than for plate = 0. However, in these cases of learning additional chemical systems the lifelong learning strategies require further development to be on par with the accuracy of training on all data.

As this work is a proof of concept for lMLPs, more and different chemical systems need to be explored in future work to fine-tune and improve the lifelong learning strategies. Still, we showed that lifelong learning can reach the same accuracy as training on all data, while the flexibility of the training process was significantly increased.

4.6. Ensemble Prediction and Uncertainty Quantification

To validate the ensemble prediction and examine the uncertainty quantification, the lMLP trained by the CoRe optimizer on reference data set B was applied on conformations obtained from constrained DFT optimizations on a dense grid of rCl–C and rBr–C distances for the SN2 reaction Br + CH3Cl ⇌ BrCH3 + Cl. We emphasize that reference data set B uses a sparser and irregular sampling of this reaction and the validation conformations are unlikely to be in the reference data set B. Hence, the smoothness of the lMLP potential energy surface can be validated. The validation set contained 1062 conformations with maximal atomic force components of 16 eV Å–1.

Figure 5 reveals that most sections of the represented potential energy surface are predicted within chemical accuracy, i.e., 1 kcal mol–1 = 4.184 kJ mol–1, with respect to the DFT reference energies. For this six atom system an error of less than 7.2 meV atom–1 is therefore required. Larger errors are only observed for high energy conformations, which we expected because the maximal atomic force component can be 1 eV Å–1 higher than that of the training data. The small errors with smooth distributions in the trained conformation space prove the smoothness of the lMLP.

Figure 5.

Figure 5

Potential energy surface of the SN2 reaction Br + CH3Cl ⇌ BrCH3 + Cl. The lMLP ensemble prediction energy E is referenced to the minimum DFT reference energy Eminref of the given conformation space spanned by DFT optimized structures with constrained distances rCl–C and rBr–C. The color represents the error of E with respect to the DFT reference energy Eref. Black dots show the explicit evaluations of the lMLP. The colors between black dots are interpolated.

The reliability of the uncertainty quantification is addressed in Figure 6 (a) and (b). The uncertainty quantification is equal to or larger than the absolute error with respect to the DFT reference energies for 98.6% of the validation data with uncertainties ΔE ⩽ 10 meV atom–1. For the atomic force components the fraction is 99.7% for ΔFα,n ⩽ 250 meV Å–1. Hence, for most conformations the magnitude of the error is reliably predicted. For large errors the uncertainty quantification is expected to underestimate the errors (see Section 2.6) so that the above-mentioned fractions decrease to 66% for ΔE > 10 meV atom–1 and 93% for ΔFα,n > 250 meV Å–1. Still, a reliable identification of large errors is provided.

Figure 6.

Figure 6

Absolute values of the errors with respect to the DFT reference and the uncertainty quantification for the ensemble prediction of (a) energies |ΔE| and (b) atomic force components |ΔFα,n| of the validation data. The order x sorts the validation conformations according to their uncertainty quantification.

5. Conclusions

This work introduces the concept of lMLPs that can fine-tune and extend their representation in a rolling fashion. Hence, the lMLP concept unites MLP model efficiency and accuracy with flexibility. For an lMLP, a universal and computationally efficient atomic structure representation, an MLP model, uncertainty quantification, and lifelong learning strategies need to be combined.

Therefore, we introduced eeACSF vectors for the structural representation, which are size-independent with respect to the number of chemical elements in contrast to many other common MLP descriptors. Their representation performance of an SN2 reference data set is similar to that of ACSF vectors at the break-even point of computational cost, which is at about four different elements. For several more elements eeACSF vectors become the only computationally reasonable option due to the combinatorial growth of ACSF vectors. An ensemble HDNNP model using eeACSFs can predict an SN2 reference data set with ten different elements with a state-of-the-art accuracy of previous HDNNPs trained on less elements. Further, an ensemble of HDNNPs is a reliable way to quantify the uncertainty due to model variance and to identify conformations with high uncertainty in predictions. Additionally, ensembling increases the accuracy yielding potential energy surfaces with chemical accuracy for the SN2 reactions.

As a basis of our lifelong learning strategies, we introduced the CoRe optimizer, which can combine and improve the fast convergence of RPROP and the high final accuracy of the Adam optimizer. In the training of this work, the CoRe optimizer significantly improves the HDNNP ensemble accuracy by about 33% for energies and forces compared to the Adam optimizer. Applying lifelong adaptive data selection further improves the accuracy and enables to narrow the training data set and exclude doubtful data during the training process. The CoRe optimizer and lifelong adaptive data selection can also improve training of machine learning models beyond lMLPs.

Finally, an lMLP can adapt to additional data that can be continuously added at any point in the training process. In this way, improvements of lMLPs are possible without learning again on all previous data and still a reliable method is obtained due to uncertainty quantification. In learning cases that are obtained during active learning or in the extension of the conformation space for the same reaction systems, the training accuracy is similar to that of learning on a stationary batch of all data. Even adding new reaction systems can be performed by the presented algorithms with only moderate accuracy loss. The benefit of lifelong learning is the enhanced flexibility of the training process enabling rolling explorations of chemical reactivity and training continuation of previous lMLPs. Moreover, adaptability of lMLPs is especially advantageous for large reference data sets where training on all data at once is computationally very demanding. We emphasize that the lMLP concept can also be applied for other MLP approaches beyond HDNNPs in future work.

Acknowledgments

This work was supported by an ETH Zurich Postdoctoral Fellowship.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jctc.3c00279.

  • Activation function, weight initialization, DFT free atom energies, ACSF parameters, performance evaluation of activation function and weight initialization, energy distribution of reference data, prediction error distributions, training process comparison of ACSFs and eeACSFs, performance evaluation of random data selection, and lMLP training processes (PDF)

The authors declare no competing financial interest.

Supplementary Material

ct3c00279_si_001.pdf (9.2MB, pdf)

References

  1. Cramer C. J.Essentials of Computational Chemistry: Theories and Models; Wiley, 2013. [Google Scholar]
  2. Jensen F.Introduction to Computational Chemistry; Wiley, 2017. [Google Scholar]
  3. Burke K. Perspective on density functional theory. J. Chem. Phys. 2012, 136, 150901. 10.1063/1.4704546. [DOI] [PubMed] [Google Scholar]
  4. Riplinger C.; Pinski P.; Becker U.; Valeev E. F.; Neese F. Sparse maps–A systematic infrastructure for reduced-scaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory. J. Chem. Phys. 2016, 144, 024109. 10.1063/1.4939030. [DOI] [PubMed] [Google Scholar]
  5. Das S.; Motamarri P.; Gavini V.; Turcksin B.; Li Y. W.; Leback B.. Fast, scalable and accurate finite-element based ab initio calculations using mixed precision computing: 46 PFLOPS simulation of a metallic dislocation system. International Conference for High Performance Computing, Networking, Storage and Analysis (SC19). Denver, CO, USA, 2019.
  6. van Duin A. C. T.; Dasgupta S.; Lorant F.; Goddard W. A. ReaxFF: A Reactive Force Field for Hydrocarbons. J. Phys. Chem. A 2001, 105, 9396–9409. 10.1021/jp004368u. [DOI] [Google Scholar]
  7. Piana S.; Lindorff-Larsen K.; Shaw D. E. How Robust Are Protein Folding Simulations with Respect to Force Field Parameterization?. Biophys. J. 2011, 100, L47–L49. 10.1016/j.bpj.2011.03.051. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Friederich P.; Häse F.; Proppe J.; Aspuru-Guzik A. Machine-learned potentials for next-generation matter simulations. Nat. Mater. 2021, 20, 750–761. 10.1038/s41563-020-0777-6. [DOI] [PubMed] [Google Scholar]
  9. Behler J.; Csányi G. Machine Learning Potentials for Extended Systems-A Perspective. Eur. Phys. J. B 2021, 94, 142. 10.1140/epjb/s10051-021-00156-1. [DOI] [Google Scholar]
  10. Behler J. Perspective: Machine learning potentials for atomistic simulations. J. Chem. Phys. 2016, 145, 170901. 10.1063/1.4966192. [DOI] [PubMed] [Google Scholar]
  11. Bartók A. P.; De S.; Poelking C.; Bernstein N.; Kermode J. R.; Csányi G.; Ceriotti M. Machine learning unifies the modeling of materials and molecules. Sci. Adv. 2017, 3, e1701816. 10.1126/sciadv.1701816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Deringer V. L.; Caro M. A.; Csányi G. Machine Learning Interatomic Potentials as Emerging Tools for Materials Science. Adv. Mater. 2019, 31, 1902765. 10.1002/adma.201902765. [DOI] [PubMed] [Google Scholar]
  13. Noé F.; Tkatchenko A.; Müller K.-R.; Clementi C. Machine Learning for Molecular Simulation. Annu. Rev. Phys. Chem. 2020, 71, 361–390. 10.1146/annurev-physchem-042018-052331. [DOI] [PubMed] [Google Scholar]
  14. Westermayr J.; Gastegger M.; Schütt K. T.; Maurer R. J. Perspective on integrating machine learning into computational chemistry and materials science. J. Chem. Phys. 2021, 154, 230903. 10.1063/5.0047760. [DOI] [PubMed] [Google Scholar]
  15. Käser S.; Vazquez-Salazar L. I.; Meuwly M.; Töpfer K. Neural network potentials for chemistry: concepts, applications and prospects. Digital Discovery 2023, 2, 28–58. 10.1039/D2DD00102K. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Blank T. B.; Brown S. D.; Calhoun A. W.; Doren D. J. Neural network models of potential energy surfaces. J. Chem. Phys. 1995, 103, 4129–4137. 10.1063/1.469597. [DOI] [Google Scholar]
  17. Behler J.; Parrinello M. Generalized Neural-Network Representation of High-Dimensional Potential-Energy Surfaces. Phys. Rev. Lett. 2007, 98, 146401. 10.1103/PhysRevLett.98.146401. [DOI] [PubMed] [Google Scholar]
  18. Behler J. First Principles Neural Network Potentials for Reactive Simulations of Large Molecular and Condensed Systems. Angew. Chem., Int. Ed. 2017, 56, 12828–12840. 10.1002/anie.201703114. [DOI] [PubMed] [Google Scholar]
  19. Behler J. Four Generations of High-Dimensional Neural Network Potentials. Chem. Rev. 2021, 121, 10037–10072. 10.1021/acs.chemrev.0c00868. [DOI] [PubMed] [Google Scholar]
  20. Schütt K. T.; Arbabzadah F.; Chmiela S.; Müller K. R.; Tkatchenko A. Quantum-chemical insights from deep tensor neural networks. Nat. Commun. 2017, 8, 13890. 10.1038/ncomms13890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Unke O. T.; Meuwly M. PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments, and Partial Charges. J. Chem. Theory Comput. 2019, 15, 3678–3693. 10.1021/acs.jctc.9b00181. [DOI] [PubMed] [Google Scholar]
  22. Unke O. T.; Chmiela S.; Gastegger M.; Schütt K. T.; Sauceda H. E.; Müller K.-R. SpookyNet: Learning force fields with electronic degrees of freedom and nonlocal effects. Nat. Commun. 2021, 12, 7273. 10.1038/s41467-021-27504-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Batatia I.; Kovács D. P.; Simm G. N. C.; Ortner C.; Csányi G.. MACE: Higher Order Equivariant Message Passing Neural Networks for Fast and Accurate Force Fields. arXiv:2206.07697 [stat.ML] 2022. 10.48550/arXiv.2206.07697 [DOI] [Google Scholar]
  24. Bartók A. P.; Payne M. C.; Kondor R.; Csányi G. Gaussian Approximation Potentials: The Accuracy of Quantum Mechanics, without the Electrons. Phys. Rev. Lett. 2010, 104, 136403. 10.1103/PhysRevLett.104.136403. [DOI] [PubMed] [Google Scholar]
  25. Shapeev A. V. Moment Tensor Potentials: a class of systematically improvable interatomic potentials. Multiscale Model. Simul. 2016, 14, 1153–1173. 10.1137/15M1054183. [DOI] [Google Scholar]
  26. Artrith N.; Morawietz T.; Behler J. High-dimensional neural-network potentials for multicomponent systems: Applications to zinc oxide. Phys. Rev. B 2011, 83, 153101. 10.1103/PhysRevB.83.153101. [DOI] [Google Scholar]
  27. Ghasemi S. A.; Hofstetter A.; Saha S.; Goedecker S. Interatomic potentials for ionic systems with density functional accuracy based on charge densities obtained by a neural network. Phys. Rev. B 2015, 92, 045131. 10.1103/PhysRevB.92.045131. [DOI] [Google Scholar]
  28. Ko T. W.; Finkler J. A.; Goedecker S.; Behler J. A fourth-generation high-dimensional neural network potential with accurate electrostatics including non-local charge transfer. Nat. Commun. 2021, 12, 398. 10.1038/s41467-020-20427-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Artrith N.; Behler J. High-dimensional neural network potentials for metal surfaces: A prototype study for copper. Phys. Rev. B 2012, 85, 045439. 10.1103/PhysRevB.85.045439. [DOI] [Google Scholar]
  30. Behler J. Constructing High-Dimensional Neural Network Potentials: A Tutorial Review. Int. J. Quantum Chem. 2015, 115, 1032–1050. 10.1002/qua.24890. [DOI] [Google Scholar]
  31. Podryabinkin E. V.; Shapeev A. V. Active learning of linearly parametrized interatomic potential. Comput. Mater. Sci. 2017, 140, 171–180. 10.1016/j.commatsci.2017.08.031. [DOI] [Google Scholar]
  32. Bernstein N.; Csányi G.; Deringer V. L. De Novo Exploration and Self-Guided Learning of Potential-Energy Surfaces. npj Comput. Mater. 2019, 5, 99. 10.1038/s41524-019-0236-6. [DOI] [Google Scholar]
  33. Eckhoff M.; Behler J. High-dimensional neural network potentials for magnetic systems using spin-dependent atom-centered symmetry functions. npj Comput. Mater. 2021, 7, 170. 10.1038/s41524-021-00636-z. [DOI] [Google Scholar]
  34. Morawietz T.; Singraber A.; Dellago C.; Behler J. How van der Waals interactions determine the unique properties of water. Proc. Natl. Acad. Sci. U. S. A. 2016, 113, 8368–8373. 10.1073/pnas.1602375113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Hellström M.; Behler J. Concentration-Dependent Proton Transfer Mechanisms in Aqueous NaOH Solutions: From Acceptor-Driven to Donor-Driven and Back. J. Phys. Chem. Lett. 2016, 7, 3302–3306. 10.1021/acs.jpclett.6b01448. [DOI] [PubMed] [Google Scholar]
  36. Cheng B.; Engel E. A.; Behler J.; Dellago C.; Ceriotti M. Ab initio thermodynamics of liquid and solid water. Proc. Natl. Acad. Sci. U. S. A. 2019, 116, 1110–1115. 10.1073/pnas.1815117116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Westermayr J.; Gastegger M.; Menger M. F. S. J.; Mai S.; González L.; Marquetand P. Machine learning enables long time scale molecular photodynamics simulations. Chem. Sci. 2019, 10, 8100–8107. 10.1039/C9SC01742A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Amabilino S.; Bratholm L. A.; Bennie S. J.; Vaucher A. C.; Reiher M.; Glowacki D. R. Training Neural Nets To Learn Reactive Potential Energy Surfaces Using Interactive Quantum Chemistry in Virtual Reality. J. Phys. Chem. A 2019, 123, 4486–4499. 10.1021/acs.jpca.9b01006. [DOI] [PubMed] [Google Scholar]
  39. Eckhoff M.; Behler J. From Molecular Fragments to the Bulk: Development of a Neural Network Potential for MOF-5. J. Chem. Theory Comput. 2019, 15, 3793–3809. 10.1021/acs.jctc.8b01288. [DOI] [PubMed] [Google Scholar]
  40. Eckhoff M.; Schönewald F.; Risch M.; Volkert C. A.; Blöchl P. E.; Behler J. Closing the gap between theory and experiment for lithium manganese oxide spinels using a high-dimensional neural network potential. Phys. Rev. B 2020, 102, 174102. 10.1103/PhysRevB.102.174102. [DOI] [Google Scholar]
  41. Eckhoff M.; Behler J. Insights into lithium manganese oxide-water interfaces using machine learning potentials. J. Chem. Phys. 2021, 155, 244703. 10.1063/5.0073449. [DOI] [PubMed] [Google Scholar]
  42. Zuo Y.; Chen C.; Li X.; Deng Z.; Chen Y.; Behler J.; Csányi G.; Shapeev A. V.; Thompson A. P.; Wood M. A.; Ong S. P. Performance and Cost Assessment of Machine Learning Interatomic Potentials. J. Phys. Chem. A 2020, 124, 731–745. 10.1021/acs.jpca.9b08723. [DOI] [PubMed] [Google Scholar]
  43. Smith J. S.; Isayev O.; Roitberg A. E. ANI-1: An extensible neural network potential with DFT accuracy at force field computational cost. Chem. Sci. 2017, 8, 3192–3203. 10.1039/C6SC05720A. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Zhang S.; Makoś M. Z.; Jadrich R. B.; Kraka E.; Barros K. M.; Nebgen B. T.; Tretiak S.; Isayev O.; Lubbers N.; Messerly R. A.; Smith J. S.. Exploring the frontiers of chemistry with a general reactive machine learning potential. ChemRxiv 2022. 10.26434/chemrxiv-2022-15ct6-v2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Hassabis D.; Kumaran D.; Summerfield C.; Botvinick M. Neuroscience-Inspired Artificial Intelligence. Neuron 2017, 95, 245–258. 10.1016/j.neuron.2017.06.011. [DOI] [PubMed] [Google Scholar]
  46. Chen Z.; Liu B.. Lifelong Machine Learning; Morgan & Claypool Publishers, 2018. [Google Scholar]
  47. Parisi G. I.; Kemker R.; Part J. L.; Kanan C.; Wermter S. Continual lifelong learning with neural networks: A review. Neural Netw. 2019, 113, 54–71. 10.1016/j.neunet.2019.01.012. [DOI] [PubMed] [Google Scholar]
  48. Maltoni D.; Lomonaco V. Continuous learning in single-incremental-task scenarios. Neural Netw. 2019, 116, 56–73. 10.1016/j.neunet.2019.03.010. [DOI] [PubMed] [Google Scholar]
  49. Grossberg S. Adaptive Resonance Theory: How a brain learns to consciously attend, learn, and recognize a changing world. Neural Netw. 2013, 37, 1–47. 10.1016/j.neunet.2012.09.017. [DOI] [PubMed] [Google Scholar]
  50. Goodfellow I. J.; Mirza M.; Xiao D.; Courville A.; Bengio Y.. An Empirical Investigation of Catastrophic Forgetting in Gradient-Based. arXiv:1312.6211v3 [stat.ML] 2015. 10.48550/arXiv.1312.6211 [DOI] [Google Scholar]
  51. Hayes T. L.; Cahill N. D.; Kanan C.. Memory Efficient Experience Replay for Streaming Learning. International Conference on Robotics and Automation (ICRA), Montreal, Canada, 2019; pp 9769–9776.
  52. Lopez-Paz D.; Ranzato M. A.. Gradient Episodic Memory for Continual Learning. 31st Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 2017; pp 6470–6479.
  53. Zenke F.; Poole B.; Ganguli S.. Continual Learning Through Synaptic Intelligence. 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017; pp 3987–3995. [PMC free article] [PubMed]
  54. Rusu A. A.; Rabinowitz N. C.; Desjardins G.; Soyer H.; Kirkpatrick J.; Kavukcuoglu K.; Pascanu R.; Hadsell R.. Progressive Neural Networks. arXiv:1606.04671v4 [cs.LG] 2022. 10.48550/arXiv.1606.04671 [DOI] [Google Scholar]
  55. Parisi G. I.; Tani J.; Weber C.; Wermter S. Lifelong Learning of Spatiotemporal Representations with Dual-Memory Recurrent Self-Organization. Front. Neurorobot. 2018, 12, 78. 10.3389/fnbot.2018.00078. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. Unke O. T.; Chmiela S.; Sauceda H. E.; Gastegger M.; Poltavsky I.; Schütt K. T.; Tkatchenko A.; Müller K.-R. Machine Learning Force Fields. Chem. Rev. 2021, 121, 10142–10186. 10.1021/acs.chemrev.0c01111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Kulik H. J.; Hammerschmidt T.; Schmidt J.; Botti S.; Marques M. A. L.; Boley M.; Scheffler M.; Todorović M.; Rinke P.; Oses C.; Smolyanyuk A.; Curtarolo S.; Tkatchenko A.; Bartók A. P.; Manzhos S.; Ihara M.; Carrington T.; Behler J.; Isayev O.; Veit M.; Grisafi A.; Nigam J.; Ceriotti M.; Schütt K. T.; Westermayr J.; Gastegger M.; Maurer R. J.; Kalita B.; Burke K.; Nagai R.; Akashi R.; Sugino O.; Hermann J.; Noé F.; Pilati S.; Draxl C.; Kuban M.; Rigamonti S.; Scheidgen M.; Esters M.; Hicks D.; Toher C.; Balachandran P. V.; Tamblyn I.; Whitelam S.; Bellinger C.; Ghiringhelli L. M. Roadmap on machine learning in electronic structure. Electron. Struct. 2022, 4, 023004. 10.1088/2516-1075/ac572f. [DOI] [Google Scholar]
  58. Behler J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Chem. Phys. 2011, 134, 074106. 10.1063/1.3553717. [DOI] [PubMed] [Google Scholar]
  59. Bartók A. P.; Kondor R.; Csányi G. On representing chemical environments. Phys. Rev. B 2013, 87, 184115. 10.1103/PhysRevB.87.184115. [DOI] [Google Scholar]
  60. Musil F.; Grisafi A.; Bartók A. P.; Ortner C.; Csányi G.; Ceriotti M. Physics-Inspired Structural Representations for Molecules and Materials. Chem. Rev. 2021, 121, 9759–9815. 10.1021/acs.chemrev.1c00021. [DOI] [PubMed] [Google Scholar]
  61. Gugler S.; Reiher M. Quantum Chemical Roots of Machine-Learning Molecular Similarity Descriptors. J. Chem. Theory Comput. 2022, 18, 6670–6689. 10.1021/acs.jctc.2c00718. [DOI] [PubMed] [Google Scholar]
  62. Scarselli F.; Gori M.; Tsoi A. C.; Hagenbuchner M.; Monfardini G. The Graph Neural Network Model. IEEE Trans. Neural Netw. 2009, 20, 61–80. 10.1109/TNN.2008.2005605. [DOI] [PubMed] [Google Scholar]
  63. Gilmer J.; Schoenholz S. S.; Riley P. F.; Vinyals O.; Dahl G. E.. Neural message passing for Quantum chemistry. 34th International Conference on Machine Learning (ICML), Sydney, Australia, 2017; pp 1263–1272.
  64. Chen C.; Ong S. P. A universal graph deep learning interatomic potential for the periodic table. Nat. Comput. Sci. 2022, 2, 718–728. 10.1038/s43588-022-00349-3. [DOI] [PubMed] [Google Scholar]
  65. Artrith N.; Urban A.; Ceder G. Efficient and accurate machine-learning interpolation of atomic energies in compositions with many species. Phys. Rev. B 2017, 96, 014112. 10.1103/PhysRevB.96.014112. [DOI] [Google Scholar]
  66. Rostami S.; Amsler M.; Ghasemi S. A. Optimized symmetry functions for machine-learning interatomic potentials of multicomponent systems. J. Chem. Phys. 2018, 149, 124106. 10.1063/1.5040005. [DOI] [PubMed] [Google Scholar]
  67. Gastegger M.; Schwiedrzik L.; Bittermann M.; Berzsenyi F.; Marquetand P. wACSF-Weighted atom-centered symmetry functions as descriptors in machine learning potentials. J. Chem. Phys. 2018, 148, 241709. 10.1063/1.5019667. [DOI] [PubMed] [Google Scholar]
  68. Faber F. A.; Christensen A. S.; Huang B.; von Lilienfeld O. A. Alchemical and structural distribution based representation for universal quantum machine learning. J. Chem. Phys. 2018, 148, 241717. 10.1063/1.5020710. [DOI] [PubMed] [Google Scholar]
  69. Peterson A. A.; Christensen R.; Khorshidi A. Addressing uncertainty in atomistic machine learning. Phys. Chem. Chem. Phys. 2017, 19, 10978–10985. 10.1039/C7CP00375G. [DOI] [PubMed] [Google Scholar]
  70. Smith J. S.; Nebgen B.; Lubbers N.; Isayev O.; Roitberg A. E. Less is more: Sampling chemical space with active learning. J. Chem. Phys. 2018, 148, 241733. 10.1063/1.5023802. [DOI] [PubMed] [Google Scholar]
  71. Smith J. S.; Nebgen B. T.; Zubatyuk R.; Lubbers N.; Devereux C.; Barros K.; Tretiak S.; Isayev O.; Roitberg A. E. Approaching coupled cluster accuracy with a general-purpose neural network potential through transfer learning. Nat. Commun. 2019, 10, 2903. 10.1038/s41467-019-10827-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Devereux C.; Smith J. S.; Huddleston K. K.; Barros K.; Zubatyuk R.; Isayev O.; Roitberg A. E. Extending the Applicability of the ANI Deep Learning Molecular Potential to Sulfur and Halogens. J. Chem. Theory Comput. 2020, 16, 4192–4202. 10.1021/acs.jctc.0c00121. [DOI] [PubMed] [Google Scholar]
  73. Imbalzano G.; Zhuang Y.; Kapil V.; Rossi K.; Engel E. A.; Grasselli F.; Ceriotti M. Uncertainty estimation for molecular dynamics and sampling. J. Chem. Phys. 2021, 154, 074102. 10.1063/5.0036522. [DOI] [PubMed] [Google Scholar]
  74. Reiher M. Molecule-Specific Uncertainty Quantification in Quantum Chemical Studies. Isr. J. Chem. 2022, 62, e202100101. 10.1002/ijch.202100101. [DOI] [Google Scholar]
  75. Eckhoff M.; Lausch K. N.; Blöchl P. E.; Behler J. Predicting oxidation and spin states by high-dimensional neural networks: Applications to lithium manganese oxide spinels. J. Chem. Phys. 2020, 153, 164107. 10.1063/5.0021452. [DOI] [PubMed] [Google Scholar]
  76. Riedmiller M.; Braun H.. A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Algorithm. International Conference on Neural Networks (ICNN), San Francisco, CA, USA, 1993; pp 586–591.
  77. Riedmiller M. Advanced supervised learning in multi-layer perceptrons–From backpropagation to adaptive learning algorithms. Comput. Stand. Interfaces 1994, 16, 265–278. 10.1016/0920-5489(94)90017-5. [DOI] [Google Scholar]
  78. Kingma D. P.; Ba J.. Adam: AMethod for Stochastic Optimization. 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015.
  79. Robbins H.; Monro S. A Stochastic Approximation Method. Ann. Math. Stat. 1951, 22, 400–407. 10.1214/aoms/1177729586. [DOI] [Google Scholar]
  80. Igel C.; Hüsken M.. Improving the Rprop Learning Algorithm. Second International ICSC Symposium on Neural Computation, 2000; pp 115–121.
  81. Heid E.; McGill C. J.; Vermeire F. H.; Green W. H.. Characterizing Uncertainty in Machine Learning for Chemistry. ChemRxiv 2023. 10.26434/chemrxiv-2023-00vcg-v2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  82. Musil F.; Willatt M. J.; Langovoy M. A.; Ceriotti M. Fast and Accurate Uncertainty Estimation in Chemical Machine Learning. J. Chem. Theory Comput. 2019, 15, 906–915. 10.1021/acs.jctc.8b00959. [DOI] [PubMed] [Google Scholar]
  83. Paleico M. L.; Behler J. A bin and hash method for analyzing reference data and descriptors in machine learning potentials. Mach. Learn.: Sci. Technol. 2021, 2, 037001. 10.1088/2632-2153/abe663. [DOI] [Google Scholar]
  84. Westermayr J.; Marquetand P. Machine Learning for Electronically Excited States of Molecules. Chem. Rev. 2021, 121, 9873–9926. 10.1021/acs.chemrev.0c00749. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. Neese F. The ORCA program system. WIREs Comput. Mol. Sci. 2012, 2, 73–78. 10.1002/wcms.81. [DOI] [Google Scholar]
  86. Neese F. Software update: The ORCA program system–Version 5.0. WIREs Comput. Mol. Sci. 2022, 12, e1606. 10.1002/wcms.1606. [DOI] [Google Scholar]
  87. Perdew J. P.; Burke K.; Ernzerhof M. Generalized Gradient Approximation Made Simple. Phys. Rev. Lett. 1996, 77, 3865–3868. 10.1103/PhysRevLett.77.3865. [DOI] [PubMed] [Google Scholar]
  88. Weigend F.; Ahlrichs R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys. 2005, 7, 3297–3305. 10.1039/b508541a. [DOI] [PubMed] [Google Scholar]
  89. Grimme S.; Antony J.; Ehrlich S.; Krieg H. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J. Chem. Phys. 2010, 132, 154104. 10.1063/1.3382344. [DOI] [PubMed] [Google Scholar]
  90. Johnson E. R.; Becke A. D. A post-Hartree-Fock model of intermolecular interactions. J. Chem. Phys. 2005, 123, 024101. 10.1063/1.1949201. [DOI] [PubMed] [Google Scholar]
  91. Grimme S.; Ehrlich S.; Goerigk L. Effect of the Damping Function in Dispersion Corrected Density Functional Theory. J. Comput. Chem. 2011, 32, 1456–1465. 10.1002/jcc.21759. [DOI] [PubMed] [Google Scholar]
  92. Harris C. R.; Millman K. J.; van der Walt S. J.; Gommers R.; Virtanen P.; Cournapeau D.; Wieser E.; Taylor J.; Berg S.; Smith N. J.; Kern R.; Picus M.; Hoyer S.; van Kerkwijk M. H.; Brett M.; Haldane A.; Fernández del Río J.; Wiebe M.; Peterson P.; Gérard-Marchant P.; Sheppard K.; Reddy T.; Weckesser W.; Abbasi H.; Gohlke C.; Oliphant T. E. Array programming with NumPy. Nature 2020, 585, 357–362. 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Paszke A.; Gross S.; Massa F.; Lerer A.; Bradbury J.; Chanan G.; Killeen T.; Lin Z.; Gimelshein N.; Antiga L.; Desmaison A.; Köpf A.; Yang E.; DeVito Z.; Raison M.; Tejani A.; Chilamkurthy S.; Steiner B.; Fang L.; Bai J.; Chintala S.. PyTorch: An Imperative Style, High-Performance Deep Learning Library. 33rd International Conference on Neural Information Processing Systems (NIPS), Vancouver, Canada, 2019; pp 8026–8037.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ct3c00279_si_001.pdf (9.2MB, pdf)

Articles from Journal of Chemical Theory and Computation are provided here courtesy of American Chemical Society

RESOURCES