Reformulating Reactivity Design for Data-Efficient Machine Learning

Toby Lewis-Atwell; Daniel Beechey; Özgür Şimşek; Matthew N Grayson

doi:10.1021/acscatal.3c02513

. 2023 Oct 6;13(20):13506–13515. doi: 10.1021/acscatal.3c02513

Reformulating Reactivity Design for Data-Efficient Machine Learning

Toby Lewis-Atwell ^†,^‡, Daniel Beechey ^‡, Özgür Şimşek ^‡, Matthew N Grayson ^†,^*

PMCID: PMC10594582 PMID: 37881791

Abstract

graphic file with name cs3c02513_0006.jpg

Machine learning (ML) can deliver rapid and accurate reaction barrier predictions for use in rational reactivity design. However, model training requires large data sets of typically thousands or tens of thousands of barriers that are very expensive to obtain computationally or experimentally. Furthermore, bespoke data sets are required for each region of interest in reaction space as models typically struggle to generalize. We have therefore reformulated the ML barrier prediction problem toward a much more data-efficient process: finding a reaction from a prespecified set with a desired target value. Our reformulation enables the rapid selection of reactions with purpose-specific activation barriers, for example, in the design of reactivity and selectivity in synthesis, catalyst design, toxicology, and covalent drug discovery, requiring just tens of accurately measured barriers. Importantly, our reformulation does not require generalization beyond the domain of the data set at hand, and we show excellent results for the highly toxicologically and synthetically relevant data sets of aza-Michael addition and transition-metal-catalyzed dihydrogen activation, typically requiring less than 20 accurately measured density functional theory (DFT) barriers. Even for incomplete data sets of E2 and S_N2 reactions, with high numbers of missing barriers (74% and 56% respectively), our chosen ML search method still requires significantly fewer data points than the hundreds or thousands needed for more conventional uses of ML to predict activation barriers. Finally, we include a case study in which we use our process to guide the optimization of the dihydrogen activation catalyst. Our approach was able to identify a reaction within 1 kcal mol^–1 of the target barrier by only having to run 12 DFT reaction barrier calculations, which illustrates the usage and real-world applicability of this reformulation for systems of high synthetic importance.

Keywords: machine learning, activation barriers, catalyst design, organic synthesis, data efficiency

1. Introduction

The ability to determine the activation barriers of chemical reactions is highly desirable, allowing us to address interesting and important chemical problems. For example, understanding reaction kinetics^1,2 or elucidating reaction mechanisms.³⁻⁵ Exposing this knowledge can greatly aid in the design of useful materials or new pharmaceutical compounds.⁶⁻¹¹ Unfortunately, traditional methods of evaluating reaction activation barriers, quantum mechanical (QM) calculations or experiments, are very expensive. Therefore, demand for cheaper yet equally effective alternatives is high.¹²⁻¹⁴ Of particular recent interest is the use of machine learning (ML) models to predict activation barriers.¹⁵ The general ML approach to predict barriers is foundational: (1) obtain a data set of accurately measured activation barriers (either from experiment,¹⁶ or more commonly, quantum mechanical calculations¹⁷⁻²¹), (2) choose a suitable chemical representation, (3) train the ML model to minimize the regression loss between its outputs and the activation barriers of the training set. Typically, these data sets contain between thousands^19,20 and tens of thousands^17,21 of reaction barriers. As one illustration of the expense of creating such data sets, the quantum mechanical (QM) calculations for a set of E2 and S_N2 reactions, crafted by von Rudorff et al., took approximately 2.8 million core hours to complete,²² and this data set provides the barriers for only 2720 symmetrically unique reactions.

Two recent studies have shown good results for predicting activation barriers in low data regimes. Friederich et al.²⁰ used Gaussian process regression to predict the barriers of the dihydrogen activation reaction on derivatives of Vaska’s complex, calculated at the PBE-D3/def2-SVP level of theory. They were able to achieve a mean absolute error (MAE) in the barrier predictions below the chemical accuracy threshold of 1 kcal mol^–1, using only 20% of the total data set size of 1947 reactions (approximately 390 reactions). Jorner et al.¹⁶ were also able to obtain an MAE below 1 kcal mol^–1 when training a Gaussian process regression model on approximately 110 of a total of 443 experimentally derived nucleophilic aromatic substitution barriers. However, this did require one of the model’s input features to be the reaction barrier itself, calculated with density functional theory (DFT).

Despite some good results on small data sets, current ML approaches still suffer from another key limitation: They tend to show poor predictive performance on inputs lying outside the domain of their training data.²³⁻²⁵ For these methods to be widely adopted, they must show accurate predictions for all regions of interest in the reaction space. However, given the incomprehensible vastness of chemical space and the even larger number of possible reactions, the prospect of gathering enough accurate reference data to train ML models to predict arbitrary reaction barriers is infeasible.

Recently, some attention had been paid to addressing this infeasibility with transfer learning: Training a base model on a large initial data set containing a wide variety of different reactions and then fine-tuning on a small amount of data specific to a new reaction type. This approach has been reported to be useful for predicting the products of organic reactions,²⁶ the geometries of the transition states of organic reactions,²⁷ material properties,²⁸ and activation barriers of Diels–Alder reactions.²⁹ However, the true effects of transfer learning are still unclear. For instance, the base models for this type of transfer learning will still only be specific to a limited range of reaction classes, and transferring from them to new, very different reactions may not be effective or improve data efficiency.³⁰⁻³²

In this work, we contribute an alternative formulation for predicting reaction activation barriers in a highly data-efficient manner: identifying a reaction with a barrier close to a desired “target” value. Given the reaction class and desired activation barrier, we define a core structure for the reaction and a set of functional groups. These functional groups are varied around the reaction core, where changing the functional groups will change the reaction’s activation barrier. Under this formulation, an algorithm can then search through the space of reactions to identify one with the desired activation barrier. There are many chemical problems in which an efficient search procedure would be highly desirable and could reduce waste: controlling reactivity or selectivity during a chemical synthesis,^33,34 ensuring that a covalent drug has an optimal reactivity,^35,36 or avoiding a potentially toxic side effect in a drug molecule if its reactivity were too high.³⁷

Under our new formulation, we tested several algorithms to search spaces of reactions. Our most successful and thus recommended approach is a simple yet effective ML-based algorithm, which iteratively selects reactions that are predicted to have barriers closer to the target value. When using rough estimates of the reaction barriers, calculated at a lower level of theory compared with the target values, as an input feature, we found the performance of this algorithm to improve. The use of these approximate activation barriers appears to guide the ML model’s predictions in slightly better directions, allowing the algorithm to sample from more refined regions of the reaction space and thus speeding up its location of a target barrier. We also compare this approach with a simple baseline random search, two techniques based on local search, a genetic algorithm and Bayesian optimization, and find our technique to be the most successful.

Our proposed reformulation is also found to be effective over a range of activation barrier data sets, with a diverse set of reaction mechanisms. We present results from a new data set of aza-Michael additions (a reaction we select for its importance to toxicology^37,38 and the creation of target covalent inhibitors^35,36,39,40), E2 and S_N2 reaction barriers,²² as well as a set of dihydrogen activations on derivatives of Vaska’s complex from Friederich et al.²⁰ The latter data set was chosen as a validation experiment to ensure that our reformulated approach was capable of handling complex and highly relevant catalytic reactions.

Our experiments demonstrate that our approaches for identifying reactions with desired target barriers within chemical accuracy are very successful. In particular, when the target barrier lies within the densest regions of the distribution of barriers for a given data set, most approaches are consistently able to quickly identify a reaction with a barrier within 1 kcal mol^–1 of the target values. However, our recommended ML approach was also found to be effective for finding the more challenging barrier targets at the tail ends of the barrier distribution, using merely double-digit numbers of data points during the search process. Compare this with previous studies of ML to predict activation barriers, which typically use hundreds,^16,18,20 thousands,^19,41,42 or tens of thousands^17,21 of barriers calculated at a high level of theory.

2. Computational Methods

2.1. Barrier Data Sets

2.1.1. Aza-Michael Addition

Figure 1 shows the scheme that was used to generate the structures of 1000 Michael acceptors and the 1000 corresponding transition states for their reactions with methylamine. The “R-Group Creator” and “Custom R-Group Enumeration” tools from Schrödinger’s Maestro v12.5⁴³ were used for this structure generation, and following this, the reactants and constrained transition states were conformationally searched with the MMFF force field⁴⁴ using the “Conformational Search” tool from Schrödinger’s MacroModel v12.9.^45,46 Our choice of force field was due to previous findings that the MMFF force fields are among the most successful for predicting the conformations of organic molecules.^47,48 The lowest-energy conformers of each of the reactants and transition states predicted by MMFF were retained for further quantum chemical calculations.

R-group placement scheme for the data set of Michael acceptors reacting with methylamine. The scheme on the right side shows all of the possible functional groups that may occupy each of the positions shown on the left side of the vertical black bar.

For this data set of aza-Michael addition reactions, we select the PM6 semiempirical method to calculate the approximate, low-level barriers that are used as an input feature to the ML model in our main search algorithm. Single-point energies using the PM6 method and geometry optimizations at the ωB97X-D/def2-TZVP level of theory were performed on all of the MMFF structures using Gaussian 16, Revision A.03,⁴⁹ and Revision C.01.⁵⁰ All DFT and semiempirical calculations were performed in the IEF-PCM (water)⁵¹ implicit solvent model due to its previous use in reaction modeling of systems similar to those of this data set.^3,37,38 All 1000 transition states and reactants were successfully optimized with the chosen DFT method. Further information on the aza-Michael addition data set, including plots of the DFT barrier distribution and the correlation between the DFT and PM6 barriers, is found in the Supporting Information, Section S1.1.

2.1.2. Dihydrogen Activation on Vaska’s Complex

The data set of reaction barriers for dihydrogen activation on derivatives of Vaska’s complex by Friederich et al.,²⁰ available from ref (52), was also utilized in this work. The enumeration scheme used by Friederich et al. for the R-groups attached to the iridium core of Vaska’s complex is shown in Figure 2. This scheme allows for the generation of a total of 2574 catalytic reactions once the symmetry of the catalyst is accounted for (R3 and R4 may be interchanged, and the reaction remains the same). 1947 dihydrogen activation barriers calculated at the PBE/def2-SVP level of theory including Grimme’s D3 dispersion correction⁵³ were available from this data set, which gave a convergence rate of approximately 75.6%. Full details of the data set generation may be found in the original work by Friederich and co-workers.²⁰

R-group placement scheme for the set of derivatives of Vaska’s complex by Friederich et al.²⁰ The scheme on the right-hand side shows all of the possible functional groups that may occupy each of the positions shown on the left-hand side of the vertical black bar.

In this work, we also calculated single-point energies of the 1947 reactant and transition state structures that were available from this data set at the SVWN/def2-SVP level of theory (this LDA functional is at a lower rung and has cheaper computational cost compared with the PBE functional used to calculate the “accurate” barriers for this reaction) using Gaussian 16, Revision C.01.⁵⁰ The approximate LDA activation barriers were calculated as the differences between the electronic energies of the reactants and transition states and are later used as the low-level barrier input feature to the ML model in our main search algorithm. For the 24.4% of the possible structures that were unavailable from this data set, their low-level LDA barriers were imputed using our selected Ridge regression model (see also Section S3 in the Supporting Information), trained on the set of LDA barriers calculated from the 75.6% of the possible reactions with coordinates available. Plots of the distribution of the accurate barriers and the correlation with the low-level LDA barriers may be found in the Supporting Information, Section S1.2.

2.1.3. E2 and S_N2

The data sets of E2 and S_N2 reaction activation barriers that were created by von Rudorff et al.^22,54 were also generated via an enumeration scheme, and are therefore suitable for this work. The R-group, leaving group (X), and nucleophile (Y) placement schemes for the E2 and S_N2 reactions are shown in Figure 3. These generation schemes allow for a total of 3900 distinct reactions to be generated after accounting for the symmetry of the systems (when R1 and R2 are interchanged simultaneously with R3 and R4, the reaction remains the same).

R-group, leaving group, and nucleophile placement scheme for the data sets of E2 (top) and S_N2 (bottom) reactions by von Rudorff et al.²² The schemes on the right side show all of the possible functional groups that may occupy each of the positions shown on the left side of the vertical black bars.

In this work, we treat the MP2/6-311G(d) E2 and S_N2 barriers as our expensive, target level of theory and the HF/6-311G(d) barriers that are also available from this data set as the low-level barriers. Once symmetry duplicates are removed, this data set provides 1000 E2 and 1720 S_N2 reaction barriers, all at both levels of theory, giving convergence rates of 25.6 and 44.1%, respectively. Full details of the E2 and S_N2 data set generation may be found in the original work by von Rudorff and co-workers.²² As for the dihydrogen activation data set, we impute the missing lower-level barriers with our selected Ridge regression model, trained on the full set of HF/6-311G(d) barriers for each reaction. Visualizations of the MP2 barrier distributions and their correlation plots with the HF barriers are found in the Supporting Information, Section S1.3.

2.2. Machine Learning Search Algorithm

Our ML search strategy is based on the iterative training of an ML model and using that model to select the most promising candidates from the pool of reactions without measured barriers. The procedure begins by randomly selecting five reactions from the defined search space, measuring their activation barriers, and training an ML model to predict the barriers of the remaining unmeasured set. Before the model is trained on the full set of measured reaction barriers, the hyperparameters for the model are selected by a small grid search according to those that minimize the mean absolute error from 5-fold cross-validation. The ML model is then made to predict the barriers of all of the unsampled reactions, and the reaction predicted to have a barrier closest to the target value has its activation barrier measured and is added to the “training” set. In the case that the predicted target reaction is unavailable due to calculation or experimental failure, that reaction is discarded from the pool of unsampled reactions, and the next most closely predicted reaction is measured and added to the training set. The model is then retrained with the augmented training set and the process repeats until a satisficing reaction with a barrier within 1 kcal mol^–1 of the target value is obtained. Figure 4 also shows an illustration of this procedure in full.

Overall scheme for the ML barrier search procedure in this work. The “generate model features for entire data set” box shows the representation vector used in this work for an example Michael addition reaction: a one-hot encoding of its R-groups concatenated with an approximate, low-level calculated barrier. Each column in this representation vector corresponds to one input feature for the ML model, and these features are generated for the whole data set of reactions before the search process can begin. Blue, straight arrows represent the initialization stages that are carried out before the iterative ML search (green, curved arrows).

The representation we use for the input to the ML models in this work is primarily composed of a one-hot encoding of its interchangeable functional groups. That is, each position in the reacting structures that may be substituted with a range of functional groups is assigned a bit vector, where each bit corresponds to one of the possible groups that may occupy that position (a value of zero indicates absence, one indicates presence). The bit vectors for each position are concatenated together to produce the representation of each reaction. In previous works predicting activation barriers,^16,18 it was observed that approximate activation barriers calculated at a lower level of theory are extremely useful to the predictions of higher-level barriers. Therefore, in addition to the one-hot encoding, we tested as another input feature the approximate activation barriers of each of the reactions in the search space, as calculated with various lower levels of theory (see Section 2.1) or imputed with the same ML model used to search the reaction space where the structures are unavailable. The combination of one-hot encoding and low-level calculated barriers provides a powerful, yet compact reaction representation, which reduces the effects of the curse of dimensionality and therefore means that the ML model requires less data to train effectively and is less likely to overfit.

In order to make a principled selection of the particular ML model that would be used to search the reaction spaces in this work, we performed a small test of their predictive performances when trained on data sets of the tiny sizes they are expected to handle under our reformulation. Six ML models from the Python library scikit-learn⁵⁵ (linear regression (LR), ridge regression (RR), random forest regression (RFR), gradient boosting regression (GBR), kernel ridge regression (KRR) and Gaussian process regression (GPR)) were selected and trained on 30 randomly selected data points from each of our data sets and the mean absolute errors of their predictions on the training set and the remaining “test” set were recorded. These tests were repeated five times, and we report the mean errors. The results of this test (in which we find ridge regression to be the overall best model) are reported in the Supporting Information, Section S3, along with the results of a similar experiment in which we use more chemically inspired input features instead of the one-hot encoding. However, we do not find these different features to generally improve the performance of our models in this context. The hyperparameters of the models were tuned before final training using a grid search, with selection based on the minimal mean absolute error from 5-fold cross-validation. The hyperparameter ranges used for each of the models are reported in the Supporting Information, Section S4.

To address the timings of each of the computational processes in this work, we emphasize that the time required for the entire search process (including model training and hyperparameter tuning in the supervised learning case) is completely eclipsed by the time for the high-level activation barrier calculation (in particular, the transition state optimization for our quantum mechanically calculated barriers). The time required for the optimization of a single transition state varies but is typically on the order of hours for the reacting species considered in this work. The searching code for a given reaction, on the other hand, takes a few minutes to complete, at most. Therefore, in practice, the major bottleneck when applying the methods presented in this work will be the determination of the small number of reaction barriers.

For each data set, we test our algorithms on a range of target values from across the barrier distributions. Starting from the minimum barrier reaction as a target value, we select the next target reaction as the one that is closest to having a barrier 2 kcal mol^–1 higher than the previous target until the maximum barrier is reached. This provides a set of target values covering the entire range of the barrier distribution, each separated from the last by approximately 2 kcal mol^–1. For each target barrier, we repeat the search experiments 25 times with a different initial random sample in each repeat. The metrics we report in this main paper are based on the mean numbers of sampled reactions over the 25 repeats.

In addition to this ML-based search algorithm, we test five other algorithms: a random search, two methods based on local search, Bayesian optimization, and a genetic algorithm (as well as a variant of our main algorithm that does not use the low-level barriers as an input feature), but these do not perform as well as our main ML algorithm discussed here. We report the workings of these additional five algorithms in the Supporting Information, Section S2.

3. Results and Discussion

3.1. Search Algorithm Performances

In the Supporting Information, Section S3, we report the results from our ML model assessment, in which we find that the ridge regression model showed the best balance between performance on held-out test data and lack of overfitting to the extremely small data set of 30 samples, and therefore we selected it as the ML model for the use in our main algorithm.

We assess our search algorithms by the number of reactions each required to sample before obtaining a reaction with a barrier within 1 kcal mol^–1 of a target value. In the Supporting Information, Section S5, we show the means of these numbers averaged over the 25 repeats for all of the target barriers for each data set. In this section of this manuscript, we report these mean sample numbers averaged over different regions of the barrier distributions, to show how the performances of the different algorithms change when the target value moves to more challenging regions of the distributions.

Table 1 shows the mean numbers of reactions that are sampled by each of our algorithms, averaged over all of the target values for each data set. Our ML search procedure outperforms all of the other algorithms on each data set. It finds reactions within 1 kcal mol^–1 of the target value with fewer samples than any of the other algorithms. This approach is followed closely in performance by Bayesian optimization, which requires very similar numbers of sampled reactions, although it seems that the additional leveraging of the predicted uncertainty does not offer a great advantage in this particular context.

Table 1. Mean Numbers of Sampled Reactions Required to Obtain Barriers within 1 kcal mol^–1 of a Target Value, Averaged over all of the Target Values for Each Data Set.

data set (% complete)	random search	local search	guided local	ML (no barr.)	ML search	bayes. opt.	genetic alg.
MA (100%)	167.11	37.96	26.03	16.78	13.39	13.50	63.16
H₂ (75.6%)	318.94	132.85	259.24	39.09	9.69	10.30	189.22
S_N2 (44.1%)	923.27	470.05	449.67	134.58	47.75	49.48	373.64
E2 (25.6%)	912.72	503.81	653.97	221.04	100.24	116.33	577.08

Open in a new tab

Rerunning the same ML search procedure with the exclusion of the low-level calculated barriers as input features (column titled “ML (no barr.)”) shows that the algorithm still performs well when the completion rate of the accurate barrier data set is high (i.e., in the aza-Michael addition and dihydrogen activation data sets). This suggests that one may implement a strategy in which the sampling procedure begins without using the low-level barriers as an input feature, and after a small number of samples, if the convergence rate of the accurate barriers is determined to be too low, the low-level barriers may then be calculated. This would allow for the avoidance of the calculations of the low-level barriers, should the convergence rate be high enough that the algorithm will perform similarly with or without that feature.

Also seen in Table 1 is that the local search and genetic algorithms perform notably worse than the model-based algorithms, and they are therefore not our recommended approaches for this problem reformulation. However, all of our tested algorithms outperform a totally random search, reassuring us that a principled approach to searching reaction spaces is more effective for our reformulation than the blind sampling of reactions.

We also observe that the smallest typical number of samples required to obtain a barrier close to the target value is found for the dihydrogen activation data set, which is most likely due to the very strong correlation between the set of PBE reaction barriers we treat as our target level of theory and the lower-level LDA barriers (see also Figure S2 in the Supporting Information). However, it is also seen that the S_N2 and (in particular) the E2 reactions require significantly more samples to obtain the target barriers. This is most likely due to the very low convergence rates of the barriers in the two data sets (25.6% and 44.1% for E2 and S_N2, respectively), meaning that there is a very high chance in each that any selected data point will not have a barrier available and thus the model will not be trained on this data and will be unable to leverage its information.

However, the sample numbers averaged over the entire set of barrier targets do not quite tell the whole story in terms of how the performances of the various algorithms vary as the target value moves to different regions of the barrier distributions. Table 2 shows the mean numbers of samples averaged over all of the target values except the highest five and lowest five (or highest three and lowest three for the dihydrogen activation data set since the range of its barrier distribution is much smaller than the other data sets). The algorithms now all perform more similarly, showing much more practically achievable numbers of sampled reactions. This better performance in the more central, denser regions of the barrier distribution is explained by the higher numbers of reactions with barriers close to these target values. For target barriers in regions of the barrier distribution with greater sample densities, there is a much higher chance that a randomly chosen reaction will have a barrier closer to the target value. This is most easily seen from the dramatic reduction in the averaged sample numbers for the random search seen in Table 2 compared with Table 1.

Table 2. Mean Numbers of Sampled Reactions Required to Obtain Barriers within 1 kcal mol^–1 of a Target Value, Averaged over All except the 5 Highest and 5 Lowest Target Values (or 3 Highest and 3 Lowest in the Case of the Dihydrogen Activation Data Set, due to Its Shorter Barrier Range).

data set (% complete)	random search	local search	guided local	ML (no barr.)	ML search	bayes. opt.	genetic alg.
MA (100%)	18.95	14.71	14.75	9.57	9.46	9.00	17.75
H₂ (75.6%)	21.08	24.45	23.79	11.14	9.14	9.75	13.50
S_N2 (44.1%)	257.59	106.72	87.97	52.65	33.41	35.42	128.55
E2 (25.6%)	211.47	113.91	147.59	92.98	70.37	86.73	160.57

Open in a new tab

Therefore, the most important test of our algorithms is how many samples they require to find the reactions with the minimum and maximum barriers in each of the data sets. Table 3 shows the mean numbers of samples averaged over the five highest and five lowest target barriers for each data set (except for the dihydrogen activation data set, for which we average over the three highest and three lowest target barriers due to the smaller range of its barrier distribution). Again it is seen that our ML-based search approach using the low-level calculated barriers as an input feature requires the smallest numbers of sampled reactions, with Bayesian optimization using the same features following very closely behind. Our ML algorithm when not using the low-level barriers as a feature also performs quite closely for the aza-Michael addition data set (for which the completion rate is 100%). This again suggests that the low-level barriers do not necessarily have to be calculated when the convergence rate of the accurate barriers is high enough. We observe that the performances of these two algorithms do not deteriorate to as great an extent as our alternative search algorithms. For the aza-Michael addition and dihydrogen activation data sets, our ML algorithm is still able to find reactions with barriers within 1 kcal mol^–1 of these most challenging target values within very reasonable numbers of samples. In the E2 and S_N2 data sets, it is more difficult for our algorithm to locate reactions with target barriers close to the tail ends of the distributions, which is again very likely due to the low convergence rates of the barrier calculations in these data sets. However, our proposed ML method still shows the best performance out of all our algorithms for these challenging data sets.

Table 3. Mean Numbers of Sampled Reactions Required to Obtain Barriers within 1 kcal mol^–1 of a Target Value, Averaged over the 5 Highest and 5 Lowest Target Values (or 3 Highest and 3 Lowest in the Case of the Dihydrogen Activation Data Set, due to Its Shorter Barrier Range).

data set (% complete)	random search	local search	guided local	ML (no barr.)	ML search	bayes. opt.	genetic alg.
MA (100%)	315.27	61.21	37.31	23.98	17.32	17.99	108.56
H₂ (75.6%)	666.45	259.33	533.93	71.69	10.34	10.93	394.23
S_N2 (44.1%)	3053.46	1632.69	1607.13	396.78	93.64	94.46	1157.91
E2 (25.6%)	2525.61	1400.59	1818.65	515.59	168.92	184.41	1535.06

Open in a new tab

At this point, some discussion of the generalities and limitations of our proposed approach is warranted. This reformulation has been tested on a diverse range of data sets, from toxicologically relevant aza-Michael addition to transition-metal-catalyzed dihydrogen activation, and is therefore suitable for a wide range of chemistries. Specifically, our approach is oriented toward the situation in which one is interested in the selection of a single reaction from a large set, which may be described by a core structure with variable surrounding R-groups (for example, Figures 1–3). The restriction to a search within a predefined set of reactions means that one no longer needs to be concerned whether the ML model is able to generalize beyond the specific domain of the data set at hand, thus avoiding this pitfall that is common to nearly all types of ML.

On the other hand, our most performant approach requires the calculation of low-level, approximate activation barriers for all of the possible reactions in the search space. However, the level of theory used may be quite low, only single-point energies on approximate transition state and reactant geometries are required (for example, for the aza-Michael addition data set, PM6 semiempirical calculations were performed on constrained MMFF geometries) and any missing low-level barriers may be imputed with an ML model once a great enough proportion of the data set has been covered (for example, as in the dihydrogen activation, E2 and S_N2 data sets). Therefore, the calculations of these low-level barriers should not lead to an extremely problematic computational overhead and will be much cheaper than the determinations of the accurate reaction barriers. In addition, when the convergence rate of the accurate barriers for a given data set is high enough, the ML algorithm performs quite similarly when not using the low-level barrier as a feature (as for the aza-Michael addition and dihydrogen activation data sets). Therefore, the search algorithm could be implemented to switch to using the low-level barrier feature should the convergence rate of the accurate barriers be found to be too low after a small number of samples, and otherwise, the low-level barriers would not even need to be calculated.

The reformulation of the problem toward finding a particular reaction from a finite set of possibilities limits the eventual reaction that will be found to one that is within the initially specified set of reactions. This is in contrast to the idea of using ML models to predict the activation barriers of arbitrary chemical reactions with high accuracy and speed. However, training an ML model to make highly accurate predictions of barriers for a wide and useful variety of reactions requires an impractically large amount of expensive data. Therefore, we argue that approaches similar to those proposed here are worthy of further investigation. Specifically, a more effective use of ML and other computational techniques in the context of predicting activation barriers could create tools for efficiently identifying reactions with barriers that suit a particular purpose. Such methods could guide established experimental and computational methods rather than replacing them.

The decreases in the performance from our approach come when the completion rates of the data sets decrease. The E2 and S_N2 data sets are only 25.6% and 44.1% complete, respectively, and give the lowest and second lowest performances of all of the data sets. This result is because the majority of the sampled data points during the ML search cannot be added to the training set, and therefore, the model is unable to leverage the information from that sample and cannot improve its predictions until a reaction with a new barrier value is obtained. However, we observe that the data set for which the best results were produced was for the dihydrogen activation reaction, which had a very strong correlation between its low-level barrier feature and the “accurate” barrier values. Therefore, in cases where such high failure rates are observed or expected, it may be prudent to obtain more expensive but more accurate lower-level barriers for the input feature, since they are more likely to be more strongly correlated with the accurate barriers.

Finally, in the Supporting Information, Section S6, we present additional analyses of the ML model used in the main search algorithm as the sampling iterations proceed. Analysis of feature importances shows that the low-level barrier features are most important and the model leverages the one-hot encoding features for small corrections to these approximate barriers. Visualizing the predictive performance of the model is as expected: the model’s performance on held-out test data improves as more data is sampled. Lastly, scrambling the input features severely deteriorates the performance of the search procedure at the challenging tail ends of the barrier distribution, indicating that the model was learning a meaningful relationship between the features and the activation barrier rather than simply overfitting and randomly guessing the correct reaction.

3.2. Case Study in Catalyst Optimization

Our results presented above give an empirical analysis of the efficiency and effectiveness of our method. However, an equally important consideration is its practical application. Therefore, as a final experiment, we wished to demonstrate how our reformulation may be put to use in a “lab-based” scenario. Suppose that one was performing a dihydrogen activation reaction, but the rate of reaction when using the current transition metal catalyst was either zero or much too slow and one wished to optimize the catalyst structure to reduce the reaction barrier and speed up the reaction as much as possible. To this end, we selected a “starting” reaction from the dihydrogen activation data set with a barrier in the upper region of the distribution at 20 kcal mol^–1. We chose the target barrier of the reaction to be optimized to be the very minimum of the data set of dihydrogen activation reactions (1.6 kcal mol^–1), and the process of identifying a reaction with this barrier was the same as our previous experiments.

The starting reaction was added to a data set of four randomly selected reactions (for which barrier measurements would first be performed either computationally or experimentally, in addition to the measurement of the starting reaction barrier). After adding an additional reaction to replace one from the initial sample that did not have a barrier available, this initial sample was used to train the ML model, which was then used to iteratively select the reactions that updated the training set. Table 4 shows the structures that were present in the initial sample and those that were selected until a reaction within 1 kcal mol^–1 of the target value was obtained. Our algorithm sampled six additional reactions before finding a reaction with a catalyst that gave a barrier within 0.7 kcal mol^–1 of the minimum barrier. It is a testament to the efficiency of our approach that this reaction turned out to have the second lowest barrier in the entire data set of dihydrogen activation reactions and that three of the iteratively selected reactions did not have accurate barriers available.

Table 4. Left: The Catalyst Structures That Were Part of the Initial Sample of Reactions Used for the Initial Training of the Model in the ML Search Procedure Include the Starting 20 kcal mol^–1 Catalyst and One of the Initially Selected Reactions That Did Not Have an Accurate Barrier Available. Right: The Catalyst Structures That Were Selected by the ML Search Algorithm, Where Descending the Column Gives the Order of Selection^a.

Open in a new tab

N/A refers to reactions which were sampled but did not have accurate barriers available from the data set.

4. Conclusions

We have reformulated the problem of accurately predicting activation barriers for chemical reactions with ML toward a much more data-efficient process: finding a reaction with a desired target barrier from a prespecified set of possible reactions. In our formulation, each reaction differs from the next by modification of the functional groups around a reaction core. The overall aim of this reformulation is to significantly reduce the number of expensive, accurate barrier measurements that are required in order to obtain a reaction with desirable properties for any particular purpose.

We then proposed an ML-based approach for searching for these reaction spaces. In our data sets of aza-Michael addition and dihydrogen activation reactions, our algorithm was typically able to find reactions with barriers within 1 kcal mol^–1 of a target value with less than 20 accurate barrier measurements. The performance of all our tested algorithms was somewhat reduced for the E2 and S_N2 reactions due to the large number of missing accurate barriers from the original data sets. However, target barriers for the S_N2 data set were still typically found within double-digit numbers of samples and even the E2 sampling requirements are much lower than previous uses of ML to predict activation barriers, the data sets for which often contain thousands of barriers.^{17,19,21,41,42} Additionally, in a case study in the use of our approach for the optimization of a transition metal catalyst for dihydrogen activation, within 12 barrier calculations, we were able to identify a reaction with a barrier within 1 kcal mol^–1 of a target value, finding the second lowest barrier in the data set. This further demonstrated the efficiency of this reformulation in a particularly chemically complex scenario.

Acknowledgments

This work was supported by U.K. Research and Innovation (UKRI) [grant number EP/S023437/1]; and the Engineering and Physical Sciences Research Council [grant number EP/W003724/1]. This work made use of the Balena, Anatra and Nimbus high-performance computing services at the University of Bath with support from the University of Bath’s Research Computing Group (doi.org/10.15125/b6cd-s854). This work was supported by the ART-AI CDT and the University of Bath.

Data Availability Statement

Gaussian output files for the aza-Michael addition reactants and transition states and LDA single-point energies for the dihydrogen activation data set are available from the Unversity of Bath Research Data Archive (10.15125/BATH-01240).⁵⁶ Code and other data are available from https://github.com/the-grayson-group/finding_barriers.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acscatal.3c02513.

Additional details of data sets and plots of barrier distributions and correlations with low-level barriers; methods for all alternative search algorithms; results from ML model assessments with and without chemical features; hyperparamter ranges for all models throughout this work; tables of average numbers of samples used by each ML algorithm for each target barrier in each data set; additional results from analysis of machine learning model during search procedure: feature importances, mean absolute error scores, and results from using scambled features (PDF)

The authors declare no competing financial interest.

Supplementary Material

cs3c02513_si_001.pdf^{(5.7MB, pdf)}

References

Truhlar D. G.; Garrett B. C.; Klippenstein S. J. Current Status of Transition-State Theory. J. Phys. Chem. A 1996, 100, 12771–12800. 10.1021/jp953748q. [DOI] [Google Scholar]
Klippenstein S. J. From theoretical reaction dynamics to chemical modeling of combustion. Proc. Combust. Inst. 2017, 36, 77–111. 10.1016/j.proci.2016.07.100. [DOI] [Google Scholar]
Lam Y.-h.; Grayson M. N.; Holland M. C.; Simon A.; Houk K. N. Theory and Modeling of Asymmetric Catalytic Reactions. Acc. Chem. Res. 2016, 49, 750–762. 10.1021/acs.accounts.6b00006. [DOI] [PubMed] [Google Scholar]
Peng Q.; Duarte F.; Paton R. S. Computing organic stereoselectivity – from concepts to quantitative calculations and predictions. Chem. Soc. Rev. 2016, 45, 6093–6107. 10.1039/C6CS00573J. [DOI] [PubMed] [Google Scholar]
Bachrach S. M. Computational organic chemistry. Annu. Rep. Sect. "B" (Org. Chem.) 2008, 104, 394–426. 10.1039/b719311b. [DOI] [Google Scholar]
Seh Z. W.; Kibsgaard J.; Dickens C. F.; Chorkendorff I.; Nørskov J. K.; Jaramillo T. F. Combining theory and experiment in electrocatalysis: Insights into materials design. Science 2017, 355, eaad4998 10.1126/science.aad4998. [DOI] [PubMed] [Google Scholar]
Cavalli F.; De Keer L.; Huber B.; Van Steenberge P. H. M.; D’hooge D. R.; Barner L. A kinetic study on the para-fluoro-thiol reaction in view of its use in materials design. Polym. Chem. 2019, 10, 2781–2791. 10.1039/C9PY00435A. [DOI] [Google Scholar]
Grom M.; Stavber G.; Drnovšek P.; Likozar B. Modelling chemical kinetics of a complex reaction network of active pharmaceutical ingredient (API) synthesis with process optimization for benzazepine heterocyclic compound. Chem. Eng. J. 2019, 10, 2781–2791. [Google Scholar]
Meghani N. M.; Amin H. H.; Lee B.-J. Mechanistic applications of click chemistry for pharmaceutical drug discovery and drug delivery. Drug Discovery Today 2017, 22, 1604–1619. 10.1016/j.drudis.2017.07.007. [DOI] [PubMed] [Google Scholar]
Cores A.; Clerigué J.; Orocio-Rodríguez E.; Menéndez J. C. Multicomponent Reactions for the Synthesis of Active Pharmaceutical Ingredients. Pharmaceuticals 2022, 15, 1009 10.3390/ph15081009. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kar S.; Sanderson H.; Roy K.; Benfenati E.; Leszczynski J. Green Chemistry in the Synthesis of Pharmaceuticals. Chem. Rev. 2022, 122, 3637–3710. 10.1021/acs.chemrev.1c00631. [DOI] [PubMed] [Google Scholar]
Lin X.; Li X.; Lin X. A Review on Applications of Computational Methods in Drug Screening and Design. Molecules 2020, 25, 1375 10.3390/molecules25061375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Butler K. T.; Davies D. W.; Cartwright H.; Isayev O.; Walsh A. Machine learning for molecular and materials science. Nature. Nature 2018, 559, 547–555. 10.1038/s41586-018-0337-2. [DOI] [PubMed] [Google Scholar]
Meuwly M. Machine Learning for Chemical Reactions. Chem. Rev. 2021, 121, 10218–10239. 10.1021/acs.chemrev.1c00033. [DOI] [PubMed] [Google Scholar]
Lewis-Atwell T.; Townsend P. A.; Grayson M. N. Machine learning activation energies of chemical reactions. WIREs Comput. Mol. Sci. 2022, 12, e1593 10.1002/wcms.1593. [DOI] [Google Scholar]
Jorner K.; Brinck T.; Norrby P.-O.; Buttar D. Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies. Chem. Sci. 2021, 12, 1163–1175. 10.1039/D0SC04896H. [DOI] [PMC free article] [PubMed] [Google Scholar]
Spiekermann K. A.; Pattanaik L.; Green W. H. Fast Predictions of Reaction Barrier Heights: Toward Coupled-Cluster Accuracy. J. Phys. Chem. A 2022, 126, 3976–3986. 10.1021/acs.jpca.2c02614. [DOI] [PubMed] [Google Scholar]
Farrar E. H. E.; Grayson M. N. Machine learning and semi-empirical calculations: a synergistic approach to rapid, accurate, and mechanism-based reaction barrier prediction. Chem. Sci. 2022, 13, 7594–7603. 10.1039/D2SC02925A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heinen S.; von Rudorff G. F.; von Lilienfeld O. A. Toward the design of chemical reactions: Machine learning barriers of competing mechanisms in reactant space. J. Chem. Phys. 2021, 155, 064105 10.1063/5.0059742. [DOI] [PubMed] [Google Scholar]
Friederich P.; dos Passos Gomes G.; De Bin R.; Aspuru-Guzik A.; Balcells D. Machine learning dihydrogen activation in the chemical space surrounding Vaska’s complex. Chem. Sci. 2020, 11, 4584–4601. 10.1039/D0SC00445F. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grambow C. A.; Pattanaik L.; Green W. H. Deep Learning of Activation Energies. J. Phys. Chem. Lett. 2020, 11, 2992–2997. 10.1021/acs.jpclett.0c00500. [DOI] [PMC free article] [PubMed] [Google Scholar]
von Rudorff G. F.; Heinen S. N.; Bragato M.; von Lilienfeld O. A. Thousands of reactants and transition states for competing E2 and S_N2 reactions. Mach. Learn. Sci. Technol. 2020, 1, 045026 10.1088/2632-2153/aba822. [DOI] [Google Scholar]
Liu R.; Wang H.; Glover K. P.; Feasel M. G.; Wallqvist A. Dissecting Machine-Learning Prediction of Molecular Activity: Is an Applicability Domain Needed for Quantitative Structure-Activity Relationship Models Based on Deep Neural Networks?. J. Chem. Inf. Model. 2019, 59, 117–126. 10.1021/acs.jcim.8b00348. [DOI] [PubMed] [Google Scholar]
Fechner N.; Jahn A.; Hinselmann G.; Zell A. Estimation of the applicability domain of kernel-based machine learning models for virtual screening. J. Cheminf. 2010, 2, 2 10.1186/1758-2946-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Artrith N.; Butler K. T.; Coudert F.-X.; Han S.; Isayev O.; Jain A.; Walsh A. Best practices in machine learning for chemistry. Nat. Chem. 2021, 13, 505–508. 10.1038/s41557-021-00716-z. [DOI] [PubMed] [Google Scholar]
Zhang Y.; Wang L.; Wang X.; Zhang C.; Ge J.; Tang J.; Su A.; Duan H. Data augmentation and transfer learning strategies for reaction prediction in low chemical data regimes. Org. Chem. Front. 2021, 8, 1415–1423. 10.1039/D0QO01636E. [DOI] [Google Scholar]
Jackson R.; Zhang W.; Pearson J. TSNet: predicting transition state structures with tensor field networks and transfer learning. Chem. Sci. 2021, 12, 10022–10040. 10.1039/D1SC01206A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yamada H.; Liu C.; Wu S.; Koyama Y.; Ju S.; Shiomi J.; Morikawa J.; Yoshida R. Predicting Materials Properties with Little Data Using Shotgun Transfer Learning. ACS Cent. Sci. 2019, 5, 1717–1730. 10.1021/acscentsci.9b00804. [DOI] [PMC free article] [PubMed] [Google Scholar]
Espley S. G.; Farrar E. H. E.; Buttar D.; Tomasi S.; Grayson M. N. Machine learning reaction barriers in low data regimes: a horizontal and diagonal transfer learning approach. Digital Discovery 2023, 2, 941–951. 10.1039/D3DD00085K. [DOI] [Google Scholar]
Day O.; Khoshgoftaar T. M. A survey on heterogeneous transfer learning. J. Big Data 2017, 4, 29 10.1186/s40537-017-0089-0. [DOI] [Google Scholar]
Neyshabur B.; Sedghi H.; Zhang C. What is being transferred in transfer learning?. Adv. Neural Inf. Process. Syst. 2020, 512–523. [Google Scholar]
Wang J.; Chen Y.; Hao S.; Feng W.; Shen Z. In Balanced Distribution Adaptation for Transfer Learning, IEEE International Conference on Data Mining (ICDM); IEEE, 2017; pp 1129–1134.
Bhawal B. N.; Reisenbauer J. C.; Ehinger C.; Morandi B. Overcoming Selectivity Issues in Reversible Catalysis: A Transfer Hydrocyanation Exhibiting High Kinetic Control. J. Am. Chem. Soc. 2020, 142, 10914–10920. 10.1021/jacs.0c03184. [DOI] [PubMed] [Google Scholar]
Omann L.; Pudasaini B.; Irran E.; Klare H. F. T.; Baik M.-H.; Oestreich M. Thermodynamic versus kinetic control in substituent redistribution reactions of silylium ions steered by the counteranion. Chem. Sci. 2018, 9, 5600–5607. 10.1039/C8SC01833B. [DOI] [PMC free article] [PubMed] [Google Scholar]
Flanagan M. E.; Abramite J. A.; Anderson D. P.; Aulabaugh A.; Dahal U. P.; Gilbert A. M.; Li C.; Montgomery J.; Oppenheimer S. R.; Ryder T.; Schuff B. P.; Uccello D. P.; Walker G. S.; Wu Y.; Brown M. F.; Chen J. M.; Hayward M. M.; Noe M. C.; Obach R. S.; Philippe L.; Shanmugasundaram V.; Shapiro M. J.; Starr J.; Stroh J.; Che Y. Chemical and Computational Methods for the Characterization of Covalent Reactive Groups for the Prospective Design of Irreversible Inhibitors. J. Med. Chem. 2014, 57, 10072–10079. 10.1021/jm501412a. [DOI] [PubMed] [Google Scholar]
Lonsdale R.; Burgess J.; Colclough N.; Davies N. L.; Lenz E. M.; Orton A. L.; Ward R. A. Expanding the Armory: Predicting and Tuning Covalent Warhead Reactivity. J. Chem. Inf. Model. 2017, 57, 3124–3137. 10.1021/acs.jcim.7b00553. [DOI] [PubMed] [Google Scholar]
Townsend P. A.; Grayson M. N. Density Functional Theory Transition-State Modeling for the Prediction of Ames Mutagenicity in 1,4 Michael Acceptors. J. Chem. Inf. Model. 2019, 59, 5099–5103. 10.1021/acs.jcim.9b00966. [DOI] [PubMed] [Google Scholar]
Townsend P. A.; Grayson M. N. Reactivity prediction in aza-Michael additions without transition state calculations: the Ames test for mutagenicity. Chem. Commun. 2020, 56, 13661–13664. 10.1039/D0CC05681B. [DOI] [PubMed] [Google Scholar]
Shannon D. A.; Weerapana E. Covalent protein modification: the current landscape of residue-specific electrophiles. Curr. Opin. Chem. Biol. 2015, 24, 18–26. 10.1016/j.cbpa.2014.10.021. [DOI] [PubMed] [Google Scholar]
Gehringer M.; Laufer S. A. Emerging and Re-Emerging Warheads for Targeted Covalent Inhibitors: Applications in Medicinal Chemistry and Chemical Biology. J. Med. Chem. 2019, 62, 5673–5724. 10.1021/acs.jmedchem.8b01153. [DOI] [PubMed] [Google Scholar]
Bragato M.; von Rudorff G. F.; von Lilienfeld O. A. Data enhanced Hammett-equation: reaction barriers in chemical space. Chem. Sci. 2020, 11, 11859–11868. 10.1039/D0SC04235H. [DOI] [PMC free article] [PubMed] [Google Scholar]
Choi S.; Kim Y.; Kim J. W.; Kim Z.; Kim W. Y. Feasibility of Activation Energy Prediction of Gas-Phase Reactions by Machine Learning. Chem. - Eur. J. 2018, 24, 12354–12358. 10.1002/chem.201800345. [DOI] [PubMed] [Google Scholar]
Maestro | Schrödinger, 2023. https://www.schrodinger.com/products/maestro. [Accessed August 2023].
Halgren T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996, 17, 490–519. [Google Scholar]
MacroModel | Schrödinger, 2023. https://www.schrodinger.com/products/macromodel [Accessed August 2023].
Mohamadi F.; Richards N. G. J.; Guida W. C.; Liskamp R.; Lipton M.; Caufield C.; Chang G.; Hendrickson T.; Still W. C. Macromodeln integrated software system for modeling organic and bioorganic molecules using molecular mechanics. J. Comput. Chem. 1990, 11, 440–467. 10.1002/jcc.540110405. [DOI] [Google Scholar]
Lewis-Atwell T.; Townsend P. A.; Grayson M. N. Comparisons of different force fields in conformational analysis and searching of organic molecules: A review. Tetrahedron 2021, 79, 131865 10.1016/j.tet.2020.131865. [DOI] [Google Scholar]
Lewis-Atwell T.; Townsend P. A.; Grayson M. N. Comparing the Performances of Force Fields in Conformational Searching of Hydrogen-Bond-Donating Catalysts. J. Org. Chem. 2022, 87, 5703–5712. 10.1021/acs.joc.2c00066. [DOI] [PMC free article] [PubMed] [Google Scholar]
Frisch M. J.; Trucks G. W.; Schlegel H. B.; Scuseria G. E.; Robb M. A.; Cheeseman J. R.; Scalmani G.; Barone V.; Petersson G. A.; Nakatsuji H.; Li X.; Caricato M.; Marenich A. V.; Bloino J.; Janesko B. G.; Gomperts R.; Mennucci B.; Hratchian H. P.; Ortiz J. V.; Izmaylov A. F.; Sonnenberg J. L.; Williams-Young D.; Ding F.; Lipparini F.; Egidi F.; Goings J.; Peng B.; Petrone A.; Henderson T.; Ranasinghe D.; Zakrzewski V. G.; Gao J.; Rega N.; Zheng G.; Liang W.; Hada M.; Ehara M.; Toyota K.; Fukuda R.; Hasegawa J.; Ishida M.; Nakajima T.; Honda Y.; Kitao O.; Nakai H.; Vreven T.; Throssell K.; Montgomery J. A. Jr.; Peralta J. E.; Ogliaro F.; Bearpark M. J.; Heyd J. J.; Brothers E. N.; Kudin K. N.; Staroverov V. N.; Keith T. A.; Kobayashi R.; Normand J.; Raghavachari K.; Rendell A. P.; Burant J. C.; Iyengar S. S.; Tomasi J.; Cossi M.; Millam J. M.; Klene M.; Adamo C.; Cammi R.; Ochterski J. W.; Martin R. L.; Morokuma K.; Farkas O.; Foresman J. B.; Fox D. J.. Gaussian 16, revision A.03; Gaussian Inc.: Wallingford CT, 2016.
Frisch M. J.; Trucks G. W.; Schlegel H. B.; Scuseria G. E.; Robb M. A.; Cheeseman J. R.; Scalmani G.; Barone V.; Petersson G. A.; Nakatsuji H.; Li X.; Caricato M.; Marenich A. V.; Bloino J.; Janesko B. G.; Gomperts R.; Mennucci B.; Hratchian H. P.; Ortiz J. V.; Izmaylov A. F.; Sonnenberg J. L.; Williams-Young D.; Ding F.; Lipparini F.; Egidi F.; Goings J.; Peng B.; Petrone A.; Henderson T.; Ranasinghe D.; Zakrzewski V. G.; Gao J.; Rega N.; Zheng G.; Liang W.; Hada M.; Ehara M.; Toyota K.; Fukuda R.; Hasegawa J.; Ishida M.; Nakajima T.; Honda Y.; Kitao O.; Nakai H.; Vreven T.; Throssell K.; Montgomery J. A. Jr.; Peralta J. E.; Ogliaro F.; Bearpark M. J.; Heyd J. J.; Brothers E. N.; Kudin K. N.; Staroverov V. N.; Keith T. A.; Kobayashi R.; Normand J.; Raghavachari K.; Rendell A. P.; Burant J. C.; Iyengar S. S.; Tomasi J.; Cossi M.; Millam J. M.; Klene M.; Adamo C.; Cammi R.; Ochterski J. W.; Martin R. L.; Morokuma K.; Farkas O.; Foresman J. B.; Fox D. J.. Gaussian 16, revision C.01; Gaussian Inc.: Wallingford CT, 2016.
Cappelli C.; Lipparini F.; Bloino J.; Barone V. Towards an accurate description of anharmonic infrared spectra in solution within the polarizable continuum model: Reaction field, cavity field and nonequilibrium effects. J. Chem. Phys. 2011, 135, 104505 10.1063/1.3630920. [DOI] [PubMed] [Google Scholar]
Friederich P.Vaska’s space, 2019. https://doi.org/10.5683/SP2/CJS7QA. [Accessed August 2023].
Grimme S.; Antony J.; Ehrlich S.; Krieg H. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J. Chem. Phys. 2010, 132, 154104 10.1063/1.3382344. [DOI] [PubMed] [Google Scholar]
von Rudorff G. F.; Heinen S. N.; Bragato M.; von Lilienfeld O. A.. QMrxn20: Thousands of reactants and transition states for competing E2 and SN2 reactions, Materials Cloud Archive 2020.55, 2022. https://doi.org/10.24435/materialscloud:sf-tz. [Accessed August 2023].
Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Lewis-Atwell T.; Beechey D.; Şimşek Ö.; Grayson M.. Dataset for ″Reformulating Reactivity Design for Data-Efficient Machine Learning″, 2023. https://doi.org/10.15125/BATH-01240. University of Bath Research Data Archive, [Accessed August 2023]. [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

cs3c02513_si_001.pdf^{(5.7MB, pdf)}

Data Availability Statement

[ref1] Truhlar D. G.; Garrett B. C.; Klippenstein S. J. Current Status of Transition-State Theory. J. Phys. Chem. A 1996, 100, 12771–12800. 10.1021/jp953748q. [DOI] [Google Scholar]

[ref2] Klippenstein S. J. From theoretical reaction dynamics to chemical modeling of combustion. Proc. Combust. Inst. 2017, 36, 77–111. 10.1016/j.proci.2016.07.100. [DOI] [Google Scholar]

[ref3] Lam Y.-h.; Grayson M. N.; Holland M. C.; Simon A.; Houk K. N. Theory and Modeling of Asymmetric Catalytic Reactions. Acc. Chem. Res. 2016, 49, 750–762. 10.1021/acs.accounts.6b00006. [DOI] [PubMed] [Google Scholar]

[ref4] Peng Q.; Duarte F.; Paton R. S. Computing organic stereoselectivity – from concepts to quantitative calculations and predictions. Chem. Soc. Rev. 2016, 45, 6093–6107. 10.1039/C6CS00573J. [DOI] [PubMed] [Google Scholar]

[ref5] Bachrach S. M. Computational organic chemistry. Annu. Rep. Sect. "B" (Org. Chem.) 2008, 104, 394–426. 10.1039/b719311b. [DOI] [Google Scholar]

[ref6] Seh Z. W.; Kibsgaard J.; Dickens C. F.; Chorkendorff I.; Nørskov J. K.; Jaramillo T. F. Combining theory and experiment in electrocatalysis: Insights into materials design. Science 2017, 355, eaad4998 10.1126/science.aad4998. [DOI] [PubMed] [Google Scholar]

[ref7] Cavalli F.; De Keer L.; Huber B.; Van Steenberge P. H. M.; D’hooge D. R.; Barner L. A kinetic study on the para-fluoro-thiol reaction in view of its use in materials design. Polym. Chem. 2019, 10, 2781–2791. 10.1039/C9PY00435A. [DOI] [Google Scholar]

[ref8] Grom M.; Stavber G.; Drnovšek P.; Likozar B. Modelling chemical kinetics of a complex reaction network of active pharmaceutical ingredient (API) synthesis with process optimization for benzazepine heterocyclic compound. Chem. Eng. J. 2019, 10, 2781–2791. [Google Scholar]

[ref9] Meghani N. M.; Amin H. H.; Lee B.-J. Mechanistic applications of click chemistry for pharmaceutical drug discovery and drug delivery. Drug Discovery Today 2017, 22, 1604–1619. 10.1016/j.drudis.2017.07.007. [DOI] [PubMed] [Google Scholar]

[ref10] Cores A.; Clerigué J.; Orocio-Rodríguez E.; Menéndez J. C. Multicomponent Reactions for the Synthesis of Active Pharmaceutical Ingredients. Pharmaceuticals 2022, 15, 1009 10.3390/ph15081009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] Kar S.; Sanderson H.; Roy K.; Benfenati E.; Leszczynski J. Green Chemistry in the Synthesis of Pharmaceuticals. Chem. Rev. 2022, 122, 3637–3710. 10.1021/acs.chemrev.1c00631. [DOI] [PubMed] [Google Scholar]

[ref12] Lin X.; Li X.; Lin X. A Review on Applications of Computational Methods in Drug Screening and Design. Molecules 2020, 25, 1375 10.3390/molecules25061375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Butler K. T.; Davies D. W.; Cartwright H.; Isayev O.; Walsh A. Machine learning for molecular and materials science. Nature. Nature 2018, 559, 547–555. 10.1038/s41586-018-0337-2. [DOI] [PubMed] [Google Scholar]

[ref14] Meuwly M. Machine Learning for Chemical Reactions. Chem. Rev. 2021, 121, 10218–10239. 10.1021/acs.chemrev.1c00033. [DOI] [PubMed] [Google Scholar]

[ref15] Lewis-Atwell T.; Townsend P. A.; Grayson M. N. Machine learning activation energies of chemical reactions. WIREs Comput. Mol. Sci. 2022, 12, e1593 10.1002/wcms.1593. [DOI] [Google Scholar]

[ref16] Jorner K.; Brinck T.; Norrby P.-O.; Buttar D. Machine learning meets mechanistic modelling for accurate prediction of experimental activation energies. Chem. Sci. 2021, 12, 1163–1175. 10.1039/D0SC04896H. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] Spiekermann K. A.; Pattanaik L.; Green W. H. Fast Predictions of Reaction Barrier Heights: Toward Coupled-Cluster Accuracy. J. Phys. Chem. A 2022, 126, 3976–3986. 10.1021/acs.jpca.2c02614. [DOI] [PubMed] [Google Scholar]

[ref18] Farrar E. H. E.; Grayson M. N. Machine learning and semi-empirical calculations: a synergistic approach to rapid, accurate, and mechanism-based reaction barrier prediction. Chem. Sci. 2022, 13, 7594–7603. 10.1039/D2SC02925A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref19] Heinen S.; von Rudorff G. F.; von Lilienfeld O. A. Toward the design of chemical reactions: Machine learning barriers of competing mechanisms in reactant space. J. Chem. Phys. 2021, 155, 064105 10.1063/5.0059742. [DOI] [PubMed] [Google Scholar]

[ref20] Friederich P.; dos Passos Gomes G.; De Bin R.; Aspuru-Guzik A.; Balcells D. Machine learning dihydrogen activation in the chemical space surrounding Vaska’s complex. Chem. Sci. 2020, 11, 4584–4601. 10.1039/D0SC00445F. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] Grambow C. A.; Pattanaik L.; Green W. H. Deep Learning of Activation Energies. J. Phys. Chem. Lett. 2020, 11, 2992–2997. 10.1021/acs.jpclett.0c00500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] von Rudorff G. F.; Heinen S. N.; Bragato M.; von Lilienfeld O. A. Thousands of reactants and transition states for competing E2 and S_N2 reactions. Mach. Learn. Sci. Technol. 2020, 1, 045026 10.1088/2632-2153/aba822. [DOI] [Google Scholar]

[ref23] Liu R.; Wang H.; Glover K. P.; Feasel M. G.; Wallqvist A. Dissecting Machine-Learning Prediction of Molecular Activity: Is an Applicability Domain Needed for Quantitative Structure-Activity Relationship Models Based on Deep Neural Networks?. J. Chem. Inf. Model. 2019, 59, 117–126. 10.1021/acs.jcim.8b00348. [DOI] [PubMed] [Google Scholar]

[ref24] Fechner N.; Jahn A.; Hinselmann G.; Zell A. Estimation of the applicability domain of kernel-based machine learning models for virtual screening. J. Cheminf. 2010, 2, 2 10.1186/1758-2946-2-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] Artrith N.; Butler K. T.; Coudert F.-X.; Han S.; Isayev O.; Jain A.; Walsh A. Best practices in machine learning for chemistry. Nat. Chem. 2021, 13, 505–508. 10.1038/s41557-021-00716-z. [DOI] [PubMed] [Google Scholar]

[ref26] Zhang Y.; Wang L.; Wang X.; Zhang C.; Ge J.; Tang J.; Su A.; Duan H. Data augmentation and transfer learning strategies for reaction prediction in low chemical data regimes. Org. Chem. Front. 2021, 8, 1415–1423. 10.1039/D0QO01636E. [DOI] [Google Scholar]

[ref27] Jackson R.; Zhang W.; Pearson J. TSNet: predicting transition state structures with tensor field networks and transfer learning. Chem. Sci. 2021, 12, 10022–10040. 10.1039/D1SC01206A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] Yamada H.; Liu C.; Wu S.; Koyama Y.; Ju S.; Shiomi J.; Morikawa J.; Yoshida R. Predicting Materials Properties with Little Data Using Shotgun Transfer Learning. ACS Cent. Sci. 2019, 5, 1717–1730. 10.1021/acscentsci.9b00804. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] Espley S. G.; Farrar E. H. E.; Buttar D.; Tomasi S.; Grayson M. N. Machine learning reaction barriers in low data regimes: a horizontal and diagonal transfer learning approach. Digital Discovery 2023, 2, 941–951. 10.1039/D3DD00085K. [DOI] [Google Scholar]

[ref30] Day O.; Khoshgoftaar T. M. A survey on heterogeneous transfer learning. J. Big Data 2017, 4, 29 10.1186/s40537-017-0089-0. [DOI] [Google Scholar]

[ref31] Neyshabur B.; Sedghi H.; Zhang C. What is being transferred in transfer learning?. Adv. Neural Inf. Process. Syst. 2020, 512–523. [Google Scholar]

[ref32] Wang J.; Chen Y.; Hao S.; Feng W.; Shen Z. In Balanced Distribution Adaptation for Transfer Learning, IEEE International Conference on Data Mining (ICDM); IEEE, 2017; pp 1129–1134.

[ref33] Bhawal B. N.; Reisenbauer J. C.; Ehinger C.; Morandi B. Overcoming Selectivity Issues in Reversible Catalysis: A Transfer Hydrocyanation Exhibiting High Kinetic Control. J. Am. Chem. Soc. 2020, 142, 10914–10920. 10.1021/jacs.0c03184. [DOI] [PubMed] [Google Scholar]

[ref34] Omann L.; Pudasaini B.; Irran E.; Klare H. F. T.; Baik M.-H.; Oestreich M. Thermodynamic versus kinetic control in substituent redistribution reactions of silylium ions steered by the counteranion. Chem. Sci. 2018, 9, 5600–5607. 10.1039/C8SC01833B. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] Flanagan M. E.; Abramite J. A.; Anderson D. P.; Aulabaugh A.; Dahal U. P.; Gilbert A. M.; Li C.; Montgomery J.; Oppenheimer S. R.; Ryder T.; Schuff B. P.; Uccello D. P.; Walker G. S.; Wu Y.; Brown M. F.; Chen J. M.; Hayward M. M.; Noe M. C.; Obach R. S.; Philippe L.; Shanmugasundaram V.; Shapiro M. J.; Starr J.; Stroh J.; Che Y. Chemical and Computational Methods for the Characterization of Covalent Reactive Groups for the Prospective Design of Irreversible Inhibitors. J. Med. Chem. 2014, 57, 10072–10079. 10.1021/jm501412a. [DOI] [PubMed] [Google Scholar]

[ref36] Lonsdale R.; Burgess J.; Colclough N.; Davies N. L.; Lenz E. M.; Orton A. L.; Ward R. A. Expanding the Armory: Predicting and Tuning Covalent Warhead Reactivity. J. Chem. Inf. Model. 2017, 57, 3124–3137. 10.1021/acs.jcim.7b00553. [DOI] [PubMed] [Google Scholar]

[ref37] Townsend P. A.; Grayson M. N. Density Functional Theory Transition-State Modeling for the Prediction of Ames Mutagenicity in 1,4 Michael Acceptors. J. Chem. Inf. Model. 2019, 59, 5099–5103. 10.1021/acs.jcim.9b00966. [DOI] [PubMed] [Google Scholar]

[ref38] Townsend P. A.; Grayson M. N. Reactivity prediction in aza-Michael additions without transition state calculations: the Ames test for mutagenicity. Chem. Commun. 2020, 56, 13661–13664. 10.1039/D0CC05681B. [DOI] [PubMed] [Google Scholar]

[ref39] Shannon D. A.; Weerapana E. Covalent protein modification: the current landscape of residue-specific electrophiles. Curr. Opin. Chem. Biol. 2015, 24, 18–26. 10.1016/j.cbpa.2014.10.021. [DOI] [PubMed] [Google Scholar]

[ref40] Gehringer M.; Laufer S. A. Emerging and Re-Emerging Warheads for Targeted Covalent Inhibitors: Applications in Medicinal Chemistry and Chemical Biology. J. Med. Chem. 2019, 62, 5673–5724. 10.1021/acs.jmedchem.8b01153. [DOI] [PubMed] [Google Scholar]

[ref41] Bragato M.; von Rudorff G. F.; von Lilienfeld O. A. Data enhanced Hammett-equation: reaction barriers in chemical space. Chem. Sci. 2020, 11, 11859–11868. 10.1039/D0SC04235H. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] Choi S.; Kim Y.; Kim J. W.; Kim Z.; Kim W. Y. Feasibility of Activation Energy Prediction of Gas-Phase Reactions by Machine Learning. Chem. - Eur. J. 2018, 24, 12354–12358. 10.1002/chem.201800345. [DOI] [PubMed] [Google Scholar]

[ref43] Maestro | Schrödinger, 2023. https://www.schrodinger.com/products/maestro. [Accessed August 2023].

[ref44] Halgren T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996, 17, 490–519. [Google Scholar]

[ref45] MacroModel | Schrödinger, 2023. https://www.schrodinger.com/products/macromodel [Accessed August 2023].

[ref46] Mohamadi F.; Richards N. G. J.; Guida W. C.; Liskamp R.; Lipton M.; Caufield C.; Chang G.; Hendrickson T.; Still W. C. Macromodeln integrated software system for modeling organic and bioorganic molecules using molecular mechanics. J. Comput. Chem. 1990, 11, 440–467. 10.1002/jcc.540110405. [DOI] [Google Scholar]

[ref47] Lewis-Atwell T.; Townsend P. A.; Grayson M. N. Comparisons of different force fields in conformational analysis and searching of organic molecules: A review. Tetrahedron 2021, 79, 131865 10.1016/j.tet.2020.131865. [DOI] [Google Scholar]

[ref48] Lewis-Atwell T.; Townsend P. A.; Grayson M. N. Comparing the Performances of Force Fields in Conformational Searching of Hydrogen-Bond-Donating Catalysts. J. Org. Chem. 2022, 87, 5703–5712. 10.1021/acs.joc.2c00066. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref49] Frisch M. J.; Trucks G. W.; Schlegel H. B.; Scuseria G. E.; Robb M. A.; Cheeseman J. R.; Scalmani G.; Barone V.; Petersson G. A.; Nakatsuji H.; Li X.; Caricato M.; Marenich A. V.; Bloino J.; Janesko B. G.; Gomperts R.; Mennucci B.; Hratchian H. P.; Ortiz J. V.; Izmaylov A. F.; Sonnenberg J. L.; Williams-Young D.; Ding F.; Lipparini F.; Egidi F.; Goings J.; Peng B.; Petrone A.; Henderson T.; Ranasinghe D.; Zakrzewski V. G.; Gao J.; Rega N.; Zheng G.; Liang W.; Hada M.; Ehara M.; Toyota K.; Fukuda R.; Hasegawa J.; Ishida M.; Nakajima T.; Honda Y.; Kitao O.; Nakai H.; Vreven T.; Throssell K.; Montgomery J. A. Jr.; Peralta J. E.; Ogliaro F.; Bearpark M. J.; Heyd J. J.; Brothers E. N.; Kudin K. N.; Staroverov V. N.; Keith T. A.; Kobayashi R.; Normand J.; Raghavachari K.; Rendell A. P.; Burant J. C.; Iyengar S. S.; Tomasi J.; Cossi M.; Millam J. M.; Klene M.; Adamo C.; Cammi R.; Ochterski J. W.; Martin R. L.; Morokuma K.; Farkas O.; Foresman J. B.; Fox D. J.. Gaussian 16, revision A.03; Gaussian Inc.: Wallingford CT, 2016.

[ref50] Frisch M. J.; Trucks G. W.; Schlegel H. B.; Scuseria G. E.; Robb M. A.; Cheeseman J. R.; Scalmani G.; Barone V.; Petersson G. A.; Nakatsuji H.; Li X.; Caricato M.; Marenich A. V.; Bloino J.; Janesko B. G.; Gomperts R.; Mennucci B.; Hratchian H. P.; Ortiz J. V.; Izmaylov A. F.; Sonnenberg J. L.; Williams-Young D.; Ding F.; Lipparini F.; Egidi F.; Goings J.; Peng B.; Petrone A.; Henderson T.; Ranasinghe D.; Zakrzewski V. G.; Gao J.; Rega N.; Zheng G.; Liang W.; Hada M.; Ehara M.; Toyota K.; Fukuda R.; Hasegawa J.; Ishida M.; Nakajima T.; Honda Y.; Kitao O.; Nakai H.; Vreven T.; Throssell K.; Montgomery J. A. Jr.; Peralta J. E.; Ogliaro F.; Bearpark M. J.; Heyd J. J.; Brothers E. N.; Kudin K. N.; Staroverov V. N.; Keith T. A.; Kobayashi R.; Normand J.; Raghavachari K.; Rendell A. P.; Burant J. C.; Iyengar S. S.; Tomasi J.; Cossi M.; Millam J. M.; Klene M.; Adamo C.; Cammi R.; Ochterski J. W.; Martin R. L.; Morokuma K.; Farkas O.; Foresman J. B.; Fox D. J.. Gaussian 16, revision C.01; Gaussian Inc.: Wallingford CT, 2016.

[ref51] Cappelli C.; Lipparini F.; Bloino J.; Barone V. Towards an accurate description of anharmonic infrared spectra in solution within the polarizable continuum model: Reaction field, cavity field and nonequilibrium effects. J. Chem. Phys. 2011, 135, 104505 10.1063/1.3630920. [DOI] [PubMed] [Google Scholar]

[ref52] Friederich P.Vaska’s space, 2019. https://doi.org/10.5683/SP2/CJS7QA. [Accessed August 2023].

[ref53] Grimme S.; Antony J.; Ehrlich S.; Krieg H. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J. Chem. Phys. 2010, 132, 154104 10.1063/1.3382344. [DOI] [PubMed] [Google Scholar]

[ref54] von Rudorff G. F.; Heinen S. N.; Bragato M.; von Lilienfeld O. A.. QMrxn20: Thousands of reactants and transition states for competing E2 and SN2 reactions, Materials Cloud Archive 2020.55, 2022. https://doi.org/10.24435/materialscloud:sf-tz. [Accessed August 2023].

[ref55] Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

[ref56] Lewis-Atwell T.; Beechey D.; Şimşek Ö.; Grayson M.. Dataset for ″Reformulating Reactivity Design for Data-Efficient Machine Learning″, 2023. https://doi.org/10.15125/BATH-01240. University of Bath Research Data Archive, [Accessed August 2023]. [DOI] [PMC free article] [PubMed]

PERMALINK

Reformulating Reactivity Design for Data-Efficient Machine Learning

Toby Lewis-Atwell

Daniel Beechey

Özgür Şimşek

Matthew N Grayson

Abstract

1. Introduction