Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Jul 28.
Published in final edited form as: ACS Catal. 2020 Nov 5;10(22):13504–13517. doi: 10.1021/acscatal.0c03939

Iterative Supervised Principal Component Analysis Driven Ligand Design for Regioselective Ti-Catalyzed Pyrrole Synthesis

Xin Yi See 1, Xuelan Wen 2, T Alexander Wheeler 3, Channing K Klein 4, Jason D Goodpaster 5, Benjamin R Reiner 6, Ian A Tonks 7
PMCID: PMC8318334  NIHMSID: NIHMS1662623  PMID: 34327040

Abstract

The rational design of catalysts remains a challenging endeavor within the broader chemical community owing to the myriad variables that can affect key bond-forming events. Designing selective catalysts for any reaction requires an efficient strategy for discovering predictive structure–activity relationships. Herein, we describe the use of iterative supervised principal component analysis (ISPCA) in de novo catalyst design. The regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H-pyrrole (C) via a Ti-catalyzed formal [2 + 2 +1] cycloaddition of phenylpropyne and azobenzene was targeted as a proof of principle. The initial reaction conditions led to an unselective mixture of all possible pyrrole regioisomers. ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this PCA space and k-means clustering were used to inform the design of new test catalysts. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and optimal candidates were synthesized and tested experimentally. This data-driven predictive-modeling workflow was iterated, and after only three generations the catalytic selectivity was improved from 0.5 (statistical mixture of products) to over 11 (>90% C) by incorporating 2,6-dimethyl-4-(pyrrolidin-1-yl)pyridine as a ligand. The origin of catalyst selectivity was probed by examining ISPCA variable loadings in combination with DFT modeling, revealing that ligand lability plays an important role in selectivity. A parallel catalyst search using multivariate linear regression (MLR), a popular approach in catalysis informatics, was also conducted in order to compare these strategies in a hypothetical catalyst scouting campaign. ISPCA appears to be more robust and predictive than MLR when sparse training sets are used that are representative of the data available during the early search for an optimal catalyst. The successful development of a highly selective catalyst without resorting to long, stochastic screening processes demonstrates the inherent power of ISPCA in de novo catalyst design and should motivate the general use of ISPCA in reaction development.

Keywords: iterative supervised principal component analysis, selectivity, DFT, catalyst prediction, titanium, pyrrole

Graphical Abstract

graphic file with name nihms-1662623-f0001.jpg

■ INTRODUCTION

The rational design of selective catalysts remains an outstanding challenge in organometallic chemistry and the broader synthetic community. A detailed mechanistic understanding of key bond-forming events can provide an avenue for improvement, but even for relatively simple systems the number of potential variables that affect catalysis—intermediate and transition state structures, as well as solvent and additive effects—can be staggering. These can often affect other aspects of catalysis beyond selectivity (rate, yield, etc.) as well.1-4 Moreover, the mechanisms of many catalytic systems are difficult to interrogate because of experimental, spectroscopic, or computational limitations. With the need for an improved organic methodology as motivation, a variety of strategies have been developed to simplify the search for improved catalysis. For example, design of experiment (DOE)5,6 and high-throughput experimentation (HTE)7-9 are commonly employed to expedite otherwise long screening processes. Nonetheless, the most substantial research bottleneck in identifying optimal catalysts is often catalyst synthesis and purification, which can take hours or even days. In this context, there is great value in using statistical analysis to (1) facilitate the search for an ideal catalyst and (2) develop a quantitative structure–activity relationship (QSAR) for further intuitive prediction. In many instances, training sets are inherently sparse and the chemical space often undersampled. Thus, there is great incentive to develop statistical models that lead to a rapid prediction of candidate catalysts with as few data as possible. Ideal models will be both simple and highly predictive. Univariate analyses have been popular historically owing to their simplicity and interpretability; however, classical descriptors (e.g., Hammett parameters, Tolman cone angles, etc.)10-12 can be inadequate in predicting complex catalytic systems.2,13 Modern approaches have directed a significant effort at the computer-aided design of selective catalysts,14-21 and in recent years, machine learning22-28 and multivariate linear regression4,13,29,30 (MLR) analysis have emerged as powerful tools for detailed modeling and prediction in chemical catalysis.

Building off these prior efforts, we were interested in developing an approach to facilitate the search for an optimal catalyst that would be both highly predictive and also chemically intuitive. In this context, principal component analysis (PCA) is an attractive option in balancing predictive power and sensible interpretation. PCA is a statistical procedure that aims to transform a set of possibly correlated variables into a set of linearly orthogonal variables called components, which describe the variance in the original data set. PCA is commonly used in bioinformatics and chemometrics applications, where analysis of an unwieldy number of variables is facilitated by data reduction.31-36 Despite the success of PCA in other chemical contexts, the use of PCA in catalysis has been underexplored. Notably, reports demonstrating the extrapolation of PCA-borne structure–activity relationships to practical de novo catalyst design are rare. The Fey lab has reported the use of PCA ligand maps, denoted “ligand knowledge bases” (LIKBs)1,37-44 to guide the development of hydroformylation/hydrocyanation catalysts.45 The Rothenberg46,47 and Jensen17 laboratories have utilized a similar statistical technique, partial least-squares regression (PLSR), to inform the design of olefin isomerization and metathesis catalysts, respectively.

Herein, we report the use of iterative supervised PCA (ISPCA) in aiding rational catalyst optimization (Figure 1). ISPCA is a powerful technique because it is simple to implement, is computationally inexpensive, and does not require large data sets to provide meaningful insights into catalyst design. We have chosen the regioselective Ti-catalyzed [2 + 2 + 1] synthesis of substituted pyrroles as a proof of principle to demonstrate how this strategy can be deployed to improve any nascent or established methodology. Furthermore, we have executed a parallel analysis of ISPCA and MLR in a hypothetical catalyst scouting campaign for regioselective [2 + 2 + 1] pyrrole synthesis. ISPCA appears to be more robust and predictive than MLR when sparse training sets are used that are representative of the data available during the search for an optimal catalyst. However, MLR appears to be the superior method for developing a global catalyst structure–activity model. ISPCA is well poised for use where (1) a clear trend using univariate analysis cannot be established and/or (2) multiple steps are affecting a reaction. At its core, the ability to reduce data with PCA allows for a facile identification of key descriptors and serves as a strategy to complement chemists’ intuition in developing new reactions.

Figure 1.

Figure 1.

Graphical outline of the iterative supervised principal component analysis (ISPCA) strategy. (1) Generate a descriptor basis set for hypothetical catalysts using metrics that span a wide range of electronic, steric, and geometric parameters. (2) Conduct principal component analysis (PCA) on a training set of catalysts described by the descriptor basis set and determine the top three principal components (PC) and recast the data onto the PCs. (3) Regress the catalyst scores (coordinates in PCA space) onto catalyst selectivity. (4) If the fit statistics are inadequate, iteratively optimize the descriptor basis set by exhaustively searching all combinations of descriptors. (5) Determine linear regression coefficients from the optimized PCA space. (6) Use regression coefficients to inform a set of new test catalysts, using the distance in PCA space (see text) to cull potential catalysts in silico.

■ APPROACH

In this work, we have developed an iterative supervised PCA (ISPCA) strategy36 that exhaustively searches all catalyst descriptor combinations, in an effort to provide a model wherein the least number of descriptors describes the most variance in catalyst performance data (Figure 1). This leads to an ideal model that is both simple and highly predictive. First, a basis set of potential descriptors of interest is computed/tabulated for a catalyst training set (Figure 1, box 1). The catalyst scores (coordinates) in PCA space are determined for this initial descriptor basis set (Figure 1, box 2). Next, catalyst scores are linearly regressed against the catalyst performance data, and the regression model is evaluated for statistical fit (R2/Q2, Figure 1, box 3). Given that the initial basis set may contain redundant, irrelevant, or correlated descriptors, the regression fit may be poor. As a result, the descriptor basis set is iteratively optimized by exhaustively searching for the smallest combination of descriptors that best correlate to catalyst performance data using an automated script (Figure 1, box 4). From this optimized PCA basis set, a predictive and intuitive model for catalyst design can be developed by examining the summed descriptor loadings. These values are determined by taking the dot product of the PCA regression coefficients and the descriptor loadings for the top three principal components (Figure 1, box 5). Finally, new catalysts can be proposed and modeled against the training set using the optimized basis set (Figure 1, box 6).

Hypothetical new catalysts can be evaluated using a “molecular ruler” (Figure 1, box 6). Changes in catalyst performance are captured as the Euclidean distance between catalyst scores in PCA space. Sufficient distance must be achieved between a training set sample and a “new catalyst” before a substantial change in catalyst performance is expected—for better or worse. This can then be used as a statistical gauge to determine if a test catalyst is sufficiently different from the training set to warrant synthesis. This in silico culling of catalyst test candidates substantially reduces synthesis time, which is often the most substantial research bottleneck in heuristic catalyst optimization. The overall process can be iterated though several cycles of design/prediction/synthesis/analysis as needed. In principle, steps 5 and 6 could be further automated to design and test catalysts in silico using established molecular evolution algorithms if desired.18 The value of this type of data-driven predictive-modeling workflow has been reviewed previously.47 Notably, knowledge about the chemical reaction or catalyst performance is not required a priori, illustrating the generalized value of the ISPCA approach.

RESULTS AND DISCUSSION

To demonstrate the utility of ISPCA strategies in de novo catalyst design, we targeted the development of a regioselective Ti imido catalyst for the intermolecular [2 + 2 + 1] synthesis of pyrroles from unsymmetrical alkynes and azobenzene.48 Catalyst-controlled selective intermolecular [2 + 2 + 1] formal cycloadditions are uncommon;49-51 these types of multicomponent reactions inherently have multiple steps where both chemo- and regioselectivity must be considered, making straightforward prediction and design of selective catalysts difficult. Here, the regioselective synthesis of 2,5-dimethyl-1,3,4-triphenyl-1H-pyrrole (C) from phenylpropyne and azobenzene (Figure 2) was targeted. This reaction was chosen for two reasons: (1) the 3,4-diaryl motif is found in several natural product classes including halitulins52-54 and lamellarins55-57 and (2) PhCCMe exhibits a low proclivity toward competitive alkyne trimerization and other side reactions in Ti-catalyzed [2 + 2 + 1] reactions. Further, this reaction has proven resistant to optimization through rational catalyst design. Reaction of PhCCMe with PhNNPh catalyzed by py3TiCl2(NPh) (1) generates a 20:45:35 mixture of all possible pyrrole regioisomers A–C (Figure 2). Since the initial report of Ti-catalyzed pyrrole synthesis and a follow-up detailed mechanistic study,48,58,59 significant effort in our laboratory has been dedicated to the discovery of selective catalysts, including the examination of many regioselective alkyne hydroamination platforms (which shares a common [2 + 2] cycloadduct intermediate) reported in the literature (see page S58 in the Supporting Information for additional details). In all cases, Bercaw’s law of initial optimization60,61 appeared in effect: either reaction rates were significantly lowered or there was no significant effect on the reaction regioselectivity (or both). Furthermore, a computational analysis of reactions catalyzed by 1 indicated that complex catalyst speciation was likely an issue.58

Figure 2.

Figure 2.

Initial catalyst screen for selective [2 + 2 + 1] pyrrole formation from PhCCMe. The most selective catalysts are shown in orange. Conditions: 0.5 mmol of PhCCMe, 0.1 mmol of PhNNPh, 10 mol % of [Ti], 0.5 mL of PhCF3, 115 °C, 16 h, average of two to three runs.

Iterative Supervised Principal Component Analysis.

On the basis of these initial attempts, a more systematic approach was warranted, and we undertook a catalyst screen where X- and L-type ligands were systematically varied (Figure 2). The catalyst library was based on simple halide-58 and N-heterocycle-based X-type62-64 ligands and pyridine-based L-type ligands. These frameworks were chosen since they were easy to synthesize, they were highly modular, and the precursors are commercially available. None of the catalysts in Figure 2 resulted in synthetically useful selectivity for product C. Although both sterically encumbered (4, 5) and electron-deficient (10) catalysts resulted in a moderate improvement in selectivity (maximum of 1.8 for C), clear or quantitative trends to further improve selectivity via univariate modification were not obvious. Nonetheless, the small spread in the data provided an opportunity to perform ISPCA using catalysts 1–14 as a training set, as outlined below. For this study, the product selectivity was represented as a ratio of regioisomers rather than a free energy change because all reactions were carried out under identical conditions—this helped to differentiate small changes in selectivity, given that a 10:1 selectivity ratio corresponds to a ΔΔGǂ value of only 1.4 kcal mol−1. In principle, reactions run under different conditions could also be analyzed using ISPCA via an analysis of ΔΔGǂ (vide infra).

Step 1. The initial basis set for the analysis was constructed from 22 descriptors spanning a variety of steric, electronic, and spectroscopic descriptors for catalysts 1–14 calculated at the M06/6-311g(d,p) level of theory (see pages S68-S70 in the Supporting Information for additional details). Descriptors for catalysts and free ligands were chosen on the basis of recent work from the Sigman laboratory detailing the value of catalyst-specific descriptors.13

Step 2. Principal component analysis was conducted on the catalyst training set using the initial basis set, and the catalyst selectivity was plotted against the scores using the top three principal components. This generated a 3D PCA map1 populated by the catalyst training set where catalyst scores (coordinates in PCA space) are related to their individual performance (Figure 3, top).

Figure 3.

Figure 3.

(top) Principal component map constructed from the top three components of ISPCA model 1 and populated by the catalyst training set. Coloring is set by a dynamic k-means clustering algorithm (see the Supporting Information for additional details). (bottom) Plots of unoptimized PCA (black) and ISPCA (blue) model I predictions versus experimental reaction selectivity delivered by the catalyst training set. Example Calculation (ISPCA Model 1) selectivity = (regression coefficients x catalyst scores) + c c = arithmetic mean of experimental selectivity (0.9877) catalyst 6 score [x, y, z] = [0.1747, 0.6143, −1.3467] PC1, PC2, PC3 regression coefficients = 0.0460, 0.0459, 0.3472 catalyst 6 predicted selectivity = (0.0460 × 0.1747) + (0.0459 × 0.6143) + (0.3472 × −1.3467) + 0.9877 = 0.56 (experimental selectivity = 0.56)

Step 3. Linear least-squares fitting was used to regress the catalyst scores against selectivity (Figure 3, bottom, black). The regression was artificially limited to three components to provide a means of visualizing a chemical space1 as well as avoiding overfitting issues. Fit statistics using this unsupervised approach were unsatisfactory (R2 = 0.5) despite the top three components accounting for 76% of the data variance. This is unsurprising, considering the low likelihood that all descriptors would have a substantial correlation to catalyst performance.

Step 4. An exhaustive search for an improved basis set was then performed by conducting PCA on all possible descriptor groupings (2.1 million possibilities), using residuals between experimental and predicted selectivity (via the top three component approach described above) as the fitness parameters. The grouping delivering the lowest residuals contained 12 descriptors (ISPCA model 1), accounted for 82% of the data variance, and provided the basis set for ongoing analysis (Table 1). Fit statistics using this supervised approach were considered acceptable (R2 = 0.95), and the model fit was validated using leave-one-out cross validation (Q2 = 0.93, Figure 3, bottom, blue). Conducting the analysis after converting the selectivity ratio to ΔΔGǂ gives equivalent results.65 The PLSR strategy reported by the Rothenberg laboratory46 was also implemented, resulting in a worse statistical fit (R2 = 0.72, Q2 = 0.22), which supports the merits of the ISPCA method over other factor analysis based methods (Figure S94).66 The Rothenberg approach was used as a benchmark because partial least-squares regression (PLSR) is the closest mathematical approach to ISPCA used in the catalysis literature. There have been only a handful of reports17,67 on the successful use of factor analysis to model and predict new catalysts in a quantitative fashion.

Table 1.

Descriptor Contributions to PC1, PC2, and PC3 in ISPCA Model 1 and the Relative Summed Weight of These Descriptors

descriptor PC1 PC2 PC3 relative
weighta
free pyridine o-13C NMR shift 0.255 −0.068 0.514 18.7
imido–Ti–Npy angle 0.081 −0.359 0.523 16.8
catalyst-bound pyridine N atom Mulliken charge −0.356 0.072 −0.380 −14.5
Ti–X donor BDE 0.144 −0.322 −0.308 −11.5
Ti–X donor bond length −0.028 0.530 0.204 9.4
composite donor atom polarizability −0.067 −0.542 0.198 9.1
catalyst LUMO 0.348 −0.144 −0.218 −6.6
free pyridine LUMO −0.324 −0.139 0.230 5.9
free pyridine proton affinity −0.414 −0.047 −0.031 −3.2
free pyridine quadrupole moment −0.394 −0.121 0.148 2.8
% buried volume −0.221 −0.349 0.126 1.8
free pyridine HOMO 0.419 0.072 −0.052 0.5
a

PC regression coefficients • descriptor loadings scaled by a factor of 100.

Step 5. From the optimal ISPCA basis set, relative descriptor weights were determined by a regression coefficient weighted summation of the coefficients of all descriptors in the top three principal components (Table 1; cf. Approach). The relative magnitudes and signs of each descriptor weight were used to understand how tuning electronic or steric parameters would increase catalyst selectivity. For example, the calculated gauge-independent 13C NMR chemical shift of the o-C atom on the free pyridine ligand has a high positive weight (18.7) in the ISPCA model, suggesting that ligands with downfield-shifted o-13C NMR chemical shifts (associated with substitution of the o-C) will have higher selectivity. Similarly, the high positive weight (16.8) of the imido–Ti–Npy angle indicates that sterically encumbered pyridine ligands will exhibit higher selectivity. Additionally, there is a high negative weight (−11.5) associated with Ti–X donor bond dissociation energy (BDE) and a high positive weight (9.1) for composite donor atom polarizability (number-weighted sum of atomic polarizability for donor atoms at Ti), indicating that more polarizable X-type ligands are advantageous. In sum, these quantitative results corroborate chemically intuitive trends: catalysts bearing both highly polarizable, poorly donating X-type ligands (i.e., iodide) and electron-rich, sterically encumbered L-type ligands (i.e., lutidine) will be selective catalysts.

Step 6. Using the foregoing ISPCA model 1, “second generation” catalysts 15–22 were predicted (Figure 4), with a focus on changing characteristics of the catalyst in accordance with the relative weights of the descriptors in the model. Prospective test catalysts were first evaluated using the “molecular ruler” approach wherein, on the basis of ISPCA model 1, a good catalyst must be 1.7 units in PCA space (in any direction) from the nearest training set neighbor in order to expect a substantial change in selectivity (Δselectivity = 0.46 or 1σ). Page S83 in the Supporting Information contains additional proposed test catalysts that were culled in silico using the “molecular ruler” approach.

Figure 4.

Figure 4.

Second-generation catalysts predicted and synthesized using ISPCA model 1. The most selective catalyst is drawn in orange. Conditions: 0.5 mmol of PhCCMe, 0.1 mmol of PhNNPh, 10 mol % of [Ti], 0.5 mL of PhCF3, 115 °C, 16 h, average of two to three runs.

Catalysts 15 and 16 were computed as representative candidates that contained sterically encumbered L-type ligands. On the basis of ISPCA model 1 these two catalysts were predicted (15 at 1.4 and 16 at 2.8) and experimentally corroborated (15 at 2.0 and 16 at 1.9) to be among the more selective catalysts in comparison to the training set. Excitingly, 17, which contains both a sterically encumbered pyridine ligand and a poorly donating iodide ligand, was predicted (3.65) and experimentally confirmed (2.41) to be more selective than any catalyst in the training set. 17 is a good example of potential additive effects on selectivity: 16, which has stronger Ti–Cl bonds (and is thus less Lewis acidic than 17), and 10, which is similarly Lewis acidic but lacks steric encumbrance on pyridine, each have worse selectivity.

Predictive success with 17, along with the strong model correlation to pyridine N atom Mulliken charge, led us to hypothesize that the strength of the Lewis pair interaction between Ti and the pyridine ligand was important and that 18, with a sterically encumbered, strongly donating pyridine ligand, would also result in high selectivity. 18 was predicted to be selective (3.0) and remarkably exhibited a significantly higher experimental selectivity (10.1) in comparison to any other catalyst tested, a 20-fold increase from 1. In comparison to catalysts with only one of these properties, the strength of the multivariate analysis becomes apparent: catalysts 5 (sterically encumbered) and 19/20 (strongly donating) each have pedestrian selectivity for C, and in a conventional screening endeavor there would be no obvious reason to spend time on the tedious synthesis of 2,6-Me2-DMAP, which is not commercially available.

The high relative negative weight of catalyst LUMO in the ISPCA model motivated us to consider catalysts bearing pentafluorophenoxide X-type ligands, since orbital calculations reveal that the catalyst LUMO is almost exclusively localized on the X and L donors. ISPCA model 1 predicted these catalysts (21, 22) would occupy disadvantageous positions in PCA space and should not be highly selective for product C. Nonetheless, we tested these catalysts to further explore the predictive power of ISPCA. Indeed, while catalyst 22 is not a competent catalyst for the pyrrole reaction, catalyst 21 provides product B in unusually high selectivity. Thus, the unique coordinates of catalysts 21 and 22 in PCA space demonstrate that ISPCA should be an effective method for predicting changes in catalyst performance in a more general context, in this case ultimately allowing for the selective synthesis of any potential pyrrole regioisomer.

While ISPCA model 1 led to the synthesis of a highly selective catalyst, the relatively poor accuracy for 18 (predicted vs observed: 3.0 vs 10.1) prompted us to rerun ISPCA using the full catalyst data set. Linear regression using ISPCA and catalysts 1–22 as a training set results in relatively poor fit statistics (R2 = 0.72, Q2 = 0.13; Figure S95). Excluding 18 from the training set produced good fit statistics (R2 = 0.90, Q2 = 0.84; Figure S96). Notably, attempting to fit the PCA subspace surrounding 18 using k-means clustering (Figure S97) to determine the local nearest neighbors (catalysts 4, 5, 15, 16, 18, 19, 20) resulted in an excellent coefficient of determination (R2 = 0.99) but a very poor cross-validation value (Q2 = −0.07). In combination, these modeling attempts suggest that catalyst 18 may represent an activity cliff/threshold or that the chemical space surrounding 18 is undersampled. Activity cliffs are commonly found in predictive catalysis and medicinal chemistry,68 where a small modification in a substrate or catalyst results in a dramatic change in performance. Identifying and predicting activity cliffs are inherent challenges in chemical catalysis. The dramatic increase in performance of catalyst 18 is a salient example of how any predictive model must be carefully informed by rigorous experimentation.

In order to effectively model a training set containing catalysts 1–22, a weighted ISPCA scheme was implemented (ISPCA model 2, Table 2). Weights were applied to each catalyst based on the pairwise distance from catalyst 18 and the formula: weights = e−distance/c, where c = 25 and was determined empirically.69 This modeling strategy resulted in satisfactory fit statistics (R2 = 0.95, Q2 = 0.77, Figure S98), and an analysis of the relative descriptor weights from ISPCA model 2 was informative for the development of third-generation catalysts. In ISPCA model 2, the electron-donating ability of the ancillary pyridine ligand (correlated to the catalyst-bound pyridine p-13C NMR shift and free pyridine HOMO) is the most critical factor in determining selectivity. Additionally, maintaining steric bulk around the Ti center (correlated to imido–Ti–Npy angle and o-13C NMR shifts) remains important. Therefore, we were prompted to investigate two catalysts: one bearing a more electron rich 2,6-substituted pyridine, 2,6-dimethyl-4-(pyrrolidin-1-yl)pyridine (Table 2, catalyst 23, ISPCA model 2 predicted selectivity 11.9), and one with 2,6-Me2-DMAP in combination with poorly donating I ligands (Table 2, catalyst 24, ISPCA model 2 predicted selectivity 12.9). Experimental selectivity for C with catalyst 23 was 11.5, nearly 25 times as selective as the initial catalyst, 1. Unfortunately, 24 was not able to be isolated in pure form and catalysis was not conducted.

Table 2.

ISPCA Model 2 Relative Summed Descriptor Weights from the Top Three Components and Prediction of Highly Selective Third-Generation Catalysts

graphic file with name nihms-1662623-t0010.jpg
descriptor PC1 PC2 PC3 relative
weighta
catalyst-bound pyridine p-13C NMR shift 0.408 0.318 0.840 −25.1
imido–Ti–Npy angle 0.408 0.275 −0.420 17.6
free pyridine HOMO −0.407 0.877 −0.112 13.1
catalyst-bound pyridine o-13C NMR shift 0.409 0.155 −0.262 10.9
free pyridine o-13C NMR shift 0.409 0.170 −0.118 6.1
Ti–Npy bond length 0.409 −0.043 −0.152 5.0
a

PC regression coefficients × descriptor loadings.

Comparison of Statistical Approaches.

Although the ISPCA strategy was successful in delivering a selective catalyst, we were motivated to explore whether alternative statistical approaches would better facilitate the discovery of lead catalyst 23 in a hypothetical catalyst scouting campaign. Multivariate linear regression (MLR) is a popular and powerful approach for reaction optimization and was chosen as a benchmark to ISPCA. Within this parallel analysis, we aimed to interrogate the merits of each method in (1) developing an accurate global catalyst structure–activity model and (2) predicting the optimal catalyst (23). In particular, we wanted to investigate these metrics as functions of both training set size and composition—two factors that would both vary during a “real” catalyst scouting campaign.

First, MLR analysis using a training set comprised of catalysts 1–14 was carried out using the same 22 descriptors used for the earlier ISPCA studies. This analysis leads to a fit with R2 = 0.995 and Q2 = −0.18. The unusually low cross-validation value is indicative of overfitting issues, which is highlighted graphically in Figure 5. The highest weighted coefficient in the MLR model is an interaction term composed of free pyridine N Mulliken charge and catalyst-bound X donor atom. The remaining terms are constructed from free pyridine o-13C NMR shift, Ti–X donor BDE, and catalyst LUMO. The relative weight of descriptors in the MLR model is consistent with the qualitative trend described by the analogous ISPCA model 1, which similarly indicated that selective catalysts bear strongly electron donating (free pyridine N Mulliken charge) and sterically congested (free pyridine o-13C NMR shift) pyridine ligands and easily polarized anionic (Ti–X donor BDE, catalyst LUMO) ligands (Figure 5, bottom). Notably, the training set statistical fit for the MLR model is nearly perfect in contrast to the ISPCA model (R2 of 0.995 vs 0.954), suggesting that MLR may be a more efficient strategy for determining catalyst structure–activity relationships of the data set.

Figure 5.

Figure 5.

(top) Plots of ISPCA model 1 (black) and MLR (blue) model predictions versus experimental reaction selectivity delivered by the catalyst training set 1–14. Open circles refer to values predicted by leave-one-out cross validation (LOOCV). (bottom) Descriptors and relative weights for the MLR model.

Despite identification of qualitatively similar trends in the training set of 1–14, quantitative prediction of the lead catalyst 23 with the ISPCA and MLR models shows a stark divergence: for catalyst 23, ISPCA model 1 predicts a selectivity value of 3.0, while the MLR model predicts a selectivity value of −1.7. Although neither initial model accurately predicts the experimental selectivity of 23 (11.3), ISPCA does a successful job of highlighting 23 as a candidate warranting synthesis, considering that the highest selectivity observed in the training set was only 1.8. In contrast, the MLR model predicts a physically impossible value which would likely provide very limited value to researchers embarking on a new catalyst search. Thus, while MLR may ultimately yield a model with better descriptive power within a given data set, ISPCA provides complementary predictive power beyond the initial data set. This is valuable because training sets are often constructed heuristically by chemical intuition with limited quantitative chemical information and, by pragmatic necessity, undersample the desired feature space.

A “single-point” (single training set) comparison between ISPCA and MLR does not fully reflect the strengths of each statistical technique. Thus, we were further motivated to carry out a more detailed stepwise parallel analysis of the two techniques to better understand how the training set size and composition may lead to different modeling behaviors (Figure 6). For the parallel analysis, the full set of catalysts (1–23) was split into a “training set” and a “test set”. The test set was defined as the catalysts excluded by the training set. The initial training set was constructed from catalysts 1–10, resulting in an initial test set of catalysts 11–23. The training set was incremented sequentially by one catalyst in numerical order: i.e., training set #2 is catalysts 1–11, training set #3 is catalysts 1–12, etc (cf. Figure 6a). An automated script (analysisparalysis, see the Supporting Information) was used to fit the expanding training set using either MLR or ISPCA. The resulting model was then used to predict the remaining test catalysts, and the statistical fit was determined using three measures: (1) residuals between experimental and predicted selectivity for catalyst 23, (2) R2 for the global prediction of catalysts 1–23 based on a particular training set, and (3) RS (Spearman correlation), which measures the rank correlation of catalysts as a function of predicted vs experimental selectivity (Figure 6b).

Figure 6.

Figure 6.

(a) Graphical outline of the single-path analysis workflow showing how a training set (green dots) is iteratively expanded and analyzed to model a global data set that includes the training set and a test set (green plus orange dots). (b) Single-path parallel analysis of MLR and ISPCA. The top subplot shows the total number of terms in the accompanying regression model. Each following subplot shows a different measure of statistical fit (from top to bottom): absolute residuals between experimental and predicted selectivity for catalyst 23, global model R2, and global model Spearman correlation. Each dot refers to a different model employing either MLR or ISPCA using catalysts 1–N in the training set, where N is denoted on the x axis. In each subplot the blue dots/lines refer to analysis by ISPCA and the black dots/lines refer to analysis by MLR.

In this stepwise test set analysis, it is notable that neither MLR or ISPCA is the clearly superior method for either predicting the optimal catalyst (23 residuals) or developing a global catalyst structure–activity relationship (R2/Rs). Nonetheless, MLR appears to be ultimately superior for fitting a global catalyst structure–activity relationship for the complete data set, as evidenced by higher R2/RS values when all 23 catalysts are included (Figure 6b). However, ISPCA appears superior in modeling and predicting the optimal catalyst in sparser training sets, as evidenced by the improved statistical fits in the intermediate training sets comprised of ~14–22 catalysts. ISPCA also shows less model volatility than MLR analysis, which shows large deviations in prediction (e.g., residuals on 23 prediction when 14 is initially included in training or R2 on overall prediction when 18 is initially included). The volatility in the MLR analysis when including catalysts such as 14 is likely a result of these complexes occupying a different chemical space (carbazolide rather than a simple halide ligand) than the rest of the training set. This structural diversity may not be captured completely in the MLR analysis.

Thus, the training set has a substantial bias on the ensuing model but seems to influence MLR more strongly than ISPCA. In order to further explore the statistical robustness/volatility of MLR and ISPCA as a function of training set bias, we next sought to compare “path” effects on model building (Figure 7). Here, the initial training set of catalysts 1–14 was incrementally expanded to include catalysts 15, 17, 18, and 19 (a set of catalysts that “chemical intuition” might lead one to investigate on the basis of initial modeling/data sets). However, these catalysts were not just added in sequence (15, 17, and 18 and then 19) as described in the foregoing experiment but also in every possible permutation (24 possible permutation, 96 models total for each analysis method). The order in which the catalysts were added to the expanding training set is denoted as a “path” (Figure 7a). Paths were limited to four catalysts owing to computational limitations.70 The models comprised of the permutated training sets were used to predict the selectivity of leading catalyst 23, and the resulting residuals between the experimental and predicted selectivities were plotted against the training set “paths” in Figure 7b. Here, we expect statistically robust approaches to show a relatively small Δ between the different model residuals—that the order of addition does not significantly affect the model. The area bounded by the paths constructed from ISPCA (Δ = 6.1) is smaller than the equivalent area constructed by MLR analysis (Δ = 9.3). A model that predicts a catalyst as highly selective in some training sets and poorly selective in others would likely be disregarded as unreliable in a “real” catalyst scouting campaign. While all ISPCA models have residuals above 2.5, several MLR models have residuals well below 1: this may appear as though MLR-borne models are more accurate on average, but in realistic catalyst scouting campaigns the “best catalyst” is not known a priori. Thus, researchers must judge evolving models solely on their relative volatility and precision. The difference in Δ residuals between ISPCA and MLR further corroborates the superior statistical robustness of ISPCA in this catalytic system.

Figure 7.

Figure 7.

(a) Graphical outline of the training set composition analysis workflow, showing how a chosen training set (green dots) can be exhaustively permutated (adding orange dots to the training set in different orders, denoted by addition of green dots with black circles on subsequent lines) and analyzed. (b) Multipath parallel analysis of MLR and ISPCA: (left) permutated training set dependence of ISPCA; (right) MLR analysis. Each line is a different training set path. The y axis depicts the absolute residual between the experimental and predicted selectivity for catalyst 23.

In summary, the results from the parallel analysis of MLR and ISPCA suggest that MLR may be better poised for developing a global catalyst structure–activity relationship (large training sets), while ISPCA may be superior in finding an optimal catalyst (small training sets). However, it is important to note that this data set can provide only limited conclusions for translation to other systems owing to its small relative size. Further comparisons between the two strategies that incorporate larger and more diverse data sets are warranted. Furthermore, these informatics experiments suggest models developed in “real time” during catalyst optimization should include training set permutations to best understand the training set biases and true predictive power.

Computational Modeling.

We next were interested in interrogating the origins of the remarkable selectivity of 23 and 18 relative to 1. DFT calculations were performed on catalysts 16, 18, and 23, which all showed improved selectivity in comparison to 1 (Table 3). These structurally similar catalysts differ only in the para substitution on pyridine, yet the experimental selectivity ranges from 1.9 to 11.5, thus providing a good benchmark for computational comparison. The free energies of TS1, IM3, and TS2 for catalysts 1, 16, 18, and 23 are shown in Table 3. In the case of TS2, there are three possible isomers, named cis-1, cis-2, and trans (Figure S100) corresponding to the relative position of the pyridine ligand to the incoming alkyne. In all cases, the cis-2 structures (Figure 8, top right) are lowest in free energy, and these structures will be used for discussion below. Free energies of the other isomers are reported in the Supporting Information. From our calculations, [2 + 2] cycloaddition (TS1) is significantly lower in energy than rate-determining second alkyne insertion (TS2). Thus, cycloaddition (TS1) is reversible and insertion (TS2) irreversible, indicating a Curtin–Hammett scenario where the product distribution is mainly determined by the population of IM3 and the reaction barrier of TS2. Unfortunately, the calculated TS2 reaction barriers for 16, 18, and 23 are higher than experimental rates allow; this could arise from several effects: (1) partial/full associative ligand displacement during insertion and (2) discrepancies arising from using an implicit solvent model, as we previously investigated.58 Further, the choice of different DFT functionals can significantly affect the absolute value of the reaction barriers (see Table S11), such that this discrepancy between the experiment and computation is likely within the uncertainty of the DFT. Nonetheless, a comparison of ΔΔGǂ values can still provide significant insight into the mechanism.

Table 3.

DFT-Calculated Free Energies (M06/def2-SVP, SMD PhCF3, 115 °C, kcal mol−1) for Selected Catalystsa

1 16 18 23
TS1A 25.4 25.1 26.9 26.9
TS1B 23.2 22.7 24.9 25.0
ΔΔGǂ(TS1A-TS1B) 2.2 2.4 2.0 1.9
IM3A 8.6 6.7 7.8 9.1
IM3B 1.7 3.5 5.7 4.8
ΔΔG(IM3A-IM3B) 5.9 3.2 2.1 4.3
TS2AC 32.7 43.0 44.5 44.2
TS2AD 32.7 40.8 43.4 43.4
TS2BE 26.5 40.8 43.0 43.3
TS2BF 25.3 37.0 39.6 39.8
ΔΔGǂ(TS2BE-TS2BF) 1.2 3.8 3.4 3.5
selectivity (computedb) 4.9 49 13.3 99
selectivity (observed) 0.46 1.9 10.1 11.5
a

All TS2 free energies are reported for the cis-2 isomer. Schematics of the full selectivity manifolds are shown in Figures S96-S100.

b

Computed selectivities are determined from the Boltzmann equation and Eyring equation. Details are given in the Supporting Information.

Figure 8.

Figure 8.

(left) Selectivity manifold for [2 + 2 + 1] pyrrole formation from PhCCMe. (top right) Sterically encumbered pyridine ligands disfavor TS2BE, leading to selective formation of C through TS2BF. (bottom right) Electron-rich, bulky pyridines favor coordinated selective species, while solely sterically encumbered pyridines decoordinate, eroding selectivity.

When 1 is compared with 16, 18, and 23, the values for ΔGǂ(TS1) are similar among the four catalysts, ranging from 22.7 to 26.9 kcal mol−1. By changing pyridine (1) to 2,6-dimethyl-substituted pyridine (16, 18, 23) the values for ΔΔGǂ(TS1A-TS1B) all remain ~2 kcal mol−1, suggesting that in all cases TS1 is still under electronic control for the thermodynamically favored TS1B. Thus, there is no major effect of the pyridine ligand on [2 + 2] cycloaddition. In contrast, ΔΔGǂ(TS2BE-TS2BF) ≈ 4 kcal mol−1 for 16, 18, and 23 while it is only 1.2 kcal mol−1 for 1. Since 16, 18, and 23 are electronically different but sterically the same, this result indicates that the steric hindrance of the 2,6-Me2-substituted pyridines is the key factor that leads to the selectivity for C in TS2 (Figure 8, top right). It is worth mentioning that ΔΔGǂ(TS2BE-TS2BF) may be consistently overestimated for all four catalysts: ππ stacking of adjacent phenyl groups in TS2BF artificially stabilizes the structure. However, this is only one geometry of the ensemble of transition states, and most geometries will not have the additional arene π-stacking interaction.

Given the large positive ΔΔG(IM3A-IM3B) and ΔΔGǂ(TS2BE-TS2BF), catalysts 16, 18, and 23 should all produce almost exclusively C through the cis-2 configuration pathway of TS2BF (Table 3). However, 18 and 23 are significantly more selective than 16. We hypothesize that the difference in observed selectivity may be correlated to relative ligand lability (Figure 8, bottom right). When uncoordinated TiCl2(NPh) is computed as the catalyst, ΔΔGǂ(TS2BE-TS2BF) = −0.7 kcal mol−1, indicating that selectivity favors B over C for the unbound catalyst system (Figure S105). The calculation of dissociation free energies is particularly challenging because they are sensitive to the solvation model; however, calculations suggest that dissociation is possible for 16, 18, and 23 under catalytic conditions. Sterically encumbered ligands with electron-donating groups in the para position are better donors and thus less labile, resulting in a coordination equilibrium favoring the pyridine-bound (and selective) catalyst species over the unbound species. The selectivity of 23 increases over time, consistent with the hypothesis that catalyst speciation may be affecting selectivity (Figures S79 and S80).

In sum, calculations indicate that [2 + 2] cycloaddition is relatively insensitive to ligand effects, as the transition state is not very sterically encumbered and there is already a large intrinsic electronic bias for PhCCMe cycloaddition. However, sterically encumbered catalysts are needed to bias second alkyne insertion to favor product C. These sterically encumbered ligands are more likely to dissociate from Ti, which results in the formation of an unselective naked catalyst. In order to shift the coordination equilibrium to favor bound, selective catalysts, strongly electron donating pyridine groups are needed. Thus, sterically encumbered, electron-rich catalysts such as 18 and 23 are the most selective catalysts: they are sterically encumbered enough to bias second alkyne insertion, but strong enough donors as to stay bound to Ti throughout catalysis. These observations are consistent with the models derived from ISPCA, where the free pyridine HOMO and p-13C NMR shifts (electronic descriptors) and the imido–Ti–Npy angle and o-13C NMR shifts (steric descriptors) are predicted to be the critical factors affecting selectivity.

CONCLUSIONS

Iterative supervised principal component analysis (ISPCA) was implemented as a strategy for rational catalyst design. The Ti-catalyzed [2 + 2 + 1] synthesis of pentasubstituted pyrroles from phenylpropyne and azobenzene was chosen as a model system to validate the merits of the ISPCA approach, with the aim of controlling the reaction selectivity for 2,5-dimethyl-1,3,4-triphenyl-1H-pyrrole (C). ISPCA was conducted on a training set of catalysts, and their performance was regressed against the scores from the top three principal components. Component loadings from this optimized PCA space were used to inform the design of new test catalysts. Additionally, a “molecular ruler” was used as a statistical gauge to determine if test catalysts were significantly different from the training set to warrant synthesis. The selectivity of a prospective test set was predicted in silico using the ISPCA model, and only optimal candidates were synthesized and tested experimentally. This judicious in silico winnowing of test catalysts helped shorten the screening process, and after only three generations, the catalytic selectivity was improved from 0.5 to over 11 for C—a factor of nearly 25.

DFT calculations demonstrated that the origin of selectivity for catalyst 23 could be rationalized as a function of the persistence of sterically encumbered pyridine-bound catalyst species. The effective steric bulk of pyridine ligands is normalized by the strength of the accompanying Ti pyridine Lewis pair. Optimal catalysts are only required to bias the second alkyne insertion, as the initial [2 + 2] cycloaddition is largely insensitive to ligand effects. Catalyst 23 bears a pyridine ligand that is both sterically bulky and highly electron donating. This leads to a highly regioselective reaction, likely because the electron-donating ability of the ancillary pyridine helps keep the ligand bound to Ti, creating a sterically encumbered active site that can influence selectivity-determining transition states. Both ISPCA and MLR models corroborate this observation, where regression coefficients of descriptors associated with steric bulk and electron donor ability are heavily weighted.

In summary, this work serves as a proof of principle that ISPCA is a valuable strategy in de novo catalyst design. This ISPCA approach led to models that use relatively sparse training sets (<20), are reasonably predictive with out-of-sample test cases, have sensible interpretations, and are constructed from low-computational-cost descriptors. ISPCA serves as a way to effectively quantify and complement a researcher’s chemical intuition during the catalyst screening process but maintains outputs that are chemically meaningful. A hypothetical parallel catalyst discovery initiative using both MLR and ISPCA reveals that the two methods appear to have complementary advantages. ISPCA provided more robust intermediate predictive models, indicating that ISPCA may be a more useful tool for initial catalyst discovery, while MLR ultimately yielded a stronger statistical model of the entire data set. Although these observations come from the study of a single reaction, we hope that it encourages researchers to consider using MLR and ISPCA in a complementary fashion. Attempts to accelerate the successful discovery of a new catalyst using advanced statistical models will always be balanced by similar attempts using qualitative chemical information and intuition. Widespread adoption of catalysis informatics in realistic catalyst scouting campaigns will require leveraging multiple analysis techniques that can (1) effectively model structure activity relationships and (2) aid in prediction from nonideal (sparse and undersampled) training sets—which ISPCA is poised to do. Overall, this strategy should be generalizable to any catalyst system and be a more widely utilized tool in catalysis informatics.

EXPERIMENTAL SECTION

General Considerations.

All air- and moisture-sensitive reactions were carried out in a nitrogen-filled glovebox. Solvents for air- and moisture-sensitive reactions were prepared in one of three ways: (i) predried on a Pure Process Technology solvent purification system (C6H6, THF, PhMe, PhCF3, hexanes, pentane, ether), (ii) degassed, dried over CaH2, and stored over molecular sieves prior to use (PhBr) or (iii) vacuum-transferred from sodium benzophenone ketyl (C6D6) or CaH2 (CDCl3). C6D5Br was synthesized following the literature procedure,71 degassed, dried over CaH2, freeze–pump–thawed three times, brought into the glovebox, and filtered through basic alumina prior to use. PhNNPh was purchased from TCI America, extracted with hexanes/water to remove residual methanol, and dried in vacuo before use. Commercial PhCCMe was dried over CaH2, freeze–pump–thawed three times, brought into the glovebox, and filtered through basic alumina prior to use.

[Ti(NPh)Cl2]n72 and [Ti(NTol)Cl2]n73 were synthesized as previously described, dried in vacuo, and stored in the glovebox freezer until use. [Ti(NPh)Br2]n and [Ti(NPh)I2]n were prepared following a modified procedure using TiBr4 and TiI472 instead. Catalysis results for py3Ti(NPh)Cl2 (1) from previous work48 were used for comparison.

Catalysis.

PhNNPh (364 mg, 2 mmol), PhCCMe (1.16 g, 10 mmol), and 1,3,5-trimethoxybenzene (336 mg, 2 mmol, as an internal standard) were placed in a 10 mL volumetric flask and diluted to 10 mL with PhCF3 to make a stock solution. For each catalytic run, a precatalyst (10 mol % of Ti, 0.01 mmol) and 0.5 mL of the stock solution were placed in an NMR tube in a N2-filled glovebox. This was then sealed with an NMR cap and electrical tape before heating at 115 °C for 16 h. Reaction yields and selectivities were then determined by 1H NMR spectroscopy.

Statistical Analysis.

All analysis, data mining, and visualization were performed using PCA4U2, a Matlab script developed in house (see the Supporting Information for additional details). The script was performed on a commercial MacBook Air using a 1.8 GHz Intel Core i5 processor and required a run time of roughly 30–500 s depending on the search method. PCA was performed using the singular value decomposition (SVD) method. The generation of correlation tables, PCA maps, and regression statistics was all automated through the script.

Computational Details.

Geometry optimizations were performed using Gaussian 16 program version c01.74 All geometry optimizations and frequency calculations were performed using the M06 functional,75 the def2SVP basis set,76 and the SMD solvation model77 with PhCF3 (ε = 9.18) as the solvent. All geometries were characterized by frequency analysis calculations to be local minima (without any imaginary frequency) or transition states (with only one imaginary frequency). All vibrational frequencies below 50 cm−1 were replaced with values of 50 cm−1 due to the breakdown of the harmonic oscillator model for low-frequency vibrational modes. Zero-point vibrational energies and thermal contributions to electronic energy were calculated at 388.15 K and 1 atm. Full computational details are available in the Supporting Information.

Supplementary Material

SI

ACKNOWLEDGMENTS

This work was inspired in part by the Dow Chemical Statistics Short Course held at the University of Minnesota Department of Chemistry on August 5–9, 2019.

Funding

Financial support was provided by the National Institutes of Health (1R35GM119457) and the Alfred P. Sloan Foundation (I.A.T. is a 2017 Sloan Fellow). Instrumentation for the University of Minnesota Chemistry NMR facility was supported from a grant through the National Institutes of Health (S10OD011952). X-ray diffraction experiments were performed with a diffractometer purchased through a grant from the NSF/MRI (1229400) and the University of Minnesota. We acknowledge the Minnesota Supercomputing Institute (MSI) at the University of Minnesota and the National Energy Research Scientific Computing Center (NERSC), a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231, for providing resources that contributed to the results reported within this paper.

Footnotes

The data sets that support the findings of this study are provided as a Microsoft Excel file (PCAModelingData.xlsx). The authors declare that all other data supporting the findings of this study (full experimental details and computational geometries) are available within the paper and the Supporting Information files.

All ISPCA statistical analysis was carried out using the Matlab script PCA4U2.m. This script is available as a file in the Supporting Information.

The authors declare no competing financial interest.

Supporting Information

(.PDF) All computational geometries (.XYZ) (.XLSX) (.M) (.M) The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acscatal.0c03939.

ISPCA model I and model II data (XLSX)

All computational geometries (XYZ)

Matlab script analysisparalysis used to carry out training set pathing/sequencing studies (ZIP)

Matlab script PCA4U2 used to generate ISPCA results (ZIP)

Full experimental details, statistical analysis, and computational analysis (PDF)

Contributor Information

Xin Yi See, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

Xuelan Wen, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

T. Alexander Wheeler, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

Channing K. Klein, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

Jason D. Goodpaster, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

Benjamin R. Reiner, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

Ian A. Tonks, Department of Chemistry, University of Minnesota–Twin Cities, Minneapolis, Minnesota 55455, United States.

REFERENCES

  • (1).Fey N Lost in chemical space? Maps to support organometallic catalysis. Chem. Cent. J 2015, 9, 38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (2).Wu K; Doyle AG Parameterization of phosphine ligands demonstrates enhancement of nickel catalysis via remote steric effects. Nat. Chem 2017, 9, 779–784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (3).Toste FD; Sigman MS; Miller SJ Pursuit of Noncovalent Interactions for Strategic Site-Selective Catalysis. Acc. Chem. Res 2017, 50, 609–615. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (4).Sigman MS; Harper KC; Bess EN; Milo A The Development of Multidimensional Analysis Tools for Asymmetric Catalysis and Beyond. Acc. Chem. Res 2016, 49, 1292–1301. [DOI] [PubMed] [Google Scholar]
  • (5).Weissman SA; Anderson NG Design of Experiments (DoE) and Process Optimization. A Review of Recent Publications. Org. Process Res. Dev 2015, 19, 1605–1633. [Google Scholar]
  • (6).Murray PM; Bellany F; Benhamou L; Bučar D-K; Tabor AB; Sheppard TD The application of design of experiments (DoE) reaction optimization and solvent selection in the development of new synthetic chemistry. Org. Biomol. Chem 2016, 14, 2373–2384. [DOI] [PubMed] [Google Scholar]
  • (7).Isbrandt ES; Sullivan RJ; Newman SG High Throughput Strategies for the Discovery and Optimization of Catalytic Reactions. Angew. Chem, Int. Ed 2019, 58, 7180–7191. [DOI] [PubMed] [Google Scholar]
  • (8).Robbins DW; Hartwig JF A Simple, Multidimensional Approach to High-Throughput Discovery of Catalytic Reactions. Science 2011, 333, 1423–1427. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (9).McNally A; Prier CK; MacMillan DWC Discovery of an α-Amino C–H Arylation Reaction Using the Strategy of Accelerated Serendipity. Science 2011, 334, 1114–1117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (10).Hammett LP The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives. J. Am. Chem. Soc 1937, 59, 96–103. [Google Scholar]
  • (11).Tolman CA Steric effects of phosphorus ligands in organometallic chemistry and homogeneous catalysis. Chem. Rev 1977, 77, 313–348. [Google Scholar]
  • (12).Clavier H; Nolan SP Percent buried volume for phosphine and N-heterocyclic carbene ligands: steric properties in organometallic chemistry. Chem. Commun 2010, 46, 841–861. [DOI] [PubMed] [Google Scholar]
  • (13).Harper KC; Bess EN; Sigman MS Multidimensional steric parameters in the analysis of asymmetric catalytic reactions. Nat. Chem 2012, 4, 366–374. [DOI] [PubMed] [Google Scholar]
  • (14).Raugei S; DuBois DL; Rousseau R; Chen S; Ho M-H; Bullock RM; Dupuis M Toward Molecular Catalysts by Computer. Acc. Chem. Res 2015, 48, 248–255. [DOI] [PubMed] [Google Scholar]
  • (15).Wheeler SE; Seguin TJ; Guan Y; Doney AC Noncovalent Interactions in Organocatalysis and the Prospect of Computational Catalyst Design. Acc. Chem. Res 2016, 49, 1061–1069. [DOI] [PubMed] [Google Scholar]
  • (16).Reid JP; Sigman MS Comparing quantitative prediction methods for the discovery of small-molecule chiral catalysts. Nat. Rev. Chem 2018, 2, 290–305. [Google Scholar]
  • (17).Occhipinti G; Bjørsvik H-R; Jensen VR Quantitative Structure-Activity Relationships of Ruthenium Catalysts for Olefin Metathesis. J. Am. Chem. Soc 2006, 128, 6952–6964. [DOI] [PubMed] [Google Scholar]
  • (18).Chu Y; Heyndrickx W; Occhipinti G; Jensen VR; Alsberg BK An Evolutionary Algorithm for de Novo Optimization of Functional Transition Metal Compounds. J. Am. Chem. Soc 2012, 134, 8885–8895. [DOI] [PubMed] [Google Scholar]
  • (19).Lakuntza O; Besora M; Maseras F Searching for Hidden Descriptors in the Metal–Ligand Bond through Statistical Analysis of Density Functional Theory (DFT) Results. Inorg. Chem 2018, 57, 14660–14670. [DOI] [PubMed] [Google Scholar]
  • (20).Burai Patrascu M; Pottel J; Pinus S; Bezanson M; Norrby P-O; Moitessier N From desktop to benchtop with automated computational workflows for computer-aided design in asymmetric catalysis. Nature Catalysis 2020, 3, 574–584. [Google Scholar]
  • (21).Peng Q; Duarte F; Paton RS Computing organic stereoselectivity – from concepts to quantitative calculations and predictions. Chem. Soc. Rev 2016, 45, 6093–6107. [DOI] [PubMed] [Google Scholar]
  • (22).Ahneman DT; Estrada JG; Lin S; Dreher SD; Doyle AG Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, 360, 186–190. [DOI] [PubMed] [Google Scholar]
  • (23).Estrada JG; Ahneman DT; Sheridan RP; Dreher SD; Doyle AG Response to Comment on “Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, 362, No. eaat8763. [DOI] [PubMed] [Google Scholar]
  • (24).Chuang KV; Keiser MJ Comment on “Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, 362, No. eaat8603. [DOI] [PubMed] [Google Scholar]
  • (25).Nielsen MK; Ahneman DT; Riera O; Doyle AG Deoxyfluorination with Sulfonyl Fluorides: Navigating Reaction Space with Machine Learning. J. Am. Chem. Soc 2018, 140, 5004–5008. [DOI] [PubMed] [Google Scholar]
  • (26).Zahrt AF; Henle JJ; Rose BT; Wang Y; Darrow WT; Denmark SE Prediction of higher-selectivity catalysts by computer-driven workflow and machine learning. Science 2019, 363, No. eaau5631. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (27).Foscato M; Jensen VR Automated in Silico Design of Homogeneous Catalysts. ACS Catal. 2020, 10, 2354–2377. [Google Scholar]
  • (28).Toyao T; Maeno Z; Takakusagi S; Kamachi T; Takigawa I; Shimizu K.-i. Machine Learning for Catalysis Informatics: Recent Applications and Prospects. ACS Catal. 2020, 10, 2260–2297. [Google Scholar]
  • (29).Reid JP; Sigman MS Holistic prediction of enantioselectivity in asymmetric catalysis. Nature 2019, 571, 343–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (30).Reid JP; Proctor RSJ; Sigman MS; Phipps RJ Predictive Multivariate Linear Regression Analysis Guides Successful Catalytic Enantioselective Minisci Reactions of Diazines. J. Am. Chem. Soc 2019, 141, 19178–19185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (31).Kowalski BR Chemometrics: Views and Propositions. J. Chem. Inf. Model 1975, 15, 201–203. [Google Scholar]
  • (32).Lourenço ND; Lopes JA; Almeida CF; Sarraguça MC; Pinheiro HM Bioreactor monitoring with spectroscopy and chemometrics: a review. Anal. Bioanal Chem 2012, 404, 1211–1237. [DOI] [PubMed] [Google Scholar]
  • (33).Macnaughtan D; Rogers LB; Wernimont G Principal-component analysis applied to chromatographic data. Anal. Chem 1972, 44, 1421–1427. [Google Scholar]
  • (34).Bro R; Smilde AK Principal component analysis. Anal. Methods 2014, 6, 2812–2831. [Google Scholar]
  • (35).García-Muelas R; López N Statistical learning goes beyond the d-band model providing the thermochemistry of adsorbates on transition metals. Nat. Commun 2019, 10, 4687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (36).Piironen J; Vehtari A In Iterative Supervised Principal Components International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research; PMLR, April 9–11, 2018; Storkey A, Perez-Cruz F, Eds.; Playa Blanca, Lanzarote, Canary Islands, 2018. [Google Scholar]
  • (37).Fey N; Orpen AG; Harvey JN Building ligand knowledge bases for organometallic chemistry: Computational description of phosphorus(III)-donor ligands and the metal–phosphorus bond. Coord. Chem. Rev 2009, 253, 704–722. [Google Scholar]
  • (38).Fey N; Harvey JN; Lloyd-Jones GC; Murray P; Orpen AG; Osborne R; Purdie M Computational Descriptors for Chelating P,P- and P,N-Donor Ligands1. Organometallics 2008, 27, 1372–1383. [Google Scholar]
  • (39).Durand DJ; Fey N Computational Ligand Descriptors for Catalyst Design. Chem. Rev 2019, 119, 6561–6594. [DOI] [PubMed] [Google Scholar]
  • (40).Fey N; Tsipis AC; Harris SE; Harvey JN; Orpen AG; Mansson RA Development of a Ligand Knowledge Base, Part 1: Computational Descriptors for Phosphorus Donor Ligands. Chem. - Eur. J 2006, 12, 291–302. [DOI] [PubMed] [Google Scholar]
  • (41).Jover J; Fey N; Harvey JN; Lloyd-Jones GC; Orpen AG; Owen-Smith GJJ; Murray P; Hose DRJ; Osborne R; Purdie M Expansion of the Ligand Knowledge Base for Chelating P,P-Donor Ligands (LKB-PP). Organometallics 2012, 31, 5302–5306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (42).Jover J; Fey N; Harvey JN; Lloyd-Jones GC; Orpen AG; Owen-Smith GJJ; Murray P; Hose DRJ; Osborne R; Purdie M Expansion of the Ligand Knowledge Base for Monodentate P-Donor Ligands (LKB-P). Organometallics 2010, 29, 6245–6258. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (43).Fey N; Haddow MF; Harvey JN; McMullin CL; Orpen AG A ligand knowledge base for carbenes (LKB-C): maps of ligand space. Dalton Trans. 2009, 8183–8196. [DOI] [PubMed] [Google Scholar]
  • (44).Mansson RA; Welsh AH; Fey N; Orpen AG Statistical Modeling of a Ligand Knowledge Base. J. Chem. Inf. Model 2006, 46, 2591–2600. [DOI] [PubMed] [Google Scholar]
  • (45).Fey N; Garland M; Hopewell JP; McMullin CL; Mastroianni S; Orpen AG; Pringle PG Stable Fluorophosphines: Predicted and Realized Ligands for Catalysis. Angew. Chem., Int. Ed 2012, 51, 118–122. [DOI] [PubMed] [Google Scholar]
  • (46).Landman IR; Paulson ER; Rheingold AL; Grotjahn DB; Rothenberg G Designing bifunctional alkene isomerization catalysts using predictive modelling. Catal Sci. Technol 2017, 7, 4842–4851. [Google Scholar]
  • (47).Maldonado AG; Rothenberg G Predictive modeling in homogeneous catalysis: a tutorial. Chem. Soc. Rev 2010, 39, 1891–1902. [DOI] [PubMed] [Google Scholar]
  • (48).Gilbert ZW; Hue RJ; Tonks IA Catalytic formal [2 + 2+1] synthesis of pyrroles from alkynes and diazenes via TiII/TiIV redox catalysis. Nat. Chem 2016, 8, 63–68. [DOI] [PubMed] [Google Scholar]
  • (49).Ricker JD; Geary LM Recent Advances in the Pauson–Khand Reaction. Top. Catal 2017, 60, 609–619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (50).Miura H; Takeuchi K; Shishido T Intermolecular [2 + 2+1] Carbonylative Cycloaddition of Aldehydes with Alkynes, and Subsequent Oxidation to γ-Hydroxybutenolides by a Supported Ruthenium Catalyst. Angew. Chem. Int. Ed 2016, 55, 278–282. [DOI] [PubMed] [Google Scholar]
  • (51).Jin H; Rudolph M; Rominger F; Hashmi ASK The Carbocation-Catalyzed Intermolecular Formal [2 + 2 + 1] Cycloaddition of Ynamides with Quinoxaline N-Oxides. ACS Catal. 2019, 9, 11663–11668. [Google Scholar]
  • (52).Heinrich MR; Steglich W; Banwell MG; Kashman Y Total synthesis of the marine alkaloid halitulin. Tetrahedron 2003, 59, 9239–9247. [Google Scholar]
  • (53).Kashman Y; Koren-Goldshlager G; Gravalos MDG; Schleyer M Halitulin, a new cytotoxic alkaloid from the marine sponge Haliclona tulearensis. Tetrahedron Lett. 1999, 40, 997–1000. [Google Scholar]
  • (54).Banwell MG; Bray AM; Edwards AJ; Wong DJ Rapid and convergent assembly of the polycyclic framework assigned to the cytotoxic marine alkaloid halitulin. J. Chem. Soc., Perkin Trans. 1 2002, 1340–1343. [Google Scholar]
  • (55).Young IS; Thornton PD; Thompson A Synthesis of natural products containing the pyrrolic ring. Nat. Prod. Rep 2010, 27, 1801–1839. [DOI] [PubMed] [Google Scholar]
  • (56).Imbri D; Tauber J; Opatz T Synthetic approaches to the lamellarins–a comprehensive review. Mar. Drugs 2014, 12, 6142–6177. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (57).Bailly C Anticancer properties of lamellarins. Mar. Drugs 2015, 13, 1105–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (58).Davis-Gilbert ZW; Wen X; Goodpaster JD; Tonks IA Mechanism of Ti-Catalyzed Oxidative Nitrene Transfer in [2 + 2+1] Pyrrole Synthesis from Alkynes and Azobenzene. J. Am. Chem. Soc 2018, 140, 7267–7281. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (59).Davis-Gilbert ZW; Yao LJ; Tonks IA Ti-Catalyzed Multicomponent Oxidative Carboamination of Alkynes with Alkenes and Diazenes. J. Am. Chem. Soc 2016, 138, 14570–14573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (60).Behenna DC; Stoltz BM; Gooßen L Natural Products as Inspiration for Reaction Development: Catalytic Enantioselective Decarboxylative Reactions of Prochiral Enolate Equivalents. Top. Organomet. Chem 2012, 44, 281–313. [Google Scholar]
  • (61).Jansen K Chemists share their lab superstitions; https://cen.acs.org/people/Chemists-share-lab-superstitions/96/i44 (accessed April 8, 2020). [Google Scholar]
  • (62).Green MLH A new approach to the formal classification of covalent compounds of the elements. J. Organomet. Chem 1995, 500, 127–148. [Google Scholar]
  • (63).DiFranco SA; Maciulis NA; Staples RJ; Batrice RJ; Odom AL Evaluation of Donor and Steric Properties of Anionic Ligands on High Valent Transition Metals. Inorg. Chem 2012, 51, 1187–1200. [DOI] [PubMed] [Google Scholar]
  • (64).Billow BS; McDaniel TJ; Odom AL Quantifying ligand effects in high-oxidation-state metal catalysis. Nat. Chem 2017, 9, 837–842. [DOI] [PubMed] [Google Scholar]
  • (65).Conducting ISPCA using catalyst selectivity described as ΔΔGǂ produced the following results: The grouping delivering the lowest residuals contained five descriptors and accounted for 94% of the data variance, and the regression statistics were satisfactory (R2 = 0.90, Q2 = 0.86). The five descriptors had comparable regression coefficients: catalyst LUMO (0.0243), free pyridine o-13C NMR shift (−0.0697), catalyst-bound pyridine p-13C NMR shift (0.0170), free pyridine proton affinity (0.0173), and Ti–Npy bond length (0.0404). This model corroborates the results from analysis using selectivity ratios: namely, that selective catalysts bear ligands that are both electron rich and sterically bulky. Notably, the influence of the X-type ligands is not well captured in the model employing selectivity described as ΔΔGǂ, which may be a consequence of scaling selectivity by the logarithmic function.
  • (66).Bair E; Hastie T; Paul D; Tibshirani R Prediction by Supervised Principal Components. J. Am. Stat. Assoc 2006, 101, 119–137. [Google Scholar]
  • (67).Park Y; Niemeyer ZL; Yu J-Q; Sigman MS Quantifying Structural Effects of Amino Acid Ligands in Pd(II)-Catalyzed Enantioselective C–H Functionalization Reactions. Organometallics 2018, 37, 203–210. [Google Scholar]
  • (68).Stumpfe D; Hu H; Bajorath J Evolving Concept of Activity Cliffs. ACS Omega 2019, 4, 14360–14368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (69).Delchambre L Weighted principal component analysis: a weighted covariance eigendecomposition approach. Mon. Not. R. Astron. Soc 2015, 446, 3545–3555. [Google Scholar]
  • (70).Each additional catalyst adds an exponential increase in computational time (5 catalysts require 720 models). This more exhaustive experiment highlights an important facet of statistical analysis—computation time. MLR analysis on average required ~30 s while ISPCA required ~300 s on a commercial laptop. While MLR analysis produced models about 1 order of magnitude quicker than ISPCA, neither method required more than 10 min with any training set. When it is considered that catalyst synthesis and purification often require hours or even days and are usually the research bottleneck in catalyst optimization, both ISPCA and MLR are valuable tools in facilitating the catalyst search process.
  • (71).Whitesides GM; Ehmann WJ The Mechanism of Formation of 1,2,3,4-Tetramethylnaphthalene from 2-Butyne and Triphenyltris(tetrahydrofuran)chromium(III). J. Am. Chem. Soc 1970, 92, 5625–5640. [Google Scholar]
  • (72).Pennington DA; Horton PN; Hursthouse MB; Bochmann M; Lancaster SJ Synthesis and catalytic activity of dinuclear imido titanium complexes: the molecular structure of [Ti(NPh)Cl(μ-Cl)(THF)2]2. Polyhedron 2005, 24, 151–156. [Google Scholar]
  • (73).See XY; Beaumier EP; Davis-Gilbert ZW; Dunn PL; Larsen JA; Pearce AJ; Wheeler TA; Tonks IA Generation of TiII Alkyne Trimerization Catalysts in the Absence of Strong Metal Reductants. Organometallics 2017, 36, 1383–1390. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • (74).Frisch MJ; Trucks GW; Schlegel HB; Scuseria GE; Robb MA; Cheeseman JR; Scalmani G; Barone V; Petersson GA; Nakatsuji H; Li X; Caricato M; Marenich AV; Bloino J; Janesko BG; Gomperts R; Mennucci B; Hratchian HP; Ortiz JV; Izmaylov AF; Sonnenberg JL; Williams Ding F; Lipparini F; Egidi F; Goings J; Peng B; Petrone A; Henderson T; Ranasinghe D; Zakrzewski VG; Gao J; Rega N; Zheng G; Liang W; Hada M; Ehara M; Toyota K; Fukuda R; Hasegawa J; Ishida M; Nakajima T; Honda Y; Kitao O; Nakai H; Vreven T; Throssell K; Montgomery JA Jr.; Peralta JE; Ogliaro F; Bearpark MJ; Heyd JJ; Brothers EN; Kudin KN; Staroverov VN; Keith TA; Kobayashi R; Normand J; Raghavachari K; Rendell AP; Burant JC; Iyengar SS; Tomasi J; Cossi M; Millam JM; Klene M; Adamo C; Cammi R; Ochterski JW; Martin RL; Morokuma K; Farkas O; Foresman JB; Fox DJ Gaussian 16, Rev. C.01; Gaussian Inc.: Wallingford, CT, 2016. [Google Scholar]
  • (75).Zhao Y; Truhlar DG The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, non-covalent interactions, excited states, and transition elements: two new functionals and systematic testing of four M06-class functionals and 12 other functionals. Theor. Chem. Acc 2008, 120, 215–241. [Google Scholar]
  • (76).Weigend F; Ahlrichs R Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Phys. Chem. Chem. Phys 2005, 7, 3297–3305. [DOI] [PubMed] [Google Scholar]
  • (77).Marenich AV; Cramer CJ; Truhlar DG Universal Solvation Model Based on Solute Electron Density and on a Continuum Model of the Solvent Defined by the Bulk Dielectric Constant and Atomic Surface Tensions. J. Phys. Chem. B 2009, 113, 6378–6396. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SI

RESOURCES