Abstract
Carbon dioxide (CO2) can be transformed into valuable chemical building blocks, including C2-carboxylated 1,3-azoles, which have potential applications in pharmaceuticals, cosmetics, and pesticides. However, only a small fraction of the millions of available 1,3-azoles are carboxylated at the C2 position, highlighting significant opportunities for further research in the synthesis and application of these compounds. In this study, we utilized a supervised machine learning approach to predict reaction yields for a data set of amide-coupled C2-carboxylated 1,3-azoles. To facilitate molecular design, we integrated an interpretable heat-mapping algorithm named PIXIE (Predictive Insights and Xplainability for Informed chemical space Exploration). PIXIE visualizes the influence of molecular substructures on predicted yields by leveraging fingerprint bit importances, providing synthetic chemists with a powerful tool for the rational design of molecules. While heat mapping is an established technique, its integration with a machine-learning model tailored to the chemical space of C2-carboxylated 1,3-azoles represents a significant advancement. This approach not only enables targeted exploration of this underrepresented chemical space, fostering the discovery of new bioactive compounds, but also demonstrates the potential of combining these methods for broader applications in other chemical domains.
1. Introduction
While carbon dioxide (CO2) is a key driver of climate change,1−3 it is also recognized as a versatile carbon source in organic chemistry due to its role as a C1 building block and its abundant availability and production.4−8 Consequently, synthesis protocols that transform CO2 into valuable molecules are of great interest.5−7,9−14 Among these valuable molecules are 1,3-azoles, a prominent subgroup within the broader class of azoles.15 1,3-Azoles and their C2-carboxylated derivatives are already known to be relevant as anticoagulants, herbicides, fungicides, and aroma compounds.16−18 C2-carboxylated 1,3-azoles could also enable new applications as prodrugs and propesticides, as well as in treatments for Bartter’s disease or breast cancer.15
Despite their advantages, only about 20,000 C2-carboxylated azoles are commercially available, compared to almost 2.5 million C2-unsubstituted 1,3-azoles (CAS SciFinder search in 07/2024; Figure 1).15,19 This highlights significant potential for expanding the chemical space of these compounds since access to a broader variety of molecules facilitate the discovery of new bioactive compounds and, therefore, drugs.
Figure 1.

Number (in millions) of commercially available compounds that feature a 1,3-azole substructure (with X = NR, O, S), excluding compounds with a molecular weight exceeding 900 Da, as well as compounds containing metal ions or isotopes.
Felten et al. developed a mild, functional-group-tolerant synthesis protocol for the carboxylation of 1,3-azoles, followed by a subsequent amide coupling reaction (Figure 2).19 The results of this protocol were reported as liquid chromatography area percent (LCAP) yields. One data set from the study comprises 288 LCAP yields, obtained by combinatorially pairing 24 1,3-azoles (ten thiazoles, nine oxazoles, five imidazoles) with 12 amines, see Figure 2. The LCAP values of this combinatorial set range from 0 to 78%.
Figure 2.
Synthesis protocol developed by Felten et al.19
By utilizing computer-aided synthesis planning (CASP) tools, time and resources can be saved by focusing on the most promising synthesis routes or adapting protocols to achieve the desired outcome.20−28 Building on the data set of Felten et al. (Figure 3) and its associated synthesis protocol, we aim to develop an efficient CASP tool to assist synthetic chemists in expanding the chemical space of C2-carboxylated 1,3-azoles by predicting reaction yields (Figure 2).
Figure 3.
1,3-azoles (red boxes) and amines (gray box) considered in this work. The numbering was adopted from Felten et al.19 Oxazoles are displayed in light red, thiazoles in medium red, and imidazoles in dark red.
Machine learning offers a suitable framework for developing such a CASP tool, assuming sufficient data is available.20,29−31 To illustrate the current state of CASP tools incorporating machine learning for yield prediction, examples from the literature are summarized in Table 1. However, the models referred to in this table are often not readily interpretable due to the complexity of their underlying architectures.32
Table 1. Overview of Previous Machine-Learning-Based Yield Prediction Studies.
| publication | model architecture | data set | # reactions | RMSE | R2 | ref. |
|---|---|---|---|---|---|---|
| Ahneman et al. | Random Forest | Buchwald–Hartwig (BH) Reactions | 4 608 | 7.8% | 0.92 | (33) |
| Granda et al. | Neural Network | Suzuki–Miyaura Reactions | 5 760 | 11% | (34, 35) | |
| Nielsen et al. | Random Forest | Deoxyfluorination | 640 | 7.4% | 0.93 | (36) |
| Jiang et al. | Neural Network | USPTO Patent | 269 132 | 23.0% | (37) | |
| Yarish et al. | Neural Network | Enamine Frequent Reactions | 29 904 | 0.18 | (38) | |
| Saebi et al. | Random Forest | BH Reactions from AstraZeneca | 781 | 0.27 | (39) | |
| Chen et al. | Multimodal Model | Amide Coupling Reactions | 41 239 | 19.3% | 0.26 | (40) |
| Schleinitz et al. | Random Forest | Nickel-Catalyzed Cross-Couplings | 1 406 | 22.6% | 0.54 (r2) | (41) |
Interpretable models have the advantage that prediction outcomes can be directly linked to the input data,42 enabling human experts to draw conclusions or develop hypotheses that rigid models designed for specific tasks cannot provide. This linkage can be achieved using models whose coefficients quantify the influence of specific chemical features or properties on the prediction outcome in combination with explainable descriptors.32,43 The simplest and most interpretable class of models comprises linear models, which can be further differentiated by the objective function used to determine their optimal coefficients. Examples include least-squares, regularized least-squares (also known as “ridge”) and the least absolute shrinkage and selection operator (LASSO).44
These models contrast with nonlinear “black-box” models where the prediction outcomes cannot as easily be linked to the influence of the input data but often achieve better performance on more complex data sets.32 Examples of such model architectures include neural networks and Gaussian processes.32 The latter, however, offer partial interpretability by providing predictive probability distributions instead of single-valued predictions.32,45 The width of each distribution serves as a measure of the model’s uncertainty about its own predictions.
One way to determine the importance of features and influence of such “black-box” models is the SHapley Additive exPlanations (SHAP) method.46 This method requires a specified model and assigns each feature a value that quantifies its importance. SHAP values are determined by training a series of models, where one feature is excluded in each iteration. The prediction outcomes of these models are then compared to the prediction outcome of the original model, where all features are included.
In this work, we develop a yield prediction model based on the data set from Felten et al.19 to support synthetic chemists in expanding the chemical space of C2-carboxylated 1,3-azoles. To ensure interpretability, we integrate a heat-mapping algorithm named PIXIE (Predictive Insights and Xplainability for Informed chemical space Exploration), which visualizes the relationship between structural motifs and prediction outcomes. We apply our CASP tool featuring PIXIE to the ZINC database, demonstrating its practical utility in molecular design and targeted expansion of the chemical space of C2-carboxylated 1,3-azoles and beyond.
2. Methods
2.1. Data
2.1.1. Labeled Data from Merck & Co
In the synthesis optimization study by Felten et al., LCAP yields of the coupling of 24 azoles with 12 amines were reported.19 The synthesis of amine 47 followed a different protocol than the standard one used for the other reagents, leading to the exclusion of its yields from our analysis. This results in 264 (instead of 288) data points for our study. The Cartesian nuclear coordinates and SMILES47 codes of the reactants were created manually based on the data given in the publication. The Cartesian nuclear coordinates were additionally subjected to a force field optimization.
2.1.2. Unlabeled Data from ZINC20
The ZINC20 “In-stock” database was downloaded at 2023/11/02 and preprocessed according to our previous work.15 This database contains molecules listed as “purchasable” from different vendors. Based on this preprocessed database, a substructure search was performed to identify 1,3-azole amides. SMARTS codes (O=C(N)c1nccs1, O=C(n)c1nccs1, O=C(N)c1ncco1, O=C(n)c1ncco1, O=C(N)c1nccn1, and O=C(n)c1ncn1) were used for this purpose. The resulting subset contains 5708 1,3-azole amides on which predictions can be performed.
2.2. Descriptors
We employed different types of descriptors (Table 2) to determine the most suitable ones for yield prediction and interpretation. In this way, we are covering a wide range of descriptors with different advantages and disadvantages. A full list of the descriptors employed in this work can be found in the Supporting Information (see results-of-grid-search.xlsx, which is available at https://git.rz.tu-bs.de/proppe-group/yield-prediction).
Table 2. Overview of Descriptors Used in This Work.
| descriptor type | # descriptors | cost |
|---|---|---|
| RDKit Descriptors | 209 | + |
| QM Descriptors | 5 | ++++ |
| Fingerprints | 3 | ++ |
| Many-Body Descriptors | 3 | +++ |
2.2.1. Property Descriptors
2.2.1.1. From RDKit
All RDKit descriptors were calculated from the SMILES47 strings of the amide-coupling products using the RDKit Python toolkit (version 2023.03.1b1).48 The rdkit.Chem.Descriptors module from RDKit contains 209 molecular properties, which are widely used as features in chemical machine learning.49 These descriptors capture properties such as molecular weight, counts of different functional groups, partial charges, and relative atomic masses.48 Due to their manageable complexity, we primarily used RDKit molecular property descriptors to represent the chemical space in this project employing principal component analysis (PCA) for dimensionality reduction.
2.2.1.2. QM-Derived
Quantum mechanics (QM)-derived descriptors provide detailed information on the electronic properties of molecules and the energetics of chemical reactions, making them valuable for machine learning. While generating QM descriptors is computationally expensive and limits the model’s ability to make fast (ideally “real-time”) predictions, their accuracy and detail can outweigh this drawback.
To include information about the reaction mechanism, the Gibbs free energy (ΔG) of the product formation step was calculated and used as a descriptor. This step was chosen because it directly involves the amine, ensuring that each product in the data set has a unique ΔG value. In contrast, ΔG values for other steps in the mechanism are indistinguishable across products derived from the same azole and different amines. Additional QM-derived descriptors include electronic properties such as HOMO and LUMO energies, dipole moments, electronegativity, and hardness of the reaction products.
QM-optimized structures and properties were obtained from a series of calculations, the first step of which was a conformational search using CREST50 (version 2.12) with the GFN2-xTB method.51 The lowest-energy conformer was used as input for subsequent density functional theory calculations performed in ORCA 5.0.52,53 We employed the B3LYP54 functional incorporating D3(BJ)-type55,56 dispersion corrections, balancing computational efficiency with structural and energetic accuracy.57 Initial optimizations used the def2-SVP basis set,58 followed by refined optimizations with def2-TZVP.58,59 Harmonic frequency calculations confirmed structural minima and provided ΔG values (cf. Equation 7 in ref. (60)). Structures with imaginary frequencies below 100 cm–1 were considered optimized, as such frequencies are treated as artifacts.59
To account for solvation effects, the CPCM solvation model was employed.61 The solvent mixture used in the product-forming step, as described by Felten et al., consists of 1,2-dimethoxyethane (DME) and ethyl acetate (EtOAc) in a 1:4 ratio.19 The dielectric constant (6.26) and refractive index (1.374) of this mixture were calculated as weighted averages of the pure solvents’ dielectric constants (DME, 7.20; EtOAc, 6.02) and refractive indices (DME, 1.380; EtOAc, 1.372).62−65 The temperature was set to 303.15 K, consistent with the experimental conditions of the amide coupling step.19
2.2.2. Structural Descriptors
2.2.2.1. Fingerprints
Molecular fingerprints are one way to encode molecules as binary vectors that represent their chemical structure. Following the approach by Haywood et al.,66 three different types of fingerprints were used for each amide-coupling product: the MACCS fingerprint,67 the RDK fingerprint,48 and the Morgan fingerprint.68
The fingerprints were calculated from the SMILES codes of the amide-coupling products using the RDKit Python toolkit (version 2023.03.1b1).48 A fingerprint length of 2048 bits was used for the RDK and Morgan fingerprints, with a radius of 2 for the Morgan fingerprint. All other settings were kept at their default values.
The RDK fingerprint is based on bond paths and, in this case, includes substructures containing up to seven bonds.48 These substructures are transformed into “on bits” within the fingerprint using a hash function. While the use of a hash function prevents the direct reconstruction of a molecule from its fingerprint, information about which substructure corresponds to which bit can be retrieved through the bitInfo map during fingerprint generation.
2.2.2.2. Many-Body Descriptors
Many-body descriptors, which encode and differentiate between the three-dimensional structural features of a molecule, provide more steric information than property descriptors or fingerprints. However, these descriptors are generally less interpretable.
The many-body descriptors considered in this work were calculated on the QM-optimized structures of the reaction products resulting from the protocol described in Section 2.2.1. While this may increase the accuracy in comparison to learning on unoptimized structures, this step also increases the computational cost of the descriptor generation process. In cases where optimized structures are not already available as a byproduct of, e.g., QM-derived property descriptors, it is advisible to examine the suitability of unoptimized structures for generating many-body descriptors.
The Coulomb Matrix (CM) is constructed from atomic energies (diagonal elements) and the internuclear Coulomb potential, which depends on the charges of and distances between atomic nuclei (off-diagonal elements).69 Here, we considered the sorted eigenvalues (by absolute value) of the CM.
The “two-body forces” descriptor F2B by Pronobis et al. is also built upon the Coulomb potential but distinguishes between unique pairs of chemical elements.70 For instance, all inverse hydrogen–carbon distances are summed into the same feature of the F2B vector. In the original implementation, each unique element pair was assigned 15 features, corresponding to different orders of the inverse distance (from 1 to 15) to increase the descriptor’s flexibility. Here, we reduced computational cost by introducing only one feature per unique element pair, using an order of 1 as in the Coulomb potential.
The many-body tensor representations (MBTR) descriptor provides a discretized spectrum of one-body (atomic numbers), two-body (distances), and three-body (angles) features.71 Each feature type occupies a separate segment of the spectrum. Atomic numbers, distances, and angles are first smoothed using Gaussian functions and then discretized into a finite vector. In this work, the MBRT descriptor was used with default settings, except for the number of discrete grid points, which was set to 20.
2.3. Model Building
The Python library scikit-learn (Version 1.3.2) was used for all model-building steps described in this work.72 To predict the LCAP yields, various regression models were evaluated: Gaussian process regression (GaussianProcessRegressor()) with the kernel ConstantKernel() * Matern()+WhiteKernel(), five restarts of the optimizer (n_restarts_optimizer=5), and normalized target values (normalize_y=True), gradient boosting regression (GradientBoostingRegressor()), linear least-squares regression (LinearRegression()), Bayesian ridge regression (BayesianRidge()), LASSO regression (Lasso()), multilayer perceptron regression (MLPRegressor()), and random forest regression (RandomForestRegressor()). These methods were selected for their balance between predictive power, computational efficiency, and widespread use in the field.73,74
To find the best model architecture and descriptor combination, a grid search was performed using all descriptors listed in Section 2.2 and Table 2. The grid search results are provided in the Supporting Information (SI-Grid-Search.xlsx).
2.3.1. Data Set Preparation and Splitting
The data set preparation was standardized to ensure consistency across models. A test set containing 50 samples (20% of the data) was defined, leaving the remaining 214 data points for training and validation. For validation, the leave-one-out (LOO) method was used, resulting in 214 separate models, each trained on all but one data point. This method allowed the identification of the best-performing model and descriptor combination. Additional details on the data splitting strategy can be found in Section S2. Negatively predicted yields were set to 0 in all cases.
2.3.2. Model Performance Evaluation
Model performance was evaluated using several metrics, including the median (AE50, eq 1), the mean absolute error (MAE, eq 2), the root-mean-square error (RMSE, eq 3), the maximum absolute error (AEmax, eq 4), and the coefficient of determination (R2, eq 5).
![]() |
1 |
| 2 |
| 3 |
| 4 |
| 5 |
3. Results and Discussion
3.1. Yield Prediction
Table 3 lists the best performing descriptor for each regression type as measured by the MAE resulting from a LOO analysis of the training set. Multilayer perceptron regression plus Morgan fingerprint and Bayesian ridge regression plus RDK fingerprint perform identical according to this statistic, with a MAE of 7.1 yield percent. Due to the linearity of its predictive function, the Bayesian ridge model has the advantage of being more interpretable than the multilayer perceptron, which is a feedforward neural network with nonlinear activation functions. The following analysis pertains to the Bayesian ridge–RDK fingerprint model unless stated otherwise.
Table 3. Performance of the Best Descriptors for Each Regression Model Based on the MAE in 214 LOO Models.
| regression type | best descriptor | AE50 [%] | MAE [%] |
|---|---|---|---|
| Linear Least Squares | MACCS Fingerprint | 8.1 | 9.6 |
| LASSO | RDK Fingerprint | 5.9 | 7.9 |
| Bayesian Ridge | RDK Fingerprint | 5.6 | 7.1 |
| Random Forest | RDK Fingerprint | 5.4 | 7.3 |
| Gaussian Process | MACCS Fingerprint | 7.4 | 8.5 |
| Gradient Boosting | Morgan Fingerprint | 5.5 | 7.2 |
| Multi-Layer Perceptron | Morgan Fingerprint | 5.6 | 7.1 |
The MAE on the test set equals 7.4 yield percent (Table 4). The largest prediction errors occur at the lower end of the yield scale, whereas more accurate predictions are made at the upper end of the scale (Figure 4), which we consider a fortunate trend. The maximum absolute error (AEmax), an estimate of the worst-case scenario, amounts to 24 yield percent. While this error might be considered substantial, the model still appears to reliably predict the general direction of the reaction outcome.
Table 4. Performance of the Test Set (50 Molecules) and Fivefold Cross-Validation on the Training Set (214 molecules) as Described in Section 2.3.
| R2 | MAE | RMSE | AEmax | |
|---|---|---|---|---|
| test set | 0.69 | 7.4% | 9.9% | 23.6% |
| first fold | 0.67 | 7.7% | 10.0% | 26.8% |
| second fold | 0.72 | 5.8% | 8.0% | 23.6% |
| third fold | 0.80 | 5.1% | 6.8% | 21.9% |
| fourth fold | 0.63 | 7.0% | 9.3% | 30.8% |
| fifth fold | 0.66 | 8.1% | 10.7% | 27.8% |
| mean folds | 0.70 | 6.7% | 8.9% | 26.2% |
| std folds | 0.06 | 1.1% | 1.4% | 3.14% |
Figure 4.

Agreement between the predicted and experimental LCAP yields of the test set. The gray line represents ideal agreement, where predictions perfectly match the experimental values.
To ensure that the model is not biased toward the test set, a 5-fold cross-validation was performed on the training set (Table 4). The results, with an average MAE of 6.7% and a standard deviation of 1.1%, suggest that this specific type of bias is negligible.
Unlike the LCAP yields analyzed in this study, many synthetic laboratories commonly report isolated yields as a measure of reaction efficiency. To investigate the relationship between these two quantities, we examined the correlation between LCAP yields and isolated yields. In the study by Felten et al.,19 24 1,3-azoles were coupled with amine 50, and both LCAP and isolated yields were reported for these reactions. Notably, in all but one case, the isolated yields exceeded the LCAP yields (Figure S4). This observation suggests that the yields predicted by our model, based on LCAP data, are likely to reflect a conservative estimate, with isolated yields potentially being equal to or higher than the predicted values.
3.2. Strengths and Limitations of Our Model
Previous data sets used for training yield prediction models (Table 1) are generally more comprehensive than the training set of 214 reactions by Felten et al.19 (Merck & Co.) examined in this work. While the performance metrics of our model surpass those reported for the models developed by Granda et al.,34,35 Jiang et al.,37 Yarish et al.,38 Saebi et al.,39 Chen et al.40 and Schleinitz et al.,41 it is important to note that these models were evaluated on broader data sets covering a more diverse chemical space. The higher metrics observed in this study are likely due to the system-focused nature of the Merck data set (Figure 3), which narrows the scope of predictions. This specificity improves accuracy within the data set but comes at the cost of generalizability, as our model is not expected to perform well on reactions outside this limited chemical space.
Ahneman et al. reported performance metrics for a similarly small training set of 230 Buchwald–Hartwig couplings.33 Their model achieved an R2 of 0.68 and an RMSE of 15.3%, which is comparable to the performance of our model (R2 = 0.69, RMSE = 9.9%, cf. Table 4). The performance gap relative to their full model (R2 = 0.92, RMSE = 7.8%, see first entry in Table 1) highlights the potential for improved accuracy in our model if additional data were available for training.
While efforts were made to ensure that the test set is representative of the training set (Figure S2, right), the combinatorial nature of the Merck & Co. data set poses an additional challenge, as it inevitably results in shared structural motifs between the training and test sets. This overlap limits the model’s ability to predict yields for entirely new combinations of azoles and amines (Figure 3). To quantify this limitation, out-of-sample predictions were conducted in a leave-one-compound-out (LOCO) fashion. In each iteration, the model was trained on all but one azole and subsequently used to predict the yields for the excluded azole. The same procedure was applied to the amines. Table 5 summarizes the five best (0.68 ≤ R2 ≤ 0.91) and five worst (−6.8 ≥ R2 ≥ – 51.0) LOCO performances. The complete table can be reproduced using the code provided in the SI.75
Table 5. Performance of LOCO Models on Unseen Azoles and Amines.
| structure | R2 | MAE | RMSE | AEmax |
|---|---|---|---|---|
| 50 | 0.91 | 3.1% | 4.0% | 9.1% |
| 48 | 0.73 | 7.4% | 9.1% | 23.6% |
| 49 | 0.72 | 5.1% | 7.0% | 18.1% |
| 51 | 0.70 | 7.0% | 9.8% | 22.9% |
| 44 | 0.68 | 7.0% | 10.2% | 26.2% |
| 22 | –6.8 | 14.4% | 15.4% | 19.7% |
| 25 | –6.9 | 12.8% | 12.8% | 15.0% |
| 23 | –15.4 | 14.8% | 16.4% | 26.2% |
| 5 | –45.9 | 22.0% | 22.5% | 33.7% |
| 14 | –51.0 | 25.1% | 26.0% | 37.6% |
Among the top-performing unseen structures, only amines are present. These amines include both aliphatic (44) and aromatic (48–51) compounds. This result suggests that the model is capable of making accurate out-of-sample predictions for unseen amines. In contrast, the predictive performance is notably low for certain azoles (5, 14, 22, 23, 25). In these cases, the model underperforms compared to a simple average of experimental yields.
The poor predictions for 22, 23, and 25 can be attributed to how the molecules are encoded in the model. Since the RDK fingerprint is a substructure-based descriptor, it performs well on molecules composed of substructures already known to the model. However, this approach leads to errors when predicting yields for molecules with structural motifs unknown to the model. For instance, 23 contains a trimethylsilane (TMS) group, a motif absent from the training data set in the LOCO training cycle. As a result, the model is unable to make accurate predictions for such molecules.
For 5 and 14, similar structures in the training data set exhibit distinct yields. For example, 5 (an oxazole) and 17 (a thiazole) differ only in the heteroatom of the azole motif (Figure 5, top). However, 17 achieves yields ranging from 23% to 58%, whereas 5 produces a maximum yield of only 10%. A similar pattern is observed for 14, which contains a pyridine motif. This motif is also found in amines 51 and 52, yet their reaction products yield up to 69%, while products containing 14 achieve a maximum yield of only 12% (Figure 5, middle and bottom).
Figure 5.
Structures with similar structural motifs but distinctly different yields. The structure motifs of interest are highlighted in red.
3.3. Learning from Our Model
The discrepancies observed for azoles 5 and 14 are reminiscent of the concept of activity cliffs, a phenomenon commonly discussed in medicinal chemistry and drug design. Activity cliffs describe structurally similar compounds that exhibit different biological activities.76 This concept can be generalized to other properties, such as reaction yields, where such differences are referred to as property cliffs.77
Property cliffs can be investigated through an analysis of the model’s bits {xn} and coefficients {ωn},
| 6 |
Each bit is linked to a model coefficient. The coefficients indicate the contribution of each substructure to the predicted yield given they are linked to “on” bits (xn = 1), as opposed to “off” bits (xn = 0). Positive coefficients signify a positive influence on the yield. Negative coefficients denote a negative influence. The magnitude of the coefficient reflects the overall impact of the substructure. The individual coefficients of our model can be printed out with the code provided in the project-related Gitlab repository.75
To further understand the discrepancies observed for azoles 5 and 14, we compared the coefficients from models trained on LOCO subsets to those from a model trained on the entire data set. The analysis focused on “on” bit coefficients with the largest differences between the two models. For instance, the oxazole motif in 5, which distinguishes it from 17 containing the analogous thiazole motif (Figure 5, top), was associated with the most significant coefficient changes. This finding suggests that such changes can result from the descriptor’s inability to consistently capture the influence of certain motifs, even though similar structures containing the same motifs are present in the training set.
The analysis of coefficient–bit pairs is not only useful for understanding the aforementioned discrepancies. It also serves as a powerful tool for addressing broader interpretability challenges in machine-learning models. By linking specific structural motifs to their contributions, this approach provides valuable insights into the underlying patterns driving predictions.
To make the information provided by the model more accessible and intuitive, we implemented the heat-mapping algorithm PIXIE to visualize the importance of individual features as measured by their associated coefficients (eq 6). Using this information, we assigned weights to each bond in a molecule by summing the coefficients of all substructures associated with that bond. These weights were then visualized as colors on the molecular structure, creating a heat map. In the representation used here, red tones indicate a positive influence of the structural motif on the yield, while blue tones suggest a negative influence. The color scale was normalized to the largest and smallest weights in the data set for consistency.
For example, as shown in Figure 6, the brominated oxazole substructure in 5-CO-50 and 34-CO-46 is identified as detrimental to the yield. Conversely, the 5-phenyl oxazole substructure in 13-CO-45 is linked to a positive strong contribution. The analysis also highlights the role of specific amines. For instance, amine 46 shows a potential positive effect on the yield of 34-CO-46, while amine 50 appears to contribute negatively to the yield of 1-CO-50.
Figure 6.
Heat maps showing examples from the test set. Blue tones indicate substructures that reduce the predicted yield, while red tones highlight those that increase it.
While the heat maps generated by PIXIE offer valuable insights, inconsistencies may arise due to fingerprint collisions—instances where a single bit corresponds to multiple substructures. Such collisions can hinder the unambiguous assignment of coefficients to specific substructures within a molecule, limiting the interpretability of the heat map.78 Due to the narrow scope of the chemical space investigated in this work, such collisions are expected to have a negligible impact. Nevertheless, we note this limitation to ensure the approach remains adaptable and informative in broader applications beyond the specific system studied here.
Heat-mapping algorithms have already been used in similar contexts to visualize how machine-learning models make predictions.78−83 These methods fall under the broader domain of explainable artificial intelligence and are often applied in drug design.78−82 For instance, Riniker et al. implemented a heat-mapping approach in RDKit, focusing on mapping based on molecular similarity or the predicted probability of a machine-learning model trained on fingerprints.78 Marcou et al. developed an approach named ColorAtom for explainability based on fragment descriptors.84 PIXIE, on the other hand, is based on RDKit fingerprints and can therefore be applied to other types of models.
PIXIE has the distinct advantage of being directly applicable to model coefficients, which is not feasible with more complex model architectures. For such architectures, tools like SHAP46 are often used to explain predictions. PIXIE was designed to include SHAP values for models trained on RDKit fingerprints, enabling the generation of comparable heat maps (Figure S5).
3.4. Exploring the 1,3-Azole Amide Space
To explore the applicability of our machine-learning model to “real-world” data, we conducted yield predictions for azole amides from the ZINC20 database (Section 2.1.2). This database includes 5708 1,3-azole amides in its “In-Stock” subset, with predicted yields ranging from 9% to 60%. The Merck & Co. data set (Section 2.1.1) used to train the model provides only partial coverage of the structural diversity found in the ZINC database, as illustrated in Figure 7. As a result, predictions for structures located further from the training set in the chemical space, depending on the chosen descriptor, may be less accurate than those closer to it.
Figure 7.

PCA based on seven RDKit descriptors, with the data by Felten et al.19 shown in red and the 1,3-azole amides from the ZINC database in gray. Details on the PCA settings are provided in Section S1.
This trend is also reflected in the prediction uncertainty, which generally increases for structures farther from the training set. However, in this case, prediction uncertainties span a relatively narrow range of 8% to 10%, suggesting consistent model confidence across the data set. Examples of ZINC structures, along with their predicted yields and associated uncertainties, are summarized in Table 6.
Table 6. Predicted Yields and Uncertainties for Selected Molecules from the ZINC20 Database and a Newly Designed Molecule (E)a.
Blue regions on the structures indicate substructures predicted to decrease the yield, while red regions highlight those predicted to increase it. Molecule E, designed by combining motifs from C and D, is not part of the ZINC database. Uncertainty values for each prediction are listed in the corresponding column.
The predictions in Table 6 suggest that structure C, which includes a 5-phenyl-1,3-oxazole group, is associated with a positive impact on the yield, while the benzoimidazole group appears to have a neutral effect. In contrast, structure D contains a benzothiazole group that is predicted to enhance the yield. By replacing the benzoimidazole group in structure C with the benzothiazole group from structure D, a new molecule—structure E—was designed, combining two motifs predicted to positively influence the yield. It is important to note, however, that this analysis is purely mathematical and does not account for chemical feasibility. In practice, the substitution of these motifs may not be straightforward, and alternative reaction mechanisms could affect the validity of the model’s predictions. Thus, the conclusions drawn here are only valid under the known assumptions of the model. This example nonetheless demonstrates how heat mapping can support the rational design of molecules.
4. Conclusions
The chemical space of C2-carboxylated 1,3-azoles remains largely underexplored. To facilitate systematic exploration, we developed an interpretable machine-learning model for yield prediction using RDK fingerprints. The model targets C–H carboxylation reactions followed by an amide coupling step and achieves a test set MAE of 7.4 yield percent, which corresponds to an R2 value of 0.69.
Limitations such as property cliffs (analogous to activity cliffs) and underrepresented structural motifs were identified, highlighting areas for further refinement. To provide deeper insights into the model’s predictions, we integrated a heat-mapping algorithm named PIXIE. This algorithm enhances chemical intuition by visualizing the quantitative effect of individual structural motifs on the predicted yields, making the model predictions directly interpretable.
The adaptability of PIXIE extends beyond yield prediction, with potential applications in quantitative structure–property relationships (QSPR) for properties like quantum chemical descriptors and in quantitative structure–activity relationships (QSAR) for activities or toxicities. By demonstrating its utility on real-world data from the ZINC database, we showed how PIXIE can facilitate rational molecular design.
By combining predictive modeling with interpretable visualizations through PIXIE, we aim to support synthetic chemists in systematically expanding the chemical space of C2-carboxylated 1,3-azoles and beyond.
Acknowledgments
JP acknowledges funding by Germany’s joint federal and state program supporting early career researchers (WISNA) established by the Federal Ministry of Education and Research (BMBF). The authors thank the team around Dr. Marion H. Emmert (Merck & Co.) for providing numerical LCAP values as well as Prof. Christoph R. Jacob (TU Braunschweig) for computational resources. The authors employed ChatGPT/OpenAI to assist with coding during the preparation of this work. They carefully reviewed and revised the AI-generated content, taking full responsibility for the final publication.
Glossary
Abbrevations
- AE50
Median of the Absolute Error
- AEmax
Maximum Absolute Error
- BH
Buchwald–Hartwig
- CASP
Computer-Aided Synthesis Planning
- CM
Coulomb Matrix
- DME
1,2-Dimethoxyethane
- EtOAc
Ethyl Acetate
- HOMO
Highest Occupied Molecular Orbital
- LASSO
Least Absolute Shrinkage and Selection Operator
- LCAP
Liquid Chromatography Area Percent
- LOCO
Leave-One-Compound-Out
- LOO
Leave-One-Out
- LUMO
Lowest Unoccupied Molecular Orbital
- MAE
Mean Absolute Error
- MBTR
Many-Body Tensor Representations
- PCA
Principal Component Analysis
- QM
Quantum Mechanics
- R2
Coefficient of Determination
- RMSE
Root Mean Square Error
- SHAP
SHapley Additive exPlanations
- TMS
Trimethylsilane
Data Availability Statement
The data used for creating the machine-learning model can be found under https://pubs.acs.org/doi/abs/10.1021/jacs.2c10557. The ZINC “in-stock” can be downloaded via https://zinc.docking.org/tranches/home/. The code as well as the model to reproduce the results shown in this work are available at https://git.rz.tu-bs.de/proppe-group/yield-prediction.
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.4c02336.
Author Contributions
Kerrin Janssen: Conceptualization, Data curation, Formal analysis, Investigation, Experimental design, Methodology, Computational modeling, Data interpretation, Visualization, Writing–original draft. Jonny Proppe: Conceptualization, Methodology, Data interpretation, Project administration, Resources, Supervision, Writing–review and editing.
The authors declare no competing financial interest.
Special Issue
Published as part of Journal of Chemical Information and Modelingspecial issue “Chemical Compound Space Exploration by Multiscale High-Throughput Screening and Machine Learning”.
Supplementary Material
References
- Wheeler T.; Von Braun J. Climate Change Impacts on Global Food Security. Science 2013, 341, 508–513. 10.1126/science.1239402. [DOI] [PubMed] [Google Scholar]
- Ledley T. S.; Sundquist E. T.; Schwartz S. E.; Hall D. K.; Fellows J. D.; Killeen T. L. Climate change and greenhouse gases. Eos, Transactions American Geophysical Union 1999, 80, 453–458. 10.1029/99EO00325. [DOI] [Google Scholar]
- Mac Dowell N.; Fennell P. S.; Shah N.; Maitland G. C. The role of CO2 capture and utilization in mitigating climate change. Nature Climate Change 2017, 7, 243–249. 10.1038/nclimate3231. [DOI] [Google Scholar]
- Leitner W. Carbon Dioxide as a Raw Material: The Synthesis of Formic Acid and Its Derivatives from CO2. Angewandte Chemie International Edition in English 1995, 34, 2207–2221. 10.1002/anie.199522071. [DOI] [Google Scholar]
- Vechorkin O.; Hirt N.; Hu X. Carbon Dioxide as the C1 Source for Direct CH Functionalization of Aromatic Heterocycles. Organic Letters 12, Publisher: American Chemical Society 2010, 12, 3567–3569. 10.1021/ol101450u. [DOI] [PubMed] [Google Scholar]
- Liu Q.; Wu L.; Jackstell R.; Beller M. Using carbon dioxide as a building block in organic synthesis. Nature Communications 2015, 6, 5933. 10.1038/ncomms6933. [DOI] [PubMed] [Google Scholar]
- Artz J.; Müller T. E.; Thenert K.; Kleinekorte J.; Meys R.; Sternberg A.; Bardow A.; Leitner W. Sustainable Conversion of Carbon Dioxide: An Integrated Review of Catalysis and Life Cycle Assessment. Chem. Rev. 2018, 118, 434–504. 10.1021/acs.chemrev.7b00435. [DOI] [PubMed] [Google Scholar]
- Dabral S.; Schaub T. The Use of Carbon Dioxide (CO2) as a Building Block in Organic Synthesis from an Industrial Perspective. Advanced Synthesis & Catalysis 2019, 361, 223–246. 10.1002/adsc.201801215. [DOI] [Google Scholar]
- Luo J.; Larrosa I. CH Carboxylation of Aromatic Compounds through CO2 Fixation. ChemSusChem 2017, 10, 3317–3332. 10.1002/cssc.201701058. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Siegel R. E.; Pattanayak S.; Berben L. A. Reactive Capture of CO 2: Opportunities and Challenges. ACS Catalysis 2023, 13, 766–784. 10.1021/acscatal.2c05019. [DOI] [Google Scholar]
- Fenner S.; Ackermann L. C–H carboxylation of heteroarenes with ambient CO2. Green Chemistry 2016, 18, 3804–3807. 10.1039/C6GC00200E. [DOI] [Google Scholar]
- Das Neves Gomes C.; Jacquet O.; Villiers C.; Thuéry P.; Ephritikhine M.; Cantat T. A Diagonal Approach to Chemical Recycling of Carbon Dioxide: Organocatalytic Transformation for the Reductive Functionalization of CO2. Angew. Chem. 2012, 124, 191–194. 10.1002/ange.201105516. [DOI] [PubMed] [Google Scholar]
- Fiorani G.; Guo W.; Kleij A. W. Sustainable conversion of carbon dioxide: the advent of organocatalysis. Green Chemistry 2015, 17, 1375–1389. 10.1039/C4GC01959H. [DOI] [Google Scholar]
- Hong J.; Li M.; Zhang J.; Sun B.; Mo F. CH Bond Carboxylation with Carbon Dioxide. ChemSusChem 2019, 12, 6–39. 10.1002/cssc.201802012. [DOI] [PubMed] [Google Scholar]
- Janssen K.; Kirchmair J.; Proppe J. Relevance and Potential Applications of C2-Carboxylated 1,3-Azoles. ChemMedChem. 2024, 19, e202400307 10.1002/cmdc.202400307. [DOI] [PubMed] [Google Scholar]
- Poulakos M.; Walker J. N.; Baig U.; David T. Edoxaban: A direct oral anticoagulant. American Journal of Health-System Pharmacy 2017, 74, 117–129. 10.2146/ajhp150821. [DOI] [PubMed] [Google Scholar]
- Chen Z.-F.; Ying G.-G. Occurrence, fate and ecological risk of five typical azole fungicides as therapeutic and personal care products in the environment: A review. Environment International 2015, 84, 142–153. 10.1016/j.envint.2015.07.022. [DOI] [PubMed] [Google Scholar]
- Wang X.-Y.; Ma Y.-J.; Guo Y.; Luo X.-L.; Du M.; Dong L.; Yu P.; Xu X.-B. Reinvestigation of 2-acetylthiazole formation pathways in the Maillard reaction. Food Chemistry 2021, 345, 128761. 10.1016/j.foodchem.2020.128761. [DOI] [PubMed] [Google Scholar]
- Felten S.; He C. Q.; Weisel M.; Shevlin M.; Emmert M. H. Accessing Diverse Azole Carboxylic Acid Building Blocks via Mild C–H Carboxylation: Parallel, One-Pot Amide Couplings and Machine-Learning-Guided Substrate Scope Design. J. Am. Chem. Soc. 2022, 144, 23115–23126. 10.1021/jacs.2c10557. [DOI] [PubMed] [Google Scholar]
- Coley C. W.; Green W. H.; Jensen K. F. Machine Learning in Computer-Aided Synthesis Planning. Accounts of Chemical Research 2018, 51, 1281–1289. 10.1021/acs.accounts.8b00087. [DOI] [PubMed] [Google Scholar]
- Segler M. H. S.; Preuss M.; Waller M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018, 555, 604–610. 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]
- Struble T. J.; Alvarez J. C.; Brown S. P.; Chytil M.; Cisar J.; DesJarlais R. L.; Engkvist O.; Frank S. A.; Greve D. R.; Griffin D. J.; Hou X.; Johannes J. W.; Kreatsoulas C.; Lahue B.; Mathea M.; Mogk G.; Nicolaou C. A.; Palmer A. D.; Price D. J.; Robinson R. I.; Salentin S.; Xing L.; Jaakkola T.; Green W. H.; Barzilay R.; Coley C. W.; Jensen K. F. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. Journal of Medicinal Chemistry 2020, 63, 8667–8682. 10.1021/acs.jmedchem.9b02120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Szymkuć S.; Gajewska E. P.; Klucznik T.; Molga K.; Dittwald P.; Startek M.; Bajczyk M.; Grzybowski B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angewandte Chemie International Edition 2016, 55, 5904–5937. 10.1002/anie.201506101. [DOI] [PubMed] [Google Scholar]
- Klucznik T.; Mikulak-Klucznik B.; McCormack M. P.; Lima H.; Szymkuć S.; Bhowmick M.; Molga K.; Zhou Y.; Rickershauser L.; Gajewska E. P.; Toutchkine A.; Dittwald P.; Startek M. P.; Kirkovits G. J.; Roszak R.; Adamski A.; Sieredzińska B.; Mrksich M.; Trice S. L.; Grzybowski B. A. Efficient Syntheses of Diverse, Medicinally Relevant Targets Planned by Computer and Executed in the Laboratory. Chem. 2018, 4, 522–532. 10.1016/j.chempr.2018.02.002. [DOI] [Google Scholar]
- Warr W. A. A Short Review of Chemical Reaction Database Systems, Computer-Aided Synthesis Design, Reaction Prediction and Synthetic Feasibility. Molecular Informatics 2014, 33, 469–476. 10.1002/minf.201400052. [DOI] [PubMed] [Google Scholar]
- Shen Y.; Borowski J. E.; Hardy M. A.; Sarpong R.; Doyle A. G.; Cernak T. Automation and computer-assisted planning for chemical synthesis. Nat. Rev. Methods Primers 2021, 1, 23. 10.1038/s43586-021-00022-5. [DOI] [Google Scholar]
- Molga K.; Szymkuć S.; Grzybowski B. A. Chemist Ex Machina: Advanced Synthesis Planning by Computers. Accounts of Chemical Research 2021, 54, 1094–1106. 10.1021/acs.accounts.0c00714. [DOI] [PubMed] [Google Scholar]
- Tripp A.; Maziarz K.; Lewis S.; Liu G.; Segler M.. Re-Evaluating Chemical Synthesis Planning Algorithms. OpenReview, 2024.
- Schwaller P.; Vaucher A. C.; Laino T.; Reymond J.-L. Prediction of chemical reaction yields using deep learning. Machine Learning: Science and Technology 2021, 2, 015016. 10.1088/2632-2153/abc81d. [DOI] [Google Scholar]
- Sandfort F.; Strieth-Kalthoff F.; Kühnemund M.; Beecks C.; Glorius F. A Structure-Based Platform for Predicting Chemical Reactivity. Chem. 2020, 6, 1379–1390. 10.1016/j.chempr.2020.02.017. [DOI] [Google Scholar]
- Żurański A. M.; Martinez Alvarado J. I.; Shields B. J.; Doyle A. G. Predicting Reaction Yields via Supervised Learning. Accounts of Chemical Research 2021, 54, 1856–1865. 10.1021/acs.accounts.0c00770. [DOI] [PubMed] [Google Scholar]
- Rasmussen C. E.Gaussian Processes in Machine Learning. Advanced Lectures on Machine Learning, Vol. 3176, edited by Bousquet O.; Von Luxburg U.; Rätsch G. Series Title: Lecture Notes in Computer Science (Springer Berlin Heidelberg, Berlin, Heidelberg, 2004), pp 63–71. [Google Scholar]
- Ahneman D. T.; Estrada J. G.; Lin S.; Dreher S. D.; Doyle A. G. Predicting reaction performance in C–N cross-coupling using machine learning. Science 2018, 360, 186–190. 10.1126/science.aar5169. [DOI] [PubMed] [Google Scholar]
- Perera D.; Tucker J. W.; Brahmbhatt S.; Helal C. J.; Chong A.; Farrell W.; Richardson P.; Sach N. W. A platform for automated nanomole-scale reaction screening and micromole-scale synthesis in flow. Science 2018, 359, 429–434. 10.1126/science.aap9112. [DOI] [PubMed] [Google Scholar]
- Granda J. M.; Donina L.; Dragone V.; Long D.-L.; Cronin L. Controlling an organic synthesis robot with machine learning to search for new reactivity. Nature 2018, 559, 377–381. 10.1038/s41586-018-0307-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nielsen M. K.; Ahneman D. T.; Riera O.; Doyle A. G. Deoxyfluorination with Sulfonyl Fluorides: Navigating Reaction Space with Machine Learning. Journal of the American Chemical Society 2018, 140, 5004–5008. 10.1021/jacs.8b01523. [DOI] [PubMed] [Google Scholar]
- Jiang S.; Zhang Z.; Zhao H.; Li J.; Yang Y.; Lu B.-L.; Xia N. When SMILES Smiles, Practicality Judgment and Yield Prediction of Chemical Reaction via Deep Chemical Language Processing. IEEE Access 2021, 9, 85071–85083. 10.1109/ACCESS.2021.3083838. [DOI] [Google Scholar]
- Yarish D.; Garkot S.; Grygorenko O. O.; Radchenko D. S.; Moroz Y. S.; Gurbych O. Advancing molecular graphs with descriptors for the prediction of chemical reaction yields. Journal of Computational Chemistry 2023, 44, 76–92. 10.1002/jcc.27016. [DOI] [PubMed] [Google Scholar]
- Saebi M.; Nan B.; Herr J. E.; Wahlers J.; Guo Z.; Zurański A. M.; Kogej T.; Norrby P.-O.; Doyle A. G.; Chawla N. V.; Wiest O. On the use of real-world datasets for reaction yield prediction. Chemical Science 2023, 14, 4997–5005. 10.1039/D2SC06041H. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J.; Guo K.; Liu Z.; Isayev O.; Zhang X. Uncertainty-Aware Yield Prediction with Multimodal Molecular Features. Proceedings of the AAAI Conference on Artificial Intelligence 2024, 38, 8274–8282. 10.1609/aaai.v38i8.28668. [DOI] [Google Scholar]
- Schleinitz J.; Langevin M.; Smail Y.; Wehnert B.; Grimaud L.; Vuilleumier R. Machine Learning Yield Prediction from NiCOlit, a Small-Size Literature Data Set of Nickel Catalyzed C–O Couplings. Journal of the American Chemical Society 2022, 144, 14722–14730. 10.1021/jacs.2c05302. [DOI] [PubMed] [Google Scholar]
- Aouichaoui A. R. N.; Fan F.; Mansouri S. S.; Abildskov J.; Sin G. Combining Group-Contribution Concept and Graph Neural Networks Toward Interpretable Molecular Property Models. Journal of Chemical Information and Modeling 2023, 63, 725–744. 10.1021/acs.jcim.2c01091. [DOI] [PubMed] [Google Scholar]
- Santiago C. B.; Guo J.-Y.; Sigman M. S. Predictive and mechanistic multivariate linear regression models for reaction development. Chemical Science 2018, 9, 2398–2412. 10.1039/C7SC04679K. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bishop C. M.Pattern recognition and machine learning. Information science and statistics (Springer, New York (NY), United States, 2006). [Google Scholar]
- James G.; Witten D.; Hastie T.; Tibshirani R.. An Introduction to Statistical Learning: with Applications in R, Springer Texts in Statistics (Springer US, New York, NY, 2021). [Google Scholar]
- Lundberg S. M.; Lee S.-I.. A unified approach to interpreting model predictions. In Advances in neural information processing systems, Vol. 30, edited by Guyon I.; Luxburg U. V.; Bengio S.; Wallach H.; Fergus R.; Vishwanathan S.; Garnett R. (2017). [Google Scholar]
- Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
- Landrum G.; Tosco P.; Kelley B.; Ric, Cosgrove D.; Sriniker; Gedeck, Vianello R.; Nadine Schneider, Kawashima D N E.; Jones G.; Dalke A.; Cole B.; Swain M.; Turk S.; AlexanderSavelyev; Vaucher A.; Wójcikowski, Take I.; Probst D.; Ujihara K.; Scalfani V. F.; Godin G.; Lehtivarjo J.; Pahl A.; Walker R.; Berenger F.; Jasondbiggs; Strets123 . rdkit/rdkit: 2023_03_2 (Q1 2023) Release, Oct. 2023.
- Skoraczyński G.; Dittwald P.; Miasojedow B.; Szymkuć S.; Gajewska E. P.; Grzybowski B. A.; Gambin A. Predicting the outcomes of organic reactions via machine learning: are current descriptors sufficient?. Scientific Reports 2017, 7, 3582. 10.1038/s41598-017-02303-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pracht P.; Grimme S.; Bannwarth C.; Bohle F.; Ehlert S.; Feldmann G.; Gorges J.; Müller M.; Neudecker T.; Plett C.; Spicher S.; Steinbach P.; Wesołowski P. A.; Zeller F. CREST—A program for the exploration of low-energy molecular chemical space. The Journal of Chemical Physics 2024, 160, 114110. 10.1063/5.0197592. [DOI] [PubMed] [Google Scholar]
- Bannwarth C.; Ehlert S.; Grimme S. GFN2-xTB—An Accurate and Broadly Parametrized Self-Consistent Tight-Binding Quantum Chemical Method with Multipole Electrostatics and Density-Dependent Dispersion Contributions. Journal of Chemical Theory and Computation 2019, 15, 1652–1671. 10.1021/acs.jctc.8b01176. [DOI] [PubMed] [Google Scholar]
- Neese F. The ORCA program system. Wiley Interdisciplinary Reviews: Computational Molecular Science 2012, 2, 73–78. 10.1002/wcms.81. [DOI] [Google Scholar]
- Neese F. Software update: The ORCA program system—Version 5.0. WIREs Computational Molecular Science 2022, 12, e1606 10.1002/wcms.1606. [DOI] [Google Scholar]
- Becke A. D. A new mixing of Hartree–Fock and local density-functional theories. The Journal of Chemical Physics 1993, 98, 1372–1377. 10.1063/1.464304. [DOI] [Google Scholar]
- Grimme S.; Antony J.; Ehrlich S.; Krieg H. A consistent and accurate ab initio parametrization of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. The Journal of Chemical Physics 2010, 132, 154104. 10.1063/1.3382344. [DOI] [PubMed] [Google Scholar]
- Grimme S.; Ehrlich S.; Goerigk L. Effect of the damping function in dispersion corrected density functional theory. Journal of Computational Chemistry 2011, 32, 1456–1465. 10.1002/jcc.21759. [DOI] [PubMed] [Google Scholar]
- Mardirossian N.; Head-Gordon M. Thirty years of density functional theory in computational chemistry: an overview and extensive assessment of 200 density functionals. Mol. Phys. 2017, 115, 2315–2372. 10.1080/00268976.2017.1333644. [DOI] [Google Scholar]
- Weigend F.; Ahlrichs R. Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for H to Rn: Design and assessment of accuracy. Physical Chemistry Chemical Physics 2005, 7, 3297–3305. 10.1039/b508541a. [DOI] [PubMed] [Google Scholar]
- Bursch M.; Mewes J.-M.; Hansen A.; Grimme S. Best-Practice DFT Protocols for Basic Molecular Computational Chemistry. Angewandte Chemie International Edition 2022, 61, e202205735 10.1002/anie.202205735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vahl M.; Proppe J. The computational road to reactivity scales. Physical Chemistry Chemical Physics 2023, 25, 2717–2728. 10.1039/D2CP03937K. [DOI] [PubMed] [Google Scholar]
- Garcia-Ratés M.; Neese F. Effect of the Solute Cavity on the Solvation Energy and its Derivatives within the Framework of the Gaussian Charge Scheme. Journal of Computational Chemistry 2020, 41, 922–939. 10.1002/jcc.26139. [DOI] [PubMed] [Google Scholar]
- Lohani U. C.; Fallahi P.; Muthukumarappan K. Comparison of Ethyl Acetate with Hexane for Oil Extraction from Various Oilseeds. Journal of the American Oil Chemists’ Society 2015, 92, 743–754. 10.1007/s11746-015-2644-1. [DOI] [Google Scholar]
- Álvarez V. H.; Serrão D.; Da Silva J. L.; Barbosa M. R.; Aznar M. Vapor–liquid and liquid–liquid equilibrium for binary systems ester + a new protic ionic liquid. Ionics 2013, 19, 1263–1269. 10.1007/s11581-013-0846-9. [DOI] [Google Scholar]
- Yoshio M.; Nakamura H.; Hyakutake M.; Nishikawa S.; Yoshizuka K. Conductivities of 1,2-dimethoxyethane or 1,2-dimethoxyethane-related solutions of lithium salts. Journal of Power Sources 1993, 41, 77–86. 10.1016/0378-7753(93)85006-A. [DOI] [Google Scholar]
- Wang Y.; Luo T.; Li Y.; Wang A.; Wang D.; Bao J. L.; Mohanty U.; Tsung C.-K. Molecular-Level Insights into Selective Transport of Mg 2+ in Metal–Organic Frameworks. ACS Applied Materials & Interfaces 2021, 13, 51974–51987. 10.1021/acsami.1c08392. [DOI] [PubMed] [Google Scholar]
- Haywood A. L.; Redshaw J.; Hanson-Heine M. W. D.; Taylor A.; Brown A.; Mason A. M.; Gärtner T.; Hirst J. D. Kernel Methods for Predicting Yields of Chemical Reactions. Journal of Chemical Information and Modeling 2022, 62, 2077–2092. 10.1021/acs.jcim.1c00699. [DOI] [PubMed] [Google Scholar]
- Durant J. L.; Leland B. A.; Henry D. R.; Nourse J. G. Reoptimization of MDL Keys for Use in Drug Discovery. Journal of Chemical Information and Computer Sciences 2002, 42, 1273–1280. 10.1021/ci010132r. [DOI] [PubMed] [Google Scholar]
- Rogers D.; Hahn M. Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
- Rupp M.; Tkatchenko A.; Müller K.-R.; Von Lilienfeld O. A. Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning. Physical Review Letters 2012, 108, 058301. 10.1103/PhysRevLett.108.058301. [DOI] [PubMed] [Google Scholar]
- Pronobis W.; Tkatchenko A.; Müller K.-R. Many-Body Descriptors for Predicting Molecular Properties with Machine Learning: Analysis of Pairwise and Three-Body Interactions in Molecules. J. Chem. Theory Comput. 2018, 14, 2991–3003. 10.1021/acs.jctc.8b00110. [DOI] [PubMed] [Google Scholar]
- Huo H.; Rupp M. Unified representation of molecules and crystals for machine learning. Machine Learning: Science and Technology 2022, 3, 045017. 10.1088/2632-2153/aca005. [DOI] [Google Scholar]
- Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay É. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825–2830. [Google Scholar]
- Hoffmann G.; Balcilar M.; Tognetti V.; Héroux P.; Gaüzère B.; Adam S.; Joubert L. Predicting experimental electrophilicities from quantum and topological descriptors: A machine learning approach. Journal of Computational Chemistry 2020, 41, 2124–2136. 10.1002/jcc.26376. [DOI] [Google Scholar]
- Eckhoff M.; Diedrich J. V.; Mücke M.; Proppe J. Quantitative Structure–Reactivity Relationships for Synthesis Planning: The Benzhydrylium Case. The Journal of Physical Chemistry A 2024, 128, 343–354. 10.1021/acs.jpca.3c07289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Proppe J.Yield Prediction, https://git.rz.tu-bs.de/proppe-group/yield-prediction, Dec. 2024.
- Maggiora G. M. On Outliers and Activity CliffsWhy QSAR Often Disappoints. Journal of Chemical Information and Modeling 2006, 46, 1535–1535. 10.1021/ci060117s. [DOI] [PubMed] [Google Scholar]
- Cruz-Monteagudo M.; Medina-Franco J. L.; Pérez-Castillo Y.; Nicolotti O.; Cordeiro M. N. D.; Borges F. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde?. Drug Discovery Today 2014, 19, 1069–1080. 10.1016/j.drudis.2014.02.003. [DOI] [PubMed] [Google Scholar]
- Riniker S.; Landrum G. A. Similarity maps - a visualization strategy for molecular fingerprints and machine-learning methods. Journal of Cheminformatics 2013, 5, 43. 10.1186/1758-2946-5-43. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wellawatte G. P.; Gandhi H. A.; Seshadri A.; White A. D. A Perspective on Explanations of Molecular Prediction Models. Journal of Chemical Theory and Computation 2023, 19, 2149–2160. 10.1021/acs.jctc.2c01235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rasmussen M. H.; Christensen D. S.; Jensen J. H. Do machines dream of atoms? Crippen’s logP as a quantitative molecular benchmark for explainable AI heatmaps. SciPost Chemistry 2023, 2, 002. 10.21468/SciPostChem.2.1.002. [DOI] [Google Scholar]
- Harren T.; Matter H.; Hessler G.; Rarey M.; Grebner C. Interpretation of Structure–Activity Relationships in Real-World Drug Design Data Sets Using Explainable Artificial Intelligence. Journal of Chemical Information and Modeling 2022, 62, 447–462. 10.1021/acs.jcim.1c01263. [DOI] [PubMed] [Google Scholar]
- Rosenbaum L.; Hinselmann G.; Jahn A.; Zell A. Interpreting linear support vector machine models with heat map molecule coloring. Journal of Cheminformatics 2011, 3, 11. 10.1186/1758-2946-3-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen Y.; Stork C.; Hirte S.; Kirchmair J. NP-Scout: Machine Learning Approach for the Quantification and Visualization of the Natural Product-Likeness of Small Molecules. Biomolecules 2019, 9, 43. 10.3390/biom9020043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marcou G.; Horvath D.; Solov’ev V.; Arrault A.; Vayer P.; Varnek A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639–642. 10.1002/minf.201100136. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data used for creating the machine-learning model can be found under https://pubs.acs.org/doi/abs/10.1021/jacs.2c10557. The ZINC “in-stock” can be downloaded via https://zinc.docking.org/tranches/home/. The code as well as the model to reproduce the results shown in this work are available at https://git.rz.tu-bs.de/proppe-group/yield-prediction.







