Materials Precursor Score: Modeling Chemists’ Intuition for the Synthetic Accessibility of Porous Organic Cage Precursors

Steven Bennett; Filip T Szczypiński; Lukas Turcani; Michael E Briggs; Rebecca L Greenaway; Kim E Jelfs

doi:10.1021/acs.jcim.1c00375

. 2021 Aug 13;61(9):4342–4356. doi: 10.1021/acs.jcim.1c00375

Materials Precursor Score: Modeling Chemists’ Intuition for the Synthetic Accessibility of Porous Organic Cage Precursors

Steven Bennett ^†, Filip T Szczypiński ^†, Lukas Turcani ^†, Michael E Briggs ^‡, Rebecca L Greenaway ^†, Kim E Jelfs ^†,^*

PMCID: PMC8479809 PMID: 34388347

Abstract

graphic file with name ci1c00375_0009.jpg

Computation is increasingly being used to try to accelerate the discovery of new materials. One specific example of this is porous molecular materials, specifically porous organic cages, where the porosity of the materials predominantly comes from the internal cavities of the molecules themselves. The computational discovery of novel structures with useful properties is currently hindered by the difficulty in transitioning from a computational prediction to synthetic realization. Attempts at experimental validation are often time-consuming, expensive, and frequently, the key bottleneck of material discovery. In this work, we developed a computational screening workflow for porous molecules that includes consideration of the synthetic difficulty of material precursors, aimed at easing the transition between computational prediction and experimental realization. We trained a machine learning model by first collecting data on 12,553 molecules categorized either as “easy-to-synthesize” or “difficult-to-synthesize” by expert chemists with years of experience in organic synthesis. We used an approach to address the class imbalance present in our data set, producing a binary classifier able to categorize easy-to-synthesize molecules with few false positives. We then used our model during computational screening for porous organic molecules to bias toward precursors whose easier synthesis requirements would make them promising candidates for experimental realization and material development. We found that even by limiting precursors to those that are easier-to-synthesize, we are still able to identify cages with favorable, and even some rare, properties.

Introduction

Functional materials underpin the foundations of modern society, but their discovery is a long and challenging process. High-throughput computational screening seeks to guide the process and thus to accelerate novel material discovery.¹⁻³ A key component in computational materials discovery needs to be a consideration of whether a hypothetical material with promising properties can actually be experimentally obtained.⁴ There are many elements to that challenge, including the ability to obtain or synthesize the precursors, finding a successful synthetic method to form the material, and being able to control the solid-state form and assembly, for example, to enable incorporation into a device. For materials built from entirely or primarily organic components, the consideration of whether the precursor building blocks of the material can be easily and ideally cheaply synthesized is vital. Without expert chemists guiding some element of the computational screening process, it is challenging to foresee which new candidate materials are synthetically viable and accessible prior to experimental testing.

The growth of data-driven tools in chemistry has allowed chemists to ease the transition from computational prediction to experimental realization of targeted molecules.^5,6 In the pharmaceutical industry, there has been a growing interest in the computational prediction of synthetic difficulty and the automated prediction of retrosynthetic pathways for organic molecules.⁶ An assessment of synthetic difficulty allows medicinal chemists to more efficiently allocate their time and resources, prioritizing molecules with greater drug-likenesses and lower synthetic difficulties. The approaches used to calculate synthetic difficulty can be broadly categorized as follows: calculating the structural complexity of the molecule;^7,8 using retrosynthetic analysis to elucidate viable synthetic pathways to a molecule;⁹ modeling the intuition of expert chemists;¹⁰ and finally, machine learning (ML) models trained on extensive reaction databases or data sets of easy- and difficult-to-synthesize molecules.^11,12 Existing approaches to calculate the synthetic difficulty of organic molecules include the synthetic accessibility score (SAScore), the synthetic complexity score (SCScore), and the SYnthetic Bayesian Accessibility (SYBA) score.^8,11,12 The SAScore uses a combination of structural complexity and fragment contributions to calculate synthetic difficulty.⁸ A similar approach is employed by SYBA, which also uses the frequency fragments appear in public compound data sets to estimate synthetic difficulty.¹² The synthetic complexity score (SCScore) meanwhile uses a neural network to predict the number of reaction steps required to synthesize a molecule, defining synthetically complex molecules as those that require a greater number of reaction steps to synthesize.¹¹ Each of these approaches is subject to individual limitations, making creation of a useful heuristic for synthetic difficulty challenging. However, both scores are able to provide a continuous score extremely quickly, which can be used to bias toward easier-to-synthesize molecules.¹³

Structural complexity does not imply that a molecule is difficult to synthesize, as complexity can be introduced into a molecule with simple reaction transformations. Identifying retrosynthetic routes is often computationally intensive;⁹ however, recent open-source developments in computer-aided synthesis planning have been shown to produce viable synthesis pathways and be good estimators of synthetic difficulty.^13,14 Meanwhile, data-driven approaches can struggle to generalize to molecules outside of their training set¹¹ and can contain inherent bias due to the lack of failed reactions. Obtaining scores derived from the intuition of expert chemists is often labor intensive and can be subject to human bias;¹⁵ however, this approach can create a useful model if large amounts of unbiased training data are obtained. Moreover, many of these predictive methods are developed with the primary objective of obtaining synthesizable drug candidates, whose synthesis requirements may not perfectly align to those of a materials discovery program.

In this work, we focus on predicting the synthetic accessibility of precursors for porous molecular materials, whose properties depend not only on the solid-state packing but also on the structure of their discrete building blocks. The molecular nature of porous molecular materials means that they are typically soluble in common solvents, inferring advantages particularly in their solution synthesis and processability, for instance allowing relatively easy formation into membranes.¹⁶ Porous materials have been investigated for applications such as gas storage,¹⁷⁻¹⁹ separations,^20,21 catalysis,^22,23 and sensing.^24,25 Porous organic cages (POCs), an example of which can be seen in Figure 1, are a class of porous molecular materials where the porosity predominantly arises from the internal cavity of the molecule (known as “intrinsic” porosity) and can form a porous material in the solid state by directing the packing to create interconnected pore networks, sometimes combined with the “extrinsic” porosity that arises between the molecules due to inefficient packing.

An imine condensation of 1,3,5-triformylbenzene (4 equiv) with *trans*-1,2-diaminocyclohexane (6 equiv) to form covalent cage 3 (**CC3**), a prototypical porous organic cage (POC) with a permanent internal cavity (colored blue).³⁵ Carbon atoms are shown in gray, nitrogen atoms are shown in blue, and hydrogen atoms are shown in white.

POCs pack together in the solid state through intermolecular interactions and do not have the extended chemical bonding found in porous network materials such as metal–organic frameworks (MOFs) and zeolites. Furthermore, the intrinsic porosity of POCs means they have solution-based applications and can give rise to porous liquids.^26,27 Although the POC literature is becoming increasingly diverse,^28,29 the number of previously reported POCs is in the low hundreds; this is comparatively few compared to the number of discovered MOFs, numbering 70,000 in the Cambridge Structural Database.^30,31 Recently, we reported an approach to explore the vast chemical space of POCs using an evolutionary algorithm.^32,33 Using this approach, we were able to computationally identify new cages with less common properties, including a large internal cavity diameter of 16 Å. However, the vast majority of precursors are unlikely to form a shape-persistent POC with a permanent internal cavity, making screening all potential precursors extremely inefficient.³⁴ Many POCs are thus discovered through serendipity or are designed using the intuition of an expert chemist in the POC field.

POCs are typically synthesized via dynamic covalent chemistry (DCC), where the reversible reaction allows for error-correction and affords the opportunity to reach either the thermodynamic product³⁶ or a desired kinetic product,³⁷ rather than an oligomer or polymeric adduct. The computational prediction of a synthetically accessible POC with a desired set of properties is accompanied by a series of challenges. The majority of POCs reported thus far were synthesized through imine bond formation. The precursor components of the POC, for example an amine and an aldehyde for an imine-based molecule, must themselves be synthesized, and they must subsequently react in a predictable manner to form the desired POC of the targeted molecular mass and topology. For example, imine condensation between a trialdehyde and a diamine (arguably the most common precursor pair in reported POCs) can result in cages of six different topologies, in addition to unpredictable mixtures of polymers.^38,39 The challenge of synthesizing a POC can extend beyond precursor synthesis; a combination of insoluble products can result in inseparable mixtures,³⁹ decreased reaction yields, or driving the equilibrium away from the desired thermodynamic minimum.^40,41

The most common topology observed in POCs, denoted as Tri⁴Di⁶ using the terminology of Santolini et al.,³⁸ results from the condensation of four tritopic molecules with six ditopic molecules into a single cage unit, as shown for the formation of CC3 in Figure 1. As POCs formed through imine condensation are often the thermodynamically most stable product, we have previously used formation energies to predict the likely topological outcome of a reaction between a precursor pair³⁸ and the likely sorting outcome within a given topology if a mixture of precursors is used.⁴² The required POC model construction and energy calculations can be automated, for example by our own open-source supramolecular toolkit software (stk),⁴³ which we have previously exploited to computationally aid the experimental discovery of 33 cleanly formed POCs using a robotic platform.³⁰ We have in multiple cases ourselves used computation to accelerate the discovery of POCs,^44,45 but time and again, the synthesis of the POC is the most time-consuming component in their development, taking months to years compared to weeks for the computational screening.⁴⁶ Worse, the synthesis is often not successful at all.

Here, we investigate the best approach to consider the synthetic difficulty of an organic material’s precursors in a computational screening workflow, with a focus on POCs. The long-term goal is to increase the success rate of experimental materials discovery programs in relation to the synthetic realization of computational targets. We develop our own synthetic difficulty prediction model, the Materials Precursor Score (MPScore), and compare how this performs relative to the previously reported SAScore and SCScore. Our model reformulates synthetic difficulty prediction as a classification problem modeled using a random forest to answer the following question: can you make 1 g of this compound in under 5 steps. In the end, we demonstrate the applicability of our classifier for chemists’ intuition in a context that would normally require significant database reduction to human-tractable size. We demonstrate the model’s ability to bias against precursors that chemists themselves would avoid in materials synthesis, allowing us to focus our computational resources on POCs with a greater probability of being synthetically realized. We show that even when limiting the precursors to the easiest-to-synthesize, we are able to identify shape-persistent POCs with unconventional properties, such as large pore diameters.

Methods

Overall, our ambition is to include a synthetic difficulty consideration of organic precursors within our workflow for computational screening to identify functional POCs. We test two existing approaches (SCScore and SAScore) and compare the results to a new synthetic difficulty model we develop here. We test the three approaches in a computational workflow aimed at identifying synthesizable functional POCs that are shape-persistent. Shape persistency is a property that POCs must exhibit to achieve porosity, where they remain rigid and have a permanent cavity even in the absence of scaffolding solvent. We calculate shape persistency as part of our workflow here as it is relatively computationally cheap to assess, allowing for easy comparison between different computational screening workflows. We have chosen to target POCs with the Tri⁴Di⁶ topology, which is the most common topology in previously reported POCs, formed by imine condensations between diamines and trialdehydes. Our computational workflow, which will be described in full detail below, is depicted in Figure 2. The workflow involves screening a precursor database to remove molecules predicted to have the largest synthetic difficulty, followed by automated cage construction and POC structure prediction, and finally characterization of the POC pores to identify those with permanent internal cavities. In the below subsections, we first discuss the creation of a labeled training database used to train our synthetic difficulty model. We then evaluate our model using cross-validation and finally demonstrate the model’s utility, selectively screening for easy-to-synthesize precursors as part of a POC screening workflow.

High-throughput computational screening workflow for the discovery of synthetically viable shape-persistent porous organic cages (POCs).

Creating the Synthetic Difficulty Model for Organic Material Precursors

Training Database Construction

First, to train an ML model to classify potential POC precursors as either easy-to-synthesize or difficult-to-synthesize, we needed to generate a diverse training data set with an approximately equal number in each group. Initially, molecules were extracted from a subset of the proprietary Reaxys⁴⁷ and eMolecules⁴⁸ databases. This initial subset consisted of molecules with functional groups that are likely to undergo common DCC reactions, frequently used in materials synthesis. This set included di-, tri-, tetra-, penta-, and hexatopic amines and aldehydes, which was extended with molecules that have been used previously in POC synthesis by our experimental collaborators, known to be easy-to-synthesize. Additionally, we included molecules that our experimental collaborators have previously avoided due to their challenging syntheses. To this set of molecules, functional group substitutions were performed to extend the size and diversity of our training set, resulting in a set of molecules with a fairly even distribution of different functional groups. For each molecule in the initial starting set, a SMARTS substitution was performed, exchanging all functional groups in the molecule with those from a predefined list, resulting in a data set consisting of 14,859 molecules. The final database, in addition to the script used to perform functional group substitutions, is available on GitHub.⁴⁹ A full breakdown of molecules by their respective functional groups can be found in Table S1.

The molecules were then assessed by three experimental chemists with at least Ph.D.-level training in synthetic chemistry. Existing measures of synthetic difficulty of organic compounds originate from the drug discovery field and might not capture the scale and simplicity required for materials synthesis. Therefore, we asked experienced synthetic chemists to label molecules relevant to POC synthesis as “easy-to-synthesize” based on the question “can you make 1 g of this compound in under 5 steps?”. Instead of producing a continuous measure of synthetic difficulty, as is the case in both the SAScore and SCScore, we aimed to create a discrete binary classifier, with the goal of identifying easy-to-synthesize precursors for POC synthesis. A binary classification approach was chosen to reduce the challenge of collecting training data, and it is much easier to obtain a consensus on a binary classification. While the use of binary classification does not capture the subtle variations in ease of synthesis between molecules, our goal for the MPScore was not to detect these nuances but to detect and prioritize easy-to-synthesize precursors. To achieve this, we did not require the user to perform an in-depth retrosynthetic search but instead favored a fast judgment on the ease of synthesis.

Figure 3 shows the graphical user interface developed to collect the training labels. Three of our authors labeled the molecules; these are author R.L.G, with 12 years of research experience in organic synthesis and 8 years of experience synthesizing POCs; author F.S., with more than 4 years of research experience in organic chemistry; and author M.E.B., with 20 years of organic chemistry experience and 8 years of experience synthesizing POCs. Those labeling the molecules were presented with a two-dimensional representation of a molecule they had not previously scored and tasked with answering “yes” or “no” to the previously mentioned criterion. Each molecule was presented randomly, with equal probability of selection, to reduce the likelihood of systematic scoring occurring if the chemist’s opinion was influenced by the preceding molecule. The “unsure” option was added to avoid appending anything to the database, instead skipping to the next molecule.

Interface for labeling of molecules by experimental chemists.

Random Forest Model

We aimed to replicate the decision-making process that experimental chemists themselves would use when selecting precursors for materials synthesis in our synthetic difficulty model. For this task, we chose a random forest (RF) classifier to model the data due to its practical utility, such as the fact that RF models are quick to train and produce good performance on small to medium data sets. RF models are frequently used in chemistry problems to develop quantitative structure–property relationships for both classification and regression problems, and the models can offer some interpretability.^50,51 We used the RandomForestClassifier Python class, as implemented in scikit-learn version 0.24.1,⁵² to construct the model. We chose “balanced” as the “class-weight” hyperparameter, which reduces the effect of class imbalance by weighing each data point inversely to the class frequency, increasing the importance of classes that appear fewer times in the training data set. Weighted RFs have shown to perform equally, and in some cases better, on imbalanced data sets than using sampling techniques such as the Synthetic Minority Oversampling Technique to artificially generate minority class data.⁵³ Hyperparameter optimization was used to identify the best performing parameters for both the RF and fingerprinting algorithm used to create vector representations. We used a randomized grid search to sample the parameter space, in which we identified best parameters for the following: the number of decision trees, the maximum tree depth, the minimum sample split, the minimum samples per leaf, the maximum number of leaf nodes, the maximum features used to identify the best split, and finally, the maximum samples to draw from each estimator. Further implementation details and best performing parameters can be found in the SI.

The extended-connectivity fingerprint (ECFP) was chosen as the vector representation of each molecule for the input vector, as implemented in RDKit version 2020.09.4.^54,55 We included the bit-size and radius as parameters to be optimized during the hyperparameter optimization procedure, identifying a bit-size of 1024 and radius of 4 as best performing parameters, as shown in Table S4. The ECFP was chosen as it has previously been shown to be one of the best performing fingerprints when similarity searching molecules to identify those with similar bioactive properties.⁵⁶ A count-based fingerprint was implemented to encode the number of times a feature appears in the molecule. This count-based approach is thought to provide greater information within the vector encoding of the molecule and shows improved performance when predicting bioactivity compared to their bit-encoded counterparts.⁵⁷

Cross-Validation and Calibration of the Model

We used 5-fold cross-validation to assess the performance and generalizability of each RF model during the hyperparameter optimization procedure and to estimate the performance of the final MPScore model. During cross-validation, an individual RF model was trained and evaluated using each fold. In 5-fold sampling, the data set is split into five different folds randomly, and the fold that is used as the test set is changed each time. This evaluation procedure allows us to assess how well the model can generalize to unseen data, allowing all the labeled samples to be used as test data at least once. We calculated the average accuracy, precision, recall, F1 scores, and F_β (β = 0.2) across all folds to assess the performance of the models (as discussed later in the Results). These scores are defined as

Following cross-validation, we calibrated the probabilities returned by the RF model using Platt scaling to improve the probability estimate. As we interpret the MPScore as a continuous measure of synthetic difficulty, this procedure ensures that probabilities from the RF are reliable. Ensemble methods, such as RFs, are known to struggle to provide probability predictions of close to 0 and 1 as the average of individual classifiers in the ensemble pushes the probability away from 0 or 1.⁵⁸ To counteract this, we plotted a calibration curve, as shown in Figure S2, to show how scaling the final probabilities improves the probability estimates.

For the final model, we split the entire data set into a training set (75% of all data) and calibration set (25% of all data), using the scores achieved during cross-validation as performance estimates. We also calculated the F1 score and F_β (β = 0.2) score, which provides a combined measure of precision and recall in a single metric of model performance. The F_β score provides a weighted average of the precision and recall, in which a value of β < 1 favors precision over recall. The value of 0.2 was selected so as to weigh precision five times more than recall, which we believed was a good comprise for the MPScore. In the subsequent Results section, we justify why we believed this trade-off was useful for our model, and why we chose to focus primarily on maximizing the precision score.

High-Throughput Virtual Screening

Next, we compared how the three automated synthetic difficulty scores (SAScore, SCScore, and our MPScore) perform as filters for easy-to-synthesize precursors in a POC virtual screening workflow. By selecting precursors with the lowest 1% of synthetic difficulty scores using each of the three scoring methods, we hoped to identify shape-persistent POCs that could be readily accessed from the easiest-to-synthesize precursors.

Preparation of the Precursor Database

We wanted a database of potential precursors for POCs in the Tri⁴Di⁶ topology, which are formed through a [4 + 6] imine condensation reaction between a tritopic aldehyde and a ditopic primary amine. This single topology was chosen as we previously showed that a significant number of precursors will preferentially form Tri⁴Di⁶ over the other common Tri⁸Di¹² topology,³² and because Tri⁴Di⁶ molecules are more likely to be shape-persistent (38% of 6,018 cages we investigated in a previous study).³⁴ To create our database of POC precursor molecules, we used the eMolecules⁴⁸ and Reaxys⁴⁷ databases, selecting only molecules that contained exactly three aldehydes or two primary aliphatic amines. All functional groups within molecules were identified using an automated detection algorithm, implemented using RDKit.⁵⁹ The final precursor database comprised 7,190 ditopic amines and 98 tritopic aldehydes, resulting in a possible 704,620 POC combinations that can be formed when restricted to the Tri⁴Di⁶ topology.

Identification of Synthesizable Precursors

For each combination of diamine and trialdehyde precursors, we calculated the sum of their synthetic difficulty scores with each of the three scoring methods. These sums were then scaled to values between 0 and 1 to allow for comparison between the models. For the MPScore, we interpreted the probability that a molecule belonged to the “difficult-to-synthesize” class as a continuous score, such that a higher value indicates the molecule is more challenging to synthesize. As mentioned in the previous Methods section, we scaled these probabilities to return a more reliable estimate. For each synthetic difficulty model, precursors within the first percentile of synthetic difficulty values (in total 21,140 precursor pairs, assuming no duplicates) were investigated for shape-persistence. Duplicate precursor combinations were identified by concatenating the SMILES strings of the diamine and trialdehyde precursors together and finding the overlap between precursor combinations selected by each score. In addition to easy-to-synthesize precursors filtered using synthetic difficulty scores, we selected a further 1% of precursor combinations randomly as a control sample, to investigate whether the three synthetic difficulty scores also bias toward precursors likely to form a shape-persistent cage.

Cage Construction and Conformational Search

The cages were constructed from precursors in an automated approach that utilizes our supramolecular toolkit (stk - version 18.12.2019).⁴³stk is a Python library for constructing and optimizing complex supramolecular species, by providing precursor molecules and a predefined molecular topology. Structures built in stk then underwent a three-step procedure to identify plausible geometries for the lowest energy conformation of each cage. Each step of the process employed the OPLS3 force field,⁶⁰ used within Schrödinger’s MacroModel software,⁶¹ which has been shown to be able to accurately reproduce geometries of flexible imine cages.³⁸ First, only bonds created during the stk build process were optimized, fixing the geometries of all other atoms. Subsequently, a molecular dynamics (MD) simulation was performed in the NVE ensemble for 2 ns after a 100-ps equilibration, with a time step of 1 fs and a temperature of 700 K. Structures were sampled every 40 ps along the MD trajectory, and each of the resulting 50 sampled structures underwent a further geometry optimization. All geometry optimizations employed the Polak–Ribiére Conjugate Gradient algorithm, using a gradient conversion criteria of 0.05 kJ Å^–1 mol^–1. The resulting lowest-energy conformation was used for further analysis.

Identification of Shape Persistent Cages

Following the optimization procedure, organic cages that did not remain shape-persistent were removed. pyWindow was used to detect and analyze all the windows in the cages. pyWindow is a Python package for the analysis of structural properties of molecular pores, shown to be able to accurately reproduce pore sizes of POCs with a Tri⁴Di⁶ topology.⁶² For the POCs in which the expected number of four windows was identified, we calculated a parameter α

to classify shape-persistent cages. We developed this equation in our previous work, with the aim of maximizing the number of organic cages labeled as shape-persistent using an automated approach.³⁴ The average window difference in eq 6 is the average difference in window diameter for all possible pairs of windows. If α is less than 0.035 and cavity size is greater than 1 Å, the cage was classified as shape-persistent. Otherwise, it was assigned as undetermined and disregarded. For organic cages that were deemed shape-persistent by this analysis, the central cavity diameter was then calculated using pyWindow.

Results and Discussion

Materials Precursor Score (MPScore)

Evaluating Chemists’ Scores

To train our MPScore, we first constructed a diverse database of 14,859 molecules to provide to experimental chemists for labeling, as was discussed in the previous section. Ideally, we would have a very large number of chemists rank all molecules in our database to achieve an overall consensus. However, the labeling of these molecules is extremely time-consuming, taking approximately 1 h to assess 180 molecules, and thus it was only possible to obtain 12,553 labeled data points, with the largest amount ranked by R.L.G. (10,000), followed by F.S. (1,858), and M.E.B. (695). As all three chemists work in fields related to organic synthesis, their classifications are of particular value in a model that considers ease-of-synthesis scoring for POCs (and molecular materials in general). Despite the relatively small amounts of labeled data obtained from M.E.B. and F.S., we believed model performance would benefit from having multiple labeled data points for at least some of the molecules. Averaging the scores assigned by expert chemists has been shown to lead to a better prediction of synthetic difficulty than the opinion of individuals, due to bias originating from personal preference and experiences.⁶³ The training database labeled by three experienced chemists contained 12,553 data points in total, of which we obtained 2,008 positive easy-to-synthesize labels and 10,545 negative difficult-to-synthesize labels, including overlapping molecules scored by multiple chemists. Table 1 shows the number of molecules scored by each chemist, in addition to the percentage of molecules each chemist labeled easy- and difficult-to-synthesize, and their years of synthetic chemistry experience. R.L.G. labeled the smallest percentage of molecules as easy-to-synthesize (11% of all labeled molecules), followed by M.E.B. (33%), and F.S. (36%). Despite the widely varying percentages of molecules assigned as easy- and difficult-to-synthesize, the number of molecules scored by multiple chemists was relatively low (11% of all the molecules in the training set), as seen in Table 2. This relatively low overlap in the molecules scored by each chemist suggests that the greater proportion of difficult-to-synthesize labels, especially in the case R.L.G.’s labels, is not indicative of systematic labeling of one class over the other.

Table 1. Number of Molecules Labeled by Each Chemist, in Addition to the Number of Easy- and Difficult-to-Synthesize, and Their Respective Percentages^a.

	R.L.G.	F.S.	M.E.B.
molecules	10,000	1,858	695
years of experience	12	4	20
easy-to-synthesize	1,109 (11%)	667 (36%)	232 (33%)
difficult-to-synthesize	8,891 (89%)	1,191 (64%)	463 (67%)

Open in a new tab

The number of years of synthetic chemistry experience of each chemist is also given.

Table 2. Summary of the Molecules Labeled by Each Synthetic Materials Chemist^a.

	labeled by three	labeled by two	labeled by one
molecules	42	1,625	10,886
in agreement	31 (74%)	1,179 (73%)
in disagreement	11 (26%)	446 (27%)

Open in a new tab

Percentages of each label compared to the total number of molecules labeled by each chemist are given in brackets.

Although the number of molecules scored by multiple chemists was low (1,667 molecules in total), the chemists agreed with each other 73% of the time on average. As shown in Table 2, for the 42 molecules labeled by three chemists, the chemists agreed 74% of the time, and for the 1,625 molecules labeled by two chemists, the chemists agreed 73% of the time. The 11 molecules that at least one out of the three chemists labeled differently are shown in Figure S1, followed by the labels assigned by the three chemists in Table S2. We found that disagreement between chemists was relatively large (27% of molecules labeled by two or more chemists), which, as expected, shows chemical intuition can be variable and somewhat subject to prior experiences. Discrepancies in chemist labels were counteracted by providing both positive and negative labels for the same molecule as training data for the RF model, which reduces the importance of that training sample on the overall decision made by the model. Indeed, the disagreement between chemists and the relatively short time frame in which molecules were assessed indicates this classifier is an attempt to model the fast intuition of an expert synthetic chemist when selecting precursors rather than an in-depth retrosynthetic analysis of molecules.

Comparison with Existing Methods

Figure 4 compares the labels expert chemists assigned molecules in the training database to that of the SAScore and SCScore. The correlation between the SAScore and the SCScore for our training database was very weak, with a Pearson correlation coefficient of 0.27.

Synthetic difficulty scores of molecules in the training data set, calculated using the SCScore and SAScore methods, scaled between 0 and 1. Color coding refers to those labeled as easy-to-synthesize (green) or difficult-to-synthesize (red) by the synthetic chemists. Kernel density estimates of the synthetic difficulty score distributions are shown on the top and side panels.

The overlapping distributions of synthetic difficulty scores for molecules labeled as easy- and difficult-to-synthesize suggest that these models are not able to readily distinguish between precursors used for materials synthesis in agreement with experienced chemists working in the field. These results highlight the necessity to develop an alternate heuristic model that can identify molecules that synthetic chemists in the field of materials discovery would select themselves.

To assess the correlation of our MPScore with the SAScore and the SCScore, we calculated the Spearman’s rank correlation coefficient between each score for the molecules in the training set. Table S5 shows the MPScore exhibits weak positive correlation with both the SAScore (0.15) and SCScore (0.35). Additionally, we present the molecules from our training set with the largest and smallest difference in synthetic difficulty scores in Figure S4, alongside their synthetic difficulty values in Table S6. From inspection, the three scores tended to agree on more structurally complex molecules. However, there were significant differences in scores between more simple molecules, which could highlight some of the limitations of each approach. These weak correlations and large differences in synthetic difficulty scores may be explained by the different approaches the SAScore, SCScore, and MPScore use to quantify synthetic difficulty.

The SCScore aims to capture synthetic complexity as defined by the predicted number of reaction steps required to make a molecule. Therefore, complex molecules appearing a greater number of times at the start of a reaction pathway would exhibit a lower perceived synthetic difficulty than less complex molecules that appear more frequently near the end of a reaction pathway. By contrast, the SAScore primarily uses a measure of structural complexity to measure synthetic difficulty. Complexity can be easily introduced into a molecule using many robust reaction transformations, meaning apparently complex molecules are not necessarily more synthetically challenging to access. Meanwhile, the MPScore is entirely subject to chemists’ intuition, which may be influenced by familiar structural features or common functional groups. While some obviously simple or complex molecules may have taken only a few seconds to classify, others may have taken significantly longer, resulting in an average rate of 20 s per classification. Although a similar small scoring time frame has been used to develop synthetic difficulty models, human intuition can still be subject to errors.^15,64 Indeed, the nature of the binary classification of the MPScore does not capture subtle changes in ease of synthesis which other scores attempt to capture; however, it increases the rate at which training data can be obtained as it is a less cognitively demanding task than assigning a continuous score. The different assumptions included in each scoring method can lead to very different results in quantifying synthetic difficulty, which is what we observe in Figure 4. However, this does not necessarily detract from each score’s ability to bias toward easy-to-synthesize precursors.

To gain insight into some of the frequently occurring structural features of easy- and difficult-to-synthesize molecules that may have influenced the chemist’s decision, we used the dimensionality reduction technique, principal component analysis, followed by the k-means algorithm to categorize molecules by their structural features. Figure S3 shows the representative molecules from each cluster of easy- and difficult-to-synthesize molecules and could indicate structural features to avoid when designing precursors for computational screening. Further details of the clustering technique used are available in the Supporting Information.

MPScore: Model Training and Validation

We trained an RF model using the labeled training set discussed above to see if this would give an improved assessment of the synthetic difficulty of material precursors, in particular for POC systems. The developed model was evaluated using 5-fold cross-validation, and the resulting average scores across all folds and their standard errors are shown in Table 3, in addition to the sum of false positives, false negatives, true positives, and true negatives across all five folds, shown in Table 4.

Table 3. Evaluation Metrics for the MPScore Model^a.

	precision		recall
accuracy	easy-to-synthesize	difficult-to-synthesize	easy-to-synthesize	difficult-to-synthesize	F1 score	F_β (β = 0.2)
0.88 ± 0.004	0.84 ± 0.013	0.88 ± 0.0037	0.32 ± 0.016	0.99 ± 0.00080	0.46 ± 0.017	0.79 ± 0.014

Open in a new tab

Scores are averaged across the five cross-validation folds, and the standard errors are shown.

Table 4. Sum of All Outcomes from Each Fold of the Cross-Validation Procedure Used to Estimate the Performance of the MPScore, in Addition to the Total Number of Predictions Made^a.

false negatives	false positives	true positives	true negatives	total
1,370	121	638	10,424	12,553

Open in a new tab

False positive outcomes refer to molecules labeled as difficult-to-synthesize by the chemist but easy-to-synthesize by our MPScore, whereas false negative outcomes refer to molecules labeled as easy-to-synthesize by the chemist but difficult-to-synthesize by our MPScore.

Ideally, high precision and recall are desirable for any binary classifier, but their impact on the usefulness of a synthetic difficulty score is very different. Low precision (large number of false positives), when a large number of molecules labeled easy-to-synthesize are actually difficult-to-synthesize, wastes the resources of experimental chemists. Low recall (high proportion of false negatives) results in precursor candidates being missed that could have had favorable properties when formulated into the final material but were incorrectly classed as difficult-to-synthesize and disregarded. However, the latter misclassification is less problematic, as there is no experimental cost associated with missing an easy-to-synthesize molecule. This trade-off is quantified in the weighted F_β score, which values precision five times more than recall when β = 0.2. While prescreening precursors according to their synthetic difficulty with the MPScore may miss candidates with exceptional properties, whose precursors are challenging to synthesize, we instead focus our computational resources on cages that are most likely to be made in a lab, given lab synthesis is the bottleneck we currently face in our computational screening approaches. Therefore, we aimed to develop a model that maximized precision, reducing the number of false positives obtained.

The MPScore model was able to achieve an overall accuracy of 0.88, as seen in Table 3. The mean precision and mean recall for the difficult-to-synthesize label were 0.88 and 0.99, respectively. A lower mean precision and mean recall score is seen for the easy-to-synthesize molecule label of 0.84 and 0.32, respectively. We hypothesize that the lower precision and recall for the easy-to-synthesize label is due to the model’s tendency to classify a molecule as difficult-to-synthesize, as suggested by the recall of 0.99 in the difficult-to-synthesize class. This is exemplified by the 10,424 true negatives assigned by the MPScore, which outweigh the 638 true positives the model correctly classified, as shown in Table 4. This can be explained by the imbalanced data set to train the MPScore, in which the overwhelming majority of molecules was labeled as difficult-to-synthesize. This differs from the typical training data set used in ML for chemistry, which consists of only positive examples due to the absence of failed experiments or negative results. In such cases, the generation of negative results proved advantageous for generating useful ML models.⁶⁵ Later in this subsection, we show how the threshold the classifier used to make predictions can be optimized to increase the precision of the model, reducing the negative consequences of class imbalance.

The F1 score, the harmonic mean of the precision and recall scores, provides a combined metric of precision and recall. The F1 score of 0.46 obtained for the MPScore shows the model has lower combined precision and recall scores for the easy-to-synthesize class, due to the poor recall score achieved for the easy-to-synthesize class (0.32). Meanwhile, the significantly higher F_β score of 0.79 is unaffected by the poor recall achieved, instead favoring the higher precision score of 0.84. A minor contributing factor to the low recall score could be due to some of the features of easy-to-synthesize molecules being similar to those of difficult-to-synthesize, resulting in the model having difficulty distinguishing between the two classes. Figure 5a shows that many fingerprint bit values contributed similar amounts to the classification of the RF model. This shows, as would be expected, that the chemist’s opinion was influenced by multiple fragments present within a molecule. Additionally, the confidence intervals in Figure 5a show how the importance of features differs widely across each decision tree in the forest.

Mean importances for the 20 most important fingerprint bits (a) and the precision-recall curve (b) for our MPScore, shown in red. Feature importances (a) were calculated across all 100 decision trees in the random forest, and the standard deviation across the 100 decision trees in the ensemble are shown by the black bars. Precision and recall scores (b) are calculated using the average precision and recall scores across all five folds. The baseline classifier, shown in gray, assigns a probability equal to the fraction of positive samples in each fold. The precision and recall value for the final threshold used for our MPScore model, 0.21, is labeled by the black circle, and standard errors are represented by the black lines.

One approach to minimize the impact of class imbalance is to adjust the probability threshold of the classifier.⁶⁶ Increasing the threshold required for a molecule to belong to the minority class (easy-to-synthesize) reduces the false positive rate, increasing the precision of the classifier. The probability of a sample belonging to a particular class in an RF model is determined by the proportion of decision trees in the ensemble assigning that sample to a particular class. Figure 5b shows the effect of changing the probability cutoff threshold for the difficult-to-synthesize class on the mean precision and recall values achieved by the RF models during the cross-validation processes. It can be seen that by increasing the probability thresholds, the model’s precision outperforms the recall. At almost all thresholds, the MPScore outperforms the naive classifier, defined as a model assigning molecules as easy- and difficult-to-synthesize with probabilities equal to the fraction of each class. Choosing a high probability threshold for the MPScore minimized the risk of obtaining false positives (high precision), at the expense of a larger number of false negatives (low recall). Despite the potential to miss easy-to-synthesize molecules, this compromise ensured we can be more certain that the molecules that are selected by the MPScore are not false positives and waste synthetic efforts. One estimate of the size of the chemical space of synthesizable drug-like molecules alone is 10¹¹ molecules,⁶⁷ meaning the cost of obtaining a false negative and missing a potential easy-to-synthesize candidate is low.

To reduce the probability of obtaining false positive scores for molecules, a final threshold of 0.21 was chosen for our MPScore to select easy-to-synthesize molecules during the precursor screening process, chosen to assign only 1% of all molecules in the POC precursor database as easy-to-synthesize. Molecules with a probability of below 0.21 of belonging to the difficult-to-synthesize were classified as easy-to-synthesize and carried forward in the workflow. At this probability threshold, the precision and recall scores for the easy-to-synthesize test-set class were 0.89 and 0.25, respectively. This means, despite missing 75% of easy-to-synthesize molecules, we can be 89% certain our selected molecule is not a false positive, a necessary trade-off for our materials screening workflow. We trained the final MPScore model used in the screening workflow on the 75% of the entire data set of labeled molecules, followed by calibration using Platt scaling on 25%, using the scores obtained from the cross-validation procedure as estimates for the final model. The model, in addition to the training scripts, is available on GitHub.⁴⁹

High-Throughput Computational Discovery of Synthesizable POCs

To test the applicability of the MPScore, we investigated its performance within a high-throughput screen for the discovery of novel shape-persistent POCs. We repeated the screening with the SAScore and SCScore separately to compare the results. The scores were used to eliminate difficult-to-synthesize POC precursors from a database of diamines and trialdehydes. Using the remaining easy-to-synthesize precursors, we constructed and geometry optimized the POCs that could be formed from these precursors, analyzing the pore structure of those that were predicted to be shape-persistent.

Precursor Database and Synthesizable Precursors

We explored the chemical space of Tri⁴Di⁶ imine-based cages formed from a [4 + 6] condensation of a tritopic aldehyde with a ditopic amine.³⁸ The precursors’ library was based on the eMolecules⁴⁸ and Reaxys^47,68 databases, thus containing literature-reported compounds of varying levels of synthetic difficulty. For the MPScore, we interpreted the probability value obtained from the RF model as a continuous score. Molecules with a higher probability of belonging to the easy-to-synthesize class were defined as being less synthetically difficult to make. We assigned an overall synthetic difficulty value to each pair of precursors that could be used to form a POC (diamine and trialdehyde). The aggregated score was defined as the sum of each synthetic difficulty score of each precursor in the pair, scaled between 0 and 1 to allow for easier comparison between scores. We calculated the aggregated scores for each pair using the SAScore, SCScore, and MPScore, which can be seen in Figure 6. Overall, Figure 6 shows a similar distribution of precursor synthetic difficulties for the SCScore and MPScore, whereas the SAScore consistently assigns a greater synthetic difficulty.

Distributions of the synthetic difficulty values for each POC precursor combination. Synthetic difficulty values are calculated using the SAScore (blue), SCScore (orange), and our MPScore (red). Vertical lines represent the first percentile for each score, and the respective cutoff thresholds are given by the value above each vertical line, colored according to score. A precursor pair is defined as easy-to-synthesize with a synthetic difficulty score lower than the threshold value for each score, indicated by the vertical dashed lines. A lower score for a precursor combination indicates both precursors in the combination are easier-to-synthesize, whereas a higher score indicates they are harder-to-synthesize. Precursors whose combined synthetic difficulty scores were below these threshold values (to the left of horizontal lines) were carried forward for POC construction and optimization.

The relatively broad distributions across all three scores imply that if considering precursors for experimental materials synthesis, even restricting the initial database to the literature-reported Reaxys⁶⁸ database is insufficient. The synthesis of these molecules could require multiple difficult and low yielding reaction steps, toxic or expensive reagents, or challenging purification procedures; none of these points are considered if we simplify the existence of a molecule in the Reaxys⁶⁸ database to meaning it is “synthesizable”. Even if those molecules are accessible via many-step syntheses, they are, most likely, unsuitable for materials discovery, unless one was extremely certain that a specific precursor was the only molecule that could infer a desired functionality that was of high value. Certainly, in the case of POC materials discovery programs, we are not anticipating such scenarios, and instead high quantities of readily accessible precursors are required, meaning that cheap reagents and robust reactions are essential to create them.

For our screening workflow, we chose cutoff thresholds for each of the synthetic difficulty scoring methods that would remove 99% of all precursor combinations, leaving the 1% of precursors with the lowest synthetic difficulty values remaining. By using this approach, we hoped to identify any novel and previously undiscovered POCs with promising properties that could be made from only the easiest-to-synthesize precursors. As precursor synthesis is much more cost- and time-intensive than our POC screening workflow, 1% was chosen as a cutoff to maximize the chance that the precursors could be synthesized or are commercially available. The final cutoff thresholds were as follows: 0.21 for the MPScore, 0.48 for the SAScore, and 0.18 for the SCScore. Precursor combinations with synthetic difficulty scores greater than these values were defined as too difficult-to-synthesize and disregarded. Combinations with scores less than or equal to these values were carried forward for POC construction and further computational analysis.

Model Validation by Expert Chemists

Next, we evaluated the developed MPScore against the SAScore and SCScore methods as precursor filters, comparing molecules selected as “easy-to-synthesize” and “difficult-to-synthesize” by the different methods against data coming from an experienced materials chemist. Author R.L.G. was blindly presented with 30 diamines and 30 trialdehydes with the lowest and highest synthetic difficulty values (calculated using the SAScore, the SCScore, and the MPScore) from the POC precursor database and presented with the same criterion as previously: can you make 1 g of this compound in under 5 steps. The molecules and their calculated synthetic difficulties can be found in Figure S5 and Table S7. We present the true positives, true negatives, false negatives, and false positives for each model in Table 5.

Table 5. Performance Metrics for the Blind Validation Task Used to Assess the Performance of the SAScore, SCScore, and MPScore on Predicting Easy-to-Synthesize POC Precursors^a.

	trialdehydes
	true positives	true negatives	false positives	false negatives
SAScore	7	9	3	1
SCScore	6	8	4	2
MPScore	9	10	1	0

	diamines
	true positives	true negatives	false positives	false negatives
SAScore	3	10	7	0
SCScore	9	10	1	0
MPScore	10	10	0	0

Open in a new tab

True positive outcomes are defined as molecules with the lowest calculated synthetic difficulty scores that were also classified as easy-to-synthesize by R.L.G.. For our MPScore model, we aim to minimize false positives, which are molecules classified as difficult-to-synthesize by R.L.G., but which a synthetic difficulty model assigned a low synthetic difficulty score, implying it is easy-to-synthesize. Values in bold correspond to the score that produced the fewest false positives.

Table 5 shows our MPScore was able to achieve the lowest number of false positives for both the aldehyde and amine ranking task (1 false positive and 0 false positives). Generally, models were better at identifying difficult-to-synthesize molecules in agreement with R.L.G., as shown by the higher numbers of true negatives than true positives. While desirable, this is not a requirement of a synthetic difficulty model in our POC screening workflow, as scores are used to identify easiest-to-synthesize precursors, as opposed to identifying the hardest-to-synthesize. For our requirement of selecting precursors in agreement with expert chemists, this shows that the MPScore is the most effective, given by the lowest cumulative number of false positives (1 false positive for both amines and aldehydes). However, we do note the sample size of 10 amines and 10 aldehydes for each model is too small to draw any concrete conclusions from.

Structural Analysis of Computationally Screened POCs

Following the analysis of the MPScore, we assessed the results of the computational screening workflow, when using the three different scoring methods to remove nonsynthesizable precursors. After only using the 1% of precursors scored as most likely to be synthesizable from each of the methods, we automated cage construction and conformer searching using stk and MacroModel and then analyzed the windows and central cavity of the lowest energy conformation using pyWindow.^43,61,62 We then analyzed the property distributions of the cages, with a focus on those that are shape-persistent (as defined by eq 6). In other words, we sought molecules where the lowest energy conformations had a cavity large enough to host, at the very least, a hypothetical spherical probe with a diameter of 1 Å. Finally, we compared the cavity distributions of shape-persistent POCs from precursors filtered for synthetic difficulty with a control sample of randomly selected precursor combinations from the database, also 1% of all combinations.

In total, 28,185 precursor combinations were selected: 7,046 from the MPScore, 7,047 from the SAScore, 7,047 from SCScore, and 7,045 from a control sample of randomly selected precursor combinations. A control sample was used to investigate whether the synthetic difficulty scores had any influence on the properties of the cages the precursor combinations formed. Following the duplicate precursor removal mentioned in the previous section resulted in a total of 26,830 required optimizations. In total, there were 1,062 overlapping precursor combinations between precursors combinations selected by all three synthetic difficulty scores and the control sample, the full breakdown of which can be seen in Table S8. The SCScore and MPScore had the most overlap in the number of selected precursors at 621. The SAScore had fewer overlaps, with 205 precursors in common with the SCScore and 34 for the MPScore. Table 6 shows the number of POCs optimized for each synthetic difficulty score, in addition to the percentages that remained shape-persistent.

Table 6. Number and Percentage of Shape-Persistent Cages Formed from Precursors Selected by Each Synthetic Difficulty Model in the First Three Columns, in a Control Sample of Randomly Selected Precursors, and in Our Previous Work in the Final Column³⁴^a.

	MPScore	SCScore	SAScore	control	previous work³⁴
cages	6,944	6,877	7,015	7,045	6,018
shape persistent	709 (10%)	287 (4%)	76 (1%)	90 (1%)	2,314 (38%)

Open in a new tab

The absolute number of cages and percentages is shown.

In our previous work, out of 6,018 Tri⁴Di⁶ POCs constructed from trialdehyde and diamine precursors designed by R.L.G. to exhibit rigid precursor cores, 2,314 (38%) were shape-persistent following geometry optimization (Table 6).³⁴ As expected, a far lower proportion than 38% remained shape-persistent in our workflow now that we were also including a very strict consideration of synthetic difficulty of the precursors without manual assessment of precursors for their propensity to form a shape-persistent cage. As shown in Table 6, organic cages that remained shape-persistent from each synthetic difficulty score are 10% from the MPScore, 4% from the SCScore, 1% from the SAScore, and 1% from the control sample. The percentage of shape-persistent cages from precursor combinations selected by the MPScore was 9 percentage points greater than the control sample of randomly selected cages. Both percentages of the SCScore and SAScore were much closer to that of the control sample, with percentage point differences of 3 and 0, respectively. This large difference in the number of shape-persistent cages suggests that easy-to-synthesize precursors selected by the MPScore are more likely to form a shape-persistent POC, compared with precursors selected by other scoring methods. While our only aim for the MPScore was to filter for easy-to-synthesize POC precursors, its secondary effect of biasing toward precursors more likely to form a shape-persistent POC is nonetheless useful for POC chemists in the precursor selection process.

Figure 7 shows the distribution of cavity diameters for shape-persistent POCs screened using the three different synthetic difficulty models. Interestingly, precursors selected by the MPScore formed POCs with the highest cavity diameter, with a mean of 8.4 Å, followed by the precursors with the lowest synthetic difficulty calculated by the control sample (6.5 Å), the SCScore (5.5 Å), and finally the SAScore (4.7 Å). In our previous work, we found that out of 116 cages reported in the literature, cages with cavity sizes of 0–6 Å are most prevalent, with a general absence of cages with diameters larger than 16 Å.³² Large cavity sizes are properties POCs rarely exhibit due to the propensity of larger cages to collapse upon desolvation or to catenate with other cages, making POCs with large cavity sizes an interesting target for a computational screening workflow. Precursors that form POCs with larger cavities typically contain greater numbers of atoms and bonds, resulting in more degrees of freedom, more competing synthesis pathways during their formation and greater flexibility, and thus more facile collapse mechanisms.

Distribution of cavity diameters for POCs predicted to be shape-persistent from the screening. Only POCs with a cavity diameter greater than 1 Å are included. The precursors of the POCs were those with the lowest 1% of synthetic difficulty scores calculated using the SAScore (blue), SCScore (orange), MPScore (red), and a control sample of randomly selected precursors (green).

From this analysis, we conclude it is possible to discover shape-persistent POCs from precursors predicted to be easy-to-synthesize, without designing those precursors with structural features that are more likely to result in a shape-persistent POC. We show the MPScore is able to identify the largest proportion of shape-persistent POCs (709 cages), compared with other synthetic difficulty scores. Using this approach, we discovered a total of 1,072 shape-persistent cages that could be formed from easy-to-synthesize precursors, a significant number in contrast to the hundreds of POCs that have already been discovered. Additionally, while the vast majority of shape-persistent POCs have a pore diameter of 5 Å, we show that this workflow is also able to identify a smaller number of cages with larger pore diameters. This is a promising discovery, as it shows precursors that could form larger POCs can be readily accessed.

Analysis of the Identified Promising Large Cavity Diameter POCs

From the computational screening workflow, seven unique shape-persistent cages had a cavity size of 16 Å or greater, five selected using the MPScore, two with the SCScore, and none with the SAScore. Figure 8 shows the six largest synthetically accessible cages, alongside their corresponding precursor pair, and Table 7 summarizes the properties of these cages. All six cages (labeled 1–6, according to their size, with 1 being the largest cavity) in Figure 8 are predicted to have a cavity size greater than 16 Å, and four were from the screening using our new MPScore method. If using solely the SCScore for prescreening, only cages 2 and 3 would have been included. Meanwhile, using only the SAScore, all of these cages would have been missed. We inspected the six precursor pairings to see whether they were indeed likely to be synthetically accessible. To assess whether a precursor was commercially available, we used the stock supplier feature on the ZINC database, one of the largest commercially available compound catalogues.^69,70 Diamines in cages 1, 3, 5, and 6 are commercially available, with more than five suppliers listed in the ZINC database, meanwhile the diamine in cage 4 can be synthesized in two steps.⁷¹ The diamine in 2 is more challenging to access in its enantiopure form, with no commercial suppliers or straightforward synthetic routes.

Six largest shape-persistent POCs identified from the screening workflow alongside their respective precursor pairings. The cages are labeled 1–6 in descending order of cavity diameter. Calculated cavity diameters are labeled below each POC. The corresponding features of the POCs are in Table 7. All but cages 2 and 3 were identified with our MPScore method for measuring synthetic difficulty, and cages 2 and 3 were identified by the SCScore. Carbon atoms are shown in gray, nitrogen atoms are shown in blue, and oxygen atoms are shown in red. Hydrogen atoms are omitted for clarity.

Table 7. Calculated Cavity Diameter and Precursor Synthetic Difficulty Scores (from Each of the Three Scores) for the Six Largest Shape-Persistent POCs Identified from Any of the Methods^a.

		precursor synthetic difficulty
no.	cavity diameter/Å	SCScore	SAScore	MPScore
1	18.8	0.23	0.73	0.17
2	18.0	0.17	0.72	0.48
3	17.7	0.15	0.53	0.29
4	17.3	0.29	0.73	0.19
5	16.3	0.26	0.73	0.20
6	16.2	0.36	0.68	0.10

Open in a new tab

A low score indicates a precursor is easy-to-synthesize, whereas a score nearer one indicates the synthesis is predicted to be challenging or even impossible. The synthetic difficulty scores in bold show that each precursor pair belongs in the lowest 1% of values for each scoring method.

The 1,3,5-tris(phenylenevinylene)benzene precursor used in cages 1–5 was present in five of the seven remaining shape-persistent cages with a cavity diameter greater than 16 Å, indicating this could be a promising precursor to choose when designing cages with a large cavity size. The relative ease with which these trialdehydes can be accessed and their propensity to form cages with large cavity sizes make them desirable precursors for targeting cages with large cavity sizes. Indeed, the 1,072 cages predicted to be shape-persistent are a great many more than those that have been experimentally realized to date. Despite the poor recall of the MPScore potentially resulting in a large number of false negative precursor combinations, the score is still able to bias toward easy-to-synthesize precursors that are able to form shape-persistent POCs. Promisingly, this also indicates that precursors that form cages with a cavity diameter of over 16 Å can be accessed using easy-to-synthesize precursors.

In this workflow, we only considered the ease with which cage precursors can be synthesized; however, there are a number of other challenges one must overcome when designing a POC. While we only screened for Tri⁴Di⁶ topology cages, the thermodynamic driving force may steer the DCC reaction to form a diverse range of other topologies or even form a polymer or oligomer. We also did not account for the precursor solubility or the formation of insoluble products, which is also a significant challenge in POC synthesis.^40,41 Accounting for all of these factors computationally is extremely challenging (and expensive), which is why chemical knowledge from humans still remains irreplaceable in the field of POC development.

Conclusions

Our primary goal in this work was to develop a computational screening workflow that eases the transition between computational material prediction and experimental realization, using POCs as an exemplar material. To achieve this, we needed an automated way to consider the ease with which the organic molecules that are the precursors to POCs can be synthesized. We compared existing methods of calculating synthetic difficulty computationally, showing how current methods are not able to replicate the decisions of materials chemists when selecting easy-to-synthesize precursors for materials synthesis. As existing synthetic difficulty scores were unsuitable for our intended purpose of automated POC discovery, we created our own machine learning model to predict the ease of synthesis, the Materials Precursor Score (MPScore). We collected 12,553 pieces of labeled training data from three materials chemists in the field of POCs and more general organic synthesis, with the goal of developing a model capable of modeling the decision-making process of chemists when classifying POC precursors as easy-to-synthesize or difficult-to-synthesize. The data was collected based on the answer to the following question: can you make 1 g of this compound in under 5 steps.

We applied our MPScore to the task of identifying easy-to-synthesize precursors for a computational screening workflow aimed at identifying organic cages with permanent, shape-persistent cavities. We found that our MPScore performed slightly better than existing methods to score the synthetic accessibility of organic molecules and was more likely to bias toward precursors that form shape-persistent POCs. We showed that even when limiting precursors to only those that are considered easiest-to-synthesize, we could still discover POCs with unconventional properties, including those with a cavity size of greater than 16 Å. In total, we predicted 1,072 shape-persistent cages, seven of which have a cavity size greater than 16 Å, a property scarcely found in the literature. For the six largest cages discovered using our autonomous workflow, we confirmed that the precursors used to create these top-scoring cages are either commercially available or there are straightforward synthesis routes reported in the literature.

We provide the database of shape-persistent cages in the hope of future experimental validation of some of the computational predictions, which is accessible at doi.org/10.14469/hpc/8395. Our MPScore, training data, and ranking Web site are also open-source, can be applied to the ranking of other potential POC precursors or expanded with additional training data to generalize to other material precursors, or can be utilized for other classification tasks.

Code used to train and validate our MPScore is available at http://doi.org/10.5281/zenodo.4647049, and our ranking web site is available for open-source use at http://doi.org/10.5281/zenodo.4961161. We hope the MPScore will also be of value for considering the ease of synthesis of organic precursors in the wider field of molecular materials, although the model has not yet been tested beyond POC precursors. In the future, we hope the MPScore can be used to help guide both experimental and computational design of new functional materials. Such validation is the subject of our ongoing collaboration with experimental chemists. Chemists’ intuition remains a crucial part of any materials discovery program, and we believe automating it can accelerate many computational discovery workflows. We would consider using a similar binary classification criterion in the future, as we believe a model that allows for quick prescreening of synthesizable molecules would make a useful filter in many computational screening procedures. In the future, we would consider using chemist scoring to supplement data-driven techniques. Chemists’ intuition already plays a key role in the creation of the computer-aided retrosynthesis planner SYNTHIA,⁷² and we believe it has the potential to overcome several limitations of these techniques, such as limited data set size or lack of negative training examples. We believe this workflow is a small step toward a more autonomous discovery of new porous molecular materials.

Data and Software Availability. All classification data used to train the random forest model is available on Zenodo using the DOI http://doi.org/10.5281/zenodo.4647049. The MPScore was trained using the open-source scikit-learn Python package version 0.24.1, and the workflow makes extensive use of RDKit version 2020.09.4. The open-source Python packages stk version 18.12.2019 was used to optimize cages, and the pyWindow version 0.04 was used to calculate cavity diameters. Synthetic difficulty scores were calculated using the SAScore code from https://GitHub.com/rdkit/rdkit and the SCScore code from https://GitHub.com/connorcoley/scscore. A file to replicate the Anaconda environment is available on the code repository to directly install packages. The commercially available MacroModel software version 2018-1 by Schrödinger was used to perform molecular dynamics. All optimized cages are available at doi.org/10.14469/hpc/8395.

Acknowledgments

R.L.G. and K.E.J. thank the Royal Society for a Royal Society University Research Fellowships. K.E.J. and F.T.S. thank the Leverhulme Trust for a Leverhulme Trust Research Project Grant. S.B. thanks the Leverhulme Research Centre for Functional Materials Design for a Ph.D. studentship. We acknowledge funding from the European Research Council under FP7 (CoMMaD, ERC Grant No. 758370).

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.1c00375.

Training set analysis; analysis of labeled chemist data; data for MPScore validation; and further information from cage screening workflow (PDF)

The authors declare no competing financial interest.

Supplementary Material

ci1c00375_si_001.pdf^{(583KB, pdf)}

References

Pyzer-Knapp E. O.; Suh C.; Gómez-Bombarelli R.; Aguilera-Iparraguirre J.; Aspuru-Guzik A. What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery. Annu. Rev. Mater. Res. 2015, 45, 195–216. 10.1146/annurev-matsci-070214-020823. [DOI] [Google Scholar]
Hachmann J.; Olivares-Amaya R.; Atahan-Evrenk S.; Amador-Bedolla C.; Sánchez-Carrera R. S.; Gold-Parker A.; Vogt L.; Brockway A. M.; Aspuru-Guzik A. The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. J. Phys. Chem. Lett. 2011, 2, 2241–2251. 10.1021/jz200866s. [DOI] [Google Scholar]
Oganov A. R.; Saleh G.; Kvashnin A. G.. Computational Materials Discovery: Dream or Reality?. Computational Materials Discovery; Royal Society of Chemistry: 2018; p 1, 10.1039/9781788010122-00001. [DOI] [Google Scholar]
Szczypiński F. T.; Bennett S.; Jelfs K. E. Can we predict materials that can be synthesised?. Chem. Sci. 2021, 12, 830–840. 10.1039/D0SC04321D. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bennett S.; Tarzia A.; Zwijnenburg M. A.; Jelfs K. E. In Artificial Intelligence Applied to the Prediction of Organic Materials. Machine Learning in Chemistry: The Impact of Artificial Intelligence; Cartwright H. M., Ed.; Royal Society of Chemistry: 2020; Vol. 17, p 280, 10.1039/9781839160233-00280. [DOI] [Google Scholar]
Struble T. J.; Alvarez J. C.; Brown S. P.; Chytil M.; Cisar J.; DesJarlais R. L.; Engkvist O.; Frank S. A.; Greve D. R.; Griffin D. J.; Hou X.; Johannes J. W.; Kreatsoulas C.; Lahue B.; Mathea M.; Mogk G.; Nicolaou C. A.; Palmer A. D.; Price D. J.; Robinson R. I.; Salentin S.; Xing L.; Jaakkola T.; Green W. H.; Barzilay R.; Coley C. W.; Jensen K. F. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J. Med. Chem. 2020, 63, 8667–8682. 10.1021/acs.jmedchem.9b02120. [DOI] [PMC free article] [PubMed] [Google Scholar]
Barone R.; Chanon M. A new and simple approach to chemical complexity. Application to the synthesis of natural products. J. Chem. Inf. Comput. Sci. 2001, 41, 269–272. 10.1021/ci000145p. [DOI] [PubMed] [Google Scholar]
Ertl P.; Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminf. 2009, 1, 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Segler M. H. S.; Preuss M.; Waller M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018, 555, 604–610. 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]
Takaoka Y.; Endo Y.; Yamanobe S.; Kakinuma H.; Okubo T.; Shimazaki Y.; Ota T.; Sumiya S.; Yoshikawa K. Development of a method for evaluating drug-likeness and ease of synthesis using a data set in which compounds are assigned scores based on chemists’ intuition. J. Chem. Inf. Comput. Sci. 2003, 43, 1269–1275. 10.1021/ci034043l. [DOI] [PubMed] [Google Scholar]
Coley C. W.; Rogers L.; Green W. H.; Jensen K. F. SCScore: Synthetic Complexity Learned from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261. 10.1021/acs.jcim.7b00622. [DOI] [PubMed] [Google Scholar]
Voršilák M.; Kolář M.; Čmelo I.; Svozil D. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J. Cheminf. 2020, 12, 35. 10.1186/s13321-020-00439-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 60, 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]
Genheden S.; Thakkar A.; Chadimová V.; Reymond J.-L.; Engkvist O.; Bjerrum E. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J. Cheminf. 2020, 12, 70. 10.1186/s13321-020-00472-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lajiness M. S.; Maggiora G. M.; Shanmugasundaram V. Assessment of the consistency of medicinal chemists in reviewing sets of compounds. J. Med. Chem. 2004, 47, 4891–4896. 10.1021/jm049740z. [DOI] [PubMed] [Google Scholar]
Slater A. G.; Cooper A. I. Porous materials. Function-led design of new porous materials. Science 2015, 348, aaa8075. 10.1126/science.aaa8075. [DOI] [PubMed] [Google Scholar]
Morris R. E.; Wheatley P. S. Gas storage in nanoporous materials. Angew. Chem., Int. Ed. 2008, 47, 4966–4981. 10.1002/anie.200703934. [DOI] [PubMed] [Google Scholar]
Li J.-R.; Kuppler R. J.; Zhou H.-C. Selective gas adsorption and separation in metal-organic frameworks. Chem. Soc. Rev. 2009, 38, 1477–1504. 10.1039/b802426j. [DOI] [PubMed] [Google Scholar]
Rowland C. A.; Lorzing G. R.; Gosselin A. J.; Trump B. A.; Yap G. P. A.; Brown C. M.; Bloch E. D. Methane Storage in Paddlewheel-Based Porous Coordination Cages. J. Am. Chem. Soc. 2018, 140, 11153–11157. 10.1021/jacs.8b05780. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang J.; Chen J.; Peng S.; Peng S.; Zhang Z.; Tong Y.; Miller P. W.; Yan X.-P. Emerging porous materials in confined spaces: from chromatographic applications to flow chemistry. Chem. Soc. Rev. 2019, 48, 2566–2595. 10.1039/C8CS00657A. [DOI] [PubMed] [Google Scholar]
Kewley A.; Stephenson A.; Chen L.; Briggs M. E.; Hasell T.; Cooper A. I. Porous Organic Cages for Gas Chromatography Separations. Chem. Mater. 2015, 27, 3207–3210. 10.1021/acs.chemmater.5b01112. [DOI] [Google Scholar]
Ma L.; Abney C.; Lin W. Enantioselective catalysis with homochiral metal-organic frameworks. Chem. Soc. Rev. 2009, 38, 1248–1256. 10.1039/b807083k. [DOI] [PubMed] [Google Scholar]
Sun N.; Wang C.; Wang H.; Yang L.; Jin P.; Zhang W.; Jiang J. Multifunctional tubular organic cage-supported ultrafine palladium nanoparticles for sequential catalysis. Angew. Chem., Int. Ed. 2019, 58, 18011–18016. 10.1002/anie.201908703. [DOI] [PubMed] [Google Scholar]
Brutschy M.; Schneider M. W.; Mastalerz M.; Waldvogel S. R. Porous organic cage compounds as highly potent affinity materials for sensing by quartz crystal microbalances. Adv. Mater. 2012, 24, 6049–6052. 10.1002/adma.201202786. [DOI] [PubMed] [Google Scholar]
Wales D. J.; Grand J.; Ting V. P.; Burke R. D.; Edler K. J.; Bowen C. R.; Mintova S.; Burrows A. D. Gas sensing using porous materials for automotive applications. Chem. Soc. Rev. 2015, 44, 4290–4321. 10.1039/C5CS00040H. [DOI] [PubMed] [Google Scholar]
Giri N.; Del Pópolo M. G.; Melaugh G.; Greenaway R. L.; Rätzke K.; Koschine T.; Pison L.; Gomes M. F. C.; Cooper A. I.; James S. L. Liquids with permanent porosity. Nature 2015, 527, 216–220. 10.1038/nature16072. [DOI] [PubMed] [Google Scholar]
Melaugh G.; Giri N.; Davidson C. E.; James S. L.; Del Pópolo M. G. Designing and understanding permanent microporosity in liquids. Phys. Chem. Chem. Phys. 2014, 16, 9422–9431. 10.1039/C4CP00582A. [DOI] [PubMed] [Google Scholar]
Zhang J.-H.; Xie S.-M.; Zi M.; Yuan L.-M. Recent advances of application of porous molecular cages for enantioselective recognition and separation. J. Sep. Sci. 2020, 43, 134–149. 10.1002/jssc.201900762. [DOI] [PubMed] [Google Scholar]
Little M. A.; Cooper A. I. The chemistry of porous organic molecular materials. Adv. Funct. Mater. 2020, 30, 1909842. 10.1002/adfm.201909842. [DOI] [Google Scholar]
Greenaway R. L.; Santolini V.; Bennison M. J.; Alston B. M.; Pugh C. J.; Little M. A.; Miklitz M.; Eden-Rump E. G. B.; Clowes R.; Shakil A.; Cuthbertson H. J.; Armstrong H.; Briggs M. E.; Jelfs K. E.; Cooper A. I. High-throughput discovery of organic cages and catenanes using computational screening fused with robotic synthesis. Nat. Commun. 2018, 9, 2849. 10.1038/s41467-018-05271-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moghadam P. Z.; Li A.; Wiggin S. B.; Tao A.; Maloney A. G. P.; Wood P. A.; Ward S. C.; Fairen-Jimenez D. Development of a Cambridge Structural Database Subset: A Collection of Metal–Organic Frameworks for Past, Present, and Future. Chem. Mater. 2017, 29, 2618–2625. 10.1021/acs.chemmater.7b00441. [DOI] [Google Scholar]
Berardo E.; Turcani L.; Miklitz M.; Jelfs K. E. An evolutionary algorithm for the discovery of porous organic cages. Chem. Sci. 2018, 9, 8513–8527. 10.1039/C8SC03560A. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miklitz M.; Turcani L.; Greenaway R. L.; Jelfs K. E. Computational discovery of molecular C60 encapsulants with an evolutionary algorithm. Communications Chemistry 2020, 3, 10. 10.1038/s42004-020-0255-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Turcani L.; Greenaway R. L.; Jelfs K. E. Machine Learning for Organic Cage Property Prediction. Chem. Mater. 2019, 31, 714–727. 10.1021/acs.chemmater.8b03572. [DOI] [Google Scholar]
Slater A. G.; Little M. A.; Briggs M. E.; Jelfs K. E.; Cooper A. I. A solution-processable dissymmetric porous organic cage. Mol. Syst. Des. Eng. 2018, 3, 223–227. 10.1039/C7ME00090A. [DOI] [Google Scholar]
Lei Y.; Chen Q.; Liu P.; Wang L.; Wang H.; Li B.; Lu X.; Chen Z.; Pan Y.; Huang F.; Li H. Molecular cages self-assembled by imine condensation in water. Angew. Chem., Int. Ed. 2021, 60, 4705–4711. 10.1002/anie.202013045. [DOI] [PubMed] [Google Scholar]
Kulchat S.; Chaur M. N.; Lehn J.-M. Kinetic Selectivity and Thermodynamic Features of Competitive Imine Formation in Dynamic Covalent Chemistry. Chem. - Eur. J. 2017, 23, 11108–11118. 10.1002/chem.201702088. [DOI] [PubMed] [Google Scholar]
Santolini V.; Miklitz M.; Berardo E.; Jelfs K. E. Topological landscapes of porous organic cages. Nanoscale 2017, 9, 5280–5298. 10.1039/C7NR00703E. [DOI] [PubMed] [Google Scholar]
Acharyya K.; Mukherjee P. S. Organic Imine Cages: Molecular Marriage and Applications. Angew. Chem., Int. Ed. 2019, 58, 8640–8653. 10.1002/anie.201900163. [DOI] [PubMed] [Google Scholar]
Briggs M. E.; Cooper A. I. A Perspective on the Synthesis, Purification, and Characterization of Porous Organic Cages. Chem. Mater. 2017, 29, 149–157. 10.1021/acs.chemmater.6b02903. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lauer J. C.; Zhang W.-S.; Rominger F.; Schröder R. R.; Mastalerz M. Shape-Persistent [4.4] Imine Cages with a Truncated Tetrahedral Geometry. Chem. - Eur. J. 2018, 24, 1816–1820. 10.1002/chem.201705713. [DOI] [PMC free article] [PubMed] [Google Scholar]
Abet V.; Szczypiński F. T.; Little M. A.; Santolini V.; Jones C. D.; Evans R.; Wilson C.; Wu X.; Thorne M. F.; Bennison M. J.; Cui P.; Cooper A. I.; Jelfs K. E.; Slater A. G. Inducing Social Self-Sorting in Organic Cages To Tune The Shape of The Internal Cavity. Angew. Chem., Int. Ed. 2020, 59, 16755–16763. 10.1002/anie.202007571. [DOI] [PMC free article] [PubMed] [Google Scholar]
Turcani L.; Berardo E.; Jelfs K. E. stk: A python toolkit for supramolecular assembly. J. Comput. Chem. 2018, 39, 1931–1942. 10.1002/jcc.25377. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berardo E.; Greenaway R. L.; Turcani L.; Alston B. M.; Bennison M. J.; Miklitz M.; Clowes R.; Briggs M. E.; Cooper A. I.; Jelfs K. E. Computationally-inspired discovery of an unsymmetrical porous organic cage. Nanoscale 2018, 10, 22381–22388. 10.1039/C8NR06868B. [DOI] [PubMed] [Google Scholar]
Greenaway R. L.; Santolini V.; Pulido A.; Little M. A.; Alston B. M.; Briggs M. E.; Day G. M.; Cooper A. I.; Jelfs K. E. From Concept to Crystals via Prediction: Multi-Component Organic Cage Pots by Social Self-Sorting. Angew. Chem., Int. Ed. 2019, 58, 16275–16281. 10.1002/anie.201909237. [DOI] [PubMed] [Google Scholar]
Greenaway R. L.; Jelfs K. E. Integrating Computational and Experimental Workflows for Accelerated Organic Materials Discovery. Adv. Mater. 2021, 33, 2004831. 10.1002/adma.202004831. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reaxys database. http://reaxys.com (accessed 2019-02-01).
eMolecules. https://www.emolecules.com/ (accessed 2020-01-08).
Bennett S.Materials Precursor Score. 2021. http://doi.org/10.5281/zenodo.4647049 (accessed 2021-07-30).
Svetnik V.; Liaw A.; Tong C.; Culberson J. C.; Sheridan R. P.; Feuston B. P. Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. 10.1021/ci034160g. [DOI] [PubMed] [Google Scholar]
Sheridan R. P. Using random forest to model the domain applicability of another random forest model. J. Chem. Inf. Model. 2013, 53, 2837–2850. 10.1021/ci400482e. [DOI] [PubMed] [Google Scholar]
Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Chen C.; Liaw A.; Breiman L.. et al. Using random forest to learn imbalanced data; University of California: Berkeley, 2004; Vol. 110, p 24.
Rogers D.; Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Rogers D. RDKit. https://www.rdkit.org/ (accessed 2021-02-08).
Riniker S.; Landrum G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Cheminf. 2013, 5, 26. 10.1186/1758-2946-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]
O’Boyle N. M.; Sayle R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. 2016, 8, 36. 10.1186/s13321-016-0148-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
Niculescu-Mizil A.; Caruana R.. Predicting good probabilities with supervised learning. Proceedings of the 22nd international conference on Machine learning; New York, NY, USA, 2005; pp 625–632, 10.1145/1102351.1102430. [DOI]
Ertl P. An algorithm to identify functional groups in organic molecules. J. Cheminf. 2017, 9, 36. 10.1186/s13321-017-0225-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
Harder E.; Damm W.; Maple J.; Wu C.; Reboul M.; Xiang J. Y.; Wang L.; Lupyan D.; Dahlgren M. K.; Knight J. L.; Kaus J. W.; Cerutti D. S.; Krilov G.; Jorgensen W. L.; Abel R.; Friesner R. A. OPLS3: A Force Field Providing Broad Coverage of Drug-like Small Molecules and Proteins. J. Chem. Theory Comput. 2016, 12, 281–296. 10.1021/acs.jctc.5b00864. [DOI] [PubMed] [Google Scholar]
Schrödinger Release 2018-4: MacroModel; Schrödinger, LLC: New York, NY, 2020.
Miklitz M.; Jelfs K. E. pywindow: Automated Structural Analysis of Molecular Pores. J. Chem. Inf. Model. 2018, 58, 2387–2391. 10.1021/acs.jcim.8b00490. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bonnet P. Is chemical synthetic accessibility computationally predictable for drug and lead-like molecules? A comparative assessment between medicinal and computational chemists. Eur. J. Med. Chem. 2012, 54, 679–689. 10.1016/j.ejmech.2012.06.024. [DOI] [PubMed] [Google Scholar]
Baba Y.; Isomura T.; Kashima H. Wisdom of crowds for synthetic accessibility evaluation. J. Mol. Graphics Modell. 2018, 80, 217–223. 10.1016/j.jmgm.2018.01.011. [DOI] [PubMed] [Google Scholar]
Raccuglia P.; Elbert K. C.; Adler P. D. F.; Falk C.; Wenny M. B.; Mollo A.; Zeller M.; Friedler S. A.; Schrier J.; Norquist A. J. Machine-learning-assisted materials discovery using failed experiments. Nature 2016, 533, 73–76. 10.1038/nature17439. [DOI] [PubMed] [Google Scholar]
Saito T.; Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015, 10, e0118432. 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nicolaou C. A.; Watson I. A.; Hu H.; Wang J. The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space. J. Chem. Inf. Model. 2016, 56, 1253–1266. 10.1021/acs.jcim.6b00173. [DOI] [PubMed] [Google Scholar]
Lawson A. J.; Swienty-Busch J.; Géoui T.; Evans D.. The Making of Reaxys–Towards Unobstructed Access to Relevant Chemistry Information. In The Future of the History of Chemical Information; McEwen L. R., Buntrock R. E., Eds.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014; Vol. 1164, pp 127–148, 10.1021/bk-2014-1164.ch008. [DOI] [Google Scholar]
Sterling T.; Irwin J. J. ZINC 15-Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]
ZINC. zinc.docking.org (accessed 2021-01-20).
Wu C.; Decker E. R.; Blok N.; Bui H.; Chen Q.; Raju B.; Bourgoyne A. R.; Knowles V.; Biediger R. J.; Market R. V.; Lin S.; Dupré B.; Kogan T. P.; Holland G. W.; Brock T. A.; Dixon R. A. Endothelin antagonists: substituted mesitylcarboxamides with high potency and selectivity for ET(A) receptors. J. Med. Chem. 1999, 42, 4485–4499. 10.1021/jm9900063. [DOI] [PubMed] [Google Scholar]
Szymkuć S.; Gajewska E. P.; Klucznik T.; Molga K.; Dittwald P.; Startek M.; Bajczyk M.; Grzybowski B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem., Int. Ed. 2016, 55, 5904–5937. 10.1002/anie.201506101. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ci1c00375_si_001.pdf^{(583KB, pdf)}

[ref1] Pyzer-Knapp E. O.; Suh C.; Gómez-Bombarelli R.; Aguilera-Iparraguirre J.; Aspuru-Guzik A. What Is High-Throughput Virtual Screening? A Perspective from Organic Materials Discovery. Annu. Rev. Mater. Res. 2015, 45, 195–216. 10.1146/annurev-matsci-070214-020823. [DOI] [Google Scholar]

[ref2] Hachmann J.; Olivares-Amaya R.; Atahan-Evrenk S.; Amador-Bedolla C.; Sánchez-Carrera R. S.; Gold-Parker A.; Vogt L.; Brockway A. M.; Aspuru-Guzik A. The Harvard Clean Energy Project: Large-Scale Computational Screening and Design of Organic Photovoltaics on the World Community Grid. J. Phys. Chem. Lett. 2011, 2, 2241–2251. 10.1021/jz200866s. [DOI] [Google Scholar]

[ref3] Oganov A. R.; Saleh G.; Kvashnin A. G.. Computational Materials Discovery: Dream or Reality?. Computational Materials Discovery; Royal Society of Chemistry: 2018; p 1, 10.1039/9781788010122-00001. [DOI] [Google Scholar]

[ref4] Szczypiński F. T.; Bennett S.; Jelfs K. E. Can we predict materials that can be synthesised?. Chem. Sci. 2021, 12, 830–840. 10.1039/D0SC04321D. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Bennett S.; Tarzia A.; Zwijnenburg M. A.; Jelfs K. E. In Artificial Intelligence Applied to the Prediction of Organic Materials. Machine Learning in Chemistry: The Impact of Artificial Intelligence; Cartwright H. M., Ed.; Royal Society of Chemistry: 2020; Vol. 17, p 280, 10.1039/9781839160233-00280. [DOI] [Google Scholar]

[ref6] Struble T. J.; Alvarez J. C.; Brown S. P.; Chytil M.; Cisar J.; DesJarlais R. L.; Engkvist O.; Frank S. A.; Greve D. R.; Griffin D. J.; Hou X.; Johannes J. W.; Kreatsoulas C.; Lahue B.; Mathea M.; Mogk G.; Nicolaou C. A.; Palmer A. D.; Price D. J.; Robinson R. I.; Salentin S.; Xing L.; Jaakkola T.; Green W. H.; Barzilay R.; Coley C. W.; Jensen K. F. Current and Future Roles of Artificial Intelligence in Medicinal Chemistry Synthesis. J. Med. Chem. 2020, 63, 8667–8682. 10.1021/acs.jmedchem.9b02120. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] Barone R.; Chanon M. A new and simple approach to chemical complexity. Application to the synthesis of natural products. J. Chem. Inf. Comput. Sci. 2001, 41, 269–272. 10.1021/ci000145p. [DOI] [PubMed] [Google Scholar]

[ref8] Ertl P.; Schuffenhauer A. Estimation of synthetic accessibility score of drug-like molecules based on molecular complexity and fragment contributions. J. Cheminf. 2009, 1, 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] Segler M. H. S.; Preuss M.; Waller M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 2018, 555, 604–610. 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]

[ref10] Takaoka Y.; Endo Y.; Yamanobe S.; Kakinuma H.; Okubo T.; Shimazaki Y.; Ota T.; Sumiya S.; Yoshikawa K. Development of a method for evaluating drug-likeness and ease of synthesis using a data set in which compounds are assigned scores based on chemists’ intuition. J. Chem. Inf. Comput. Sci. 2003, 43, 1269–1275. 10.1021/ci034043l. [DOI] [PubMed] [Google Scholar]

[ref11] Coley C. W.; Rogers L.; Green W. H.; Jensen K. F. SCScore: Synthetic Complexity Learned from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261. 10.1021/acs.jcim.7b00622. [DOI] [PubMed] [Google Scholar]

[ref12] Voršilák M.; Kolář M.; Čmelo I.; Svozil D. SYBA: Bayesian estimation of synthetic accessibility of organic compounds. J. Cheminf. 2020, 12, 35. 10.1186/s13321-020-00439-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 60, 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]

[ref14] Genheden S.; Thakkar A.; Chadimová V.; Reymond J.-L.; Engkvist O.; Bjerrum E. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J. Cheminf. 2020, 12, 70. 10.1186/s13321-020-00472-1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Lajiness M. S.; Maggiora G. M.; Shanmugasundaram V. Assessment of the consistency of medicinal chemists in reviewing sets of compounds. J. Med. Chem. 2004, 47, 4891–4896. 10.1021/jm049740z. [DOI] [PubMed] [Google Scholar]

[ref16] Slater A. G.; Cooper A. I. Porous materials. Function-led design of new porous materials. Science 2015, 348, aaa8075. 10.1126/science.aaa8075. [DOI] [PubMed] [Google Scholar]

[ref17] Morris R. E.; Wheatley P. S. Gas storage in nanoporous materials. Angew. Chem., Int. Ed. 2008, 47, 4966–4981. 10.1002/anie.200703934. [DOI] [PubMed] [Google Scholar]

[ref18] Li J.-R.; Kuppler R. J.; Zhou H.-C. Selective gas adsorption and separation in metal-organic frameworks. Chem. Soc. Rev. 2009, 38, 1477–1504. 10.1039/b802426j. [DOI] [PubMed] [Google Scholar]

[ref19] Rowland C. A.; Lorzing G. R.; Gosselin A. J.; Trump B. A.; Yap G. P. A.; Brown C. M.; Bloch E. D. Methane Storage in Paddlewheel-Based Porous Coordination Cages. J. Am. Chem. Soc. 2018, 140, 11153–11157. 10.1021/jacs.8b05780. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref20] Zhang J.; Chen J.; Peng S.; Peng S.; Zhang Z.; Tong Y.; Miller P. W.; Yan X.-P. Emerging porous materials in confined spaces: from chromatographic applications to flow chemistry. Chem. Soc. Rev. 2019, 48, 2566–2595. 10.1039/C8CS00657A. [DOI] [PubMed] [Google Scholar]

[ref21] Kewley A.; Stephenson A.; Chen L.; Briggs M. E.; Hasell T.; Cooper A. I. Porous Organic Cages for Gas Chromatography Separations. Chem. Mater. 2015, 27, 3207–3210. 10.1021/acs.chemmater.5b01112. [DOI] [Google Scholar]

[ref22] Ma L.; Abney C.; Lin W. Enantioselective catalysis with homochiral metal-organic frameworks. Chem. Soc. Rev. 2009, 38, 1248–1256. 10.1039/b807083k. [DOI] [PubMed] [Google Scholar]

[ref23] Sun N.; Wang C.; Wang H.; Yang L.; Jin P.; Zhang W.; Jiang J. Multifunctional tubular organic cage-supported ultrafine palladium nanoparticles for sequential catalysis. Angew. Chem., Int. Ed. 2019, 58, 18011–18016. 10.1002/anie.201908703. [DOI] [PubMed] [Google Scholar]

[ref24] Brutschy M.; Schneider M. W.; Mastalerz M.; Waldvogel S. R. Porous organic cage compounds as highly potent affinity materials for sensing by quartz crystal microbalances. Adv. Mater. 2012, 24, 6049–6052. 10.1002/adma.201202786. [DOI] [PubMed] [Google Scholar]

[ref25] Wales D. J.; Grand J.; Ting V. P.; Burke R. D.; Edler K. J.; Bowen C. R.; Mintova S.; Burrows A. D. Gas sensing using porous materials for automotive applications. Chem. Soc. Rev. 2015, 44, 4290–4321. 10.1039/C5CS00040H. [DOI] [PubMed] [Google Scholar]

[ref26] Giri N.; Del Pópolo M. G.; Melaugh G.; Greenaway R. L.; Rätzke K.; Koschine T.; Pison L.; Gomes M. F. C.; Cooper A. I.; James S. L. Liquids with permanent porosity. Nature 2015, 527, 216–220. 10.1038/nature16072. [DOI] [PubMed] [Google Scholar]

[ref27] Melaugh G.; Giri N.; Davidson C. E.; James S. L.; Del Pópolo M. G. Designing and understanding permanent microporosity in liquids. Phys. Chem. Chem. Phys. 2014, 16, 9422–9431. 10.1039/C4CP00582A. [DOI] [PubMed] [Google Scholar]

[ref28] Zhang J.-H.; Xie S.-M.; Zi M.; Yuan L.-M. Recent advances of application of porous molecular cages for enantioselective recognition and separation. J. Sep. Sci. 2020, 43, 134–149. 10.1002/jssc.201900762. [DOI] [PubMed] [Google Scholar]

[ref29] Little M. A.; Cooper A. I. The chemistry of porous organic molecular materials. Adv. Funct. Mater. 2020, 30, 1909842. 10.1002/adfm.201909842. [DOI] [Google Scholar]

[ref30] Greenaway R. L.; Santolini V.; Bennison M. J.; Alston B. M.; Pugh C. J.; Little M. A.; Miklitz M.; Eden-Rump E. G. B.; Clowes R.; Shakil A.; Cuthbertson H. J.; Armstrong H.; Briggs M. E.; Jelfs K. E.; Cooper A. I. High-throughput discovery of organic cages and catenanes using computational screening fused with robotic synthesis. Nat. Commun. 2018, 9, 2849. 10.1038/s41467-018-05271-9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] Moghadam P. Z.; Li A.; Wiggin S. B.; Tao A.; Maloney A. G. P.; Wood P. A.; Ward S. C.; Fairen-Jimenez D. Development of a Cambridge Structural Database Subset: A Collection of Metal–Organic Frameworks for Past, Present, and Future. Chem. Mater. 2017, 29, 2618–2625. 10.1021/acs.chemmater.7b00441. [DOI] [Google Scholar]

[ref32] Berardo E.; Turcani L.; Miklitz M.; Jelfs K. E. An evolutionary algorithm for the discovery of porous organic cages. Chem. Sci. 2018, 9, 8513–8527. 10.1039/C8SC03560A. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] Miklitz M.; Turcani L.; Greenaway R. L.; Jelfs K. E. Computational discovery of molecular C60 encapsulants with an evolutionary algorithm. Communications Chemistry 2020, 3, 10. 10.1038/s42004-020-0255-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] Turcani L.; Greenaway R. L.; Jelfs K. E. Machine Learning for Organic Cage Property Prediction. Chem. Mater. 2019, 31, 714–727. 10.1021/acs.chemmater.8b03572. [DOI] [Google Scholar]

[ref35] Slater A. G.; Little M. A.; Briggs M. E.; Jelfs K. E.; Cooper A. I. A solution-processable dissymmetric porous organic cage. Mol. Syst. Des. Eng. 2018, 3, 223–227. 10.1039/C7ME00090A. [DOI] [Google Scholar]

[ref36] Lei Y.; Chen Q.; Liu P.; Wang L.; Wang H.; Li B.; Lu X.; Chen Z.; Pan Y.; Huang F.; Li H. Molecular cages self-assembled by imine condensation in water. Angew. Chem., Int. Ed. 2021, 60, 4705–4711. 10.1002/anie.202013045. [DOI] [PubMed] [Google Scholar]

[ref37] Kulchat S.; Chaur M. N.; Lehn J.-M. Kinetic Selectivity and Thermodynamic Features of Competitive Imine Formation in Dynamic Covalent Chemistry. Chem. - Eur. J. 2017, 23, 11108–11118. 10.1002/chem.201702088. [DOI] [PubMed] [Google Scholar]

[ref38] Santolini V.; Miklitz M.; Berardo E.; Jelfs K. E. Topological landscapes of porous organic cages. Nanoscale 2017, 9, 5280–5298. 10.1039/C7NR00703E. [DOI] [PubMed] [Google Scholar]

[ref39] Acharyya K.; Mukherjee P. S. Organic Imine Cages: Molecular Marriage and Applications. Angew. Chem., Int. Ed. 2019, 58, 8640–8653. 10.1002/anie.201900163. [DOI] [PubMed] [Google Scholar]

[ref40] Briggs M. E.; Cooper A. I. A Perspective on the Synthesis, Purification, and Characterization of Porous Organic Cages. Chem. Mater. 2017, 29, 149–157. 10.1021/acs.chemmater.6b02903. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] Lauer J. C.; Zhang W.-S.; Rominger F.; Schröder R. R.; Mastalerz M. Shape-Persistent [4.4] Imine Cages with a Truncated Tetrahedral Geometry. Chem. - Eur. J. 2018, 24, 1816–1820. 10.1002/chem.201705713. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref42] Abet V.; Szczypiński F. T.; Little M. A.; Santolini V.; Jones C. D.; Evans R.; Wilson C.; Wu X.; Thorne M. F.; Bennison M. J.; Cui P.; Cooper A. I.; Jelfs K. E.; Slater A. G. Inducing Social Self-Sorting in Organic Cages To Tune The Shape of The Internal Cavity. Angew. Chem., Int. Ed. 2020, 59, 16755–16763. 10.1002/anie.202007571. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] Turcani L.; Berardo E.; Jelfs K. E. stk: A python toolkit for supramolecular assembly. J. Comput. Chem. 2018, 39, 1931–1942. 10.1002/jcc.25377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] Berardo E.; Greenaway R. L.; Turcani L.; Alston B. M.; Bennison M. J.; Miklitz M.; Clowes R.; Briggs M. E.; Cooper A. I.; Jelfs K. E. Computationally-inspired discovery of an unsymmetrical porous organic cage. Nanoscale 2018, 10, 22381–22388. 10.1039/C8NR06868B. [DOI] [PubMed] [Google Scholar]

[ref45] Greenaway R. L.; Santolini V.; Pulido A.; Little M. A.; Alston B. M.; Briggs M. E.; Day G. M.; Cooper A. I.; Jelfs K. E. From Concept to Crystals via Prediction: Multi-Component Organic Cage Pots by Social Self-Sorting. Angew. Chem., Int. Ed. 2019, 58, 16275–16281. 10.1002/anie.201909237. [DOI] [PubMed] [Google Scholar]

[ref46] Greenaway R. L.; Jelfs K. E. Integrating Computational and Experimental Workflows for Accelerated Organic Materials Discovery. Adv. Mater. 2021, 33, 2004831. 10.1002/adma.202004831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] Reaxys database. http://reaxys.com (accessed 2019-02-01).

[ref48] eMolecules. https://www.emolecules.com/ (accessed 2020-01-08).

[ref49] Bennett S.Materials Precursor Score. 2021. http://doi.org/10.5281/zenodo.4647049 (accessed 2021-07-30).

[ref50] Svetnik V.; Liaw A.; Tong C.; Culberson J. C.; Sheridan R. P.; Feuston B. P. Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947–1958. 10.1021/ci034160g. [DOI] [PubMed] [Google Scholar]

[ref51] Sheridan R. P. Using random forest to model the domain applicability of another random forest model. J. Chem. Inf. Model. 2013, 53, 2837–2850. 10.1021/ci400482e. [DOI] [PubMed] [Google Scholar]

[ref52] Pedregosa F.; Varoquaux G.; Gramfort A.; Michel V.; Thirion B.; Grisel O.; Blondel M.; Prettenhofer P.; Weiss R.; Dubourg V.; Vanderplas J.; Passos A.; Cournapeau D.; Brucher M.; Perrot M.; Duchesnay E. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]

[ref53] Chen C.; Liaw A.; Breiman L.. et al. Using random forest to learn imbalanced data; University of California: Berkeley, 2004; Vol. 110, p 24.

[ref54] Rogers D.; Hahn M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[ref55] Rogers D. RDKit. https://www.rdkit.org/ (accessed 2021-02-08).

[ref56] Riniker S.; Landrum G. A. Open-source platform to benchmark fingerprints for ligand-based virtual screening. J. Cheminf. 2013, 5, 26. 10.1186/1758-2946-5-26. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref57] O’Boyle N. M.; Sayle R. A. Comparing structural fingerprints using a literature-based similarity benchmark. J. Cheminf. 2016, 8, 36. 10.1186/s13321-016-0148-0. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref58] Niculescu-Mizil A.; Caruana R.. Predicting good probabilities with supervised learning. Proceedings of the 22nd international conference on Machine learning; New York, NY, USA, 2005; pp 625–632, 10.1145/1102351.1102430. [DOI]

[ref59] Ertl P. An algorithm to identify functional groups in organic molecules. J. Cheminf. 2017, 9, 36. 10.1186/s13321-017-0225-z. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref60] Harder E.; Damm W.; Maple J.; Wu C.; Reboul M.; Xiang J. Y.; Wang L.; Lupyan D.; Dahlgren M. K.; Knight J. L.; Kaus J. W.; Cerutti D. S.; Krilov G.; Jorgensen W. L.; Abel R.; Friesner R. A. OPLS3: A Force Field Providing Broad Coverage of Drug-like Small Molecules and Proteins. J. Chem. Theory Comput. 2016, 12, 281–296. 10.1021/acs.jctc.5b00864. [DOI] [PubMed] [Google Scholar]

[ref61] Schrödinger Release 2018-4: MacroModel; Schrödinger, LLC: New York, NY, 2020.

[ref62] Miklitz M.; Jelfs K. E. pywindow: Automated Structural Analysis of Molecular Pores. J. Chem. Inf. Model. 2018, 58, 2387–2391. 10.1021/acs.jcim.8b00490. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref63] Bonnet P. Is chemical synthetic accessibility computationally predictable for drug and lead-like molecules? A comparative assessment between medicinal and computational chemists. Eur. J. Med. Chem. 2012, 54, 679–689. 10.1016/j.ejmech.2012.06.024. [DOI] [PubMed] [Google Scholar]

[ref64] Baba Y.; Isomura T.; Kashima H. Wisdom of crowds for synthetic accessibility evaluation. J. Mol. Graphics Modell. 2018, 80, 217–223. 10.1016/j.jmgm.2018.01.011. [DOI] [PubMed] [Google Scholar]

[ref65] Raccuglia P.; Elbert K. C.; Adler P. D. F.; Falk C.; Wenny M. B.; Mollo A.; Zeller M.; Friedler S. A.; Schrier J.; Norquist A. J. Machine-learning-assisted materials discovery using failed experiments. Nature 2016, 533, 73–76. 10.1038/nature17439. [DOI] [PubMed] [Google Scholar]

[ref66] Saito T.; Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One 2015, 10, e0118432. 10.1371/journal.pone.0118432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref67] Nicolaou C. A.; Watson I. A.; Hu H.; Wang J. The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space. J. Chem. Inf. Model. 2016, 56, 1253–1266. 10.1021/acs.jcim.6b00173. [DOI] [PubMed] [Google Scholar]

[ref68] Lawson A. J.; Swienty-Busch J.; Géoui T.; Evans D.. The Making of Reaxys–Towards Unobstructed Access to Relevant Chemistry Information. In The Future of the History of Chemical Information; McEwen L. R., Buntrock R. E., Eds.; ACS Symposium Series; American Chemical Society: Washington, DC, 2014; Vol. 1164, pp 127–148, 10.1021/bk-2014-1164.ch008. [DOI] [Google Scholar]

[ref69] Sterling T.; Irwin J. J. ZINC 15-Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref70] ZINC. zinc.docking.org (accessed 2021-01-20).

[ref71] Wu C.; Decker E. R.; Blok N.; Bui H.; Chen Q.; Raju B.; Bourgoyne A. R.; Knowles V.; Biediger R. J.; Market R. V.; Lin S.; Dupré B.; Kogan T. P.; Holland G. W.; Brock T. A.; Dixon R. A. Endothelin antagonists: substituted mesitylcarboxamides with high potency and selectivity for ET(A) receptors. J. Med. Chem. 1999, 42, 4485–4499. 10.1021/jm9900063. [DOI] [PubMed] [Google Scholar]

[ref72] Szymkuć S.; Gajewska E. P.; Klucznik T.; Molga K.; Dittwald P.; Startek M.; Bajczyk M.; Grzybowski B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem., Int. Ed. 2016, 55, 5904–5937. 10.1002/anie.201506101. [DOI] [PubMed] [Google Scholar]

PERMALINK

Materials Precursor Score: Modeling Chemists’ Intuition for the Synthetic Accessibility of Porous Organic Cage Precursors

Steven Bennett

Filip T Szczypiński

Lukas Turcani

Michael E Briggs

Rebecca L Greenaway

Kim E Jelfs

Abstract

Introduction

Figure 1.

Methods

Figure 2.

Creating the Synthetic Difficulty Model for Organic Material Precursors

Training Database Construction

Figure 3.

Random Forest Model

Cross-Validation and Calibration of the Model

High-Throughput Virtual Screening

Preparation of the Precursor Database

Identification of Synthesizable Precursors

Cage Construction and Conformational Search

Identification of Shape Persistent Cages

Results and Discussion

Materials Precursor Score (MPScore)

Evaluating Chemists’ Scores

Table 1. Number of Molecules Labeled by Each Chemist, in Addition to the Number of Easy- and Difficult-to-Synthesize, and Their Respective Percentagesa.

Table 2. Summary of the Molecules Labeled by Each Synthetic Materials Chemista.

Comparison with Existing Methods

Figure 4.

MPScore: Model Training and Validation

Table 3. Evaluation Metrics for the MPScore Modela.

Table 4. Sum of All Outcomes from Each Fold of the Cross-Validation Procedure Used to Estimate the Performance of the MPScore, in Addition to the Total Number of Predictions Madea.

Figure 5.

High-Throughput Computational Discovery of Synthesizable POCs

Precursor Database and Synthesizable Precursors

Figure 6.

Model Validation by Expert Chemists

Table 5. Performance Metrics for the Blind Validation Task Used to Assess the Performance of the SAScore, SCScore, and MPScore on Predicting Easy-to-Synthesize POC Precursorsa.

Structural Analysis of Computationally Screened POCs

Table 6. Number and Percentage of Shape-Persistent Cages Formed from Precursors Selected by Each Synthetic Difficulty Model in the First Three Columns, in a Control Sample of Randomly Selected Precursors, and in Our Previous Work in the Final Column34a.

Figure 7.

Analysis of the Identified Promising Large Cavity Diameter POCs

Figure 8.

Table 7. Calculated Cavity Diameter and Precursor Synthetic Difficulty Scores (from Each of the Three Scores) for the Six Largest Shape-Persistent POCs Identified from Any of the Methodsa.

Conclusions

Acknowledgments

Supporting Information Available

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Table 1. Number of Molecules Labeled by Each Chemist, in Addition to the Number of Easy- and Difficult-to-Synthesize, and Their Respective Percentages^a.

Table 2. Summary of the Molecules Labeled by Each Synthetic Materials Chemist^a.

Table 3. Evaluation Metrics for the MPScore Model^a.

Table 4. Sum of All Outcomes from Each Fold of the Cross-Validation Procedure Used to Estimate the Performance of the MPScore, in Addition to the Total Number of Predictions Made^a.

Table 5. Performance Metrics for the Blind Validation Task Used to Assess the Performance of the SAScore, SCScore, and MPScore on Predicting Easy-to-Synthesize POC Precursors^a.

Table 6. Number and Percentage of Shape-Persistent Cages Formed from Precursors Selected by Each Synthetic Difficulty Model in the First Three Columns, in a Control Sample of Randomly Selected Precursors, and in Our Previous Work in the Final Column³⁴^a.

Table 7. Calculated Cavity Diameter and Precursor Synthetic Difficulty Scores (from Each of the Three Scores) for the Six Largest Shape-Persistent POCs Identified from Any of the Methods^a.