Derivatization Design of Synthetically Accessible Space for Optimization: In Silico Synthesis vs Deep Generative Design

Gergely M Makara; László Kovács; István Szabó; Gábor Pőcze

doi:10.1021/acsmedchemlett.0c00540

. 2021 Jan 7;12(2):185–194. doi: 10.1021/acsmedchemlett.0c00540

Derivatization Design of Synthetically Accessible Space for Optimization: In Silico Synthesis vs Deep Generative Design

Gergely M Makara ^1,^*, László Kovács ¹, István Szabó ¹, Gábor Pőcze ¹

PMCID: PMC7883369 PMID: 33603964

Abstract

graphic file with name ml0c00540_0007.jpg

Molecular design is of utmost importance in lead optimization programs ultimately determining the fate of the project and the speed to reach preclinical stage. Newly designed lead analogues or new chemotypes must successfully address the challenges in the multidimensional optimization process throughout several optimization cycles. The speed, quality, and creativity of the designs can have a major impact on the cycle time, the number of required cycles, and the number of compounds needed to be synthesized and evaluated that in combination affect the overall timeline and cost of the lead optimization phase. Recently, a new concept, generative design with deep learning, has become popular for de novo design of project relevant analogue sets. We have developed a de novo design technology called “derivatization design” that applies artificial-intelligence-assisted forward in silico synthesis for the generation of near neighbor lead analogues as well as scaffold variations. The several attractive features of the methodology include synthetic feasibility, reagent availability and cost data associated with each new molecule; thus, detailed synthetic assessment is automatically generated during the design. As a result, these practically important data types can become an early part of the ranking and selection process for cycle time reduction. The power of derivatization design is demonstrated in a simple design study of DDR1 inhibitors and comparison of the produced molecules to a recently published data set obtained with deep generative design.

Keywords: Lead discovery, de novo design, generative design, in silico synthesis, artificial intelligence, synthetic feasibility

Design or identification of new molecular structures for lead optimization programs is a key driver for the success of preclinical drug discovery. A structure of interest to the program can be known or preferably be novel and it would contribute to answering outstanding questions or advancing the project by improving the profile of the lead series. Conceiving a structure of interest has long been considered to require expertise by medicinal chemists and has been one of the main objectives in computer-aided drug design. Once bona fide hits or leads have been identified, several different techniques, such as substructure or similarity queries of databases, focused library enumeration with traditional and newer methods,¹ de novo drug design based on protein pocket features² or pharmacophore models,³ and even fragment growing⁴ have been devised to propose new molecules that may or may not be “trivial” for human beings to conceive. In many programs, scaffold hopping^5,6 becomes crucial either early during the hit-to-lead phase to generate new hit classes or later in lead optimization when certain properties cannot sufficiently be improved any longer with the existing scaffold motif. Therefore, scaffold hopping is an area of significant interest for software packages.⁷ Traditional techniques for molecular design suffer from inadequate or complete lack of assessment of synthetic feasibility often leading to major frustration for computational as well as synthetic chemists. Even relatively simple one-step library enumeration techniques will create synthetically very challenging products if no consideration is given to the reaction mechanism or functional group tolerance.

A conceptually very different new concept, generative design for de novo molecular structure generation, has recently emerged, creating excitement in the community.⁸ Deep networks that are trained with one or more focused data sets can produce molecule sets that are biased toward the reward functions used in training. This exciting new tool for idea generation however continues to suffer from the same issue that traditional techniques could not handle: lack of synthetic feasibility assessment in the design. While training networks with synthesizable molecules clearly enrich the product set in reasonable molecules, rapid wet laboratory synthesis often depends on a simple change in a single atom or atom type in the target molecules: one molecule may be synthetically easy because the necessary reagents are available from suppliers or in the internal repository, while its close relative may require an extensive effort and expanded synthetic timeline. Retrosynthetic analysis of tens or hundreds of thousands of designed molecules is not yet practically feasible.

Due to the availability and affordability of previously unimaginable computational power via on demand cloud infrastructure and the advent of artificial intelligence (AI) in drug discovery over the past decade, a truly revolutionary step change has been made in AI-enabled synthesis design: the computational prediction and evaluation of synthetic routes and synthetic feasibility. Both rule-based retrosynthesis⁹ and deep neural networks¹⁰ trained by large synthetic databases of millions of reaction data points have achieved practically useful predictive power. Moreover, blind studies have shown that even organic chemists highly skilled in the art cannot distinguish human and machine suggested routes to complex molecules, making Corey’s vision of retrosynthetic prediction become a reality. Since his pioneering work on the LHASA project¹¹ laid the foundation to rule-based prediction methods in organic synthesis, several groups have created algorithms or software for synthesis prediction; however, these efforts have not found widespread adoption.¹² Among several shortcomings of these methods, the effect of functional groups distant to the reacting moiety has not been sufficiently accounted for despite functional group tolerance being of utmost importance in the determination of reagent compatibility with reaction conditions.¹³ Renewed interest into the field was injected by several reports by the Grabowski group on their rule-based artificial intelligence driven retrosynthesis methodology⁹ that ultimately led to the Chematica program that included approximately 50 000 rules and was capable of solving complex natural product chemistries, as well as synthesis of drug-like small molecules. Another rule-based software developed by Wiley, ChemPlanner, has since become an integral part of SciFinder, while deep learning (DL) retrosynthesis tools developed by IBM¹⁴ (https://rxn.res.ibm.com/) and MIT¹⁵ (https://askcos.mit.edu/) have also been made available for chemists. The main advantage of rule-based AI to handle organic synthesis lies in expert retraining, that allows synthetic experts to create reactivity and tolerance rules based on their own experience and on the study of vast organic synthetic literature data that may be manually curated. On the other hand, rule-based methods do not scale well because each different reaction type, and even a similar type with a major change in the reaction conditions, requires a large amount of upfront effort. DL methods that are automatically or semiautomatically trained by vast amounts of reaction data will, however, have a very diverse understating of data pattern for thousands of different reaction types and conditions. This comprehensive training data has a major Achilles heel though that will influence the quality and ranking of the retrosynthetic solutions: organic synthetic literature has been found to be plagued with reproducibility issues,¹⁶ and it contains reports that go against conventional organic chemistry wisdom. In the absence of expert review, an unfiltered application of the contaminated organic chemistry literature will yield a range of false positives that could cause potential frustration to chemists. On the other side, the synthetic literature is extremely spotty on unsuccessful reactions and very few examples exist of reactions that do not work. This has been somewhat mitigated by clever generation of artificial negative data for regiochemistry training to get a more balanced training set, but functional group intolerance cannot be created in such ways. Nevertheless, recent literature suggests that both rule-based¹⁷ and DL retrosyntheses⁸ have been well received by chemists and the penetration of computer-assisted reaction prediction into the daily routine of medicinal chemists has become noticeable.

Development of forward reaction prediction has also been attempted, most prevalently in the prediction of product structure(s) under certain reaction conditions for one step reactions. However, generation of conceivable products based on sets of reactants via different reaction types has not been published. This could partially be due to the combinatorial problem of having too many possibilities. This is especially true if we wanted to apply the vast and diverse reaction set typically used for DL training. Rule-based methods that have fewer but more carefully verified reaction types are potentially more suited for forward synthesis because (a) still nearly infinite diversity can be generated in multistep sequences by a few hundred reactions, (b) manual curation may give a higher confidence for the user that each step is indeed practically valid, an especially important factor for multistep sequences, and (c) products containing rare, less known reactions, or steps that require custom catalysts, hazardous conditions, expensive glassware or other burdensome factors are likely to be dismissed or deprioritized during compound selection for synthesis if timelines are important.

We have developed a rule-based AI-assisted technology¹⁸ that can assess the synthetic compatibility of reagents for reactions and generate the main product when selectivity rules indicate clear preference for one. The methodology also allows incorporation of principles of reagent symmetry when addition of excess reagent can shift the product equilibrium to monosubstituted species (e.g., ethylenediamine as a reactant). In all cases, full evaluation of functional group tolerance is included regardless of the position of such moieties in the reactants. It is possible to restrict the number and types of functional group combinations that are allowed in a reactant or in all reactants combined to have specific control on outliers. The technique allows incorporation of different reactant specific side-reactions or ring closures depending on the moieties present in the reactants.

This synthetic feasibility predictor has been built into a retrosynthesis solution finder and a forward design engine that will design molecules via different principles that have been tailored for common tasks such as scaffold hopping, scaffold side-chain design, or multistep chemical space generation with several reactions per step. These capabilities are available in the SynSpace software¹⁸ that will be further detailed elsewhere. Recently, we have also developed a completely different design concept using our synthetic feasibility predictor and the design engine, derivatization design (DD), that can automatically modify a molecular structure (a lead or hit for instance) by systematically evaluating both the accessible reagent and the reaction space around the molecule, yielding synthetically feasible analogues within the relevant chemical space for lead optimization and scaffold hopping.

Herein, we introduce the AI-assisted derivatization design concept for lead analogue design and show how it can be utilized for various design tasks for automated ideation. The results obtained by the method are compared to a recently disclosed¹⁹ data set generated by generative tensorial reinforcement learning (GENTRL) generative design for discoidin domain receptor 1 (DDR1), and implications for lead optimization are discussed.

Overview of the Derivatization Design Technique

Within the last two decades, the number of commercially available reagents or building blocks has exploded with very fine variations being available for many medicinally relevant reagent classes with multiple protecting group types. This advancement in purchasable reagent space compatible with various reactions makes in silico forward synthesis highly capable of covering a large part of the desired chemical space around leads. Derivatization design was created to take full advantage of this paradigm shift in commercial fine reagents and to automatically explore lead analogue space by systematically and fully evaluating the universe of possible virtual reaction products based on simple user specifications.

The technique is composed of multiple components or steps that have been combined into a unified and fully automated design program once a synthetic scheme is chosen by the user (Figure 1). The key process driver is the rule-based AI-assisted engine that in its current version has been parametrized for predicting the outcome of >300 organic transformations with their suitable reactants. It filters and selects compatible and desirable reagents based on their compatibility to all the transformations and can generate products consistent with reactivity, selectivity, and other rules. Thus, it can take advantage of a full compatibility matrix of known reactants to design new molecules. Other program components ensure the selection of suitable reagents based on several different criteria or handle program logistics to control the variation types and their positions in products to fulfill the user-defined preferences.

(A) One of the possible synthetic pathways to Ponatinib. Scheme: (a) Reductive alkylation; (b) Ullmann or Buchwald coupling; (c) Sonogashira coupling. (B) Derivatization design program flow and schematic description of the variations at each section.

The process optionally begins with retrosynthesis to find viable synthetic routes to the target structure (Figure 1A). It is also possible to define a route manually without retrosynthesis; however, retrosynthesis will produce a set of different schemes each of which might be of significant value in adding unique products to lead analogue space due to the incorporation of unique reactant and reaction types that may not be accessible via another route. Typically, it is desirable to use several routes that proceed to the product via very different orders of reactions or via completely different reaction schemes. Reaction sequences that enable higher reactant diversity are preferred over a scheme that has less rich reagent sets (see the Supporting Information).

Once a list of synthetic schemes is selected, the user only has to define in the program a few simple settings that will determine the outcome of the design such as the number of modifications and the positional similarity ranges (see the Supporting Information for the definition of settings a–h).

The overall flow of derivatization design can be divided into three sections that are depicted in Figure 1B. In Section 1, the selected reaction sequence (“original scheme”) is verified, and the compatible reagents are identified. The synthesis know-how engine identifies suitable functional groups for the given reaction at each step and then queries its reagent database for all reactants that possess the required structure and functional group(s), and finally the defined reaction scheme as well as the available reactants that comply with the reaction scheme are verified by the synthesis engine for the target. In Section 2, alternative reactions and compatible relevant reactants are identified for each step in the original scheme. Using the rule-based AI engine, the program identifies the reactions in which the reactants can participate, other than the original one, thus creating a list of “related” reactions. A new reaction becomes “related” if it requires the same reacting functional group present in the reactant of the original scheme. Related reactions in the first reaction step are identified for both reactant classes of a bimolecular reaction, while in the following steps it is only done for the intermediate product of the original scheme. With the new reaction lists at hand, identification of the compatible reagents for every reaction (original and related) is carried out by the AI-assisted engine applying the reaction rules on the entire reagent database. The compatible reagents are then analyzed for relevance: first they are converted into their standardized product form then compared to the same form of the reactant of the target molecule. Conversion to product form makes a more relevant comparison for reagents that possess different reacting functional groups that would otherwise lead to significantly lower pairwise similarities. Currently, four different measures are used for comparison: the chemical hashed fingerprint by ChemAxon,²⁰ the ECFP4 fingerprint,²¹ the pharmacophore fingerprint,²⁰ and a shape descriptor.²⁰ Cutoff values for reactant similarity are automatically computed by weighing the user-specified product similarity limit (see user input (c)) by heavy atom count. Reactants that pass at least one of the four similarity filters become part of the reagent collection for the given reaction in each step. For clarity, reactants are analyzed for each reaction in the full reaction list separately, and a reactant that is chemically compatible with multiple reactions may only be used for some depending on its computed similarity values to the original reactant.

In Section 3, synthetically possible molecules are assembled, and the virtual products are analyzed. The reaction and reagent database created in Section 2 are utilized in conjunction with user preferences that govern the number of desired modifications for unchanged or modified bond connections (variation types) and the total number of variations in the molecules. If an intermediate has reached its user-defined limit for a variation type, then it can no longer participate in any further reaction that would lead to further change of that variation type. If no more modification is allowed, then only the input lead molecule’s original reactants and reactions (identified in Section 1 above) are used in further steps. Intermediates and the final products are filtered through the same four relevance (similarity) measures used for reactants. Shorter sequences can also contribute to the final diversity. In addition to the input lead molecule, additional molecules can also be added as similarity reference for the final analysis. Property and substructure filtration or novelty assessment can optionally be done before final acceptance (see the Supporting Information). All output molecules carry a full data set of synthetic details, required reactant data (such as ID, vendor list, cost, shipping lead time etc.), total reagent cost, combined reagent availability rank, and estimated product cost.

Comparative Case Study for DDR1

Generative design by DL methods has recently become a popular de novo design technique both in reports^8,21 and in the toolkits of pharmaceutical companies and academics. Several architectures such as recurrent neural networks, autoencoders, or generative adversarial networks have all been shown to work for the purpose.⁸ A recent disclosure of DDR1 inhibitor design by GENTRL showed how generative design can create potentially new chemotypes in weeks and that one of five selected compounds for synthesis had good potency and favorable ADMET profile, albeit the latter was not designed by the model.¹⁹ Since the entire raw set of 30 000 designed molecules was included in the report, we set out to generate a product set by derivatization design for the same target.

There are numerous crystal structures deposited in the PDB database for DDR1 with different inhibitor chemotypes bound to both the DFG in and DFG out protein conformations as well as the inactive DDR1 conformation. For this study, in order to mimic a real project, we elected to search for inhibitors binding to active DDR1 protein conformations.²² The availability of high-resolution structures allowed us to set up a simple design and analyze protocol: the derivatization design set (DD) and the literature deep generative design set (DGD) were subjected to filtration by calculated physicochemical properties and unwanted substructures (see the Supporting Information), followed by docking to evaluate enrichment for lead optimization (exploitation) as well as for the ability to provide new motifs for new lead development directions (exploration).

While docking scores have been shown by several authors to have a poor correlation with biological activity,^23,24 enrichment of actives in top ranked compounds by Glide and other software has been confirmed in several reports.^25,26 Since one of the main goals of our comparative study was to establish design quality by enrichment in potential actives (rather than selecting a few “potent” compounds), we selected docking as a capable tool to measure hit rate. In addition, further evidence indicated that docking could be a reliable prediction in our work as follows. First, all ligands found in cocrystal structures could be docked to their respective proteins, resulting in good scores and expected poses. Second, six molecules with known activities from the GENTRL set¹⁹ and compound 7r (Figure 2) were also subjected to docking using the standard settings, and the selected docking score cutoff (Glide SP score ≤ −9) was found to classify actives (IC₅₀ <500 nM) and inactives with 83% success rate, ranking all four actives correctly and only classifying one of three inactives as a false positive.

Eight input molecules used in derivatization design (DD) with potency against DDR1 (IC₅₀ values except for 6BRJ where K_d is shown) and an overlay of bound conformations.

As shown above, derivatization design requires a single active structure for input and one or more active structures for similarity assessment. We have also discussed that more than one reaction scheme can provide significant benefits toward getting better coverage of the synthetically accessible space. While the GENTRL study utilized a set of 1370 DDR1 inhibitors to train the deep network with undisclosed reward settings, in derivatization design we utilized 7 inhibitors as starting points for the design, 6 of which were simply taken from the crystal structure, while for the 6FEW PDB complex the micromolar inhibitor was replaced with the more potent compound that can be considered a midpoint in an optimization. To investigate the contribution produced by different synthetic schemes for Ponatinib-like inhibitors, Ponatinib and its close shorter analogue (compound 7r(28)) were run by a different synthetic pathway. The various inhibitors cover various parts of the large DDR1 pocket while they have diverse molecular frameworks, and they make different interactions with the protein as seen in the overlay of the conformations extracted from X-ray structures in Figure 2. Since the published training set of the GENTRL method included inhibitors that bind to the inactive DDR1 conformation,²⁷ we also included two PDB structures and their ligands belonging to this form. Although we did not use the inactive protein form for docking, we wanted to use a training set for DD that is similar, albeit much smaller than the one applied in the DGD reference to make a fair comparison between the two designed sets. In summary, the selection of the training set for derivatization design was not influenced by potency and it was based upon the following considerations: availability of cocrystal structure with DDR1, inclusion of at least one example for each known DDR1 form (DFG in, DFG out, inactive form), and structural diversity (Figure 2). Thus, in total, eight different derivatization design runs were performed with the simplest user setting of one reaction pathway and one modification per molecule allowed (see the Supporting Information for specific settings in each design run). Property filtration was turned off during derivatization design to obtain a raw set first that can be compared to the raw DGD set including the ratio of undesirable molecules produced by the respective design methods. The 8 design runs yielded a total of 72 947 distinct molecules, an average of 9118 per run. A simple property and unwanted substructure filtration removed 23 499 members (78%) of the DGD set and 33 923 compounds (46%) from the DD set. In order to close the gap in the remaining set sizes, a loose Bemis–Murcko (loose BM, LBM) analysis was carried out on the remaining DD compounds and a single member of each framework cluster was retained leading to 6748 compounds for analysis versus 6501 in the DGD set. It should be noted that the last DD filtration step can result in a loss of hits or new motifs within our DD collection sent for docking if the kept lone framework gets a poor docking score due to an unfavorable interaction but other members of the family would dock well. The docked molecules generated by the two different design techniques were further analyzed by their diversity, scaffold composition, LBM composition, and BM composition.

Based on side-chain conformations that caused major changes in the pocket volume, we selected three PDB structures (3ZOS, 5BVN, 6FEW) for docking studies as representatives of the different protein pocket types observed in various PDB entries for active DDR1 protein conformations. Each molecule was docked to all three prepared pockets using Glide, initially using the SP settings followed by XP docking for molecules that had at least one pose that complied with strain, clash, and docking score limits (see the Supporting Information). Finally, XP scores and poses were used for final analyses. Table 1 summarizes the comparative results obtained in the study.

Table 1. Comparison of the Docked Data Sets Obtained by Derivatization Design (DD) and Deep Generative Design (DGD).

	DD	DGD	overlap
size of training set^a	8	1370	8
docked set size	6748	6427	0
scaffold count (SC)^a	483	818	124
SC with 2–3 substituents	89%	74%	N/A
LBM count	6748	5394	6
BM count	1404	2828	86
Glide hit rate	34%	9%	N/A
best Glide XP score^b	–17.7	–16.4	N/A
avg hit docking LE^a	0.50	0.46	N/A
avg hit docking LIPE^a	7.8	6.9	N/A
top 50 XP Glide score	–15.5	–13	N/A
new motifs	6	4	1

Open in a new tab

See the Supporting Information.

Ponatinib XP Glide score is −15.4.

Results

Comparison of the raw DD and DGD generated sets reveals three major trends (Table 1). (a) There is extremely small overlap based on all metrics we applied: actual molecules, LBM, BM frameworks, or scaffolds (as defined by the most central ring system along with its substitution vectors). (b) The derivatization design product set contains significantly fewer compounds that fail property and substructure filters. (c) The deep generative design set has higher scaffold and BM framework diversity. The latter could be due to the orders of magnitude larger training set of DGD as opposed to the few compounds used for DD. In addition, we found that DD generated a higher percent of lead-like molecules and that DGD created a higher percent of very simple fragment-like molecules or highly complex scaffolds with >3 substituents. Analysis of the DD result sets produced by different reaction schemes for Ponatinib-like inhibitors revealed 6–17% overlap among the molecules among these series, demonstrating the value of running multiple derivatization designs with conceptually different reaction pathways.

Most importantly, for lead discovery purposes, we evaluated the enrichment in interesting compounds as predicted by docking scores. Surprisingly, we found that derivatization design produced much higher hit rate, while the number of docked “actives” in the DGD set is rather low, especially considering the very large training set used to train the deep network. So, it seems that the larger scaffold and framework diversity produced by the deep generative method hurts the hit rate much more so than it helps discover new motifs for scaffold hopping. In other words, the increased framework diversity in DGD was reached by moving out of the confines of the shape and interaction map in the DDR1 inhibitor binding pockets. To visualize the sampled lead analogue space, principle component analysis (PCA, see Figure S1 of the Supporting Information) and t-distributed stochastic neighbor embedding (t-SNE) were employed.²⁹Figure 3 confirms that a significant part of the space sampled only by DGD is rather far from the seven X-ray ligands and significant portion of these unique areas produced no hit in the docking study against three diverse DDR1 structures. In fact, the most dense area in the DGD map (Figure 3E) is the least productive in hits (Figure 3C). It is noteworthy that this analysis was able to point out the near neighbor characteristics of the two highly similar molecules compounds 1 and 7r (Figure 3D).

Clustering of generated molecules by t-distributed stochastic neighbor embedding (t-SNE; see the Supporting Information for details) inclusive of the DD and DGD sets and the eight reference ligands used in DD and compound 1. (A) DD and DGD docked sets. (B) Hit molecules versus full docked set created by DD. (C) Hit molecules versus full docked set created by DGD. (D) Positions and identity of the eight reference ligands and compound 1 in the same space. 6BRJ and 6BSD PDB structures belong to inactive DDR1 conformations. (E) Heat map for relative compound density in DGD space. (F) Heat map for relative compound density in DD space.

The number of potentially interesting new motifs was analyzed by removing the trivial analogues that contained known skeletons and chemotypes present in the training set followed by visual inspection of the remaining structures. Both methods produced notable new motifs in similar numbers, albeit with only one overlap, the benzisoxazole scaffold (compound 1 of reference study, Figure 4). It has been debated in the literature whether the transformation of a known secondary aryl carboxamide motif to a conformationally constrained benzisoxazole is a remarkable new revelation or a trivial cyclization routinely suggested in medicinal chemistry programs.^30,31 Nevertheless, it is certainly a potential new motif that could be of interest for evaluation in order to improve multiparameter optimization properties or binding affinity in a project; thus, it is rewarding to see this validated compound class in the DD result set. It was not disclosed in the GENTRL method what exploration/exploitation settings were applied; therefore, derivatization design diversity settings in this study contained no upper similarity limit that can steer designs toward exploration. Consequently, in this study, we conducted both lead optimization (exploitation) and lead modification (exploration) by applying the simplest settings that are not among the most productive ones in our experience. It is worth pointing out that, even with these simplest settings, the method was able to propose the benzisoxazole scaffold as a potential molecule of interest, thereby successfully reproducing the key finding in the reference DGD study (Figure 4), and it also proposed, without training, the previously disclosed³² successful scaffold hop from the metabolically labile spirocyclic imidazoine-4-one to the improved dihydroisoindolone scaffold (Figure 4).

Derivatization design discovers known motifs. (A) Compound 1 proposed by DGD versus benzisoxazole designed by DD. (B) Scaffold hopping by DD from compound **2.13** to dihydroisoindolone.

Moreover, derivatization design performed far better than the deep generative method in producing an enrichment of valuable ideas as signaled by both higher hit rate and improvement in average docking scores (Table 1). Not only was the hit rate found to be significantly higher, but the lead quality of the generated virtual hit molecules was also better for DD indicated by higher LE and docking LIPE values (see the Supporting Information). Moreover, the higher lead-likeness of the entire generated raw virtual set was also evident from the much higher failure rate of the DGD set in the predocking physicochemical filtration step as discussed earlier. Lastly, both the best docking score and the average of the top 50 best docking scores are significantly lower for DD. The best docking score in the derivatization design set was suggested for a Ponatinib analogue that contains a new solubilizing group that is predicted to participate in interactions very different from those of the piperazine of Ponatinib resulting in an improved docking score (Figure 5).

Overlay of bound Ponatinib conformation in 3ZOS (green) and analogue with the lowest docking score generated by derivatization design docked into 3ZOS (magenta). Presumed H-bonding interactions of piperazine NH of Ponatinib and charged interaction of the new analogue are marked with dotted lines.

Discussion

Artificial intelligence has created a significant buzz in the pharmaceutical sector by promising a range of benefits for preclinical and clinical discovery.^33,34 However, adoption of AI-driven tools into the preclinical workflow has been cautious; only in the past few years have deep generative design and deep learning models made their ways into commercial packages and internal company processes. Active learning,³⁵ desirability³⁶ score-driven virtual optimizations are becoming popular with the promise of reduction in cycle time and number of compounds synthesized. These virtual optimization methods all require generation of new, preferably novel analogues around the compound(s) selected as the cycle’s starting point. For truly quick cycles, these molecule generators should be automated for ease of use by medicinal and computational chemists. The idea generation tools include two de novo methods, deep generative design and single-step enumeration, as well as similarity or substructure database search. Both de novo methods, however, have a major and common drawback: lack of true synthetic feasibility driver in the design process. Retrospective analysis of synthetic feasibility by SA_Score³⁷ or SCScore³⁸ is possible, but these are flawed predictors. Therefore, ideally full retrosynthetic analyses would be needed on all virtual products. The latter would however require expanded timelines and major resources not yet available for the analysis of many thousands of new compounds. Moreover, it is also counterintuitive to remove a potentially significant portion of the already designed space by time-consuming retrosynthetic analysis. Bradshaw et al.³⁹ and Korovina et al.⁴⁰ have proposed synthesis-based molecule design models, but these remain unvalidated for real world multistep examples.

Recently, the synthetic accessibility of deep generative data sets has been discussed.⁴¹ Thus, we calculated the SCScore (DD: 4.6; DGD: 4.2) for both sets and found that the DD set has a slightly higher average SCScore. Interestingly, the calculated SA_score followed a similar trend (average DD: 2.9; DGD: 2.7). One reason for the surprising results is that DGD produced a higher percent of simple fragment-like molecules that have very low scores (Table 1). But more importantly, our results point to a major flaw in both scoring methods that abolishes their value toward establishing synthetic feasibility: the commercial availability of reactants is not considered in either scoring algorithm. Thus, molecules constructed via simple chemistries from complex but commercially available reactants are significantly overestimated, while simple structures that are hard to construct from simple reactants are underestimated (Figure 6). As we show, both scores obtained for an example from the DD set are high despite having a simple synthesis pathway with three (currently) commercially available reactants. Thus, we postulate that neither of the published scores is appropriate to use for the assessment of practical synthetic feasibility or synthetic timelines to compounds as they are rather related to the molecular complexity. True synthetic feasibility can only be fairly established by full retrosynthetic analysis or by a scoring method that assigns zero difficulty to the purchasable parts of the molecule and uses an up-to-date reactant database as reference.

Over- and underestimated synthetic feasibility scores exemplify poor predictability of wet laboratory effort. (A) Overestimated synthetic difficulties by SA_score and SCScore for a simple synthetic sequence using three commercial reactants (eMolecules IDs are shown), one of which is a complex building block. (B) Underestimated synthetic difficulties by SA_score for 2 longer sequences that apply simple reactants. No separation by SCscore.

On the other hand, we can estimate synthetic feasibility for a traditional enumerative design that does not build on true synthetic knowledge. Thus, we carried out a concept test using a sequence from the DDR1 data set. Ponatinib-like inhibitors can be assembled by several sequences including a three-step scheme of reductive alkylation, Ullmann coupling, and Sonogashira coupling (Figure 1A). For this sequence, we analyzed the number of reagents in our full reagent database that participate in these reactions using (a) our AI-assisted engine and (b) a reaction enumerator using reaction SMIRKS⁴² of relatively medium complexity as commonly applied in enumerator software. Our synthesis engine predicts 124K, 176K, and 112K compatible reactants for the three reactions to participate in a predictable fashion, respectively, while a simple enumerator processes 148K, 253K, and 152K reagents for the same. As a result, we can assume that on average 25% of the products generated by a simple enumerator for a reaction step would not be synthetically feasible or would be very challenging to make using a standard protocol. In this failure rate, the number of falsely enumerated product structures by the simple enumerator due to inadequate reactivity and selectivity rules is not considered, because that would be highly dependent on the SMIRKS and as such more difficult to estimate. In a three-step sequence of these three reactions, 57% of the virtual products would be synthetically challenged not considering the aforementioned reactivity and selectivity issue. We believe that such high uncertainty is not consistent with our desire to reduce cycle time and cost in lead optimization. In our practice, approximately 85% of the products requiring synthetic sequences of three steps or longer with one or more challenging steps could be rapidly synthesized in the laboratory, which equals the synthetic success rate found desirable by others⁴¹ for a single step protocol.

Today, to achieve rapid cycle times in lead generation and optimization, an effective design method must have several important attributes:

1.
>90% of proposed products should be synthetically rapidly accessible
2.
Must be tunable toward any combination of exploitation or exploration
3.
Must be able to produce nontrivial patentable ideas
4.
Must be able to produce complex changes for scaffold hopping
5.
Must have a reasonably high enrichment factor in productive ideas
6.
Must be fast enough within the limitations of the expected (virtual or real) cycle time, likely meaning a few days or less
7.
Must be able to produce multipoint variations in lead molecules (if the user organization deems that important)
8.
For novel targets or for targets with little prior art, it must be able to achieve all the above with a small training set or even with one starting point

Currently used techniques, deep generative design, and traditional enumerations cannot satisfy all the above criteria. DGD is weak on points 1, 5 (see Table 1), and 8, while simple enumeration fails at points 1, 2, 3, 4, and 7. The only criterion both methods can meet is speed (point 6) which explains why the methods have gained widespread adoption. In contrast, derivatization design meets all 8 criteria for cycle time reduction due to following specific capabilities or features that address the above listed factors:

1:
The AI-assisted synthetic feasibility engine drives reactant selection and product generation incorporating functional group tolerance, side-reactions, selectivity, regio- and stereochemistry.
2–4:
Settings like number of total variations, number of new linkages, total compound similarity, and specific component similarity thresholds as well as lower and upper limits for similarity make the method highly tunable by metrics well understood by medicinal chemists.
3,4:
Utilization of four different diversity measures and inclusion of shape and pharmacophore descriptors ensure nontrivial and scaffold hopping ideas.
5:
Enrichment factor can be controlled by the total similarity lower and upper value.
6:
In our experience, several three to five step derivatization designs can be completed in a day on a single machine.
7:
Multisite variations and the depth of change at each site are controlled by the total number of variations allowed and the number of new connections.
8:
One ligand is sufficient.

Due to these attributes, derivatization design can be effectively used in a project where the purpose is to quickly expand the chemotype diversity from one or few hit structures to many. This is especially important early in the program when the goal is to evaluate structural diversity of possible leads and their associated multiparameter profiles before committing to a few selected lead series for further optimization. Deep generative design, however, is not applicable at this stage due to lack of sufficient training set unless there is a rich literature of known prior art. Even with the availability of a large and very diverse training set like in the case of DDR1 inhibitors, we show that the hit rate of DGD is lower despite the large structural diversity of the design output. Moreover, the other popular technique, simple reaction enumeration design, will not turn up new motifs or scaffolds but will only create analogues sets for evaluation. The few limitations of derivatization design include the need for at least one small molecule hit structure. On the other hand, the starting point can be derived from the scientific or patent literature because derivatization design can tap into patentable novel chemical space from the known input. In addition, currently derivatization design cannot directly be combined with physics-based ranking methods such as free-energy perturbation (FEP), but it can be used to generate analogues for subsequent FEP calculations. Finally, the technique currently can only be used for linear synthetic routes and a convergent synthetic strategy would have to be divided into subbranches.

In later stages of lead optimization, derivatization design offers a very convenient method to strategically vary the lead structure at a single site or preferably allow synergistic combinations of multisite variations. In practice, there is a major preference for single modification at a time for easy data interpretation reasons and for applying matched molecular pair analyses, a process that has been valuable to medicinal chemistry. However, certain types of changes at one part of the molecule may require fine-tuning at other close or distant areas to produce synergistic combinations that can be interesting but would be overlooked by single site changes. No matter what one’s preference is, derivatization design can easily be directed by the user to do one, the other, or both. At the first distant look, derivatization design might appear to be solely a reagent variation method by which building blocks are exchanged for new ones in their similarity neighborhood. Indeed, that can be explored, but a lot of other relevant modification types can also be thoroughly investigated: modification of ring classes, sizes, and ring system types (mono-, bi-, spirocyclic) in conjunction with connectivity to the next structural component; change of bond type and connecting atom types; shortening and lengthening distances between functional groups or rings; bioisostere variations; and rigidification. At later stages, the method is likely to be applied to improve the ADMET or pharmacological profile rather than improving potency or scaffold hopping. Thus, it is advantageous to focus the structural variations during derivatization design to the liable area(s) of the lead and to use only those reactants that have the potential to improve on the properties of concern. Moreover, good input molecules for derivatization design can be identified by cluster analysis of actives and selection of cluster representatives if a large structure–activity relationship data set is available.

In the current comparative study, we applied simple settings with restrictions of one total modification and an average similarity depth for each run. In our experience, different derivatization design settings, i.e., allowing higher diversity depth at selected sites in particular, result in significantly more interesting new scaffold ideas. Such settings can easily explore scaffold hopping from mono- to bicyclic or to spirocyclic rings with desirable connections. These analogues can rapidly be evaluated in real optimizations because approximately 85% laboratory success rate has historically been achieved for synthetic schemes containing three to five steps owing to the expert-trained forward synthesis design engine. Moreover, detailed synthesis and reagent data in the output provide opportunities for various early rankings by the most relevant practical data types that drive the timeline to bioassays.

Improvements currently under development include handling convergent synthetic pathways and adding pharmacophore models to the “relevance” filters. An optimization case study initiated from one of the new DDR1 motifs found herein will be reported elsewhere.

Conclusion

Deep generative design was found to create a more diverse virtual scaffold and framework space than the one produced by derivatization design with the settings applied in this study.

The derivatization design technique introduced herein is an effective and flexibly adoptable design strategy for both early and advanced stage lead generation and optimization. It can rapidly provide closer lead analogues and scaffold variations simultaneously or separately to feed rapid virtual or real optimization cycles. Unlike outputs offered by deep networks, all designed compounds possess practically relevant data including synthetic schemes and reagent availability details. The method appears to outperform generative design by deep networks offering a better hit rate via sampling a more relevant chemical space without need for much training data.

Glossary

Abbreviations

BM: Bemis–Murcko
DD: derivatization design
DDR1: discoidin domain receptor 1
DGD: deep generative design
DL: deep learning
ECFP4: extended-connectivity fingerprints
FEP: free energy perturbation
GENTRL: generative tensorial reinforcement learning
LBM: Loose Bemis–Murcko

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsmedchemlett.0c00540.

Detailed reactions schemes, description of user inputs, specific settings utilized in DDR1 derivatization design, physicochemical and unwanted substructure filters used in prefiltering prior to docking, details on the scaffold analysis method and docking process, and visualization details PDF)

Author Contributions

The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript.

This work was supported by the National Research, Development and Innovation Office of Hungary (NKFIH Grant Number 2018-1.1.1-MKI-2018-00046), and by ERDF (GINOP grant number GINOP-2.1.7-15-2016-00622; VEKOP grant number VEKOP-2.1.7-15-2016-00138).

The authors declare no competing financial interest.

Supplementary Material

ml0c00540_si_001.pdf^{(515.8KB, pdf)}

References

Hoffmann T.; Gastreich M. The Next Level in Chemical Space Navigation: Going Far Beyond Enumerable Compound Libraries. Drug Discovery Today 2019, 24, 1148–1156. 10.1016/j.drudis.2019.02.013. [DOI] [PubMed] [Google Scholar]
Yuan Y.; Pei J.; Lai L. LigBuilder 2: A Practical de Novo Drug Design Approach. J. Chem. Inf. Model. 2011, 51, 1083–1091. 10.1021/ci100350u. [DOI] [PubMed] [Google Scholar]
Langer T.; Hoffmann R. D. Pharmacophore Modelling: Applications in Drug Discovery. Expert Opin. Drug Discovery 2006, 1, 261–267. 10.1517/17460441.1.3.261. [DOI] [PubMed] [Google Scholar]
de Souza Neto L. R.; Moreira-Filho J. T.; Neves B. J.; Riveros Maidana R. L. B.; Ramos Guimaraes A. C.; Furnham N.; Andrade C. H.; Silva F. P. In silico Strategies to Support Fragment-to-Lead Optimization in Drug Discovery. Front. Chem. 2020, 8, 1–18. 10.3389/fchem.2020.00093. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hu Y.; Stumpfe D.; Bajorath J. Recent Advances in Scaffold Hopping. J. Med. Chem. 2017, 60, 1238–1246. 10.1021/acs.jmedchem.6b01437. [DOI] [PubMed] [Google Scholar]
Langdon S. R.; Brown N.; Blagg J. Scaffold Diversity of Exemplified Medicinal Chemistry Space. J. Chem. Inf. Model. 2011, 51, 2174–2185. 10.1021/ci2001428. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rabal O.; Amr F. I.; Oyarzabal J. Novel Scaffold Fingerprint (SFP): Applications in Scaffold Hopping and Scaffold-Based Selection of Diverse Compounds. J. Chem. Inf. Model. 2015, 55, 1–18. 10.1021/ci500542e. [DOI] [PubMed] [Google Scholar]
Vanhaelen Q.; Lin Y.-C.; Zhavoronkov A. The Advent of Generative Chemistry. ACS Med. Chem. Lett. 2020, 11, 1496–1505. 10.1021/acsmedchemlett.0c00088. [DOI] [PMC free article] [PubMed] [Google Scholar]
Szymkuc S.; Gajewska E. P.; Klucznik T.; Molga K.; Dittwald P.; Startek M.; Bajczyk M.; Grzybowski B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem., Int. Ed. 2016, 55, 5904–5937. 10.1002/anie.201506101. [DOI] [PubMed] [Google Scholar]
Segler M.; Preuss M.; Waller M. Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018, 555, 604–610. 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]
Corey E. J.; Wipke W. T. Computer-assisted Design of Complex Organic Syntheses. Science 1969, 166, 178–192. 10.1126/science.166.3902.178. [DOI] [PubMed] [Google Scholar]
Socorro I. M.; Taylor K.; Goodman J. M. ROBIA: A Reaction Prediction Program. Org. Lett. 2005, 7, 3541–3544. 10.1021/ol0512738. [DOI] [PubMed] [Google Scholar]
Gensch T.; Teders M.; Glorius F. Approach to Comparing the Functional Group Tolerance of Reactions. J. Org. Chem. 2017, 82, 9154–9159. 10.1021/acs.joc.7b01139. [DOI] [PubMed] [Google Scholar]
Schwaller P.; Gaudin T.; Lanyi D.; Bekas C.; Laino T. Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models. Chem. Sci. 2018, 9, 6091–6098. 10.1039/C8SC02339E. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coley C. W.; Jin W.; Rogers L.; Jamison T. F.; Jaakkola T. S.; Green W. H.; Barzilay R.; Jensen K. F. A Graph-Convolutional Neural Network Model for the Prediction of Chemical Reactivity. Chem. Sci. 2019, 10, 370–377. 10.1039/C8SC04228D. [DOI] [PMC free article] [PubMed] [Google Scholar]
Baker M. 1,500 Scientists Lift the Lid on Reproducibility. Nature 2016, 533, 452–454. 10.1038/533452a. [DOI] [PubMed] [Google Scholar]
Klucznik T.; Mikulak-Klucznik B.; McCormack M. P.; Lima H.; Szymkuc S.; Bhowmick M.; Molga K.; Zhou Y.; Rickershauser L.; Gajewska E. P.; Toutchkine A.; Dittwald P.; Startek M. P.; Kirkovits G. J.; Roszak R.; Adamski A.; Sieredzinska B.; Mrksich M.; Trice S. L. J.; Grzybowski B. A. Efficient Syntheses of Diverse, Medicinally Relevant Targets Planned by Computer and Executed in the Laboratory. Chem. 2018, 4, 522–532. 10.1016/j.chempr.2018.02.002. [DOI] [Google Scholar]
Boström J.; Brown D.; Young R.; Keserű G. M. Expanding the Medicinal Chemistry Synthetic Toolbox. Nat. Rev. Drug Discovery 2018, 17, 709–727. 10.1038/nrd.2018.116. [DOI] [PubMed] [Google Scholar]
Zhavoronkov A.; Ivanenkov Y. A.; Aliper A.; Veselov M. S.; Aladinskiy V. A.; Aladinskaya A. V.; Terentiev V. A.; Polykovskiy D. A.; Kuznetsov M. D.; Asadulaev A.; Volkov Y.; Zholus A. Z.; Shayakhmetov R. R.; Zhebrak A.; Minaeva L. I.; Zagribelnyy B. A.; Lee L. H.; Soll R.; Madge D.; Xing L.; Guo T.; Aspuru-Guzik A. Deep Learning Enables Rapid Identification of Potent DDR1 Kinase Inhibitors. Nat. Biotechnol. 2019, 37, 1038–1040. 10.1038/s41587-019-0224-x. [DOI] [PubMed] [Google Scholar]
ChemAxon. https://docs.chemaxon.com.
Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
Grebner C.; Matter H.; Plowright A. T.; Hessler G. Automated De Novo Design in Medicinal Chemistry: Which Types of Chemistry Does a Generative Neural Network Learn?. J. Med. Chem. 2020, 63, 8809–8823. 10.1021/acs.jmedchem.9b02044. [DOI] [PubMed] [Google Scholar]
Warren G. L.; Andrews C. W.; Capelli A.-M.; Clarke B.; LaLonde J.; Lambert M. H.; Lindvall M.; Nevins N.; Semus S. F.; Senger S.; Tedesco G.; Wall I. D.; Woolven J. M.; Peishoff C. E.; Head M. S. A Critical Assessment of Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49, 5912–5931. 10.1021/jm050362n. [DOI] [PubMed] [Google Scholar]
Wang Z.; Sun H.; Yao X.; Li D.; Xu L.; Li Y.; Tian S.; Hou T. Comprehensive Evaluation of Ten Docking Programs on a Diverse Set of Protein–ligand Complexes: the Prediction Accuracy of Sampling Power and Scoring Power. Phys. Chem. Chem. Phys. 2016, 18, 12964–12975. 10.1039/C6CP01555G. [DOI] [PubMed] [Google Scholar]
Huang N.; Shoichet B. K.; Irwin J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801. 10.1021/jm0608356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen H.; Lyne P. D.; Giordanetto F.; Lovell T.; Li J. On Evaluating Molecular-Docking Methods for Pose Prediction and Enrichment Factors. J. Chem. Inf. Model. 2006, 46, 401–415. 10.1021/ci0503255. [DOI] [PubMed] [Google Scholar]
Gao M.; Duan L.; Luo J.; Zhang Z.; Lu X.; Zhang Y.; Zhang Z.; Tu Z.; Xu Y.; Ren X.; Ding K. Discovery and Optimization of 3-(2-(Pyrazolo[1,5-a]pyrimidin-6-yl)ethynyl)benzamides as Novel Selective and Orally Bioavailable Discoidin Domain Receptor 1 (DDR1) Inhibitors. J. Med. Chem. 2013, 56, 3281–3295. 10.1021/jm301824k. [DOI] [PubMed] [Google Scholar]
Hanson S. M.; Georghiou G.; Thakur M. K.; Miller W. T.; Rest J. S.; Chodera J. D.; Seeliger M. A. What Makes a Kinase Promiscuous for Inhibitors?. Cell Chem. Biol. 2019, 26, 390–399. 10.1016/j.chembiol.2018.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Van Der Maaten J. J. P.; Hinton G. E. Visualizing High-dimensional Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Walters W. P.; Murcko M. Assessing the Impact of Generative AI on Medicinal Chemistry. Nat. Biotechnol. 2020, 38, 143–145. 10.1038/s41587-020-0418-2. [DOI] [PubMed] [Google Scholar]
Zhavoronkov A.; Aspuru-Guzik A. Reply to ‘Assessing the Impact of Generative AI on Medicinal Chemistry’. Nat. Biotechnol. 2020, 38, 146. 10.1038/s41587-020-0417-3. [DOI] [PubMed] [Google Scholar]
Richter H.; Satz A. L.; Bedoucha M.; Buettelmann B.; Petersen A. C.; Harmeier A.; Hermosilla R.; Hochstrasser R.; Burger D.; Gsell B.; Gasser R.; Huber S.; Hug M. N.; Kocer B.; Kuhn B.; Ritter M.; Rudolph M. G.; Weibel F.; Molina-David J.; Kim J.-J.; Santos J. V.; Stihle M.; Georges G. J.; Bonfil R. D.; Fridman R.; Uhles S.; Moll S.; Faul C.; Fornoni A.; Prunotto M. DNA-Encoded Library-Derived DDR1 Inhibitor Prevents Fibrosis and Renal Function Loss in a Genetic Mouse Model of Alport Syndrome. ACS Chem. Biol. 2019, 14, 37–49. 10.1021/acschembio.8b00866. [DOI] [PMC free article] [PubMed] [Google Scholar]
Griffen E. J.; Dossetter A. G.; Leach A. G. Chemists: AI Is Here; Unite To Get the Benefits. J. Med. Chem. 2020, 63, 8695–8704. 10.1021/acs.jmedchem.0c00163. [DOI] [PubMed] [Google Scholar]
Schneider P.; Walters W. P.; Plowright A. T.; Sieroka N.; Listgarten J.; Goodnow R. A. Jr.; Fisher J.; Jansen J. M.; Duca J. S.; Rush T. S.; Zentgraf M.; Hill J. E.; Krutoholow E.; Kohler M.; Blaney J.; Funatsu K.; Luebkemann C.; Schneider G. Rethinking Drug Design in the Artificial Intelligence Era. Nat. Rev. Drug Discovery 2020, 19, 353–364. 10.1038/s41573-019-0050-3. [DOI] [PubMed] [Google Scholar]
Reker D.; Schneider G. Active-learning Strategies in Computer-assisted Drug Discovery. Drug Discovery Today 2015, 20, 458–465. 10.1016/j.drudis.2014.12.004. [DOI] [PubMed] [Google Scholar]
Cummins D. J.; Bell M. A. Integrating Everything: The Molecule Selection Toolkit, a System for Compound Prioritization in Drug Discovery. J. Med. Chem. 2016, 59, 6999–7010. 10.1021/acs.jmedchem.5b01338. [DOI] [PubMed] [Google Scholar]
Ertl P.; Schuffenhauer A. Estimation of Synthetic Accessibility Score of Drug-Like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminf. 2009, 1, 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Coley C. W.; Rogers L.; Green W. H.; Jensen K. F. SCScore: Synthetic Complexity Learned from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261. 10.1021/acs.jcim.7b00622. [DOI] [PubMed] [Google Scholar]
Bradshaw J.; Paige B.; Kusner M. J.; Segler M. H.; Herna′ndez-Lobato J. M.. A Model to Search for Synthesizable Molecules. arXiv (Machine Learning), December 4, 2019, 1906.05221, ver. 2. https://arxiv.org/abs/1906.05221v2. [Google Scholar]
Korovina K.; Xu S.; Kandasamy K.; Neiswanger W.; Poczos B.; Schneider J.; Xing E. P.. ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations. arXiv (Machine Learning), October 22, 2019, 1908.01425, ver. 2. https://arxiv.org/abs/1908.01425v2. [Google Scholar]
Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 60 (12), 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]
Daylight SMIRKS: A Reaction Transform Language; Daylight Chemical Information Systems, Inc.http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ml0c00540_si_001.pdf^{(515.8KB, pdf)}

[ref1] Hoffmann T.; Gastreich M. The Next Level in Chemical Space Navigation: Going Far Beyond Enumerable Compound Libraries. Drug Discovery Today 2019, 24, 1148–1156. 10.1016/j.drudis.2019.02.013. [DOI] [PubMed] [Google Scholar]

[ref2] Yuan Y.; Pei J.; Lai L. LigBuilder 2: A Practical de Novo Drug Design Approach. J. Chem. Inf. Model. 2011, 51, 1083–1091. 10.1021/ci100350u. [DOI] [PubMed] [Google Scholar]

[ref3] Langer T.; Hoffmann R. D. Pharmacophore Modelling: Applications in Drug Discovery. Expert Opin. Drug Discovery 2006, 1, 261–267. 10.1517/17460441.1.3.261. [DOI] [PubMed] [Google Scholar]

[ref4] de Souza Neto L. R.; Moreira-Filho J. T.; Neves B. J.; Riveros Maidana R. L. B.; Ramos Guimaraes A. C.; Furnham N.; Andrade C. H.; Silva F. P. In silico Strategies to Support Fragment-to-Lead Optimization in Drug Discovery. Front. Chem. 2020, 8, 1–18. 10.3389/fchem.2020.00093. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] Hu Y.; Stumpfe D.; Bajorath J. Recent Advances in Scaffold Hopping. J. Med. Chem. 2017, 60, 1238–1246. 10.1021/acs.jmedchem.6b01437. [DOI] [PubMed] [Google Scholar]

[ref6] Langdon S. R.; Brown N.; Blagg J. Scaffold Diversity of Exemplified Medicinal Chemistry Space. J. Chem. Inf. Model. 2011, 51, 2174–2185. 10.1021/ci2001428. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] Rabal O.; Amr F. I.; Oyarzabal J. Novel Scaffold Fingerprint (SFP): Applications in Scaffold Hopping and Scaffold-Based Selection of Diverse Compounds. J. Chem. Inf. Model. 2015, 55, 1–18. 10.1021/ci500542e. [DOI] [PubMed] [Google Scholar]

[ref8] Vanhaelen Q.; Lin Y.-C.; Zhavoronkov A. The Advent of Generative Chemistry. ACS Med. Chem. Lett. 2020, 11, 1496–1505. 10.1021/acsmedchemlett.0c00088. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] Szymkuc S.; Gajewska E. P.; Klucznik T.; Molga K.; Dittwald P.; Startek M.; Bajczyk M.; Grzybowski B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem., Int. Ed. 2016, 55, 5904–5937. 10.1002/anie.201506101. [DOI] [PubMed] [Google Scholar]

[ref10] Segler M.; Preuss M.; Waller M. Planning Chemical Syntheses with Deep Neural Networks and Symbolic AI. Nature 2018, 555, 604–610. 10.1038/nature25978. [DOI] [PubMed] [Google Scholar]

[ref11] Corey E. J.; Wipke W. T. Computer-assisted Design of Complex Organic Syntheses. Science 1969, 166, 178–192. 10.1126/science.166.3902.178. [DOI] [PubMed] [Google Scholar]

[ref12] Socorro I. M.; Taylor K.; Goodman J. M. ROBIA: A Reaction Prediction Program. Org. Lett. 2005, 7, 3541–3544. 10.1021/ol0512738. [DOI] [PubMed] [Google Scholar]

[ref13] Gensch T.; Teders M.; Glorius F. Approach to Comparing the Functional Group Tolerance of Reactions. J. Org. Chem. 2017, 82, 9154–9159. 10.1021/acs.joc.7b01139. [DOI] [PubMed] [Google Scholar]

[ref14] Schwaller P.; Gaudin T.; Lanyi D.; Bekas C.; Laino T. Found in Translation”: Predicting Outcomes of Complex Organic Chemistry Reactions using Neural Sequence-to-Sequence Models. Chem. Sci. 2018, 9, 6091–6098. 10.1039/C8SC02339E. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] Coley C. W.; Jin W.; Rogers L.; Jamison T. F.; Jaakkola T. S.; Green W. H.; Barzilay R.; Jensen K. F. A Graph-Convolutional Neural Network Model for the Prediction of Chemical Reactivity. Chem. Sci. 2019, 10, 370–377. 10.1039/C8SC04228D. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] Baker M. 1,500 Scientists Lift the Lid on Reproducibility. Nature 2016, 533, 452–454. 10.1038/533452a. [DOI] [PubMed] [Google Scholar]

[ref17] Klucznik T.; Mikulak-Klucznik B.; McCormack M. P.; Lima H.; Szymkuc S.; Bhowmick M.; Molga K.; Zhou Y.; Rickershauser L.; Gajewska E. P.; Toutchkine A.; Dittwald P.; Startek M. P.; Kirkovits G. J.; Roszak R.; Adamski A.; Sieredzinska B.; Mrksich M.; Trice S. L. J.; Grzybowski B. A. Efficient Syntheses of Diverse, Medicinally Relevant Targets Planned by Computer and Executed in the Laboratory. Chem. 2018, 4, 522–532. 10.1016/j.chempr.2018.02.002. [DOI] [Google Scholar]

[ref18] Boström J.; Brown D.; Young R.; Keserű G. M. Expanding the Medicinal Chemistry Synthetic Toolbox. Nat. Rev. Drug Discovery 2018, 17, 709–727. 10.1038/nrd.2018.116. [DOI] [PubMed] [Google Scholar]

[ref19] Zhavoronkov A.; Ivanenkov Y. A.; Aliper A.; Veselov M. S.; Aladinskiy V. A.; Aladinskaya A. V.; Terentiev V. A.; Polykovskiy D. A.; Kuznetsov M. D.; Asadulaev A.; Volkov Y.; Zholus A. Z.; Shayakhmetov R. R.; Zhebrak A.; Minaeva L. I.; Zagribelnyy B. A.; Lee L. H.; Soll R.; Madge D.; Xing L.; Guo T.; Aspuru-Guzik A. Deep Learning Enables Rapid Identification of Potent DDR1 Kinase Inhibitors. Nat. Biotechnol. 2019, 37, 1038–1040. 10.1038/s41587-019-0224-x. [DOI] [PubMed] [Google Scholar]

[ref20] ChemAxon. https://docs.chemaxon.com.

[ref21] Rogers D.; Hahn M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50, 742–754. 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]

[ref22] Grebner C.; Matter H.; Plowright A. T.; Hessler G. Automated De Novo Design in Medicinal Chemistry: Which Types of Chemistry Does a Generative Neural Network Learn?. J. Med. Chem. 2020, 63, 8809–8823. 10.1021/acs.jmedchem.9b02044. [DOI] [PubMed] [Google Scholar]

[ref23] Warren G. L.; Andrews C. W.; Capelli A.-M.; Clarke B.; LaLonde J.; Lambert M. H.; Lindvall M.; Nevins N.; Semus S. F.; Senger S.; Tedesco G.; Wall I. D.; Woolven J. M.; Peishoff C. E.; Head M. S. A Critical Assessment of Docking Programs and Scoring Functions. J. Med. Chem. 2006, 49, 5912–5931. 10.1021/jm050362n. [DOI] [PubMed] [Google Scholar]

[ref24] Wang Z.; Sun H.; Yao X.; Li D.; Xu L.; Li Y.; Tian S.; Hou T. Comprehensive Evaluation of Ten Docking Programs on a Diverse Set of Protein–ligand Complexes: the Prediction Accuracy of Sampling Power and Scoring Power. Phys. Chem. Chem. Phys. 2016, 18, 12964–12975. 10.1039/C6CP01555G. [DOI] [PubMed] [Google Scholar]

[ref25] Huang N.; Shoichet B. K.; Irwin J. J. Benchmarking Sets for Molecular Docking. J. Med. Chem. 2006, 49, 6789–6801. 10.1021/jm0608356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] Chen H.; Lyne P. D.; Giordanetto F.; Lovell T.; Li J. On Evaluating Molecular-Docking Methods for Pose Prediction and Enrichment Factors. J. Chem. Inf. Model. 2006, 46, 401–415. 10.1021/ci0503255. [DOI] [PubMed] [Google Scholar]

[ref28] Gao M.; Duan L.; Luo J.; Zhang Z.; Lu X.; Zhang Y.; Zhang Z.; Tu Z.; Xu Y.; Ren X.; Ding K. Discovery and Optimization of 3-(2-(Pyrazolo[1,5-a]pyrimidin-6-yl)ethynyl)benzamides as Novel Selective and Orally Bioavailable Discoidin Domain Receptor 1 (DDR1) Inhibitors. J. Med. Chem. 2013, 56, 3281–3295. 10.1021/jm301824k. [DOI] [PubMed] [Google Scholar]

[ref27] Hanson S. M.; Georghiou G.; Thakur M. K.; Miller W. T.; Rest J. S.; Chodera J. D.; Seeliger M. A. What Makes a Kinase Promiscuous for Inhibitors?. Cell Chem. Biol. 2019, 26, 390–399. 10.1016/j.chembiol.2018.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref29] Van Der Maaten J. J. P.; Hinton G. E. Visualizing High-dimensional Data Using T-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]

[ref30] Walters W. P.; Murcko M. Assessing the Impact of Generative AI on Medicinal Chemistry. Nat. Biotechnol. 2020, 38, 143–145. 10.1038/s41587-020-0418-2. [DOI] [PubMed] [Google Scholar]

[ref31] Zhavoronkov A.; Aspuru-Guzik A. Reply to ‘Assessing the Impact of Generative AI on Medicinal Chemistry’. Nat. Biotechnol. 2020, 38, 146. 10.1038/s41587-020-0417-3. [DOI] [PubMed] [Google Scholar]

[ref32] Richter H.; Satz A. L.; Bedoucha M.; Buettelmann B.; Petersen A. C.; Harmeier A.; Hermosilla R.; Hochstrasser R.; Burger D.; Gsell B.; Gasser R.; Huber S.; Hug M. N.; Kocer B.; Kuhn B.; Ritter M.; Rudolph M. G.; Weibel F.; Molina-David J.; Kim J.-J.; Santos J. V.; Stihle M.; Georges G. J.; Bonfil R. D.; Fridman R.; Uhles S.; Moll S.; Faul C.; Fornoni A.; Prunotto M. DNA-Encoded Library-Derived DDR1 Inhibitor Prevents Fibrosis and Renal Function Loss in a Genetic Mouse Model of Alport Syndrome. ACS Chem. Biol. 2019, 14, 37–49. 10.1021/acschembio.8b00866. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] Griffen E. J.; Dossetter A. G.; Leach A. G. Chemists: AI Is Here; Unite To Get the Benefits. J. Med. Chem. 2020, 63, 8695–8704. 10.1021/acs.jmedchem.0c00163. [DOI] [PubMed] [Google Scholar]

[ref34] Schneider P.; Walters W. P.; Plowright A. T.; Sieroka N.; Listgarten J.; Goodnow R. A. Jr.; Fisher J.; Jansen J. M.; Duca J. S.; Rush T. S.; Zentgraf M.; Hill J. E.; Krutoholow E.; Kohler M.; Blaney J.; Funatsu K.; Luebkemann C.; Schneider G. Rethinking Drug Design in the Artificial Intelligence Era. Nat. Rev. Drug Discovery 2020, 19, 353–364. 10.1038/s41573-019-0050-3. [DOI] [PubMed] [Google Scholar]

[ref35] Reker D.; Schneider G. Active-learning Strategies in Computer-assisted Drug Discovery. Drug Discovery Today 2015, 20, 458–465. 10.1016/j.drudis.2014.12.004. [DOI] [PubMed] [Google Scholar]

[ref36] Cummins D. J.; Bell M. A. Integrating Everything: The Molecule Selection Toolkit, a System for Compound Prioritization in Drug Discovery. J. Med. Chem. 2016, 59, 6999–7010. 10.1021/acs.jmedchem.5b01338. [DOI] [PubMed] [Google Scholar]

[ref37] Ertl P.; Schuffenhauer A. Estimation of Synthetic Accessibility Score of Drug-Like Molecules Based on Molecular Complexity and Fragment Contributions. J. Cheminf. 2009, 1, 8. 10.1186/1758-2946-1-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] Coley C. W.; Rogers L.; Green W. H.; Jensen K. F. SCScore: Synthetic Complexity Learned from a Reaction Corpus. J. Chem. Inf. Model. 2018, 58, 252–261. 10.1021/acs.jcim.7b00622. [DOI] [PubMed] [Google Scholar]

[ref39] Bradshaw J.; Paige B.; Kusner M. J.; Segler M. H.; Herna′ndez-Lobato J. M.. A Model to Search for Synthesizable Molecules. arXiv (Machine Learning), December 4, 2019, 1906.05221, ver. 2. https://arxiv.org/abs/1906.05221v2. [Google Scholar]

[ref40] Korovina K.; Xu S.; Kandasamy K.; Neiswanger W.; Poczos B.; Schneider J.; Xing E. P.. ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations. arXiv (Machine Learning), October 22, 2019, 1908.01425, ver. 2. https://arxiv.org/abs/1908.01425v2. [Google Scholar]

[ref41] Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 60 (12), 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]

[ref42] Daylight SMIRKS: A Reaction Transform Language; Daylight Chemical Information Systems, Inc.http://www.daylight.com/dayhtml/doc/theory/theory.smirks.html.

PERMALINK

Derivatization Design of Synthetically Accessible Space for Optimization: In Silico Synthesis vs Deep Generative Design

Gergely M Makara

László Kovács

István Szabó

Gábor Pőcze

Abstract

Overview of the Derivatization Design Technique

Figure 1.

Comparative Case Study for DDR1

Figure 2.

Table 1. Comparison of the Docked Data Sets Obtained by Derivatization Design (DD) and Deep Generative Design (DGD).

Results

Figure 3.

Figure 4.

Figure 5.

Discussion

Figure 6.

Conclusion

Glossary

Abbreviations

Supporting Information Available

Author Contributions

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Derivatization Design of Synthetically Accessible Space for Optimization: In Silico Synthesis vs Deep Generative Design

Gergely M Makara

László Kovács

István Szabó

Gábor Pőcze

Abstract

Overview of the Derivatization Design Technique

Figure 1.

Comparative Case Study for DDR1

Figure 2.

Table 1. Comparison of the Docked Data Sets Obtained by Derivatization Design (DD) and Deep Generative Design (DGD).

Results

Figure 3.

Figure 4.

Figure 5.

Discussion

Figure 6.

Conclusion

Glossary

Abbreviations

Supporting Information Available

Author Contributions

Supplementary Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases