Abstract

Generative machine learning models have become widely adopted in drug discovery and other fields to produce new molecules and explore molecular space, with the goal of discovering novel compounds with optimized properties. These generative models are frequently combined with transfer learning or scoring of the physicochemical properties to steer generative design, yet often, they are not capable of addressing a wide variety of potential problems, as well as converge into similar molecular space when combined with a scoring function for the desired properties. In addition, these generated compounds may not be synthetically feasible, reducing their capabilities and limiting their usefulness in real-world scenarios. Here, we introduce a suite of automated tools called MegaSyn representing three components: a new hill-climb algorithm, which makes use of SMILES-based recurrent neural network (RNN) generative models, analog generation software, and retrosynthetic analysis coupled with fragment analysis to score molecules for their synthetic feasibility. We show that by deconstructing the targeted molecules and focusing on substructures, combined with an ensemble of generative models, MegaSyn generally performs well for the specific tasks of generating new scaffolds as well as targeted analogs, which are likely synthesizable and druglike. We now describe the development, benchmarking, and testing of this suite of tools and propose how they might be used to optimize molecules or prioritize promising lead compounds using these RNN examples provided by multiple test case examples.
Introduction
We (as well as many other research groups and companies) have regularly used machine learning models to propose molecules for testing and then validated them in vitro with vendor available molecules as a first step.1−3 However, to optimize bioactivity of any hit molecules obtained or maintain activity with improved absorption, distribution, metabolism, excretion, and toxicity (ADME/tox) properties, vendor available compounds may not be sufficient. The most desirable chemical modifications for analogs are rarely available, and thus ways to generate and explore novel molecules are required.
In recent years, generative models have become commonly used as part of the design–make–test cycle4 to produce molecules de novo(5,6) and this field has been reviewed by many others.7−9 These generative models have come from several different architectures [e.g., recurrent neural networks (RNNs),6 variational autoencoder (VAE),10 and generative adversarial networks (GAN)]11 and have been shown to generate valid, novel molecules in the same chemical space as their training sets, with desirable physicochemical properties.12−15 Molecular representation is varied in such generative models, including SMILES, and more recently molecule trees and SELFIES, both of which have enjoyed success in producing 100% valid molecule strings.16,17 We chose the SMILES representation as our basis, as it has seen widespread success and is favored due to the simplicity and ease of molecular representation.18 However, many of these generative models have enjoyed limited success in real-world drug discovery projects due to their narrow range of capabilities and a lack of defined work pipelines for distinct generative tasks. One issue is that the focus of drug discovery projects may be varied, and a single generative design process would likely not work for all scenarios. For instance, in one project, a lead molecule scaffold may require an iterative design to find the most suitable analog, and thus, the generative model employed should only enumerate on the core structure. Conversely, in another case given a set of known active and inactive compounds against a target, the project may wish to discover entirely new scaffolds that do not exist in “patent space” yet have similar desired molecular properties to the known active compounds. While most generative models can utilize the desired physicochemical properties in the training of the generative models, in practice, the goals are often not achievable using generic, out-of-the-box generative models. This lowers the practical utility of generative models, as currently proposed. To control the “closeness” of the generated compounds to a molecule of interest, the Tanimoto similarity score19 is often included in the training. Even though the diversity of molecules generated is considerable, generative models retrained on the same parameters often end up in similar local minima of chemistry property space, reducing their usefulness past an initial run.20
To address these and other limitations, we have now created MegaSyn, a suite of algorithms, which takes a similar approach to weak learner ensemble methods such as Random Forests.21 MegaSyn modifies the approach of generative models in two ways. First, instead of training one generative model, MegaSyn trains many “weaker” generative models, starting from a generic model trained on a larger library, such as the druglike library (ChEMBL),22 and iteratively produces generative models that are continuously “focused” down onto target molecule(s) and physicochemical properties of interest. To improve diversity, random branch models are produced from each of these focused generative model nodes until multiple generative models have been created that explore many local minima. While we chose ChEMBL as a database of druglike molecules, the generic model could be trained on other sets of molecules such as NPASS for natural products.23 Second, for any proposed target molecule(s) of interest, MegaSyn first deconstructs target molecules into several substructures along with modifications, which we find improves the ability to discover analogs of interest. As we will illustrate, depending on the desired outcome (i.e., completely new drug scaffolds or enumeration on a common structural core), MegaSyn allows flexibility and balance in the exploration of chemical property space versus focused generative capabilities by traversing the “tree” of generative models based on the desired outcome.
Generating novel molecules is important but it is also key to evaluate proposed structures for synthesizability and suggest synthetic pathways for the synthesis of the compounds.8 The technologies involved in proposing, evaluating, planning, and assessing the synthetic feasibility of compound syntheses have been available within the cheminformatics industry for decades,24−28 but their implementation remains relatively state of the art (reviewed recently).29 For example, the earliest efforts in synthesis planning, reaction prediction, and synthetic feasibility assessment developed rule-based approaches such as LHASA, CAMEO, and CAESA (as reviewed in ref (30)). These software require collaboration between the chemist and machine to get the best out of the relatively limited functionality.30
In recent years, we have seen considerable development in computer-aided synthesis planning with the collection of tens of thousands of manually curated reaction transformation rules to yield millions of chemical reactions as a network in Chematica,31 which can be used to select the most cost-effective or chemically diverse synthetic pathways.32 While the manual collection of such rules is not scalable, there has also been a shift to use machine learning approaches. One has used deep neural networks trained on 3.5 million reactions from the Reaxys database with the extended connectivity fingerprints (ECFP4).33 Another approach has used 15,000 reactions from the USPTO augmented by a set of over 5 million reactions with nonrecorded products to train a neural network.25 Others have developed proof-of-concept tools that they suggest are not ready for practical use such as CompRet, which enumerates a chemical reaction network based on a depth-first proof number search, enumerating all synthetic routes and then recommending synthetic routes using simple scoring functions.34 A template-free self-corrected retrosynthesis predictor was built using a transformer neural network architecture, which improved on prior accuracy rates using the USPTO-50K set.35 Scientists at Pfizer have also demonstrated that a transformer-based retrosynthesis model generated with public USPTO training data could predict over 147,000 reactions from Pfizer electronic notebooks with a top 1% accuracy of 69%, and this number increases with their own data used in training.36 A more recent use of the transformer architecture using transfer learning for retrosynthesis prediction with literature data demonstrated a top 1% accuracy up to 60.7%.37 Various methods have also been used to predict synthetic accessibility such as using the probability of existence of substructures for the compound in question along with the number of symmetry atoms, graph complexity, and number of chiral centers.38 Clearly, these examples are an abbreviated selection and more recent open-source software for retrosynthetic planning includes AiZynthFinder,39 LillyMol,40 ASKCOS,25 and RENATE.41 Comparison of such methods is very limited and has only recently been reported,42 as such efforts require the synthesis of compounds using proposed routes obtained with each method.39
We now describe a generative model that is flexible to address the needs of many drug discovery projects, as well as prototype Pipeline Pilot protocols for automated lead expansion, filtration of analogs, and selection of a representative set that is user-accessible. Because the molecules are generated in an automated fashion, some of the molecules may be difficult or impractical to pursue from a synthetic chemistry perspective. Thus, we also created an automated tool to predict the relative difficulty of synthesis for targeted analog molecules utilizing automated retrosynthetic analysis coupled with a fragment analysis to score molecules for their synthetic feasibility. We have evaluated these tools using a set of FDA-approved drugs as well as a recently published set of natural products.43 We also provide several test cases that recapitulate a recently described known analog of ibogaine44 and analogs of lapatinib with improved predicted properties45 using MegaSyn. These examples show that MegaSyn can virtually generate synthetically feasible compounds with the desired druglike properties.
Experimental Section
To aid in outlining the current study and the various components, we have developed an overview (Figure 1).
Figure 1.
Overview diagram of this study and mapping to figures and supplemental figures to orient the reader.
Activity Models for MegaSyn
All activity models consisted of naïve Bayes-trained models using the scikit-learn package in python. Data sets were acquired for each target (e.g., HER1, HER2) from ChEMBL target activities except for the blood–brain-barrier (BBB) data set, which was acquired from Wang et al.46 All activities were binarized according to an activity threshold, with 1 indicating active and 0 indicating inactive (Table S1). For input to each model, ECFP6 fingerprints were calculated for each molecule and folded down to a 1024-bit feature vector. Each model was trained and calibrated using isotonic regression with 3-fold cross-validation, from which all statistics were generated. The calibrated prediction scores for each activity model were used as input into the composite score (defined below) when used for training a MegaSyn model.
Evaluation of Variational Autoencoder, Generative Adversarial Networks, and Recurrent Neural Networks
As input, molecules are represented as tokenized SMILES strings. Briefly, each SMILES is tokenized, and each character is represented in a vocabulary (e.g., “c [nH]”, “1”, “=”). Each token in the vocabulary has a corresponding numerical representation (e.g., all “c” were represented by 1, all = are represented by the number 2, etc.). SMILES are encoded by their integer vocabulary representation and padded to the longest sequence length with zeroes which were masked during training. Beyond this, several differences exist between models during training.
Variational autoencoder (VAE): the variational autoencoder utilizes an encoder–decoder architecture to map chemical space into a latent vector.10 The encoder is composed of three LSTM layers of 512 units each followed by a linear layer of 64 units (the latent space). Our decoder is comprised of three LSTM layers of 512 units each with a dropout of 0.2 between all layers. We used KL divergence as our loss term with an Adam optimizer = 0.0001,47 patience = 10,200 epochs, and a batch size of 64.
Generative adversarial networks (GANs): we implemented a latentGAN11 architecture for our generative GAN model. Wasserstein GAN48 with gradient penalty was utilized for the GAN model. The heteroencoder was comprised of three LSTM layers of 512 units each with a final linear layer of 64 units (the latent space), while the decoder was comprised of three LSTM layers of 512 units each followed by a linear layer with softmax activation to return the probability of each character in the vocabulary. The autoencoder was trained for 100 epochs with a batch size = 128 and an Adam optimizer with a learning rate = 0.0001 using teacher forcing. The discriminator of the GAN was formed by three linear layers of 256 hidden units each with ReLU activation49 between each layer (except for the last layer). The generator consists of five linear layers of 512 hidden units each with batch normalization = 0.9 and leaky ReLU activation between each layer.
The autoencoder was pretrained using the ChEMBL data set followed by training of the full GAN model.
Recurrent neural networks (RNNs):6 each LSTM-based model is composed of an embedding layer, three LSTM layers (512 hidden units), followed by a linear layer with softmax activation the size of a vocabulary generated from the training data.
MegaSyn Design
MegaSyn was implemented using Pytorch 1.7.1. Each LSTM-based model in the ensemble uses the architecture as described unless noted differently below. Each model is composed of an embedding layer, three LSTM layers (512 hidden units), followed by a linear layer with softmax activation the size of a vocabulary generated from the training data. As input for all models, molecules are represented as tokenized SMILES strings. MegaSyn is composed of three distinct model types: the initial pretrained model, a set of primed models, and finally a set of exploratory models.
Initial Model
The initial model is trained on ChEMBL 28’s ∼2 million compounds.22 Briefly, ChEMBL 28 SMILES were a batch of 60,000 random SMILES that were taken for each training epoch. The loss function for a sequence of encoded SMILES is the negative log-likelihood. The model uses an Adam optimizer with a learning rate = 0.002. Teacher forcing is used to expedite training of the generative model.
Primed Models
For each set of primed models, the initial model is trained for n epochs, with a new agent model saved every two epochs. The target molecule(s) of interest are broken down into substructures based on RECAP rules. Simplified carbon-only versions of these substructures and the original molecule are also generated. The initial model is trained on this set of structures and substructures alone, using the same parameters as the initial model (described above) using teacher forcing. Every i epoch, the model is saved until a set of n primed models have been created. These primed models are also scored to include only druglike molecules, evaluated using a QED score.50 The nature of the hill-climb maximum likelihood estimation (MLE) allows only correctly produced structures to influence the model, thereby restricting the molecules generated to a druglike small-molecule space. We find that 16 total epochs with a model saved after every two epochs represent the gradient of general to diverse reasonably well for a number of target molecules, and thus all models were trained to 16 total epochs unless otherwise specified.
Exploration Models
For each primed model, de novo molecules are generated. The generated molecules are then ranked based on a composite score from any number of criteria. The composite score is represented as below.
For each criterion i in the composite score (i.e., predicted target activity or druglikeness (QED) score),50 the composite score is defined as
where xi is the ith score for molecule and yi is the ith desired score. This score is assumed to be bound between 0 and 1, with 1 being the desired property (in case of the inverse, the score is taken a 1 – yi). This score usually includes QED, model prediction scores against target (target model), and any other desired scores, e.g., the average Tanimoto distance from a desired library. This design forces all scores to have positive scores (closer to 1) to always be desirable, and thus composite scores greater than the “target” score are always positive, and composite scores below the target score are always negative. We bound the scoring range to [0,1] as the majority of the individual scores are already in the range of [0,1] such as prediction scores, QED, and Tanimoto distance. We chose the sum-of-logs to penalize molecules for not meeting or exceeding all criteria. This prevents, for example, a nontoxic molecule with a high QED score but no predicted activity of interest from dominating the generative space. The scoring function is general, and if a molecular property score between 0 and 1 can be assigned to a compound, it can be included in the final composite score, giving a large potential to the tasks the generative model can be applied. The top 10% of ranked compounds are kept and fed back into the model for training using NLL and teacher forcing, a training concept called MLE. A new set of molecules is generated after training, and the cycle continues for the number of epochs. Importantly, the top 10% of compounds are kept from one epoch to the next; only if a newly generated compound has a score higher than one in the current top 10% list does it replace one in the set. This hill-climb MLE constrains the model to only be trained on molecules generated from the druglike space of the ChEMBL library in this case, restricting the model to generate only molecules with the desirable physicochemical properties. Eventually, the model will find a substructure minimum and is then capable of generating analogs of this specific substructure. Often, based on the initial seed molecules of the very first iteration, the model will converge to one local minimum. At least four models are trained and generated from each primed model node to obtain models that focus on different substructures of the original target molecule. The top-scoring 10% of compounds found over the training loop for each model are kept.
Care must be taken when generating molecules “far” from the initial training data that drives the models. It is assumed that chemical structure similarity should correlate with the uncertainty of the model: that the “closer” a structure of a molecule is to the training set, the more likely the model is to be correct. Often, an applicability domain (AD) score is applied, based on a distance metric, to reflect this.51−53 A distinction can be made, however, between the distance from the training set and the distance from the decision boundary of a model, which are two distinct measures: a molecule may be structurally far from the training set, yet if it is far from the decision boundary of the model, it may still be accurately predicted.42 Although the distance from the training data can track reliably with correct model predictions, class probabilities as produced by machine learning models were stronger predictors of misclassification.42 Therefore, weighting the prediction score heavily in our generative model likely represents a more reliable measure of applicability domain for generated molecules.
Automated Analog Generation
Lead Expansion/Enumeration
We have developed a Pipeline Pilot (Biovia, San Diego, version 19.1.0.1964)54 protocol for automated lead expansion, filtration of analogs, and selection of a representative set. For lead expansion, we encoded several different medicinal chemistry strategies to generate potential analogs. Included in these strategies are classical bioisosteric replacement and similarity “bioisosteres” (for which Pipeline Pilot components already exist).55−60 The classical bioisosteres include the replacement of several common functional groups with sterically similar functional groups believed to have similar physicochemical effects in a biological environment. Similarity bioisosteres locate fragments within molecules and replace them with similar fragments based on a user-specified similarity measure (e.g., FCFP_6 and PHFC_2). Another strategy involves the enumeration of heteroatomic regioisomers. Heteroatoms are identified and relocate them to every possible position around within the molecule.61 Finally, several molecular transformations (37 aromatic/phenyl replacement, 2 conformational restriction/ expansion, 92 Topliss, 8 Magic methyl) have been encoded to identify modification sites on molecules and automate the enumeration of analogs using common medicinal chemistry approaches.62−64 These approaches include Topliss, Magic Methyl, conformational restriction/relaxation, and ring expansion/contraction. The user can select or deselect the different transformation categories as desired.
The transformations were carried out using the “Perform Reaction from Tag” or “Perform Reaction on Each Molecule” components with the “IfMultipleReactionsPossible” parameter set to “Perform Each Reaction”. An organic filter was applied to remove any transformation products that contain inorganic molecules (under the assumption that these would not be of interest to medicinal chemists). Transformation categories may be selected/deselected. Retrosynthetic analysis (described below) may be selected or deselected. Input is an SD file.
The “Perform Reactions on Each Molecule” component is used for retrosynthetic analysis in a (run to completion) subprotocol with the IfMultipleReactionsPossible parameter set to “Perform All Reactions”. Each successful round of retrosynthesis is saved in a Pipeline Pilot Cache using a “Cache Writer” component with default parameter settings.
Tagging and Scoring
Using these techniques, tens to thousands of analogs are generated for a typical lead molecule, depending on its complexity. These molecules are then examined for any undesirable functional groups such as reactive functional groups and toxicophores.65 Molecules with any of these features are tagged (and can be removed later as desired). The molecules are then scored for synthetic feasibility, using a newly developed algorithm (see below). The molecules are then clustered using FCFP_4 fingerprints, so that a diverse set can be selected if desired. The canonical tautomer is generated for each molecule, and duplicate molecules are removed.
Selection
After the analogs are enumerated, tagged, and scored, the resulting analogs are displayed in a graphical and tabular format. Categorical and numeric charts, such as pie charts and histograms, are then generated along with a tabular output in Pipeline Pilot. The charts and tabular output are linked together such that the user can select subsets of molecules and export them readily.
Automated Retrosynthetic Analysis and Synthetic Feasibility Prediction
Three primary methodologies were used to evaluate synthetic feasibility. The first method involves the fragmentation of known (synthesized) molecules and the relative presence of those fragments in targets. The second method couples automated retrosynthesis with the first method. In addition to using automated retrosynthesis to rate synthetic feasibility, a separate application for retrosynthetic analysis was created. Finally, a weighting mechanism was added to penalize molecular elements that are undesirable from a synthetic perspective.
Fragmentation
To create the fragments used in the first method, two molecular sources were used. These included eMolecules66 consisting of 26,400,125 molecules (at the time of download) and ChEMBL version 24 consisting of 1,820,035 molecules.67 Separately, these sources were subjected to fragmentation using Pipeline Pilot using the Generate Fragments component. Specifically, ring assemblies (contiguous ring systems), BridgeAssemblies (contiguous ring systems that share two or more bonds), and Chains (contiguous atoms not in rings) and BemisMurcko assemblies were generated.68 Canonical SMILES were generated for each fragment. Fragments containing less than two atoms were filtered. Fragments that occurred more than 10 times over the entire molecule set were retained, along with their frequency (occurrence count). Each unique fragment and its frequency were saved in comma-separated files for each source.
Fragmentation Scoring
Molecules evaluated for synthetic feasibility are fragmented in the same way as the source sets. A baseline score is created by the ratio of fragments of the incoming molecules.69 The baseline score is calculated as follows
The score is then weighted using an algorithm that takes the size of the fragment (number of atoms) and the frequency of occurrence in the source set.
Retrosynthetic Analysis
This was carried out by applying a set of transformations that apply known reactions in reverse.40 Our solution has two primary sources of these transformations. The primary source is a set of reactions extracted from patents by a group at Eli Lilly (Lilly).40 The secondary source is a set of reactions detailed by a group at Astra Zeneca (AZ).70 All of the reactions were reversed so that they could be applied that way. In the case of the Lilly reactions,40 a set of 1,929,251 reactions in a format similar to SMIRKS were culled to a set of 8,040 reactions simply by looking at the number of characters in the text for each reaction. The idea here was that smaller reactions were more likely to represent the core or substructures of reactants and products and therefore would be applicable to a larger number of molecules. The reactions were then reversed by swapping the products with the reactants and converted from the SMIRKS format to RXN format in Pipeline Pilot. Approximately 10,000 druglike molecules were tested by running each of the 8040 reactions on them. Of the 8040 reactions, 2632 unique reactions were used at least once. This set of reactions was used as the final Lilly reaction set. A much smaller set of ∼45 common reactions were derived from the AZ group.70 These reactions were hand-written SMIRKS that represented common transformations used in organic synthesis. These SMIRKS were reversed by hand. Some were removed due to their promiscuous nature when applied in reverse (e.g., carbon–carbon bond formation reactions).
Once the core set of retrosynthetic reactions were selected and curated, the retrosynthetic analysis tool was developed and subjected to numerous rounds of testing (using experienced medicinal chemists) and enhancement, where various rules were imposed to encourage better outcomes. It was arbitrarily determined that up to five rounds of retrosynthetic reactions should be applied to each molecule. In the first round, each unique set of reaction products is retained. The “size” of each product molecule was determined by the number of non-hydrogen atoms. Most retrosynthetic reactions produce more than one product. For each set of products that are created by an individual reaction, the largest product (selected product) is retained. In rounds 2–5, an additional restraint is imposed. Only the five smallest of the selected products are allowed to the next round. In rounds 4 and 5, another additional restraint is imposed. The selected products must be smaller than the smallest selected product in all other rounds to be moved to the next round or to be reported. Results are reported for each round that is executed with all precursor molecules from each round.
Fragmentation and Retro Combined Scoring
The retrosynthetic analysis tool was combined with the fragmentation score to enhance the synthetic feasibility score. For the enhanced scoring, the selected product from the last three executed rounds that were executed (if at least three rounds were executed) is scored using the fragmentation scoring system. The highest score is then selected as the consensus score.
Weighting Mechanisms
After reviewing the results (with our own experienced synthetic chemists), it was clear that a certain key weighting mechanism was required to be added for certain features that are difficult to synthesize. The presence of one or more absolute chiral centers is one example of a penalizing feature. The presence of one or more spiro atoms is another example. For each of these elements that are present in the molecule, the score is reduced by a certain relative ratio.
Software Testing
A set of “best-selling 25 small-molecule drugs” were selected as an example of well-known molecules to test the automated retrosynthetic analysis software (Supporting Information, Table S2 and Figure S4). A set of 346 natural products (Canvass) were used to compare with a library of 201 FDA-approved drugs.43
Visualization of FDA-Approved Drugs and Natural Products
The molecular property space of FDA-approved drugs and the Canvass data set43 were compared using a t-distributed stochastic neighbor embedding (t-SNE) plot (see the methods below).
Data Analysis
To determine if the FDA-approved drugs were considered more synthetically feasible than the Canvass natural product library, Bootstrap hypothesis testing was performed on the two data sets.71 Briefly, both data sets (FDA library and Canvass) are combined into one data set. Two data sets of sizes n and m (the size of the FDA library and Canvass library, respectively) are randomly sampled from the combined data set. The mean and standard deviation are calculated. A p-value is calculated by determining the likelihood of the true mean occurring from the bootstrapped sample means.
t-SNE Plot Generation
All t-SNE plots were generated using the scikit-learn package in python with default parameters (number of components = 2, perplexity = 30.0, early exaggeration = 12, learning rate = 200, number of iterations = 1000, number of iterations without progress = 300, minimum gradient norm = 1e-07, metric = Euclidean).
Results
Evaluation of Different Generative Approaches
First, we evaluated several different generative model architectures (to compare with published benchmark resources MOSES72 and GuacaMol),73 which had been introduced in the literature in recent years: recurrent neural networks (RNNs),6 generative adversarial networks (GANs),11 and variational autoencoders (VAEs).13 To assess the capabilities of each architecture, we decided to use a number of metrics proposed in the literature, including validity: whether the compounds generated are theoretically realistic molecules; uniqueness: the fractions of molecules that are unique; novelty: the fraction of molecules generated not in the training set; and finally, the Fréchet ChemNet distance: (FCD)74 a measure of how close distributions of generated data are to the molecules in the training set. As comparing architectures is difficult given the ability of different hyperparameter tuning to alter results, we chose hyperparameters based on their initial implementation. We then trained each architecture (RNN, VAE, and GAN) on 1.2 million ChEMBL compounds and filtered to between 10 and 50 heavy atoms. We employed early stopping to reduce the length of time to train each model. Finally, we generated 100,000 compounds per architecture. We found that all three architectures performed similarly and were all capable of generating valid, unique, and novel compounds with a good FCD score (Figure 2).72,73 These scores were comparable to those reported in the literature with other benchmarking studies (Figure 2) and suggested that the choice of generative model architecture for MegaSyn was not a significant factor in improving generative model capabilities.
Figure 2.
Comparison of different model architectures for generative models using our models (CPI) in comparison to values reported from two other published benchmark resources (MOSES72 and GuacaMol).73
MegaSyn Design
At its core, MegaSyn uses long short-term memory (LSTM)-based generative models to learn the proper structure of SMILES strings.6 As input, molecules are represented as tokenized SMILES strings. MegaSyn is composed of three distinct model types: the initial pretrained model, a set of primed models, and finally a set of exploratory models (Figure 3).
Figure 3.

MegaSyn architecture. First, an initial model is trained on a drug database (i.e., ChEMBL). Next, a set of primed models are generated by training on a target compound(s). Finally, exploratory models are generated from each primed model node, completing a set of generative models that range from general, druglike molecules to analogs of the target compound(s).
Initial Model
The initial model is trained on ChEMBL 28’s of ∼2 million compounds.22,67 The purpose of training this model is to teach it how to create druglike molecules. Once trained, the initial model “knows” how to put together druglike molecules and then can be queried to generate compounds that fall within ChEMBL’s chemical space. This represents the prior knowledge of chemistry: valid chemical structures and how they are put together, atom by atom, is learned in this initial model. This considerable quantity of chemical information will be transfer-learned in the subsequent model. The initial model takes the largest amount of time to train; however, once trained, it can be reused for many projects as the prior model, and the overall training time of MegaSyn is small in comparison to a full retraining of a typical generative model starting from training on the entire ChEMBL database.
Primed Models
After the initial model is trained, a set of “primed” models are trained (Figure 3). The initial model is first presented molecule(s) of interest. Critically, we included a form of target-structure analysis by preprocessing each targeted molecule of interest into a set of substructures. The molecule(s) of interest are broken down into substructures based on RECAP rules using RDKit’s RecapDecompose module.75 Next, simplified carbon-only versions of these substructures and the original molecule are also generated. We found that breaking down the molecule(s) of interest into substructures allowed a different substructure set to be considered by the primed models, improving the analog exploration space around the molecule(s) of interest. Model 1 is trained on this list of structures and substructures for several epochs using teacher forcing. Every i epochs, the model is saved until a set of n primed models have been created. Each of these primed models represents generic exploration of chemical space (early primed models) to focused enumeration of the target molecule(s) (late primed models). How many epochs the model is trained on is critical; if too little are trained, the primed models explore a very wide chemical space around the target molecule. If too many epochs are trained, the model learns to focus only on the specific structure and substructures of the target(s) of interest themselves. We find that 16 total epochs with a model saved after every 2 epochs represent the gradient of general to diverse reasonably well for several target molecules, although we note that this may represent the training “distance” from ChEMBL to similar target molecules and that due to the few targets trained at a time, primed models can be quickly generated.
Exploration-Ensemble Models
Primed models represent nodes along a singular branch from a general druglike library (ChEMBL) to specific analogs of a molecule of interest. To explore more diverse chemical space around each of these nodes, a final set of ensemble models are branched off from each primed model node. For each primed model, de novo molecules are generated (∼2000–10,000 appear sufficient to cover a broad chemical space). The generated molecules are then ranked based on a composite score from a number of criteria. This usually includes QED,50 activity against the target (target model), and any other desired scores. Notably, if a score can be assigned to a compound, it can be included in the final composite score; this provides flexibility to the tasks the generative model can be applied. We can weight each objective according to its importance on a scale from 0 to 1, with 1 being extremely important and 0 representing no importance. After the generated set of compounds are scored, the top 10% of ranked compounds are kept and the model is trained on these top compounds, a training concept called hill-climb MLE.76 A new set of molecules are generated after training, and the cycle continues. Importantly, the top 10% of compounds are kept from one epoch to the next; only if a newly generated compound has a score higher than one in the current top 10% list does it replace one in the set. Eventually, the model will find a substructure minimum and is then capable of generating analogs of this specific substructure. Often, based on the initial seed molecules of the very first iteration, the model will converge to one local minimum. At least four models are trained and generated from each primed model node to obtain models that focus on different substructures of the original target molecule. The top-scoring 10% of compounds found over the entire training loop for each model are kept. The collection of models is indexed to give flexibility on what regions of chemical space the user could explore. Instead of sampling from a single generative model, MegaSyn randomly samples from a collection of t total models (initial model + (i/n) * 4) in parallel. It should be noted that training multiple models from the initial model takes a limited amount of time, only requiring ∼6 h on a single Nvidia GeForce GTX 1080 Ti GPU to generate 32 models, the number of models generated per MegaSyn case study in this paper. The desired “focus” of the model can be driven by a generative specificity parameter, which weights the chance of a model to be sampled from, either closer molecules to the training target(s) or driving away from the targets to generate novel compounds.
Evaluation of De Novo Molecules Generated from MegaSyn
Case Study 1: Lapatinib Analogs
We decided to evaluate the capability of MegaSyn to generate valid, novel molecules with desired properties by employing several case studies. As an example of our generative approach, we chose to optimize lapatinib, an orally active drug for breast cancer and other tumors (Figure 4A). Lapatinib inhibits EGFR (HER1) and HER2 kinases and thus is commonly used in combination therapy for HER2-positive breast cancer.77 Lapatinib, however, is relatively poor at crossing the blood–brain barrier (BBB), with highly variable metastasis uptake and is not detected in normal brain tissues.78 We used MegaSyn to design analogs that simultaneously optimizes for HER1 and HER2 activities with an improved ability to cross the BBB. All activity models were built using naïve Bayes (Table S1; see methods). Both HER1 and HER2 data sets were reasonably well balanced (∼42% actives and ∼40% actives, respectively). HER1 and HER2 models had ROCs of 0.80 and 0.86 and F1 scores of 0.69 and 0.77, respectively (Table S1). For inputs to the scoring function, we considered a QED score > 0.6. Similarity to lapatinib or lapatinib fragments (Tanimoto similarity >0.6), and prediction scores from machine learning models, we constructed for crossing the BBB, HER1 inhibition, HER2 inhibition, and finally a HERG model (using HEK293 cell data only) to ensure the molecules avoid this ion channel. We ran MegaSyn for 16 total epochs, saving a primed model node every two epochs and generating four exploratory models per primed model node for a total of 32 RNN-based models. A total of 10,000 molecules were generated from each of the 32 RNN-based models.
Figure 4.

Case study 1. (A) Structure of lapatinib, the target molecule of interest. (B) Number of the top 2000 MegaSyn compounds that fall within the applicability domain of the HER1 and HER2 models.
MegaSyn Explores Diverse Chemical Space
t-SNE plots of the top 200,000 scored molecules show that MegaSyn explores a rich chemical space around lapatinib (Figure 5A), ranging from Tanimoto similarity scores of 0.1–0.97. To determine if the generated molecules were within the applicability domain of our models, we applied a modified approach of Aniceto et al.79 (Figure 4B). First, we trained an ensemble random forest classifier using scikit-learn’s RandomForestClassifier (number of trees = 500, class_weight = balanced). Using the ensemble, we calculated the bias (prediction score – true binary value) and standard deviation of the ensemble predictions. Using the formula: (bias * 1-STD) to define a weight, we then calculated the average weighted Tanimoto similarity of every compound to the remaining training data set. We set a threshold corresponding to the 75th quartile of all weights calculated in the training set. Next, for each MegaSyn-generated compound, we used the maximum Tanimoto score to the training set and weighed it by that nearest neighbor’s bias and STD. We rejected the compound from the applicability domain (AD) if the weighted Tanimoto did surpass the threshold. We found that 37.6 and 44.1% of the top 2000 score compounds were in the applicability domain for HER1 and HER2, respectively. In contrast with other generative models, we also used a single LSTM-based generative model pretrained on ChEMBL and used the exact same loss function and multiparameter optimization score to drive the generative models. Both MegaSyn and the LSTM single models were trained for the same number of iterations, with MegaSyn having each ensemble submodel trained iterations/N-p*2, where N is the number of ensemble submodels and p is the number of primed models. We then sampled 5000 compounds from LSTM-based generative models to compare against MegaSyn, from which we also sampled 5000 compounds. We also sampled 5000 compounds from the individual ensemble models to explore the submodel heterogeneity. MegaSyn had significantly higher multiparameter optimization scores, suggesting that it can find a better composite score maximum (Figure 6A). While we draw analogies to ensemble prediction models, our results suggest that each of the individual models in the MegaSyn ensemble is not necessarily weaker generative models and instead often discover different local minima due to the probabilistic nature of the generative approach in combination with the hill-climb MLE scoring feedback loop. Each submodel generally found a distinct region of chemical space to focus on and did not converge onto the same spatial regions (Figure 6B). Thus, while the ensemble models generally generate lower-scoring molecules compared to the LSTM due to less training, by chance some ensemble models rapidly converge on a region of chemical space, which is optimal. This suggests that the ensemble approach is better able to avoid ending up in local minima by exploring a larger chemical space with multiple, weaker models compared to any single-trained model. When we limit our number of molecules down to the top 2000 scored generated molecules, molecular diversity is still common, further suggesting that MegaSyn is not just enumerating on a common core structure alone but exploring diverse options to meet the criteria used in the scoring function (Figure 5B). In contrast to the Tanimoto similarity score, the region in the t-SNE plot with the highest multioptimization score is distinct from the location of lapatinib, suggesting that MegaSyn is potentially capable of finding novel chemical space with better molecular properties than lapatinib (Figure 7A). While the majority of the top 2000 compounds are predicted to cross the BBB (Figure 7B), there is a clear structure–activity relationship between the activity relationship with HER1 activity and especially HER2 activity, which shows higher selectivity among the top compounds (Figure 7C,D). We evaluated the atomic contribution to model prediction for lapatinib and two of the top-scoring generated compounds (Figure S1). While the BBB model suggests that the smaller generated compounds have no distinct atom-specific prediction differences (Figure S1), the HER1 model suggests that the core atomic contribution to predicted activity is retained, with a new strong atomic contributor (the carbon atom highlighted in the first top-generated molecule under HER1) in addition (Figure S1). For HER2, however, the strongest atomic contributor is not retained from lapatinib in the top-scoring generated compounds, and instead novel atomic contributors are highlighted, suggesting that the optimization of the generated molecules can “find” distinct properties that allow the generated molecules to still be active against the target (Figure S1). We next evaluated the synthetic feasibility of the top-scoring compounds by using our newly built retrosynthetic analysis tool.
Figure 5.
t-SNE plots of structural diversity of MegaSyn-generated compounds. (A) t-SNE plot based on ECFP6 for 200,000 top-scoring generated molecules colored by Tanimoto similarity to lapatinib. (B) t-SNE plot based on ECFP6 for 2000 top-scoring generated molecules colored by Tanimoto similarity to lapatinib. The blue dot represents lapatinib.
Figure 6.
Comparison of MegaSyn and individual ensemble models (designated “E 1”, “E 2”, etc.) vs a single LSTM model multioptimization score using the same ChEMBL pretrained model for setup. (A) Boxplot showing the multiparameter optimization score for the generated compounds. (B) A t-SNE plot showing the structurally distinct generated chemical space from each submodel in the MegaSyn ensemble (designated E 1, E 2, etc.).
Figure 7.
t-SNE plots based on ECFP6 of the top 2,000 scoring compounds generated by MegaSyn colored by (A) the predicted ability to cross the BBB, (B) multiobjective optimization score, (C) predicted HER1 inhibition, or (D) predicted HER2 inhibition.
Case Studies for Retrosynthetic Analysis
Before scoring the retrosynthetic feasibility of MegaSyn-generated compounds, we first evaluated test cases to show the utility of the retrosynthetic analysis tool. Initially, the retrosynthetic analysis tool was tested on several examples to illustrate potential utility. As an example application of this tool, Sorensen et al. recently described a three-step synthesis for the antiviral drug tilorone.80 Our software suggests several approaches to synthesize tilorone (Figure S2). Another molecule tested in this way was the kinase inhibitor axitinib.81 The retrosynthetic analysis results were compared with a known synthesis route (Figure S3). We have expanded this analysis and generated a larger evaluation of the “top 25 selling small-molecule drugs” (Table S2). This resulted in a similar number of alternative synthetic routes for these drugs (Figure S4). Fifteen out of 25 were “retro-synthesized” completely to commercially available reactants (eMolecules were checked for commercial availability). Two of the drugs only required one step, four required two steps, eight required three steps, and one required five steps to break down into commercially available reactants. In many cases, the retrosynthesis went further than required to reach commercially available reactants (Table S2).
Synthetic Feasibility Prediction
An example of using tilorone for synthetic feasibility prediction is also shown in Figure S5, which illustrates the analysis of results as a whole and the scoring. In addition, we have compared the synthetic feasibility consensus scores of an FDA-approved drug library versus 346 natural products in the Canvass data set43 (Figure 8). This analysis shows a good separation of drugs from natural products using this score. We hence decided to use reference points of a synthetic feasibility score of <60 to indicate synthetic feasibility and a score of >90 to indicate a compound that is more easily synthesizable. The FDA data set and Canvass were statistically significantly different (p = 0.00318), suggesting that the synthetic feasibility tool is easily capable of discerning difficult-to-synthesize molecules (e.g., natural products) from generally simpler molecules like drugs. Visualization of the chemistry space of these approved drugs and the natural products further demonstrates that they cover different chemical property space areas, with drugs generally focused on the center of the plot while natural products are on the periphery (Figure S6).
Figure 8.

Boxplot comparing the consensus synthetic feasibility score for an FDA-approved library versus 346 natural products in the Canvass data set and 200 of the top-scoring MegaSyn-generated lapatinib analog compounds. 195/200 MegaSyn compounds had a score >60 and 46/200 compounds had a score >90, indicating that the compounds were synthetically feasible and easily synthesizable, respectively.
Synthetic Feasibility of MegaSyn-Generated Compounds
After validating our synthetic feasibility tool earlier, we used the consensus model to score the top 200 MegaSyn-generated lapatinib analogs ranked by the MPO score (Figure 8). Most compounds (97.5%) were scored as synthetically feasible, nearly a quarter (23%) being considered easily synthesizable (Figure 8). This suggests that MegaSyn can generate valid, druglike, readily synthesizable compounds with the desired predicted physicochemical and bioactivity properties.
Case Study 2: Ibogaine Analogs
As a second more challenging case study, we chose to potentially improve upon a natural product, ibogaine. Ibogaine is a natural product derived from tabernanth iboga (Figure 9A). Recent research has shown that psychedelics such as ibogaine may have therapeutic potential as antiaddictive agents. However, ibogaine has several undesirable properties, including the inhibition of the hERG channel and the induction of a psychedelic experience. In a recent publication, Cameron et al. proposed, synthesized, and tested new ibogaine analogs with the following targeted properties in mind: that it does not inhibit the hERG channel, it maintains specificity to the 5-HT2A, which is thought to be necessary for the therapeutic action, and it does not induce a psychedelic experience.44 Ultimately, the authors discovered tabernanthalog, an ibogaine derivative with these desired properties.44
Figure 9.

MegaSyn generation of new molecules based on ibogaine. (A) The structures of ibogaine and tabernanthalog. (B) t-SNE plots of the top 2000 generated molecules based on ECFP6 fingerprints colored by Tanimoto similarity. (C) Structures of three randomly sampled molecules from the top 200 compounds. (D) Histogram of the AlogP of the top 50 generated compounds. The AlogP of ibogaine is indicated by the red dashed line.
We have used this paper as a test case and challenged MegaSyn to find tabernanthalog, using the following criteria: activity against 5-HT2A, inactivity against hERG, 5-HT1A, 5-HT1F, and 5-HT2C, similarity to ibogaine and substructures (Tanimoto > 0.6), and lower cLogP than ibogaine. We ran MegaSyn for 16 total epochs, saving a primed model node every two epochs and generating four exploratory models per primed model node for a total of 32 LSTM-based models.
We built machine learning models against 5-HT2A, hERG, 5-HT1A, 5-HT1F, and 5-HT2C to include in the multiobjective scoring function to drive MegaSyn. All activity models were built using naïve Bayes (Table S1; see methods). Most of the models were well balanced and had precision values from 0.68 to 1, F1 scores from 0.6 to 0.96, and ROC values > 0.83. We then generated 100,000 compounds and took the top 50 highest multiobjective scoring compounds. Tabernanthalog was included in the top 50 highest scoring compounds. In addition, MegaSyn captured a wide variety of other related structures, including dissimilar scaffolds to ibogaine (Figure 9B,C). Most of the top 50 compounds had a lower AlogP than ibogaine, suggesting that MegaSyn could find molecules with improved predicted druglike properties (Figure 9D). In addition, the top 10 generated compounds had an MPO score (were MPO here represents a measure of BBB penetration)82 comparable or better than tabernanthalog and all had a higher MPO score than ibogaine, suggesting that several of the novel MegaSyn-generated compounds have a higher probability of crossing the BBB (Table S3).
Automated Analog Generation
In addition to the de novo design of molecules with MegaSyn, we have also developed an easy-to-use web interface using Pipeline Pilot for running an automated analog generation protocol, which can be used for lead expansion. We encoded several different medicinal chemistry strategies to generate potential analogs. A file of the molecules to generate analogs is then uploaded, and the output consists of a pie chart summarizing the makeup of the molecule analogs and bar charts of their properties (Figure S7). The charts and tabular output are linked together such that the user can select subsets of molecules and readily export them. This tool can also be used with the retrosynthetic analysis described earlier to score the likely synthetic feasibility.
Discussion
The goal of this study was to generate a complementary suite of accessible tools for generative molecular design, computer-assisted synthesis, retrosynthesis, and synthetic viability to propose new analogs or additional molecules as the next steps after the identification of a potential hit. We aimed to make use of existing data and algorithms wherever possible to deliver this additional functionality to provide meaningful synthesis suggestions for each molecule. We have now described these methods for automated lead expansion, filtration of analogs, and selection of a representative set of molecules that is user-accessible. This collection of capabilities can also be combined with other software or machine learning tools to score proposed analogs with models of interest.
Over the past few years, new discoveries in the field of de novo drug design have renewed interest in generating new molecules using machine learning.6−9 RNNs have been used for generating libraries for HTS, hit-to-lead optimization, and fragment-based hit discovery.15,83−87 A feature of these generative models is the ability to optimize multiple parameters such as the physicochemical properties or biological activity. While these new approaches are promising, a critical gap in knowledge is that limited experimental validation data was generated by synthesizing compounds and testing for activity for any of the aforementioned studies, with only a few groups validating their approach by making and testing compounds88 or finding structurally similar compounds from vendors.89 Default or “vanilla” generative models, while capable of generating novel compounds, often do not end up in the desired chemical space. In our conversations with numerous drug discovery experts at various companies, the major complaints regarding generative models are that they either end up enumerating on the same initial target molecules, essentially rediscovering what their medicinal chemists have already proposed (suggesting that the model is too focused) or ends up far away from “realistic” drug designs, proposing molecules that are well outside the realm of synthesizability. We suggest that a single model is not sufficient to cover all of the possible tasks requested of a generative model, so we have attempted to circumvent these issues by creating a large enumeration of models, from very general (little information is considered about the desired molecular space) to the specific (models that generate only analogs of the desired target molecules).
MegaSyn Initial Model Choice
The current MegaSyn models are all initially based on a single pretrained model on the ChEMBL database. This serves two purposes: first, the model has already learned how to compose correct molecule structures from SMILES strings. Second, despite the large number of learned molecules, ChEMBL has the additional bonus of being comprised almost entirely of druglike molecules. This works to the advantage of MegaSyn due to the unique training strategy of using the hill-climb MLE algorithm. The use of a hill-climb MLE means that only molecules that the initial generative model is capable of generating can be used for training, creating a feedback loop of only druglike molecules being generated and trained on, and prevents undesirable properties from being generated. This is further re-enforced through the use of a QED score to prevent molecules from straying too far into non-drug-like space. The initial choice of database to train the initial model, then, is critical to the success of a generative model. Exploration of other databases to train the initial model can be used to change the desired outcome.
Composite Score Function
The core driver of MegaSyn is the composition of the composite score function, which often includes a score for druglikeness (such as QED) and similarity to target molecule(s) (i.e., Tanimoto similarity) in addition to the primary activity scoring models for potential drug targets. The accuracy and choice of scores, then, are also critical for the success of MegaSyn. The number of possible scores to include are unbounded and only require that a molecule can be scored and ranked numerically. For example, including machine learning models on toxicity (HERG, drug-induced liver toxicity, CYPs, etc.) can be combined with on-target (i.e., a 5-HT receptor) and off-target (other 5-HT receptors or other targets to avoid) to create a composite score of dozens of scoring functions. We included a weighting value, from 0 to 1, which allows flexibility in score inclusion; instead of only the important score functions, several “nice-to-have” scores may also be introduced with a lower weight value than the more critical score functions.
Case Study Results: Pros and Cons
In the absence of prospective validations of the generative approaches, the use of case studies is a promising alternative to explore the possible application and limitations of generative de novo design software, as demonstrated herein. We illustrated that MegaSyn, even when faced with a natural product (ibogaine), can discover the same molecular analog (tabernanthalog) as proposed by medicinal chemists (using “traditional” medicinal chemistry approaches to design), suggesting that it is capable of supplementing medicinal chemistry exploration. In addition, several of the top-scoring compounds in this case study had molecular scaffolds distinct from ibogaine, highlighting that ibogaine is not considered a “druglike” molecule. Further, the proposed top-scored compounds for lapatinib, while “similar” to lapatinib, possessed improved predicted molecular properties, which was the intent.
The downside of such case studies is that the interpretation of success is only as good as the accuracy of the composite scoring function. While we can judge generated molecules as being reasonable from a chemistry point of view, it remains to be seen whether the other top-scoring compounds are in fact 1. active, 2. nontoxic, and 3. selective, without making them and testing them. These are critical points that have yet to be fully investigated by any generative model proposed to date (to the best of our knowledge), and we do not know whether the bias of using machine learning models to drive generative models also affects the probability of the top-scoring generated compounds to be truly active, nontoxic, or selective. We would argue, however, that these same machine learning models could be used to direct drug discovery projects regardless of the origin of the proposed molecules, therefore suggesting that generative models may provide a promising route to finding new molecules to test, especially when combined with retrosynthetic analysis.
Retrosynthetic Analysis and Analog Designer
While some of the tools described herein are likely less sophisticated than the approaches described earlier for computer-assisted synthesis,30−32 retrosynthesis,25,33−37 and synthetic viability tools (e.g., AutoGrow 3,90 chemical stability,91 and others38 to eliminate invalid options), they can be readily implemented in Pipeline Pilot that is a widely used and a commercially available product. Similarly, this approach and software could also be readily reimplemented in open-source tools such as KNIME54,92 or scripted in Python or other languages.
In conclusion, we have demonstrated that MegaSyn can propose synthesizable analogs for molecules based on the integration of various software components (open source and commercial). We have also demonstrated that we can recapitulate synthetic approaches for approved drugs in our case studies and that our synthetic feasibility score can reliably differentiate between approved drugs that are likely to be more synthetically feasible than more complex natural products. While these efforts represent essentially retrospective evaluations of the software developed, this is in line with what has been demonstrated with several more sophisticated tools described earlier. The next step is using the MegaSyn suite of tools to propose analogs, define how to make them, and rank their synthetic feasibility before ultimately selecting molecules to synthesize and testing in vitro. We are currently applying MegaSyn to perform just this on various internal and collaborative research projects. Developing tools like MegaSyn should also consider the potential for dual use of this technology as we have recently reported with the application of MegaSyn to develop VX, several analogs, and its precursors.93 As this is a commercial product, we have control over who has access to it, such that we can implement restrictions or an API for any forward-facing models that may be of a sensitive nature, as has been done elsewhere for other machine learning models such as GPT-3.94
Acknowledgments
Biovia is kindly acknowledged for providing Discovery Studio and Pipeline Pilot. The authors thank Dr. Ian Watson and Dr. Alex M. Clark for early discussions.
Glossary
Abbreviations Used
- ADME/tox
absorption, distribution, metabolism, excretion, and toxicity
- ECFP4
extended connectivity fingerprints
- LSTM
long short-term memory
- RNN
recurrent neural network
- QED
quantitative estimate of druglikeness
- MLE
maximum likelihood estimation
- SMILES
simplified molecular-input line-entry system
- BBB
blood–brain barrier
- FCD
Fréchet ChemNet distance
Supporting Information Available
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.2c01404.
Structures of public molecules and computational models are available upon request, supporting further details on the cross-validation statistics of models, retrosynthesis of the top 25 selling drugs, scores of ibogaine analogs, atomic contributions for lapatinib analogs, examples of retrosynthetic analysis and synthetic feasibility prediction, and visualization of drugs and natural product property space and Pipeline Pilot analog enumeration interface (PDF)
Supplemental Table 4 (ZIP)
The authors declare the following competing financial interest(s): S.E. is the founder and owner and F.U. is an employee of Collaborations Pharmaceuticals, Inc. C.T.L. and J.C.C. consulted for Collaborations Pharmaceuticals, Inc.
Notes
The authors kindly acknowledge NIH funding: R44GM122196-02A1 and 2R44GM122196-04A1 from NIGMS, 3R43AT010585-01S1 from NCCAM, and 1R43ES031038-01 from NIEHS. “Research reported in this publication was supported by the National Institute of Environmental Health Sciences of the National Institutes of Health under Award Number R43ES031038. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health”.
Notes
All data sets used for machine learning model building are available upon request (Table S1). Bayesian models used for activity models are available upon request as scikit-learn models as a JSON or joblib dump file. Output molecules from MegaSyn test cases are proprietary. Pipeline Pilot software is licensed from Biovia. The MegaSyn software and our Pipeline Pilot protocols are available for licensing.
Notes
The MegaSyn software (and likely other similar software) has potential dual-use capabilities, and the authors therefore propose to implement restrictions on who has access to it and the applications it is used for. The authors believe that such precautions are necessary, and these will evolve over time as they integrate software features to limit and prevent dual use.
Supplementary Material
References
- Vignaux P. A.; Minerali E.; Foil D. H.; Puhl A. C.; Ekins S. Machine Learning for Discovery of GSK3β Inhibitors. ACS Omega 2020, 5, 26551–26561. 10.1021/acsomega.0c03302. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vignaux P. A.; Minerali E.; Lane T. R.; Foil D. H.; Madrid P. B.; Puhl A. C.; Ekins S. The Antiviral Drug Tilorone Is a Potent and Selective Inhibitor of Acetylcholinesterase. Chem. Res. Toxicol. 2021, 34, 1296–1307. 10.1021/acs.chemrestox.0c00466. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Klein J. J.; Baker N.; Foil D. H.; Zorn K. M.; Urbina F.; Puhl A. C.; Ekins S. Using Bibliometric Analysis and Machine Learning to Identify Compounds binding to Sialidase-1. ACS Omega 2021, 6, 3186–3193. 10.1021/acsomega.0c05591. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patronov A.; Papadopoulos K.; Engkvist O. Has Artificial Intelligence Impacted Drug Discovery?. Methods Mol. Biol. 2022, 2390, 153–176. 10.1007/978-1-0716-1787-8_6. [DOI] [PubMed] [Google Scholar]
- Olivecrona M.; Blaschke T.; Engkvist O.; Chen H. Molecular de-novo design through deep reinforcement learning. J. Cheminf. 2017, 9, 48 10.1186/s13321-017-0235-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Segler M. H. S.; Kogej T.; Tyrchan C.; Waller M. P. Generating Focused Molecule Libraries for Drug Discovery with Recurrent Neural Networks. ACS Cent. Sci. 2018, 4, 120–131. 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meyers J.; Fabian B.; Brown N. De novo molecular design and generative models. Drug Discovery Today 2021, 26, 2707–2715. 10.1016/j.drudis.2021.05.019. [DOI] [PubMed] [Google Scholar]
- Bhisetti G.; Fang C. Artificial Intelligence-Enabled De Novo Design of Novel Compounds that Are Synthesizable. Methods Mol. Biol. 2022, 2390, 409–419. 10.1007/978-1-0716-1787-8_17. [DOI] [PubMed] [Google Scholar]
- Palazzesi F.; Pozzan A. Deep Learning Applied to Ligand-Based De Novo Drug Design. Methods Mol. Biol. 2022, 2390, 273–299. 10.1007/978-1-0716-1787-8_12. [DOI] [PubMed] [Google Scholar]
- Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernandez-Lobato J. M.; Sanchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Cent. Sci. 2018, 4, 268–276. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prykhodko O.; Johansson S. V.; Kotsias P. C.; Arus-Pous J.; Bjerrum E. J.; Engkvist O.; Chen H. A de novo molecular generation method using latent vector based generative adversarial network. J. Cheminf. 2019, 11, 74 10.1186/s13321-019-0397-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hochreiter S.; Schmidhuber J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. 10.1162/neco.1997.9.8.1735. [DOI] [PubMed] [Google Scholar]
- Blaschke T.; Olivecrona M.; Engkvist O.; Bajorath J.; Chen H. Application of Generative Autoencoder in De Novo Molecular Design. Mol. Inf. 2018, 37, 1700123 10.1002/minf.201700123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanchez-Lengeling B.; Outeiral C.; Guimaraes G. L.; Aspuru-Guzik A.. Optimizing distributions over molecular space. An Objective-Reinforced Generative Adversarial Network for Inverse-design Chemistry (ORGANIC), 2017. https://chemrxiv.org/engage/chemrxiv/article-details/60c73d91702a9beea7189bc2.
- Winter R.; Montanari F.; Steffen A.; Briem H.; Noé F.; Clevert D.-A. Efficient multi-objective molecular optimization in a continuous latent space. Chem. Sci. 2019, 10, 8016–8024. 10.1039/C9SC01928F. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krenn M.; Häse F.; Nigam A.; Friederich P.; Aspuru-Guzik A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 2020, 1, 045024. [Google Scholar]
- Jin W.; Barzilay R.; Jaakola T.. Junction Tree Variational Autoencoder for Molecular Graph Generation. 2021, arXiv:1802.04364. arXiv.org e-Print archive. https://arxiv.org/abs/1802.04364.
- Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31–36. 10.1021/ci00057a005. [DOI] [Google Scholar]
- Holliday J. D.; Hu C. Y.; Willett P. Grouping of coefficients for the calculation of inter-molecular similarity and dissimilarity using 2D fragment bit-strings. Comb. Chem. High Throughput Screening 2002, 5, 155–166. 10.2174/1386207024607338. [DOI] [PubMed] [Google Scholar]
- Thanh-Tung H.; Tran T.. On Catastrophic Forgetting and Mode Collapse in Generative Adversarial Networks. 2018, arXiv:1807.04015. arXiv.org e-Print archive. https://arxiv.org/abs/1807.04015.
- Breiman L. Random Forests. Mach. Learn. 2001, 45, 5–32. 10.1023/A:1010933404324. [DOI] [Google Scholar]
- Gaulton A.; Hersey A.; Nowotka M.; Bento A. P.; Chambers J.; Mendez D.; Mutowo P.; Atkinson F.; Bellis L. J.; Cibrian-Uhalte E.; Davies M.; Dedman N.; Karlsson A.; Magarinos M. P.; Overington J. P.; Papadatos G.; Smit I.; Leach A. R. The ChEMBL database in 2017. Nucleic Acids Res. 2017, 45, D945–D954. 10.1093/nar/gkw1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zeng X.; Zhang P.; He W.; Qin C.; Chen S.; Tao L.; Wang Y.; Tan Y.; Gao D.; Wang B.; Chen Z.; Chen W.; Jiang Y. Y.; Chen Y. Z. NPASS: natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res. 2018, 46, D1217–D1222. 10.1093/nar/gkx1026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Christ C. D.; Zentgraf M.; Kriegl J. M. Mining electronic laboratory notebooks: analysis, retrosynthesis, and reaction based enumeration. J. Chem. Inf. Model. 2012, 52, 1745–1756. 10.1021/ci300116p. [DOI] [PubMed] [Google Scholar]
- Coley C. W.; Barzilay R.; Jaakkola T. S.; Green W. H.; Jensen K. F. Prediction of Organic Reaction Outcomes Using Machine Learning. ACS Cent. Sci. 2017, 3, 434–443. 10.1021/acscentsci.7b00064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Proudfoot J. R. Molecular Complexity and Retrosynthesis. J. Org. Chem. 2017, 82, 6968–6971. 10.1021/acs.joc.7b00714. [DOI] [PubMed] [Google Scholar]
- Cadeddu A.; Wylie E. K.; Jurczak J.; Wampler-Doty M.; Grzybowski B. A. Organic chemistry as a language and the implications of chemical linguistics for structural and retrosynthetic analyses. Angew. Chem., Int. Ed. 2014, 53, 8108–8112. 10.1002/anie.201403708. [DOI] [PubMed] [Google Scholar]
- Todd M. H. Computer-aided organic synthesis. Chem. Soc. Rev. 2005, 34, 247–266. 10.1039/b104620a. [DOI] [PubMed] [Google Scholar]
- Nair V. H.; Schwaller P.; Laino T. Data-driven Chemical Reaction Prediction and Retrosynthesis. Chimia 2019, 73, 997–1000. 10.2533/chimia.2019.997. [DOI] [PubMed] [Google Scholar]
- Warr W. A. A Short Review of Chemical Reaction Database Systems, Computer-Aided Synthesis Design, Reaction Prediction and Synthetic Feasibility. Mol. Inf. 2014, 33, 469–476. 10.1002/minf.201400052. [DOI] [PubMed] [Google Scholar]
- Szymkuć S.; Gajewska E. P.; Klucznik T.; Molga K.; Dittwald P.; Startek M.; Bajczyk M.; Grzybowski B. A. Computer-Assisted Synthetic Planning: The End of the Beginning. Angew. Chem., Int. Ed. 2016, 55, 5904–5937. 10.1002/anie.201506101. [DOI] [PubMed] [Google Scholar]
- Badowski T.; Molga K.; Grzybowski B. A. Selection of cost-effective yet chemically diverse pathways from the networks of computer-generated retrosynthetic plans. Chem. Sci. 2019, 10, 4640–4651. 10.1039/C8SC05611K. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Segler M. H. S.; Waller M. P. Neural-Symbolic Machine Learning for Retrosynthesis and Reaction Prediction. Chem. - Eur. J. 2017, 23, 5966–5971. 10.1002/chem.201605499. [DOI] [PubMed] [Google Scholar]
- Shibukawa R.; Ishida S.; Yoshizoe K.; Wasa K.; Takasu K.; Okuno Y.; Terayama K.; Tsuda K. CompRet: a comprehensive recommendation framework for chemical synthesis planning with algorithmic enumeration. J. Cheminf. 2020, 12, 52 10.1186/s13321-020-00452-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zheng S.; Rao J.; Zhang Z.; Xu J.; Yang Y. Predicting Retrosynthetic Reactions Using Self-Corrected Transformer Neural Networks. J. Chem. Inf. Model. 2020, 60, 47–55. 10.1021/acs.jcim.9b00949. [DOI] [PubMed] [Google Scholar]
- Lee A. A.; Yang Q.; Sresht V.; Bolgar P.; Hou X.; Klug-McLeod J. L.; Butler C. R. Molecular Transformer unifies reaction prediction and retrosynthesis across pharma chemical space. Chem. Commun. 2019, 55, 12152–12155. 10.1039/C9CC05122H. [DOI] [PubMed] [Google Scholar]
- Bai R.; Zhang C.; Wang L.; Yao C.; Ge J.; Duan H. Transfer Learning: Making Retrosynthetic Predictions Based on a Small Chemical Reaction Dataset Scale to a New Level. Molecules 2020, 25, 2357 10.3390/molecules25102357. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fukunishi Y.; Kurosawa T.; Mikami Y.; Nakamura H. Prediction of synthetic accessibility based on commercially available compound databases. J. Chem. Inf. Model. 2014, 54, 3259–3267. 10.1021/ci500568d. [DOI] [PubMed] [Google Scholar]
- Genheden S.; Thakkar A.; Chadimova V.; Reymond J. L.; Engkvist O.; Bjerrum E. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J. Cheminf. 2020, 12, 70 10.1186/s13321-020-00472-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watson I. A.; Wang J.; Nicolaou C. A. A retrosynthetic analysis algorithm implementation. J. Cheminf. 2019, 11, 1 10.1186/s13321-018-0323-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ghiandoni G. M.; Bodkin M. J.; Chen B.; Hristozov D.; Wallace J. E. A.; Webster J.; Gillet V. J. RENATE: A Pseudo-retrosynthetic Tool for Synthetically Accessible de Novo Design. Mol. Inf. 2021, e2100207. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao W.; Coley C. W. The Synthesizability of Molecules Proposed by Generative Models. J. Chem. Inf. Model. 2020, 60, 5714–5723. 10.1021/acs.jcim.0c00174. [DOI] [PubMed] [Google Scholar]
- Kearney S. E.; Zahoranszky-Kohalmi G.; Brimacombe K. R.; Henderson M. J.; Lynch C.; Zhao T.; Wan K. K.; Itkin Z.; Dillon C.; Shen M.; Cheff D. M.; Lee T. D.; Bougie D.; Cheng K.; Coussens N. P.; Dorjsuren D.; Eastman R. T.; Huang R.; Iannotti M. J.; Karavadhi S.; Klumpp-Thomas C.; Roth J. S.; Sakamuru S.; Sun W.; Titus S. A.; Yasgar A.; Zhang Y. Q.; Zhao J.; Andrade R. B.; Brown M. K.; Burns N. Z.; Cha J. K.; Mevers E. E.; Clardy J.; Clement J. A.; Crooks P. A.; Cuny G. D.; Ganor J.; Moreno J.; Morrill L. A.; Picazo E.; Susick R. B.; Garg N. K.; Goess B. C.; Grossman R. B.; Hughes C. C.; Johnston J. N.; Joullie M. M.; Kinghorn A. D.; Kingston D. G. I.; Krische M. J.; Kwon O.; Maimone T. J.; Majumdar S.; Maloney K. N.; Mohamed E.; Murphy B. T.; Nagorny P.; Olson D. E.; Overman L. E.; Brown L. E.; Snyder J. K.; Porco J. A. Jr.; Rivas F.; Ross S. A.; Sarpong R.; Sharma I.; Shaw J. T.; Xu Z.; Shen B.; Shi W.; Stephenson C. R. J.; Verano A. L.; Tan D. S.; Tang Y.; Taylor R. E.; Thomson R. J.; Vosburg D. A.; Wu J.; Wuest W. M.; Zakarian A.; Zhang Y.; Ren T.; Zuo Z.; Inglese J.; Michael S.; Simeonov A.; Zheng W.; Shinn P.; Jadhav A.; Boxer M. B.; Hall M. D.; Xia M.; Guha R.; Rohde J. M. Canvass: A Crowd-Sourced, Natural-Product Screening Library for Exploring Biological Space. ACS Cent. Sci. 2018, 4, 1727–1741. 10.1021/acscentsci.8b00747. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cameron L. P.; Tombari R. J.; Lu J.; Pell A. J.; Hurley Z. Q.; Ehinger Y.; Vargas M. V.; McCarroll M. N.; Taylor J. C.; Myers-Turnbull D.; Liu T.; Yaghoobi B.; Laskowski L. J.; Anderson E. I.; Zhang G.; Viswanathan J.; Brown B. M.; Tjia M.; Dunlap L. E.; Rabow Z. T.; Fiehn O.; Wulff H.; McCorvy J. D.; Lein P. J.; Kokel D.; Ron D.; Peters J.; Zuo Y.; Olson D. E. A non-hallucinogenic psychedelic analogue with therapeutic potential. Nature 2021, 589, 474–479. 10.1038/s41586-020-3008-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rusnak D. W.; Lackey K.; Affleck K.; Wood E. R.; Alligood K. J.; Rhodes N.; Keith B. R.; Murray D. M.; Knight W. B.; Mullin R. J.; Gilmer T. M. The effects of the novel, reversible epidermal growth factor receptor/ErbB-2 tyrosine kinase inhibitor, GW2016, on the growth of human normal and tumor-derived cell lines in vitro and in vivo. Mol. Cancer Ther. 2001, 1, 85–94. [PubMed] [Google Scholar]
- Wang Z.; Yang H.; Wu Z.; Wang T.; Li W.; Tang Y.; Liu G. In Silico Prediction of Blood-Brain Barrier Permeability of Compounds by Machine Learning and Resampling Methods. ChemMedChem 2018, 13, 2189–2201. 10.1002/cmdc.201800533. [DOI] [PubMed] [Google Scholar]
- Diederik P.; Kingma J. B.. Adam: A Method for Stochastic Optimization. 2014, arXiv:1412.6980v9. arXiv.org e-Print archive. https://arxiv.org/abs/1412.6980v9.
- Arjovsky M.; Chintala S.; Bottou L.; Wasserstein G. A. N.. A Method for Stochastic Optimization. 2017, arXiv:1701.07875v3. arXiv.org e-Print archive. https://arxiv.org/abs/1701.07875v3.
- Agarap A. F.Deep Learning using Rectified Linear Units (ReLU). 2018, arXiv:1803.08375. arXiv.org e-Print archive. https://arxiv.org/abs/1803.08375.
- Bickerton G. R.; Paolini G. V.; Besnard J.; Muresan S.; Hopkins A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 2012, 4, 90–98. 10.1038/nchem.1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gadaleta D.; Mangiatordi G. F.; Catto M.; Carotti A.; Nicolotti O. Applicability Domain for QSAR models: where theory meets reality. Int. J. Quant. Struct.-Prop. Relat. 2016, 1, 45–63. 10.4018/IJQSPR.2016010102. [DOI] [Google Scholar]
- Wang Z.; Chen J.; Hong H. Developing QSAR Models with Defined Applicability Domains on PPARγ Binding Affinity Using Large Data Sets and Machine Learning Algorithms. Environ. Sci. Technol. 2021, 55, 6857–6866. 10.1021/acs.est.0c07040. [DOI] [PubMed] [Google Scholar]
- Wang Z.; Chen J.; Hong H. Developing QSAR Models with Defined Applicability Domains on PPARgamma Binding Affinity Using Large Data Sets and Machine Learning Algorithms. Environ. Sci. Technol. 2021, 55, 6857–6866. 10.1021/acs.est.0c07040. [DOI] [PubMed] [Google Scholar]
- Warr W. A. Scientific workflow systems: Pipeline Pilot and KNIME. J. Comput.-Aided Mol. Des. 2012, 26, 801–804. 10.1007/s10822-012-9577-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Silverman R.The Organic Chemistry of Drug Design and Drug Action; Elsevier, 2004. [Google Scholar]
- Brown N.Bioisosteres in Medicinal Chemistry; Wiley-VCH Verlag GmbH & Co. KGaA, 2012. [Google Scholar]
- Langmuir I. Isomorphism, Isosterism and Covalence. J. Am. Chem. Soc. 1919, 41, 1543–1559. 10.1021/ja02231a009. [DOI] [Google Scholar]
- Erlenmeyer H.; Willi E. Zusammenhänge zwischen Konstitution und Wirkung bei Pyrazolonderivaten. Helv. Chim. Acta 1935, 18, 740–743. 10.1002/hlca.193501801101. [DOI] [Google Scholar]
- Erlenmeyer H.; Leo M. Über Pseudoatome. Helv. Chim. Acta 1932, 15, 1171–1186. 10.1002/hlca.193201501132. [DOI] [Google Scholar]
- Doble M.; Kruthiventi A. K.; Gajanan V.. Biotransformations and Bioprocesses; CRC Press, 2004. [Google Scholar]
- Tyagarajan S.; Lowden C. T.; Peng Z.; Dykstra K. D.; Sherer E. C.; Krska S. W. Heterocyclic Regioisomer Enumeration (HREMS): A Cheminformatics Design Tool. J. Chem. Inf. Model. 2015, 55, 1130–1135. 10.1021/acs.jcim.5b00162. [DOI] [PubMed] [Google Scholar]
- Topliss J. G. Utilization of operational schemes for analog synthesis in drug design. J. Med. Chem. 1972, 15, 1006–1011. 10.1021/jm00280a002. [DOI] [PubMed] [Google Scholar]
- Schönherr H.; Cernak T. Profound Methyl Effects in Drug Discovery and a Call for New C·H Methylation Reactions. Angew. Chem., Int. Ed. 2013, 52, 12256–12267. 10.1002/anie.201303207. [DOI] [PubMed] [Google Scholar]
- de Sena M. Pinheiro P.; Rodrigues A. D.; Maia R. C.; Thota S.; Fraga C. A. M. The Use of Conformational Restriction in Medicinal Chemistry. Curr. Top. Med. Chem. 2019, 19, 1712–1733. 10.2174/1568026619666190712205025. [DOI] [PubMed] [Google Scholar]
- Liu D.; Jiang H.; Chen K.; Ji R. A New Approach to Design Virtual Combinatorial Library with Genetic Algorithm Based on 3D Grid Property. J. Chem. Inf. Comput. Sci. 1998, 38, 233–242. 10.1021/ci970086o. [DOI] [Google Scholar]
- Anon eMolecules, 2022. https://www.emolecules.com/info/plus/download-database.
- Anon ChEMBL, 2022. https://chembl.gitbook.io/chembl-interface-documentation/downloads.
- Bemis G. W.; Murcko M. A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 1996, 39, 2887–2893. 10.1021/jm9602928. [DOI] [PubMed] [Google Scholar]
- Sheridan R. P.; Hunt P.; Culberson J. C. Molecular Transformations as a Way of Finding and Exploiting Consistent Local QSAR. J. Chem. Inf. Model. 2006, 46, 180–192. 10.1021/ci0503208. [DOI] [PubMed] [Google Scholar]
- Hartenfeller M.; Eberle M.; Meier P.; Nieto-Oberhuber C.; Altmann K. H.; Schneider G.; Jacoby E.; Renner S. A collection of robust organic synthesis reactions for in silico molecule design. J. Chem. Inf. Model. 2011, 51, 3093–3098. 10.1021/ci200379p. [DOI] [PubMed] [Google Scholar]
- Stine R. An Introduction to Bootstrap Methods: Examples and Ideas. Sociol. Methods Res. 1989, 18, 243–291. 10.1177/0049124189018002003. [DOI] [Google Scholar]
- Polykovskiy D.; Zhebrak A.; Sanchez-Lengeling B.; Golovanov S.; Tatanov O.; Belyaev S.; Kurbanov R.; Artamonov A.; Aladinskiy V.; Veselov M.; Kadurin A.; Johansson S.; Chen H.; Nikolenko S.; Aspuru-Guzik A.; Zhavoronkov A. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front. Pharmacol. 2020, 11, 565644 10.3389/fphar.2020.565644. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown N.; Fiscato M.; Segler M. H. S.; Vaucher A. C. GuacaMol: Benchmarking Models for de Novo Molecular Design. J. Chem. Inf. Model. 2019, 59, 1096–1108. 10.1021/acs.jcim.8b00839. [DOI] [PubMed] [Google Scholar]
- Preuer K.; Renz P.; Unterthiner T.; Hochreiter S.; Klambauer G. Frechet ChemNet Distance: A Metric for Generative Models for Molecules in Drug Discovery. J. Chem. Inf. Model. 2018, 58, 1736–1741. 10.1021/acs.jcim.8b00234. [DOI] [PubMed] [Google Scholar]
- Lewell X. Q.; Judd D. B.; Watson S. P.; Hann M. M. RECAPRetrosynthetic Combinatorial Analysis Procedure: A Powerful New Technique for Identifying Privileged Molecular Fragments with Useful Applications in Combinatorial Chemistry. J. Chem. Inf. Comput. Sci. 1998, 38, 511–522. 10.1021/ci970429i. [DOI] [PubMed] [Google Scholar]
- Neil D.; Segler M. H. S.; Guasch L.; Ahmed M.; Plumbley D.; Sellwood M.; Brown N.. Exploring Deep Recurrent Models with Reinforcement Learning for Molecule Design; ICLR, 2018. [Google Scholar]
- Higa G. M.; Abraham J. Lapatinib in the treatment of breast cancer. Expert Rev. Anticancer Ther. 2007, 7, 1183–1192. 10.1586/14737140.7.9.1183. [DOI] [PubMed] [Google Scholar]
- Saleem A.; Searle G. E.; Kenny L. M.; Huiban M.; Kozlowski K.; Waldman A. D.; Woodley L.; Palmieri C.; Lowdell C.; Kaneko T.; Murphy P. S.; Lau M. R.; Aboagye E. O.; Coombes R. C. Lapatinib access into normal brain and brain metastases in patients with Her-2 overexpressing breast cancer. EJNMMI Res. 2015, 5, 30 10.1186/s13550-015-0103-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Aniceto N.; Freitas A. A.; Bender A.; Ghafourian T. A novel applicability domain technique for mapping predictive reliability across the chemical space of a QSAR: reliability-density neighbourhood. J. Cheminf. 2016, 8, 69 10.1186/s13321-016-0182-y. [DOI] [Google Scholar]
- Chen X.-Y.; Ozturk S.; Sorensen E. J. Synthesis of Fluorenones from Benzaldehydes and Aryl Iodides: Dual C–H Functionalizations Using a Transient Directing Group. Org. Lett. 2017, 19, 1140–1143. 10.1021/acs.orglett.7b00161. [DOI] [PubMed] [Google Scholar]
- Singer R. A.Commercial Development of Axitinib (AG-013736): Optimization of a Convergent Pd-Catalyzed Coupling Assembly and Solid Form Challenges. In Transition Metal-Catalyzed Couplings in Process Chemistry; Wiley, 2003; pp 165–180. [Google Scholar]
- Urbina F.; Zorn K. M.; Brunner D.; Ekins S. Comparing the Pfizer Central Nervous System Multiparameter Optimization Calculator and a BBB Machine Learning Model. ACS Chem. Neurosci. 2021, 12, 2247–2253. 10.1021/acschemneuro.1c00265. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gupta A.; Muller A. T.; Huisman B. J. H.; Fuchs J. A.; Schneider P.; Schneider G. Erratum: Generative Recurrent Networks for De Novo Drug Design. Mol. Inf. 2018, 37, 1880141 10.1002/minf.201880141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gupta A.; Muller A. T.; Huisman B. J. H.; Fuchs J. A.; Schneider P.; Schneider G. Generative Recurrent Networks for De Novo Drug Design. Mol. Inf. 2018, 37, 1700111 10.1002/minf.201700111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bjerrum E. J.; Threlfall R.. Molecular Generation with Recurrent Neural Networks (RNNs). 2017, arXiv:1705.04612. arXiv.org e-Print archive. https://arxiv.org/abs/1705.04612.
- Domenico A.; Nicola G.; Daniela T.; Fulvio C.; Nicola A.; Orazio N. De Novo Drug Design of Targeted Chemical Libraries Based on Artificial Intelligence and Pair-Based Multiobjective Optimization. J. Chem. Inf. Model. 2020, 60, 4582–4593. 10.1021/acs.jcim.0c00517. [DOI] [PubMed] [Google Scholar]
- Maziarka Ł.; Pocha A.; Kaczmarczyk J.; Rataj K.; Danel T.; Warchol M. Mol-CycleGAN: a generative model for molecular optimization. J. Cheminf. 2020, 12, 2 10.1186/s13321-019-0404-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhavoronkov A.; Ivanenkov Y. A.; Aliper A.; Veselov M. S.; Aladinskiy V. A.; Aladinskaya A. V.; Terentiev V. A.; Polykovskiy D. A.; Kuznetsov M. D.; Asadulaev A.; Volkov Y.; Zholus A.; Shayakhmetov R. R.; Zhebrak A.; Minaeva L. I.; Zagribelnyy B. A.; Lee L. H.; Soll R.; Madge D.; Xing L.; Guo T.; Aspuru-Guzik A. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 2019, 37, 1038–1040. 10.1038/s41587-019-0224-x. [DOI] [PubMed] [Google Scholar]
- Putin E.; Asadulaev A.; Vanhaelen Q.; Ivanenkov Y.; Aladinskaya A. V.; Aliper A.; Zhavoronkov A. Adversarial Threshold Neural Computer for Molecular de Novo Design. Mol. Pharmaceutics 2018, 15, 4386–4397. 10.1021/acs.molpharmaceut.7b01137. [DOI] [PubMed] [Google Scholar]
- Durrant J. D.; Lindert S.; McCammon J. A. AutoGrow 3.0: an improved algorithm for chemically tractable, semi-automated protein inhibitor design. J. Mol. Graphics Modell. 2013, 44, 104–112. 10.1016/j.jmgm.2013.05.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark A. M.; Dole K.; Coulon-Spector A.; McNutt A.; Grass G.; Freundlich J. S.; Reynolds R. C.; Ekins S. Open source bayesian models: 1. Application to ADME/Tox and drug discovery datasets. J. Chem. Inf. Model. 2015, 55, 1231–1245. 10.1021/acs.jcim.5b00143. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saubern S.; Guha R.; Baell J. B. KNIME Workflow to Assess PAINS Filters in SMARTS Format. Comparison of RDKit and Indigo Cheminformatics Libraries. Mol. Inf. 2011, 30, 847–850. 10.1002/minf.201100076. [DOI] [PubMed] [Google Scholar]
- Urbina F.; Lentzos F.; Invernizzi C.; Ekins S. Dual use of artificial-intelligence-powered drug discovery. Nat. Mach. Intell. 2022, 4, 189–191. 10.1038/s42256-022-00465-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown T. B.; Mann B.; Ryder N.; Subbiah M.; Kaplan J.; Dhariwal P.; Neelakantan A.; Shyam P.; Sastry G.; Askell A.; Agarwal S. R.; Herbert-Voss A.; Kreueger G.; Henighan T.; Child R.; Ramesh A.; Ziegler D. M.; Wiu J.; Winter C.; Hesse C.; Chen M.; Sigler E.; Litwin M.; Gray S.; Chess B.; Clark J.; Berner C.; McCandlish S.; Radford A.; Sutkever I.; Amodei D.. Language Models are Few-Shot Learners. 2020, arXiv:2005.14165. arXiv.org e-Print archive. https://arxiv.org/abs/2005.14165.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.





