Skip to main content
ACS Omega logoLink to ACS Omega
. 2021 May 4;6(19):12557–12566. doi: 10.1021/acsomega.1c00477

Artificial Intelligence-Guided De Novo Molecular Design Targeting COVID-19

Srilok Srinivasan , Rohit Batra , Henry Chan †,, Ganesh Kamath , Mathew J Cherukara †,§, Subramanian K R S Sankaranarayanan †,∥,*
PMCID: PMC8154149  PMID: 34056406

Abstract

graphic file with name ao1c00477_0009.jpg

An extensive search for active therapeutic agents against the SARS-CoV-2 is being conducted across the globe. While computational docking simulations remain a popular method of choice for the in silico ligand design and high-throughput screening of therapeutic agents, it is severely limited in the discovery of new candidate ligands owing to the high computational cost and vast chemical space. Here, we present a de novo molecular design strategy that leverages artificial intelligence (AI) to discover new therapeutic agents against SARS-CoV-2. A Monte Carlo tree search algorithm combined with a multitask neural network surrogate model for expensive docking simulations, and recurrent neural networks for rollouts, is used in an iterative search and retrain strategy. Using Vina scores as the target objective to measure binding to either the isolated spike protein (S-protein) at its host receptor region or to the S-protein/angiotensin converting enzyme 2 receptor interface, we generate several (∼100’s) new therapeutic agents that outperform Food and Drug Administration (FDA) (∼1000’s) and non-FDA molecules (∼million). Our AI strategy is broadly applicable for accelerated design and discovery of chemical molecules with any user-desired functionality.

1. Introduction

In view of the unprecedented human and economic loss resulting from the COVID-19 pandemic, there is an extensive amount of research being conducted across the globe to design effective therapeutic agents and vaccines against SARS-CoV-2, the virus causing the disease. Computational docking simulations have traditionally been used for in silico ligand design and remain a popular method of choice for high-throughput screening of therapeutic agents in the fight against COVID-19.1 However, given the relatively high computing cost of docking simulations, one can typically only screen a few 1000 molecules even with the use of leadership computing resources (for instance, summit at ORNL was used by Smith and Smith1 for their ensemble docking calculations). Despite the vast chemical space2 (millions to billions of molecules) that can be potentially explored, we remain severely limited in the search of new candidate compounds owing to the high computational cost of the ensemble docking simulations employed in traditional in silico approaches.1,3

A crucial first step toward the development of new therapeutic ligands is the identification and characterization of new candidate compounds that are synthetically feasible and effective for the proposed application. Despite advances in high-throughput screening, only a small fraction of the enormous possibilities of organic compounds have been explored. More recently, advances in artificial intelligence (AI) have accelerated this exploration by building cheaper surrogate models, that is, machine learning (ML) models (instead of expensive docking simulations) that are trained against a reasonably elaborate set of training data set derived from ensemble docking simulations.4 A variety of AI methods have been applied to the problem of generating and evaluating new molecules including reinforcement learning,5,6 Bayesian molecular design,7 and generative models, such as variational autoencoders8,9 and recurrent networks.5,10,11 We have shown in a recent work4 that a machine learnt model can be used to quickly and effectively screen active agents, from an existing database, for their potential effectiveness against SARS-CoV-2 based on binding affinity to either the isolated S-protein at its host receptor region or to the S-protein/angiotensin converting enzyme 2 (ACE2) interface complex, thereby potentially limiting and/or disrupting the host–virus interactions. Other strategies for repurposing existing therapeutics based on the principle of “neighborhood behavior” in the chemical space,1214 docking simulations,15,16 quantum mechanical scoring,17 and sequence comparison statistics18 were also recently reported. While these examples are effective in screening and repurposing the molecules within an existing database, we are still limited within the known chemical space.

Here, we demonstrate that one can take full advantage of the recent advances in AI and, in particular, deep learning methods [multi-task neural network (MTNN) and recurrent neural network (RNN)] to efficiently explore the elaborate chemical phase space. We draw inspiration from the success of AI algorithms such as Monte Carlo tree search (MCTS), which have proven to be extremely effective in computer games (Go Game and several others)19 against human competitors. Our AI-based de novo design strategy allows us to rapidly explore the chemical space and identify dozens of new candidates that are both physically viable and outperform existing Food and Drug Administration (FDA) and non-FDA molecules in terms of their Vina scores obtained from the docking simulations. Although our workflow here is demonstrated in the context of accelerated discovery of therapeutic agents against SARS-CoV-2, the approach is much broader and can be applied generally for the inverse design of chemical molecules with user-desired properties.

2. Computational Methods

A schematic of our workflow is shown in Figure 1. Our de novo molecular design is made possible by deploying a cheap surrogate MTNN model in place of docking simulations and exploring the exhaustive SMILES chemical space of millions of prospective molecules using AI algorithms such as MCTS with rollout(s) using RNN. We represent the SMILES strings of the explored molecules as a tree with each character representing a node, so that, any path from the head node to the terminal nodes represents a unique SMILES string. The goal of MCTS is to discover new ligands that strongly bind to the S-protein of the SARS-CoV-2 virus and thereby shield it from the human receptors. The strength of S-protein binding is quantified by the binding Vina scores as defined by Smith and Smith.1 A MTNN model, trained to predict the binding Vina scores for a given SMILES string, is used to evaluate the merit of the nodes. The accuracy of the MTNN model, particularly for molecules with strong binding, is improved using transfer learning strategies. The different components of our workflow are discussed in more detail below.

Figure 1.

Figure 1

(a) Schematic illustration of the MCTS algorithm. The head node contains a dummy character &. A RNN model is used to complete the partial SMILES string, obtained by traversing the tree from the head node to the leaf node selected. S-protein binding Vina scores for the valid SMILES strings are predicted using a MTNN model described in Figure 2, (b) Venn diagram representation of phase space of all the possible SMILES, valid SMILES, and the ligands with high binding energy.

2.1. Monte Carlo Tree Search

We define our search domain as the space of all the possible combinations of “building blocks” of SMILES corresponding to the atoms, number of bonds, rings, and branching. The “building blocks” are identified as a list of unique and indivisible chemical units gathered from SMILES strings within binding DB20 and ZINC21 database. Our search space of SMILES strings contains 174 unique building blocks (see Supporting Information S4). We use the MCTS algorithm within the ChemTS22 python library for molecular generation to sample from this search space. Each node of the MCTS tree contains a “building block” of the SMILES string and a upper confidence bound (UCB) score defined as

2.1. 1

where vi is the number of visits to the node i. The constant C determines the balance between the exploration and exploitation of the tree. The merit of a node, wi, is calculated as

2.1. 2

The head node and the terminal node contain dummy characters “&” and “$”, respectively. Each branch of the tree corresponds to a unique SMILES string representation. The complete SMILES string is built by traversing the tree from the head node to the terminal nodes. Figure 1a illustrates the selection, expansion, playout, and update scores steps of the MCTS. ChemTS22 uses a RNN model during the playouts to sequentially determine the subsequent nodes until the terminal node is reached. For example, if the playout is performed at a node depth of d, the RNN uses a partial SMILES string Spartial = s1, ..., sd, obtained by traversing the path from the head node, to compute a probability distribution for the subsequent building block sd+1. The partial SMILES string is then extended by sampling from the probability distribution and a new probability distribution for sd+2 is computed using the extended partial SMILES string Spartial = s1, ..., sd, sd+1. The complete SMILES string is constructed by iteratively extending until the terminal symbol is obtained. At the end of the playout, we obtain a list of the complete SMILES string. We eliminate the ones which do not satisfy the SMILES grammar before computing the binding energies using the MTNN model. The node scores are computed as

2.1. 3

where Si is the complete SMILES string after playout, BVS(Si) is the S-protein binding Vina score predicted by the MTNN, LP(Si) is a penalty term for large log P values computed using eq 4, and RP(Si) is a penalty term for a large number of rings computed using eq 5.

2.1. 4
2.1. 5

Here, ⟨log P⟩ and NRavg are the mean log P and the number of rings of the molecules in the BindingDB20 and ZINC database.21 NRstd is the standard deviation of the number of rings.

The final step of the MCTS cycle involves updating the UCB scores of all the nodes in the tree. The MCTS cycle of selection, expansion, playouts, and updating the scores is continued until a desired stopping criteria is met.

2.2. RNN for SMILES

The RNN maps an input string s1, ..., sd to a probability distribution P(yd+1 = j) = gj(hd), where hd+1 = f(hd, xd), gj is a softmax activation function and xd is a one-hot coded vector of length 175 representing the input sd. The architecture of the RNN is the same as the one used in the original ChemTS22 framework where two gated recurrent units with a 256 dimensional hidden state are stacked on top of each other and represents the function f.

The training data set consist of the 250k molecules from ZINC database21 used in original ChemTS and an additional 800k molecules from BindingDB database.20 The training of the RNN is performed by minimizing the following loss function

2.2. 6

where D is the relative entropy, xi is the ith SMILES string in the training data, and θ represents the network parameters. The training is performed using a ADAM23 optimizer with a batch size of 512.

2.3. MTNN Model for Binding Energy

A MTNN model was used as a surrogate to quickly predict Vina scores of the sampled candidate SMILES strings during the MCTS search. The architecture of the MTNN model is shown in Figure 2a. The input layer consists of 358-dimensional molecular descriptor, followed by a series of dense layers (with/without dropout) diverging into two branches, one for the isolated S-protein and the other for the interface Vina score predictions. The molecular descriptor was obtained based on our past experience on fingerprinting organic materials, including polymers.24 Our fingerprinting scheme uses the SMILES representation of the molecule to capture its chemical information at multiple length scales, that is, atomic, block-level, and at a slightly higher morphological level. More details on the descriptor definition can be obtained elsewhere.24 Superior performance of multitask networks over their single-task counterparts for several different classes of supervised learning problems motivates the use of such a network architecture here.25 The MTNN model, as implemented in TensorFlow,26 was trained using the data set from Smith and Smith,1 which consists of Vina scores for the S-protein and the S-protein/ACE2 interface systems for nearly 8000 molecules computed using docking simulations. The loss function consisted of the sum of the mean absolute errors (MAEs) of the S-protein and the interface scores, while the root mean square error and correlation coefficient (R2) were computed to measure model performance. To avoid overfitting, the MTNN model used dropout layers. Further, the number of training epochs was determined by tracking the loss function on the validation set (20% of the Smith dataset). The ADAM optimizer23 was used to train the model. We note that the Vina scores for many molecules within the Smith data set were reported to be extremely high (as much as 1,000,000 kcal/mol), while those with favorable binding energetics have Vina scores roughly between −7 and 0 kcal/mol. In order to remove such skewness in the data and train the MTNN model geared toward identifying favorable molecules with lower Vina scores, data points with scores greater than 5 kcal/mol were excluded during the training (by masking their contribution to the cost function). Further for a few cases, the SMILES representation could not be resolved correctly and were filtered out. Overall, the MTNN model when trained on 80% of the Smith data set, resulted in MAEs of 0.21 and 0.81 kcal/mol in Vina scores of the S-protein and the interface systems. Parity plots demonstrating the learning trends for both the training and the validation sets are included in the Supporting Information (Figure S1).

Figure 2.

Figure 2

(a) MTNN architecture to predict Vina scores of different molecules. (b) Parity plot of the S-protein and interface ML models for the training and the test sets, demonstrating the good prediction accuracy achieved by both the ML models. RMSE: root mean square error, MAE: mean absolute error, and R2: correlation coefficient.

2.4. Transfer Learning to Improve Network Performance

While good overall accuracy was achieved using the MTNN model when trained on the Smith data set, the model performance for extreme cases that are more relevant to this work, that is, ligands with very low S-protein or interface Vina scores, was low. This is a known issue with surrogate models, whose loss function is based on the overall data distribution, rather than a specific/extreme part of it. In order to improve the MTNN performance to reliably identify cases with extremely low Vina scores, we retrained our model by including the newly discovered molecules in the training data using a transfer learning approach. The Vina scores from the docking simulations of 177 screened candidates from MCTS search were combined with 510 randomly selected cases from the Smith data set to obtain the “retraining” data set. With the MTNN model weights initialized from the previous training (using Smith dataset), the model was retrained to reduce error on the new retraining data set. Again, error on the validation set (80% of the retraining data set) was used to track the number of epochs, with the results for the retrained MTNN model on the training and the validation sets, as shown in Figure 2b. Overall MAE of 0.70 and 0.27 kcal/mol, respectively, for the interface and the S-protein Vina scores on the validation set denotes the good accuracy achieved by the MTNN model, which is comparable to the 1 kcal/mol accuracy expected from the reference docking simulations. The MTNN architecture was kept fixed during the retraining step. Extending the training data beyond the distribution of initial data set allows the MTNN model to learn the correlation between the SMILES strings and Vina scores over a larger domain of SMILES space.

3. Results and Discussion

Using the MCTS–MTNN-based workflow described above, we search for molecules with low S-protein binding energies. Our workflow discovered many new molecules, not present in the existing databases and have a high affinity toward the S-protein of the virus. In total, we identified 97,973 unique molecules and their performance, as predicted by the surrogate MTNN model, are shown in the Supporting Information (Figure S4). A large fraction of these newly discovered molecules perform better, that is, have better Vina scores compared to top FDA molecules sampled in ref (27).

In order to validate the MTNN predictions on the sampled space of SMILES strings, we perform docking simulations on the molecules whose S-protein and interface Vina scores are both less than −5.8 kcal/mol. We chose this criteria so that the number of more expensive docking simulations are kept in the order of few hundred while the molecules with high binding affinities, as predicted by MTNN, are also included in the docking simulation set. We perform docking calculations, using the AutoDock Vina software.28 Similar to the work of Smith and Smith,1 we estimate the binding affinities of molecular ligands in the binding pocket region of the SARS-CoV-2 S-protein/ACE2 complex based on the Vina score, a hybrid empirical and knowledge-based scoring function that ranks molecular conformations and estimates the free energy of binding based on intermolecular contributions, for example, steric, hydrophobic, hydrogen bonding, and so forth.28 As such, the Vina score can be used as a proxy that correlates to the binding affinity between molecular ligands and the SARS-CoV-2 S-protein/ACE2 receptor. The structure of the ACE2 receptor has a protein data bank ID PDB: 2AJF whereas the structure of the SARS-CoV-2 S-protein has a NCBI Reference Sequence YP_009724390.1, which has the necessary mutations from its predecessor SARS variety SARS-CoV, namely, at L(455), F(486), Q(493), S(494), and N(501), respectively. A total of 12 docking receptors that include six conformations of the SARS-CoV-2 S-protein/ACE2 interface and the corresponding isolated SARS-CoV-2 S-protein receptors, that is, with the ACE2 receptor removed, are used in our docking calculations. These docking receptor conformations are obtained from Smith and Smith,1 which were originally sampled using root mean squared displacement based clustering from 1.3 microsecond long all-atom temperature replica exchange GROMACS simulations of the SARS-CoV-2 S-protein/ACE2 complex in water. Details regarding the construction and modeling of the complex are described here.1 The structure of the docking ligands is converted from SMILES string using Open Babel software.29 As in the work of Smith and Smith,1 we define a 1.2 nm × 1.2 nm × 1.2 nm search space in our docking calculation setup, which encompasses the binding pocket located at the S-protein/ACE2 interface. The same search space is explored for the isolated SARS-CoV-2 S-protein receptor cases. In our docking procedure, we find the top 10 optimized docking configurations for each molecule candidate and pick the best-scoring configuration as the predicted Vina score.

The Vina scores computed using docking simulations for the 177 molecules that satisfies the above criteria are compared against the FDA approved drugs and BindingDB data set, as shown in Figure 3. Our strategy of using a surrogate MTNN model to quickly screen and steer the MCTS search algorithm toward a promising region in the SMILES string space has successfully identified molecules with a better binding affinity than the existing ones (Figure 3). However, we note that the MAE between the MTNN model predictions and the docking simulations are 1.19 and 2.38 kcal/mol for the S-protein and interface Vina scores, respectively (see Supporting Information, Figure S2). Such high errors in the MTNN predictions are expected because the sampled molecules are not representative of the data distribution that was used to train the model but instead lie in a region with extreme Vina scores. In other words, the MTNN model is being perhaps utilized in the extrapolative regime for these new molecules, and its accuracy can be improved using active learning (or retraining).

Figure 3.

Figure 3

Vina scores for the isolated S-protein and S-protein/ACE2 interface using docking calculations. The 170 candidates selected after the first round of RNN + NN models are shown in green, with a few superior candidates displaying low Vina scores. For comparison, previously considered candidates from past works (ref a,1 ref b4) are also included.

To improve the MTNN performance and thereby the efficiency of MCTS search, we retrain the surrogate MTNN model by including the newly discovered molecules in the training data. We use a transfer learning approach as detailed in Section 2.4. Extending the training data beyond the distribution of initial data set allows the MTNN model to learn correlation between the SMILES strings and Vina scores over a larger domain of SMILES space. We next repeat the MCTS search with the improved MTNN model and now identify a significantly larger number of unique molecules (∼300,000) compared to the previous search. The distribution of the MTNN Vina scores for these molecules are shown in Supporting Information (Figure S5). The best molecules were selected by sorting the points by its distance from the line S-proteinVina + interfaceVina ≤ −7.5 kcal/mol, which is the same screening criteria used in ref (4), and selecting the 200 farthest molecules. The Vina scores computed from the docking simulation for these 200 best candidates, sampled using the retrained MTNN model, are compared with the molecules from the first round of MCTS–MTNN as well as FDA approved drugs from the CureFFI database30 and a dataset of common active ingredients from DrugCentral31 (see Figure 4). The MAE in the MTNN predictions are 0.38 and 1.28 kcal/mol for the S-protein and interface Vina scores, respectively, which is less than the original MTNN model. Sampling with an improved MTNN model which can correctly bias the MCTS algorithm to search the promising regions of the SMILES string space thus enables the design of better performing ligands (purple markers in Figure 4).

Figure 4.

Figure 4

Vina scores for the isolated S-protein and S-protein/ACE2 interface using docking calculations. The 200 candidates selected after the second round of RNN + MTNN models are shown in green, with a few superior candidates displaying low Vina scores. For comparison, previously considered candidates from past works (ref a,1 ref b4) are also included.

We note that by extending the data set to include new on-the-fly sampled data and performing the transfer learning, the MTNN model has learnt better the correlation between SMILES string and binding energies. In principle, we can iteratively improve the MTNN model by including diverse molecules sampled by MCTS during transfer learning. The advantage of iterative sampling during any given search cycle is twofold: (i) MCTS designed molecules to perform better with each iteration and (ii) surrogate MTNN model to predict the binding energies get progressively better with each iteration because it learns the correlation over a larger region in the search space. Furthermore, using our MTNN model, we evaluate the performance of some of the commonly used drugs for the treatment of COVID-19 (see Supporting Information, Table S2). Our model successfully predicts the strong binding nature (low Vina scores) of these candidate molecules as revealed by AutoDocking simulations.

Although the goal in this work is to discover molecules with high binding affinities toward the S-protein of the SARS-CoV-2 virus, one can also utilize other metrics related to the druglikeness of the compounds to further determine the viability of the drug. Such analysis are particularly valuable with regard to recent reports32,33 on the performance plateau of classical scoring functions during virtual screening. Here, we evaluate the druglikeness of the identified molecules using the Lipinski’s rule of 5, which requires the following criteria to be met:

  • Number of H-bonds < 5

  • Number of H-acceptors < 10

  • Molecular weight < 500 Da

  • Octanol–water partition coefficient (log P) < 5

Using these above metrics, we further screen the candidates identified by our MCTS–MTNN search. The distribution of the Lipinski attributes for the top 200 molecules are shown in Figure 5. The S-protein Vina scores of the 200 molecules are provided in Table S1 in Supporting Information. Furthermore, the table of the top candidates and their complete Lipinski metrics are provided in the Supporting Information files. Out of the 200 best molecules, 134 molecules satisfy the rule of 5 criteria. Some of the top ranking candidate molecules from this list are depicted in Figure 6.

Figure 5.

Figure 5

Histogram of the Lipinski attributes for the top 200 molecules discovered by the MCTS algorithm. 134 out of the 200 molecules satisfy all the 4 criteria.

Figure 6.

Figure 6

Top candidates identified from this work along with their Vina scores for the S-protein/ACE2 interface (labeled, interface) and the S-protein systems using the ensemble docking simulations. The bottom panel shows the best binding pose for the corresponding molecules at the interface of nCov and ACE2 receptor.

In this work, the therapeutic prowess of the molecules are determined by their Vina scores, which is a measure of the binding energies. The binding energies of the molecules are dictated by the strength of the chemical interaction at the S-protein of the virus as well as the interface formed by the S-protein and human receptors. We identify the common chemical fragments and patterns within the set of strongly binding molecules to gain insights into the chemical features which favors binding. Knowledge of such chemical fragments which make-up the top candidates will allow accelerated design of therapeutic agents against SARS-CoV-2.

First, we analyze the frequency of occurrence of chemical features by splitting each molecule into smaller fragments and counting the number of unique fragments. We use the molecular fragmenting algorithm within RDKit34 to break the molecules across the single bonds and building a catalogue of unique fragments. A fragment can be a part of a bigger fragment which can contribute to further bigger ones. We represent this hierarchy as a directed graph (see Supporting Information S6) and pick only the terminal nodes for our analysis. The frequency of the top 15% fragments within the best molecules sampled by MCTS and the best molecules identified by ref (4) are shown in Figure 7. The strings within the angular brackets describe the functional group attached to the fragment. The fragments found among the MCTS sampled molecules are predominantly different from the molecules identified from the FDA and DrugCentral database in ref (4). Hence, the superior performance of the molecules identified in this work can be partly attributed to the presence of unique fragments favoring strong binding.

Figure 7.

Figure 7

Counts for a given fragment within the 200 best performing molecules sampled by MCTS and the 200 best performing molecules identified from the FDA approved drugs and the DrugCentral database. Only the fragments which are present in more than 15% of the molecules are shown here. The functional group contained within the angular brackets are attached to the fragments.

Based on the fragment distribution, we find that the top candidates from our search tend to include fluoro-alkyl or fluoro-aromatic groups in their molecular makeup. Fluorinated derivatives and substitution are known to play a unique role in medicinal chemistry. The tailored incorporation of a fluoro-alkyl or fluoro-aromatic group (enamine also having cyclic amines) into bioactive compounds influences the binding affinity, stability, lipophilicity, membrane permeability, and bioactivity.35 Examples including some of our top compounds such as ligand 2 and ligand 6 consists of the fluoromethyl group (CF3), which are known to improve adsorption digestion metabolism and extraction and pharmacokinetic properties,35,36 see Figure 6a. It is worth noting that heterocycles and fused rings constitute of about 70% of the drug market with one of the reasons being the limited entropic effects showcased by their molecular rigidity. Nitrogen forms a key atom in such fused rings and heterocycles with a large number of ligands consisting of the azole moiety. The analysis of some of the top fragments for the 200 best performing molecules indicate the presence of unsaturated amines (e.g., ccccn ccccnc), as shown in Figure 7. This suggests the ubiquity of enamine and azole chemistry in drug design—such moieties have had considerable success as antifungal agents. To gain more insights into the performance of the top ranked candidates, we closely analyze our autodocking simulations and evaluate the binding pose of these candidates on the interface of the ACE2 receptor and the virus. We observe that the exocylic nitrogen (NH2) amino group of the cyclic lactum forms hydrogen bond interactions with the serine (S494) amino acid of the n-CoV. In the case of another top-ranked molecule, the benzene moiety interacts with the oxygen of the serine amounting to about −4 to −5 kcal/mol energetic interaction. Based on these observations, we postulate the role of the amino-azole group with the serine residue (S494 one of the five mutating sites from the SARS-CoV 2002 virus), as shown in Figure 6b, to be an influential factor in determining the efficacy of the ligands.

4. Conclusions

In conclusion, we present an AI search strategy (decision trees such as MCTS) to discover therapeutic agents against the SARS-CoV-2 virus. We hypothesize that the therapeutic prowess against the SARS-CoV-2 virus is measured by the binding affinity (Vina scores) of the molecules to either the isolated S-protein of SARS-CoV-2 at its host receptor region or to the S-protein/ACE2 receptor interface. We replace the expensive docking simulations with a surrogate MTNN model to accelerate the exploration of the vast chemical space represented by SMILES strings. The Vina scores predicted by the MTNN surrogate model are used as target objectives in the MCTS–RNN search. Our search and retrain strategy to iteratively explore the design space and concurrently improve the accuracy of the surrogate model is shown to significantly accelerate the discovery; we predict over a 100 new molecules that outperform existing FDA (∼100’s) and non-FDA (∼million) molecules in their therapeutic prowess measured via Vina scores. We perform detailed analysis using other metrics related to the druglikeness of the compounds to further determine the viability of these agents. Finally, we also analyze structural features and identify structural similarities between top performing candidates. We note that our molecular discovery approach using MCTS and RNN is a more general technique which can be applied to design chemical molecules with the desired property as long we have a reasonably fast and accurate surrogate model to screen the SMILES string space and guide the search algorithm.

Acknowledgments

We acknowledge funding from BES Award DE-SC0020201 by DOE to support this research. The use of the Center for Nanoscale Materials, an Office of Science user facility, was supported by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, under contract no. DE-AC02-06CH11357. This research used resources of the National Energy Research Scientific Computing Center, which was supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-05CH11231. An award of computer time was provided by the Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program of the Argonne Leadership Computing Facility at the Argonne National Laboratory, which was supported by the Office of Science of the U.S. Department of Energy under contract no. DE-AC02-06CH11357. S.K.R.S.S. acknowledges UIC start-up funds for supporting this research.

Supporting Information Available

The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.1c00477.

  • SMILES building blocks (PDF)

  • Binding energies and Lipinski metrics of top 200 molecules (XLSX)

Author Contributions

S.S., R.B., and S.K.R.S.S. conceived and designed the project. S.S., S.K.R.S.S., and M.J.C. developed the workflow. S.S. performed the sampling of molecules. S.S. and G.K. performed the chemical fragment analysis. R.B. trained the MTNN. H.C. and G.K. performed the docking simulations. S.S., R.B., H.C., G.K., M.J.C., and S.K.R.S.S. interpreted data and wrote the manuscript. All the authors contributed to the preparation of the manuscript.

The authors declare no competing financial interest.

Supplementary Material

ao1c00477_si_001.pdf (819KB, pdf)
ao1c00477_si_002.xlsx (256.2KB, xlsx)

References

  1. Smith M.; Smith J. C.. Repurposing Therapeutics for COVID-19: Supercomputer-Based Docking to the SARS-CoV-2 Viral Spike Protein and Viral Spike Protein–Human ACE2 Interface, 2020.
  2. Reymond J.-L. The chemical space project. Acc. Chem. Res. 2015, 48, 722–730. 10.1021/ar500432k. [DOI] [PubMed] [Google Scholar]
  3. Parks J. M.; Smith J. C. How to discover antiviral drugs quickly. N. Engl. J. Med. 2020, 382, 2261–2264. 10.1056/NEJMcibr2007042. [DOI] [PubMed] [Google Scholar]
  4. Batra R.; Chan H.; Kamath G.; Ramprasad R.; Cherukara M. J.; Sankaranarayanan S. K. R. S. Screening of therapeutic agents for covid-19 using machine learning and ensemble docking studies. J. Phys. Chem. Lett. 2020, 11, 7058–7065. 10.1021/acs.jpclett.0c02278. [DOI] [PubMed] [Google Scholar]
  5. Gupta A.; Müller A. T.; Huisman B. J. H.; Fuchs J. A.; Schneider P.; Schneider G. Generative recurrent networks for de novo drug design. Mol. Inf. 2018, 37, 1700111. 10.1002/minf.201700111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Ståhl N.; Falkman G.; Karlsson A.; Mathiason G.; Boström J. Deep reinforcement learning for multiparameter optimization in de novo drug design. J. Chem. Inf. Model. 2019, 59, 3166–3176. 10.1021/acs.jcim.9b00325. [DOI] [PubMed] [Google Scholar]
  7. Ikebata H.; Hongo K.; Isomura T.; Maezono R.; Yoshida R. Bayesian molecular design with a chemical language model. J. Comput.-Aided Mol. Des. 2017, 31, 379–391. 10.1007/s10822-016-0008-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Chenthamarakshan V.; Das P.; Padhi I.; Strobelt H.; Lim K. W.; Hoover B.; Hoffman S. C.; Mojsilovic A.. Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models. April 2020, arXiv:2004.01215. arXiv e-Prints. [Google Scholar]
  9. Gómez-Bombarelli R.; Wei J. N.; Duvenaud D.; Hernández-Lobato J. M.; Sánchez-Lengeling B.; Sheberla D.; Aguilera-Iparraguirre J.; Hirzel T. D.; Adams R. P.; Aspuru-Guzik A. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4, 268–276. 10.1021/acscentsci.7b00572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Bung N.; Krishnan S. R.; Bulusu G.; Roy A.. De Novo Design of New Chemical Entities (NCES) for SARS-COV-2 Using Artificial Intelligence. March 2020, https://chemrxiv.org/articles/De_Novo_Design_of_New_Chemical_Entities_NCEs_for_SARS-CoV-2_Using_Artificial_Intelligence/11998347/1. [DOI] [PMC free article] [PubMed]
  11. Segler M. H. S.; Kogej T.; Tyrchan C.; Waller M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 2018, 4, 120–131. 10.1021/acscentsci.7b00512. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Chakraborti S.; Bheemireddy S.; Srinivasan N.. Repurposing Drugs against Main Protease of SARS-COV-2: Mechanism Based Insights Supported by Available Laboratory and Clinical Data. April 2020, https://chemrxiv.org/articles/Drug_Repurposing_Approach_Targeted_Against_Main_Protease_of_SARS-CoV-2_Exploiting_Neighbourhood_Behaviour_in_3D_Protein_Structural_Space_and_2D_Chemical_Space_of_Small_Molecules/12057846/2. [DOI] [PubMed]
  13. Mahapatra S.; Nath P.; Chatterjee M.; Das N.; Kalita D.; Roy P.; Satapathi S.. Repurposing Therapeutics for COVID-19: Rapid Prediction of Commercially Available Drugs through Machine Learning and Docking. medRxiv. 2020, https://www.medrxiv.org/content/early/2020/04/23/2020.04.05.20054254. [DOI] [PMC free article] [PubMed]
  14. Elmezayen A. D.; Al-Obaidi A.; Şahin A. T.; Yelekçi K. Drug repurposing for coronavirus (covid-19): in silico screening of known drugs against coronavirus 3cl hydrolase and protease enzymes. J. Biomol. Struct. Dyn. 2020, 1–13. 10.1080/07391102.2020.1758791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Wang J. Fast identification of possible drug treatment of coronavirus disease-19 (covid-19) through computational drug repurposing study. J. Chem. Inf. Model. 2020, 60, 3277–3286. 10.1021/acs.jcim.0c00179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Mahdian S.; Ebrahim-Habibi A.; Zarrabi M. Drug repurposing using computational methods to identify therapeutic options for covid-19. J. Diabetes Metab. Disord. 2020, 19, 691–699. 10.1007/s40200-020-00546-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Cavasotto C.; Filippo J. D.. In Silico Drug Repurposing for COVID-19: Targeting SARS-COV-2 Proteins through Docking and Quantum Mechanical Scoring. April 2020, https://chemrxiv.org/articles/In_silico_Drug_Repurposing_for_COVID-19_Targeting_SARS-CoV-2_Proteins_through_Docking_and_Quantum_Mechanical_Scoring/12110199/2. [DOI] [PubMed]
  18. Kandeel M.; Al-Nazawi M. Virtual screening and repurposing of fda approved drugs against covid-19 main protease. Life Sci. 2020, 251, 117627. 10.1016/j.lfs.2020.117627. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Silver D.; Hubert T.; Schrittwieser J.; Antonoglou I.; Lai M.; Guez A.; Lanctot M.; Sifre L.; Kumaran D.; Graepel T.; Lillicrap T.; Simonyan K.; Hassabis D. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 2018, 362, 1140–1144. 10.1126/science.aar6404. [DOI] [PubMed] [Google Scholar]
  20. Gilson M. K.; Liu T.; Baitaluk M.; Nicola G.; Hwang L.; Chong J. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2015, 44, D1045–D1053. 10.1093/nar/gkv1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Sterling T.; Irwin J. J. ZINC 15—Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55, 2324–2337. 10.1021/acs.jcim.5b00559. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Yang X.; Zhang J.; Yoshizoe K.; Terayama K.; Tsuda K. Chemts: an efficient python library for de novo molecular generation. Sci. Technol. Adv. Mater. 2017, 18, 972–976. 10.1080/14686996.2017.1401424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Kingma D. P.; Ba J.. Adam: A Method for Stochastic Optimization. Dec 2014, arXiv:1412.6980, arXiv e-Prints. [Google Scholar]
  24. Kim C.; Chandrasekaran A.; Huan T. D.; Das D.; Ramprasad R. Polymer genome: a data-powered polymer informatics platform for property predictions. J. Phys. Chem. C 2018, 122, 17575–17585. 10.1021/acs.jpcc.8b02913. [DOI] [Google Scholar]
  25. Caruana R. Multitask learning. Mach. Learn. 1997, 28, 41–75. 10.1023/a:1007379606734. [DOI] [Google Scholar]
  26. Abadi M.; Agarwal A.; Barham P.; Brevdo E.; Chen Z.; Citro C.; Corrado G. S.; Davis A.; Dean J.; Devin M.; Ghemawat S.; Goodfellow I.; Harp A.; Irving G.; Isard M.; Jia Y.; Jozefowicz R.; Kaiser L.; Kudlur M.; Levenberg J.; Mané D.; Monga R.; Moore S.; Murray D.; Olah C.; Schuster M.; Shlens J.; Steiner B.; Sutskever I.; Talwar K.; Tucker P.; Vanhoucke V.; Vasudevan V.; Viégas F.; Vinyals O.; Warden P.; Wattenberg M.; Wicke M.; Yu Y.; Zheng X.. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015, software available from tensorflow.org. https://www.tensorflow.org/.
  27. Batra R.; Chan H.; Kamath G.; Ramprasad R.; Cherukara M. J.; Sankaranarayanan S.. Screening of Therapeutic Agents for COVID-19 Using Machine Learning and Ensemble Docking Simulations. April 2020, arXiv:2004.03766, arXiv e-Prints. [DOI] [PubMed] [Google Scholar]
  28. Trott O.; Olson A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 2010, 31, 455–461. 10.1002/jcc.21334. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. O’Boyle N. M.; Banck M.; James C. A.; Morley C.; Vandermeersch T.; Hutchison G. R. Open Babel: An open chemical toolbox. J. Cheminf. 2011, 3, 33. 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. CureFFI. 2013, https://www.cureffi.org/2013/10/04/list-of-fda-approved-drugs-and-cns-drugs-with-smiles/.
  31. Ursu O.; Holmes J.; Knockel J.; Bologa C. G.; Yang J. J.; Mathias S. L.; Nelson S. J.; Oprea T. I. DrugCentral: online drug compendium. Nucleic Acids Res. 2017, 45, D932–D939. 10.1093/nar/gkw993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Wójcikowski M.; Ballester P. J.; Siedlecki P. Performance of machine-learning scoring functions in structure-based virtual screening. Sci. Rep. 2017, 7, 46710. 10.1038/srep46710. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Gusarov S.; Stoyanov S. R. Cosmo-rs-based descriptors for the machine learning-enabled screening of nucleotide analogue drugs against sars-cov-2. J. Phys. Chem. Lett. 2020, 11, 9408–9414. 10.1021/acs.jpclett.0c02836. [DOI] [PubMed] [Google Scholar]
  34. RDKit: Open-Source Cheminformatics. http://www.rdkit.org (accessed April 11, 2013).
  35. Howard K. C.; Dennis E. K.; Watt D. S.; Garneau-Tsodikova S. A comprehensive overview of the medicinal chemistry of antifungal drugs: perspectives and promise. Chem. Soc. Rev. 2020, 49, 2426–2480. 10.1039/C9CS00556K. [DOI] [PubMed] [Google Scholar]
  36. Thamban Chandrika N.; Shrestha S. K.; Ngo H. X.; Tsodikov O. V.; Howard K. C.; Garneau-Tsodikova S. Alkylated piperazines and piperazine-azole hybrids as antifungal agents. J. Med. Chem. 2018, 61, 158–173. 10.1021/acs.jmedchem.7b01138. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ao1c00477_si_001.pdf (819KB, pdf)
ao1c00477_si_002.xlsx (256.2KB, xlsx)

Articles from ACS Omega are provided here courtesy of American Chemical Society

RESOURCES