Abstract
Traditional drug discovery methods focus on optimizing the efficacy of a drug against a single biological target of interest for a specific disease. However, evidence supports the multitarget theory, i.e., drugs work by exerting their therapeutic effects via interaction with multiple biological targets, which have multiple phenotypic effects. Analytics of drug–protein interactions on a large proteomic scale provides insight into disease systems while also allowing for prediction of putative therapeutics against specific indications. We present a Python package for analysis of drug–proteome and drug–disease relationships implementing the Computational Analysis of Novel Drug Opportunities (CANDO) platform. The CANDO package allows for rapid drug similarity assessment, most notably via an in-house interaction scoring protocol where billions of drug–protein interactions are rapidly scored and the similarity of drug-proteome interaction signatures is calculated. The package also implements a variety of benchmarking protocols for shotgun drug discovery and repurposing, i.e., to determine how every known drug is related to every other in the context of the indications/diseases for which they are approved. Drug predictions are generated through consensus scoring of the most similar compounds to drugs known to treat a particular indication. Support for comparing and ranking novel chemical entities, as well as machine learning modules for both benchmarking and putative drug candidate prediction is also available. The CANDO Python package is available on GitHub at https://github.com/ram-compbio/CANDO, through the Conda Python package installer, and at http://compbio.org/software/.
Graphical Abstract
INTRODUCTION
Drugs and small molecule compounds exert therapeutic effects via the perturbation of multiple macromolecules, especially proteins. Growing evidence suggests small molecule drugs interact with multiple proteins to enact cellular changes, contrary to the “magic bullet” philosophy often practiced in drug discovery.8–10 Therefore, interpreting the totality of protein interactions for drugs provides greater insight into their therapeutic functions, with the potential for more efficient drug discovery. In addition, drug repurposing has emerged as a valuable alternative to traditional drug discovery pipelines, potentially easing the burden associated with common clinical trial failures.11–14 Multiple groups have taken a multitarget approach for predicting drug effects; both Liu and Altman and Zhou et al. computed interactions between large libraries of drugs and proteins to map targets to side effect outcomes.15,16 Similarly, Simon et al. mapped drug–protein interactions to “effect profiles”, of which a given effect is a drug class (for example, calcium channel blocker or stimulant) as opposed to a disease or side effect.17 Numerous groups have used network or systems biology approaches for large scale prediction of drug– disease associations, typically using known drug–protein interactions;18–20 however, no studies have computed proteomic interactions for all drugs for the purpose of benchmarking and prediction for every disease to our knowledge.
We have developed the Computational Analysis of Novel Drug Opportunities (CANDO) platform for analysis of drug interactions on a proteomic scale, adhering to multitarget drug theory,1–7 for the purposes of shotgun drug discovery and repurposing, i.e., to evaluate every drug for every disease. An overview of the platform is provided in Supporting Figure 1. CANDO version 2 (v2) is comprised of a library of 14 606 sequence nonredundant (p-value 10e–7) protein structures extracted from the Protein Data Bank, 2162 human-approved drugs from DrugBank, and 2178 indications/diseases from the Comparative Toxicogenomics Database (CTD), encompassing 18 709 drug–indication associations.21–23 An additional set of 5317 human only protein structures is also available. The platform relates small molecules based on their computed interactions with all protein structures, known as an interaction signature, then assesses a drug repurposing accuracy based on how similar drug–proteomic signatures are for those drugs approved to treat the same indications. The hypothesis is that drugs with similar interaction signatures will have similar behavior in biological systems and will therefore be useful against the same indications.
Here, we present cando.py, a Python package implementing the CANDO platform for convenient analyses of drug–protein interaction signatures with the ultimate goal of making novel putative drug candidate generation easy and accessible. The package may be used for validation of virtual screening methods for applications in drug discovery and repurposing and for extending or developing novel drug discovery and repurposing platforms. The package reads in a matrix of precomputed interaction scores with any number of proteins, along with a drug to indication mapping, which are then benchmarked. Compound–protein interaction signatures for novel compounds/drugs not present in our library are quickly computed and added to the matrix using our default interaction scoring protocol, allowing for direct comparison and ranking relative to other drug signatures in the platform. The package can also read in any drug–drug similarity/distance matrix computed using any third party package, which may be benchmarked or used for drug–disease association prediction.
METHODS: CANDO PLATFORM IMPLEMENTATION
The CANDO platform is implemented in Python as a series of parallel pipelines with modules for the following major protocols (Supporting Figure 1).
Interaction Scoring Protocol.
The pipelines in the CANDO platform are agnostic to the interaction scoring protocol used: The compound–protein interaction scores in CANDO may be derived from high throughput disassociation constant studies, molecular docking simulations, and/or other quantification of structure–activity relationships.24–26 If more than one protocol is used, then it constitutes a different pipeline within the platform. The reference/default compound-protein interaction scores in the CANDO v2 matrices are computed using a bioinformatic docking protocol that compares the structures of query drugs to all ligands known to bind to a given site on a protein.5 Specifically, the COACH algorithm is used to elucidate potential binding sites on each query protein, which uses a consensus approach via three different complementary algorithms that consider substructure or sequence similarity to known binding sites in the PDB.27 COACH output includes a set of cocrystallized ligands for each potential binding site, which are then compared to a compound/drug of interest using chemical fingerprinting methods that binarize the presence or absence of particular molecular substructures. The maximum Tanimoto coefficient between the binary vectors of the query compound and the set of all predicted binding site ligands for a protein serve as a proxy for the binding strength. The final output is a series of interaction scores between every drug/compound and every protein structure in the corresponding libraries.
Benchmarking Protocols.
Each drug/compound is ranked relative to all others based on the pairwise similarity of their proteomic signatures, calculated using the root-mean-square deviation (RMSD) by default, resulting in a ranked list of most similar compounds. By default, all proteins in the library are used for the RMSD calculation but their composition may be varied to allow for more specific queries, both generally or on a per indication basis, which also applies to the canpredict module (discussed below). Other distance metrics, such as cosine distance, may also be used.
The benchmarking protocol (implemented in the canbenchmark module) utilizes a hold-one-out scheme to compute an accuracy for each indication. For a given indication, each approved drug is held-out and the most similar compounds (within various cutoffs) are checked to see if they are also approved for the indication (Figure 2). This protocol is run iteratively and averaged across all indications with two or more drugs approved to provide a drug repurposing accuracy at each cutoff. Both the average indication accuracy (described above) and the pairwise accuracy (the weighted average based on the number of compounds approved for the disease) are outputted, as well as the coverage, which is the number of indications with nonzero accuracy. Benchmarking performance across different versions/pipelines is available in Supporting Figure 2.
Putative Drug Candidate Generation (Prediction) Protocol.
The ranked lists of most similar compounds to each drug, other than those that are used for benchmarking, are investigated as potential novel treatments. A consensus scoring approach is utilized where for each drug associated with a specific indication, the number of times a particular drug shows up within a certain cutoff of each list is counted. The prediction module canpredict then ranks the top compounds by their consensus scores. Figure 1 provides an example with malaria (Plasmodium falciparum). The top consensus scoring drugs include lumefantrine, a known antimalarial drug, and pantoprazole, a proton pump inhibitor that has shown antimalarial activity.28 Another strong candidate is aminoglutethimide, an aromatase inhibitor with uses including Cushing’s syndrome and various cancers. The exact set of proteins used for the drug–drug similarity calculations can be modified, e.g. specifying only Plasmodium proteins.
Putative Indication Prediction Protocol.
The canpredict module can also accept a small molecule compound as input, including novel chemical entities, and suggest novel indications for which they may be useful. First the proteomic signature is computed for all proteins in the platform, then the signature is compared to all other drugs in the platform. The most similar drugs to the query compound within a specified cutoff are probed for a consensus among the diseases for which they are indicated, correlative to the disease-focused canpredict module discussed above. Figure 2 presents the results for both an approved drug, ribavirin, and an investigational compound, LMK-235. Ribavirin receives a consensus score of two at the top10 cutoff for both Breast Neoplasms (MeSH:D001943) and Leukemia, Myeloid, Acute (MeSH:D015470), which is supported by clinical trials for both diseases in which ribavirin is the primary intervention.29,30 The three drugs contributing to these consensus scores are gemcitabine, azacitidine, and decitabine, which are all nucleoside analog anticancer therapies. LMK-235 is an investigational histone deacetylase inhibitor that is yet to begin human trials. The canpredict module output with a top20 cutoff includes both Pain (MeSH:D010146) and Hypertension (MeSH:D006973), which are both supported by in vivo experiments.31,32
AI/Machine Learning Protocols.
The CANDO package also provides support for several machine learning protocols that learn more complex relationships hidden in the drug-proteome interaction signatures to improve performance. The currently supported protocols include support vector machines (SVMs), 1-class SVMs, random forests, and logistic regression, though the latter two are prioritized as they offer insight into feature importance. The modules are trained on the input data to generate models that yield prediction pipelines that are benchmarked using a protocol similar to the one used by canbenchmark: for a given indication, each approved drug is held out while the model is trained on all other drugs approved for the indication in an iterative fashion. In other words, the number of binary classifiers trained corresponds to the number of drugs associated with a particular indication. An equal number of neutral samples are chosen as negative samples during training (except 1-class SVMs), which represent drugs/compounds not associated with the indication. Several metrics are calculated based on the number of times the samples are correctly classified, including the area under the receiver operating characteristic (AUC-ROC) and precision recall (AUPR) curves (Supporting Figure 3). AUC-ROC and AUPR are only available with logistic regression and random forests as they offer classification probabilities. The user may also make predictions for novel or nonassociated compounds after training the classifier on all approved drugs for a particular indication, and both AI/machine learning benchmarking and prediction protocols are amenable to any kind of feature input (e.g., molecular substructures instead of compound-protein interaction scores).
Development and Implementation.
The CANDO software is available in Python 2.7, 3.6, and 3.7. It is available for installation via the Python Anaconda installer. All data necessary for the benchmarking and prediction modules are available for download directly in the package. The source code, API document, and a Jupyter Notebook tutorial are available on GitHub at https://github.com/ram-compbio/CANDO as well as on http://compbio.org/software/.
DISCUSSION AND CONCLUSION
For interaction scoring, in addition to the bioinformatic docking protocol described above, a compound-proteome interaction matrix generated using our state of the art docking program CANDOCK33 with predicted binding energies will be available for use shortly. Indeed, the platform can accept protein–compound interaction, and compound–compound similarity, matrices generated by any method (virtual docking, molecular fingerprinting, gene expression changes, etc.) and benchmark their utility for shotgun drug discovery and repurposing. This is especially useful given that molecular docking and chemical fingerprinting techniques vary greatly in performance.34–36 A Web server hosted on compbio.org that will feature many of the functionalities described is under development.
The multitarget approach to drug discovery is vastly unexplored and shows promise for identifying novel treatments for various diseases based on the results we have obtained using our software. The CANDO Python package allows users to investigate drug–protein interactions on a proteomic scale for the purposes of shotgun drug discovery and repurposing, moving away from the single target and single indication philosophy. The multitarget approach, which in our platform is represented as the synthesis of many virtual screens, is conducive for understanding drug behavior holistically, which will allow for better elucidation of the therapeutic (and adverse) effects these small molecules exert on biological systems. We anticipate that broader use of this platform will inform researchers about potential lead compounds that may be therapeutic for specific indications, leading to accelerated and more efficient drug discovery.
Supplementary Material
ACKNOWLEDGMENTS
The authors thank Liana Bruggemann, Matthew Hudson, Manoj Mammen, and Jim Schuler for their contributions and testing of the CANDO software. This work was supported in part by a National Institute of Health Directors Pioneer Award (DP1OD006779), a National Institute of Health Clinical and Translational Sciences (NCATS) Award (UL1TR001412), NCATS ASPIRE Design Challenge Awards, a National Library of Medicine T15 Award (T15LM012495), a National Cancer Institute/Veterans Affairs Big Data-Scientist Training Enhancement Program Fellowship in Big Data Sciences, and startup funds from the Department of Biomedical Informatics at the University at Buffalo. Startup funds from the Department of Chemistry at Purdue University, Ralph W. and Grace M. Showalter Research Trust award, the Integrative Data Science Initiative award, and the Jim and Diann Robbers Cancer Research Grant for New Investigators award to Gaurav Chopra, as well as additional support in part by a NCATS Clinical and Translational Sciences Award from the Indiana Clinical and Translational Sciences Institute (UL1TR002529), and the Purdue University Center for Cancer Research NIH grant P30CA023168, are also acknowledged. The content is solely the responsibility of the authors and does not represent the official views of the National Institutes of Health
Footnotes
Supporting Information
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acs.jcim.0c00110.
Jupyter Notebook cando.py tutorial (PDF)
Figure S1. Overview of the CANDO drug discovery and repurposing platform. Figure S2. Benchmarking performance across multiple versions and pipelines in the CANDO platform. Figure S3. AI-CANDO platform performance evaluation using the receiver operating characteristic curve (PDF)
API document (PDF)
SMILES strings (TXT)
Complete contact information is available at: https://pubs.acs.org/10.1021/acs.jcim.0c00110
The authors declare no competing financial interest.
Contributor Information
William Mangione, Department of Biomedical Informatics, University at Buffalo, Buffalo, New York 14120, United States;.
Zackary Falls, Department of Biomedical Informatics, University at Buffalo, Buffalo, New York 14120, United States.
Gaurav Chopra, Department of Chemistry, Purdue Institute for Drug Discovery, Integrated Data Science Institute, Purdue University, West Lafayette, Indiana 47907, United States;.
Ram Samudrala, Department of Biomedical Informatics, University at Buffalo, Buffalo, New York 14120, United States;.
REFERENCES
- (1).Minie M; Chopra G; Sethi G; Horst J; White G; Roy A; Hatti K; Samudrala R CANDO and the infinite drug discovery frontier. Drug Discovery Today 2014, 19, 1353–1363. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (2).Sethi G; Chopra G; Samudrala R Multiscale modelling of relationships between protein classes and drug behavior across all diseases using the CANDO platform. Mini-Rev. Med. Chem 2015, 15, 705–717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (3).Schuler J; Samudrala R Fingerprinting CANDO: Increased Accuracy with Structure-and Ligand-Based Shotgun Drug Repurposing. ACS Omega 2019, 4, 17393–17403. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (4).Mangione W; Samudrala R Identifying protein features responsible for improved drug repurposing accuracies using the CANDO platform: Implications for drug design. Molecules 2019, 24, 167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (5).Falls Z; Mangione W; Schuler J; Samudrala R Exploration of interaction scoring criteria in the CANDO platform. BMC Res. Notes 2019, 12, 318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (6).Fine J; Lackner R; Samudrala R; Chopra G Computational chemoproteomics to understand the role of selected psychoactives in treating mental health indications. Sci. Rep 2019, 9, 1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (7).Chopra G; Kaushik S; Elkin PL; Samudrala R Combating ebola with repurposed therapeutics using the CANDO platform. Molecules 2016, 21, 1537. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (8).Xie L; Evangelidis T; Xie L; Bourne PE Drug discovery using chemical systems biology: weak inhibition of multiple kinases may contribute to the anti-cancer effect of nelfinavir. PLoS Comput. Biol 2011, 7, e1002037. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (9).Bolognesi L Polypharmacology in a single drug: multitarget drugs. Curr. Med. Chem 2013, 20, 1639–1645. [DOI] [PubMed] [Google Scholar]
- (10).Horst JA; Laurenzi A; Bernard B; Samudrala R Computational multitarget drug discovery. Polypharm. Drug Discovery 2012, 263. [Google Scholar]
- (11).Yella J; Yaddanapudi S; Wang Y; Jegga A Changing trends in computational drug repositioning. Pharmaceuticals 2018, 11, 57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (12).Pushpakom S; Iorio F; Eyers PA; Escott KJ; Hopper S; Wells A; Doig A; Guilliams T; Latimer J; McNamee C; Norris A; Sanseau P; Cavalla D; Pirmohamed M Drug repurposing: progress, challenges and recommendations. Nat. Rev. Drug Discovery 2019, 18, 41. [DOI] [PubMed] [Google Scholar]
- (13).Peyvandipour A; Saberian N; Shafi A; Donato M; Draghici S A novel computational approach for drug repurposing using systems biology. Bioinformatics 2018, 34, 2817–2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (14).Paranjpe MD; Taubes A; Sirota M Insights into computational drug repurposing for neurodegenerative disease. Trends Pharmacol. Sci 2019, 40, 565. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (15).Liu T; Altman RB Relating essential proteins to drug side- effects using canonical component analysis: a structure-based approach. J. Chem. Inf. Model 2015, 55, 1483–1494. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (16).Zhou H; Gao M; Skolnick J Comprehensive prediction of drug-protein interactions and side effects for the human proteome. Sci. Rep 2015, 5, 11090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (17).Simon Z; Peragovics g.; Vigh-Smeller M; Csukly G; Tombor L; Yang Z; Zahoránszky-Köhalmi G; Végner L; Jelinek B; Hári P; et al. Drug effect prediction by polypharmacology-based interaction profiling. J. Chem. Inf. Model 2012, 52, 134–145. [DOI] [PubMed] [Google Scholar]
- (18).Zeng X; Zhu S; Liu X; Zhou Y; Nussinov R; Cheng F deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics 2019, 35, 5191–5198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (19).Saberian N; Peyvandipour A; Donato M; Ansari S; Draghici S A new computational drug repurposing method using established disease-drug pair knowledge. Bioinformatics 2019, 35, 3672–3678. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (20).Qabaja A; Alshalalfa M; Alanazi E; Alhajj R Prediction of novel drug indications using network driven biological data prioritization and integration. J. Cheminf 2014, 6, 1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (21).Wishart DS; Feunang YD; Guo AC; Lo EJ; Marcu A; Grant JR; Sajed T; Johnson D; Li C; Sayeeda Z; et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018, 46, D1074–D1082. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (22).Davis AP; Grondin CJ; Johnson RJ; Sciaky D; King BL; McMorran R; Wiegers J; Wiegers TC; Mattingly CJ The comparative toxicogenomics database: update 2017. Nucleic Acids Res. 2017, 45, D972–D978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (23).Berman HM; Westbrook J; Feng Z; Gilliland G; Bhat TN; Weissig H; Shindyalov IN; Bourne PE International Tables for Crystallography Vol. F: Crystallography of biological macromolecules; Springer, 2006; pp 675–684. [Google Scholar]
- (24).Vaidya A; Jain S; Jain S; Jain AK; Agrawal RK Quantitative structure-activity relationships: a novel approach of drug design and discovery. J. Pharm. Sci. Pharmacol 2014, 1, 219–232. [Google Scholar]
- (25).Hall MD; Yasgar A; Peryea T; Braisted JC; Jadhav A; Simeonov A; Coussens NP Fluorescence polarization assays in high- throughput screening and drug discovery: a review. Methods Appl. Fluoresc 2016, 4, 022001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (26).Pagadala NS; Syed K; Tuszynski J Software for molecular docking: a review. Biophys. Rev 2017, 9, 91–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (27).Yang J; Roy A; Zhang Y Protein-ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013, 29, 2588–2595. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (28).Riel MA; Kyle DE; Bhattacharjee AK; Milhous WK Efficacy of proton pump inhibitor drugs against Plasmodium falciparum in vitro and their probable pharmacophores. Antimicrob. Agents Chemother 2002, 46, 2627–2632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (29).Assouline S; Culjkovic B; Cocolakis E; Rousseau C; Beslu N; Amri A; Caplan S; Leber B; Roy D-C; Miller WH; Borden KLB Molecular targeting of the oncogene eIF4E in acute myeloid leukemia (AML): a proof-of-principle clinical trial with ribavirin. Blood 2009, 114, 257–260. [DOI] [PubMed] [Google Scholar]
- (30).Pettersson F; Yau C; Dobocan MC; Culjkovic-Kraljacic B; Retrouvay H; Puckett R; Flores LM; Krop IE; Rousseau C; Cocolakis E; et al. Ribavirin treatment effects on breast cancers overexpressing eIF4E, a biomarker with prognostic specificity for luminal B-type breast cancer. Clin. Cancer Res 2011, 17, 2874–2884. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (31).Lin T-B; Hsieh M-C; Lai C-Y; Cheng J-K; Chau Y-P; Ruan T; Chen G-D; Peng H-Y Modulation of Nerve Injury- induced HDAC4 Cytoplasmic Retention Contributes to Neuropathic Pain in Rats. Anesthesiology 2015, 123, 199–212. [DOI] [PubMed] [Google Scholar]
- (32).Choi SY; Kee HJ; Sun S; Seok YM; Ryu Y; Kim GR; Kee S-J; Pflieger M; Kurz T; Kassack MU; Jeong MH Histone deacetylase inhibitor LMK235 attenuates vascular constriction and aortic remodelling in hypertension. J. Cell. Mol. Med 2019, 23, 2801–2812. [DOI] [PMC free article] [PubMed] [Google Scholar]
- (33).Fine J; Konc J; Samudrala R; Chopra G CANDOCK: Chemical atomic network based hierarchical flexible docking algorithm using generalized statistical potentials. J. Chem. Inf. Model 2020, 60, 1509–1527. [DOI] [PubMed] [Google Scholar]
- (34).Li J; Fu A; Zhang L An overview of scoring functions used for protein-ligand interactions in molecular docking. Interdiscip. Sci.: Comput. Life Sci 2019, 11, 320. [DOI] [PubMed] [Google Scholar]
- (35).Saikia S; Bordoloi M Molecular docking: challenges, advances and its use in drug discovery perspective. Curr. Drug Targets 2019, 20, 501–521. [DOI] [PubMed] [Google Scholar]
- (36).Maggiora G; Vogt M; Stumpfe D; Bajorath J Molecular similarity in medicinal chemistry: miniperspective. J. Med. Chem 2014, 57, 3186–3204. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.