Abstract
Target-based screening is one of the major approaches in drug discovery. Besides the intended target, unexpected drug off-target interactions often occur, and many of them have not been recognized and characterized. The off-target interactions can be responsible for either therapeutic or side effects. Thus, identifying the genome-wide off-targets of lead compounds or existing drugs will be critical for designing effective and safe drugs, and providing new opportunities for drug repurposing. Although many computational methods have been developed to predict drug-target interactions, they are either less accurate than the one that we are proposing here or computationally too intensive, thereby limiting their capability for large-scale off-target identification. In addition, the performances of most machine learning based algorithms have been mainly evaluated to predict off-target interactions in the same gene family for hundreds of chemicals. It is not clear how these algorithms perform in terms of detecting off-targets across gene families on a proteome scale. Here, we are presenting a fast and accurate off-target prediction method, REMAP, which is based on a dual regularized one-class collaborative filtering algorithm, to explore continuous chemical space, protein space, and their interactome on a large scale. When tested in a reliable, extensive, and cross-gene family benchmark, REMAP outperforms the state-of-the-art methods. Furthermore, REMAP is highly scalable. It can screen a dataset of 200 thousands chemicals against 20 thousands proteins within 2 hours. Using the reconstructed genome-wide target profile as the fingerprint of a chemical compound, we predicted that seven FDA-approved drugs can be repurposed as novel anti-cancer therapies. The anti-cancer activity of six of them is supported by experimental evidences. Thus, REMAP is a valuable addition to the existing in silico toolbox for drug target identification, drug repurposing, phenotypic screening, and side effect prediction. The software and benchmark are available at https://github.com/hansaimlim/REMAP.
Author Summary
High-throughput techniques have generated vast amounts of diverse omics and phenotypic data. However, these sets of data have not yet been fully explored to improve the effectiveness and efficiency of drug discovery, a process which has traditionally adopted a one-drug-one-gene paradigm. Consequently, the cost of bringing a drug to market is astounding and the failure rate is daunting. The failure of the target-based drug discovery is in large part due to the fact that a drug rarely interacts only with its intended receptor, but also generally binds to other receptors. To rationally design potent and safe therapeutics, we need to identify all the possible cellular proteins interacting with a drug in an organism. Existing experimental techniques are not sufficient to address this problem, and will benefit from computational modeling. However, it is a daunting task to reliably screen millions of chemicals against hundreds of thousands of proteins. Here, we introduce a fast and accurate method REMAP for large-scale predictions of drug-target interactions. REMAP outperforms state-of-the-art algorithms in terms of both speed and accuracy, and has been successfully applied to drug repurposing. Thus, REMAP may have broad applications in drug discovery.
Introduction
Conventional one-drug-one-gene drug discovery and drug development is a time-consuming and expensive process. It suffers from high attrition rate and possible unexpected post-market withdrawal [1]. It has been recognized that a drug rarely only binds to its intended target, and off-target interactions (i.e. interactions between the drug and unintended targets) are common [2]. The off-target interaction may lead to adverse drug reactions (ADRs) [3], as demonstrated by the deadly side effect of a Fatty Acid Amide Hydrolase (FAAH) inhibitor in a recent clinical trial [4]. On the other hand, the off-target interaction may be therapeutically useful, thus providing opportunities for drug repurposing and polypharmacology [2]. Therefore, identifying off-target interactions is an important step in drug discovery and development in order to reduce the drug attrition rate and to accelerate the drug discovery and development process, and ultimately to make safer and more affordable drugs.
Many efforts have been devoted to developing statistical machine learning methods for the prediction of unknown drug-target associations by screening large chemical and protein data sets [5]. One of the fundamental assumptions in applying statistical machine learning methods to drug-target interaction prediction is that similar chemicals bind to similar protein targets, and vice versa. Based on this similarity principle, both semi-supervised and supervised machine learning techniques have been applied. The semi-supervised learning methods either build statistical models for the k nearest neighbors (k-NN) of the query compound with similar compounds in the database (e.g. Parzen-Rosenblatt Window (PRW) [6] and Set Ensemble Analysis (SEA) [7] are examples). Although a large number of 2D and 3D fingerprint representations of chemical structures have been developed, chemical structure similarity that is measured by Tanimoto coefficient (TC) or other similarity metrics of fingerprints is not continuously correlated with the binding activity. Activity cliff exists in the chemical space, where a small modification of a chemical structure can lead to a dramatic change in binding activity [8]. Thus, the chemical structural similarity alone is not sufficient to capture genome-wide target binding profile, as protein-chemical interaction is determined by both protein structures and chemical structures. New deep learning techniques that can learn non-linear, hierarchical relationships may provide new solutions for representing chemical space [9–12]. However, few work has been done to incorporate protein relationships into the deep learning framework. It remains to be seen whether the deep learning is applicable to genome-wide target prediction.
A number of techniques such as Gaussian Interaction Profile (GIP), Weighted Nearest Neighbor (WNN), Regularized Least Squares (RLS) classifier [13, 14], and matrix factorization [15–17] have been developed to integrate chemical and genomic space. Among them, Neighborhood Regularized Logistic Matrix Factorization (NRLMF) [17] and Kernelized Bayesian Matrix Factorization (KBMF) [16] are two of the most successful methods. However, several drawbacks in these algorithms hinder their applications in genome-wide off-target predictions. First, several algorithms with high performance such as KBMF are extremely time and memory-consuming. Second, these algorithms depend on a supervised learning framework that requires negative cases. While publicly available biological and/or chemical databases (e.g. ZINC [18], ChEMBL [19], DrugBank [20], PubChem [21], and UniProt [22]) have enabled large-scale screening of drug-target associations, the known chemical-protein associations are sparse, and the number of reported negative cases (i.e. chemical-protein pairs not associated) is too small to optimally train a prediction algorithm [23]. Using randomly generated negative cases will adversely impact the performance of these algorithms, and algorithmically derived negative cases are often based on unrealistic assumptions [23]. Finally, these algorithms have been mainly evaluated for the prediction of off-targets within the same gene family (e.g. GPCR) using a small benchmark with hundreds of drugs and targets. Their performances in predicting off-target across gene families on a large scale are uncertain. Indeed, drug cross-reactivity often occurs across fold spaces [2]. Thus, the development of in silico prediction methods that are fast as well as accurate enough to explore the available data is urgent.
Here, we make several contributions to address the aforementioned problems. First, we present an efficient method, REMAP, which formulates the off-target predictions as a dual-regularized One Class Collaborative Filtering (OCCF) problem. Thus, negative data are not needed for the training, but can be used if available. Secondly, REMAP is highly scalable with promising accuracy, thus can be applied to large-scale off-target predictions. Thirdly, we introduce a new benchmark set to evaluate the performance of drug-target interactions across gene families. Finally, we apply REMAP to repurposing existing drugs for new diseases. We identified seven drugs that have anti-cancer activity. Six of them are supported by experimental evidence.
Materials and Methods
Problem formulation
The problem we try to solve here is to predict how likely it is that a chemical interacts with a target protein, using a chemical-protein association network, chemical-chemical similarity, and protein-protein similarity information. We start by preparing a bipartite network for chemical-protein associations as a sparse n × m matrix R, where n is the number of chemicals and m is the number of proteins. Ri,j = 1 if the ith chemical is associated with the jth protein, and Ri,j = 0, otherwise. The chemical-chemical similarity scores are in an n × n square matrix C, with Ci,j representing the chemical-chemical similarity score between the ith and jth chemicals (0 ≤ Ci,j ≤ 1) for total n chemicals. The protein-protein similarity scores are in the same format for total m proteins (0 ≤ Ti,j ≤ 1). We consider this problem an analog of user-item preferences such that users and items represent chemicals and proteins, respectively. Therefore, the problem is to provide an n × m matrix P in which Pi,j is the prediction score for the interaction between the ith chemical and the jth protein.
Overview of off-target prediction method REMAP
Our prediction method REMAP is based on a one-class collaborative filtering algorithm that recommends the users’ preferences to the listed items [24]. It assumes that similar users will prefer similar items, unobserved associations are not necessarily negative, and user-item preferences can be analogous to drug-target associations. Assuming that a fairly low number of factors (i.e. smaller number of features than the number of total chemicals or protein targets) may capture the characteristics determining the chemical-protein associations, two low-rank matrices, U (chemical side) and V (protein side), were approximated such that is minimized where R is the matrix for known chemical-protein associations and VT is the transposition of the protein side low-rank matrix V. The two low rank matrices, Un×r and Vm×r are obtained by iteratively minimizing the objective function,
(1) |
All symbols used in the paper are summarized in Table 1, and the overall process of REMAP is in Fig 1. Here, pwt is the penalty weight on the observed and unobserved associations which indicate the reliability of the assigned probability of true association, pimp is the imputed value (i.e. the probability of unobserved associations as real associations), preg is the regularization parameter to prevent overfitting, pchem is the importance parameter for chemical-chemical similarity, pprot is the importance parameter for protein-protein similarity, and tr(A) is the trace of matrix A (Table 1). In this study, we use global weight and imputation. However, the weight and imputation values may be determined by a priori knowledge or from the prediction of other machine learning algorithms (i.e. pwt and pimp can be matrices with the same dimension as the matrix R). The raw predicted score for the ith chemical to bind the jth protein can be calculated by . The raw scores were adjusted based on the ratio of observed positive and negative cases when the negative data are available (explained in the prediction score adjustment section). Also, the matrix Un×r is referred to as a low-rank drug profile since its ith row represents the ith drug’s behavior in the drug-target interaction network as well as drug-drug similarity spaces compressed to r number of features. The REMAP code was originally written in Matlab and modified for drug-target predictions.
Table 1. The symbols and the descriptions for numerical calculations.
Symbol | Definition and Description |
---|---|
R | The adjacency matrix of the known drug-target associations |
C, T | The chemical-chemical and the target-target similarity matrices |
The chemical-chemical similarity score for the chemicals c1 and c2 | |
The Tanimoto dissimilarity coefficient for the chemicals c1 and c2 | |
T(p1,p2) | The target-target similarity score for the query protein p1 and the target protein p2 |
dbit(p1,p2) | The bit score for the query protein p1 and the target protein p2 |
DC, DT | The degree matrices of C and T, respectively |
U, V | The chemical-side and the target-side low-rank approximation matrices |
R(i,j) | The element of R at its ith row and jth column |
R(i,;) | The ith row of R |
R(;,j) | The jth column of R |
RT | The transpose matrix of R |
tr(R) | The trace of R |
pwt | The penalty weight on observed and unobserved associations which indicate the reliability of assigned probability of true association |
pimp | The imputed value (i.e. the probability of unobserved associations as real associations |
preg | The regularization parameter to prevent overfitting |
pchem | The importance parameter for chemical-chemical similarity |
pprot | The importance parameter for protein-protein similarity |
r | The rank of the low-rank approximation matrices |
piter | The number of maximum iterations to minimize the objective function |
p(i,j) | The raw prediction score by REMAP for the ith chemical and the jth protein |
Chemical-chemical similarity
Chemical-chemical similarity scores are one of the required inputs of REMAP. Although there are a number of metrics developed for chemical-chemical similarity, a recent study showed that Tanimoto coefficient-based similarity is highly efficient for fingerprint-based similarity measurement [25]. The fingerprint of choice in this study is the Extended Connectivity Fingerprint (ECFP), which has been successfully applied to chemical structure-based target prediction method, PRW [6]. Thus, it allows for a fair comparison of REMAP with PRW. It is interesting to compare the different fingerprints in the future study.
To calculate a similarity score between two chemicals, c1 and c2, the Tanimoto dissimilarity coefficient was obtained using JChem with the Tanimoto metric for the ECFP descriptor type using the command in the Unix environment, “ChemAxon/JChem/bin/screenmd target_smi query_smi -k ECFP -g -c -M Tanimoto” [26]. The chemical-chemical similarity score, is defined as . Briefly, two chemicals have a higher similarity score if they have more of the same chemical moieties (e.g. functional groups) at more similar relative positions. Chemical similarity scores below 0.5 were treated as noise and set to 0.
Protein-protein similarity
Protein-protein similarity scores are also one of the required inputs for REMAP. The similarity between two proteins was calculated based on their sequence similarity using NCBI BLAST [27] with an e-value threshold of 1 × 10−5 and its default options (e.g. 11 for gap open penalty and 1 for its extension, BLOSUM62 for the scoring matrix, and so on). Based on our 10-fold cross validation (see below), e-value thresholds from 1 to 1 × 10−20 did not significantly affect the performance (S1 Fig). Therefore, we decided to use a moderately stringent threshold (BLAST default is 1 × 10−3). A similarity score for query protein p1 to target protein p2 was calculated by the ratio of a bit score for the pair compared to the bit score of a self-query. To be specific, for the query protein p1 to the target protein p2, protein-protein the similarity score was defined such that T(p1,p2) = dbit(p1,p2)/dbit(p1,p1).
Benchmark test and data preparation
For benchmark tests, ZINC data was filtered by IC50 ≤ 10 μM, which yielded 31,735 unique chemical-protein associations for 12,384 chemicals and 3,500 proteins (ZINC dataset [18]). Targets that are protein complexes or cell-based tests were excluded. Proteins whose primary sequence is unavailable were also excluded. Protein sequences were obtained from UniProt [22], and the whole protein sequences were used to calculate protein-protein similarity scores.
To assess the predictive power of our algorithm, we performed a 10-fold cross validation on the ZINC dataset described above. We set the parameters as follows: pwt = pimp = preg = 0.1, r = 300, pchem = 0.75, pprot = 0.1, and piter = 400. The optimized values determined by the 10-fold cross validation of benchmark are shown in S2 Fig. It is noted that the best performance is achieved when pchem = 0.25 and pprot = 0.25. To further evaluate REMAP, we compared its performance on the ZINC dataset with several methods: a chemical similarity-based method (PRW [6]), the best performed matrix factorization methods so far (NRLMF [17] and KBMF with twin kernels (KBMF2K) [16]), combination of WNN and GIP (WNNGIP [14]), and another type of matrix factorization method (Collaborative Matrix Factorization (CMF) [15]) for different types of chemicals and proteins.
To obtain a detailed view of the performance of the methods, we divided the ZINC dataset into 3 categories with 2 subcategories for each, based on the connectivity of known chemical-protein associations and the degree of uniqueness of the chemicals. First, all the chemicals in the dataset were classified into the chemicals having only one known target (NT1), two known targets (NT2), or three or more known targets (NT3). Then, for the chemicals in each category, they were further divided based on either the number of known chemicals (ligands) the target proteins are associated with (number of ligands in increments of 5) or the maximum chemical-chemical similarity score for the chemical in the dataset (the similarity score range increment is 0.1). The label used in this paper for the dataset are NTaLb, or NTaMaxTcd, where ‘NT’ stands for the Number of known Target, ‘L’ for the number of known Ligand, and ‘Tc’ for the maximum (Tanimoto coefficient-based) chemical-chemical similarity score for the given chemical in the dataset, with NT = a, b ≤ L ≤ b +4, and d − 0.1 < Tc ≤ d. For instance, NT2L1 is the data set label for chemicals having two known targets and proteins having 1 to 5 ligands in the dataset, and NT1Tc0.9 is for chemicals with the most similar chemicals between 0.8 and 0.9 of similarity scores and having one known target. Chemicals having more than three known targets are included in the NT3 class, and proteins having more than twenty-one known ligands were included in L21 (not limited to 25). The categories of the ZINC dataset were then used to evaluate the performance of off-target prediction, and their labels mean the number of known ligands (L) or the maximum structural similarity (Tc) with their corresponding ranges. For example, ‘L21more’ stands for the dataset for proteins having 21 or more known targets, and ‘Tc0.9to1.0’ stands for maximum structural similarity greater than 0.9 and up to 1.0 (Tc0.5to0.6 is inclusive of 0.5). Note that NT1 is equivalent to chemicals without any known target when they are tested for cross validation. Therefore, performances on NT1 datasets reflect the ability to address the cold start problem. In other words, when one known drug-target association is intentionally hidden for the chemicals in the NT1 dataset, the tested chemicals will not have any known target in the training data, and they are less likely to be given a good recommendation of targets. This is analogous to the new user or new item problem reviewed by Su et al. [28].
Measuring prediction accuracy of REMAP by TPR vs. cutoff rank
A typical measure of prediction performance is the Receiver Operating Characteristic (ROC) curve by which one can assess the reliability of the positively predicted results. However, it is difficult to apply the ROC curve on our chemical-protein association datasets since the vast majority of the chemical-protein pairs have not been tested, and thus it is unclear whether the missing entries are actually unassociated or just not yet observed.
In order to assess how reliable the positively predicted results from REMAP are, we needed to define a performance measurement that is analogous to ROC curve but not dependent on the true negatives. Our primary measure of performance is the true positive rate ( Recall or Recovery) at the top 1% of predictions for each chemical. To be specific, the top 1% of predictions includes up to the 35th-ranked predicted target protein for a chemical for our datasets (3,500 possible target proteins for each chemical). Thus, for instance, a TPR of 0.965 at the 35th cutoff rank (top 1%) means that 96.5% of the total tested positive pairs were ranked 35th or better for the tested chemicals.
Scalability of REMAP as a matrix factorization algorithm
In order to assess the speed of REMAP for practical uses, we measured its running time by varying the rank parameter or the size of dataset. On the ZINC dataset (12,384 chemicals and 3,500 proteins), up to r = 2,000 was tested, and at fixed r = 200, dataset sizes up to 200,000 chemicals and 20,000 proteins were tested. The number of iterations (piter) was fixed to 400. A single node of CPU with 2.88 GB of memory in the City University of New York High Performance Computing Center (CUNY HPCC) was used for REMAP running time tests. We also compared the running times of different matrix factorization methods with ours. Due to the large time complexity and memory requirement for other algorithms, a multi-core node with up to 700 GB of shared memory system in CUNY HPCC was used for them on the ZINC dataset.
Genome-wide chemical-protein associations
Chemical-protein associations were obtained from the ZINC [18], ChEMBL [19] and DrugBank [20] databases. To obtain reliable chemical-protein association pairs, binding assays records with IC50 information were extracted from the databases, and the cutoff IC50 value of 10 μM was used where applicable. Two chemicals were considered the same if their InChI Keys are identical, and two proteins were considered so if their UniProt Accessions are identical. For records with IC50 in μg/L (found in ChEMBL), the full molecular weights of the compounds listed on ChEMBL were used to convert μg/L to μM. Chemical-protein pairs were considered associated if IC50≤10 μM (active pairs), unassociated if IC50>10 μM (inactive pairs), ambiguous if records exist in both ranges (ambiguous pairs), and unobserved otherwise (unknown pairs). A total of 198,712 unique chemicals and 3,549 unique target proteins were obtained from the combination of ChEMBL and ZINC with 228,725 unique chemical-protein active pairs, 76,643 inactive pairs, and 4,068 ambiguous pairs. Of the 198,712 chemicals, 722 were found to be FDA-approved drugs. Furthermore, drug-target relationships were extracted from the DrugBank and integrated into the ZINC_ChEMBL dataset above. A total of 199,338 unique chemicals and 6,277 unique proteins were obtained from the combination of ZINC, ChEMBL, and DrugBank with 233,378 unique chemical-protein active pairs.
Drug-target interaction profile analysis for drug repurposing
Since REMAP showed promising performances on predicting off-targets for chemicals with at least one known target, it is possible to use REMAP to suggest new purposes for some FDA approved drugs. As the matrix product of UUP (chemical-side low-rank matrix) and VUP (protein side low-rank matrix) is the predicted drug-target interaction matrix P, the ith row of UUP contains the target interaction profile for the ith drug. Therefore, we analyzed the drug-drug similarities based on the low-rank matrix UUP. We ran REMAP with the data combination of three databases explained above, with the parameters used in the benchmark evaluations. Then, we calculated drug-drug cosine similarities based on the matrix UUP. For each row of UUP for FDA approved drugs, the cosine similarity of drug c1 and drug c2 can be calculated by, . To search for possibly undiscovered uses of the drugs, we focused on drugs that are found to have high cosine similarity but low Tanimoto similarity (< 0.5). Markov Cluster (MCL) Algorithm [29, 30] was used to cluster drugs based on their cosine similarity of a low-rank target profile. Drug-disease associations were obtained from the Comparative Toxicogenomics Database (CTD) [31].
Prediction score adjustment
The raw prediction score () can be adjusted to better reflect the real data as well as to statistically discriminate the positive and negative predictions. We used the active, inactive and ambiguous pairs obtained from the ChEMBL database to adjust the score. REMAP prediction on the ZINC_ChEMBL dataset showed a clear division between the active and inactive pairs, suggesting that predictions scored around 1.0 are highly likely to be positive (Fig 2A). As mentioned above, however, there is a large difference between the number of active and inactive pairs, which is not likely to reflect the ratio of the actual positive and negative chemical-protein pairs. Greater accuracy is expected by adjusting the prediction scores to reflect such a positive/negative ratio. To estimate the ratio, we first normalized the counts in each bin in the histogram (Fig 2A) and calculated the weights that minimize the sum of error, Esum. Esum(w1) = Σi[Ai − {w1pi + (1 − w1)Ni}]2, where w1 and w2 are the weights on active and inactive pairs, respectively (w1 + w2 = 1.0), and Ai, pi and Ni are the normalized counts in ith bin of ambiguous, active and inactive pairs, respectively. The optimum adjustment weights were approximately w1 = 0.16, w2 = 0.84 (Fig 2B). This implies that approximately 16% of total observations are positive. Since the ratio of negative/positive is about 5.25 , we increased the number of observations for inactive pairs in each bin by 5.25 times and rounded down. The adjusted prediction score for each bin (Bi) was calculated using the increased negative counts.
(2) |
It is noted that the prediction score adjustment was not used in the benchmark study, where no negative data were used.
Graphic analysis
Drug-drug clustered network was visualized using Cytoscape [32].
Results
REMAP is highly effective in predicting off-targets even for novel chemicals
We evaluated the performances of algorithms for chemicals having one, two, or more than three known targets with varying maximum chemical-chemical similarity ranges or with proteins having a certain number of known ligands (dataset prepared as explained in the methods and materials section). In general, the performances of both algorithms improve as the number of known ligands per protein or the maximum chemical-chemical similarity value increases.
It was noticeable that REMAP performed significantly better than PRW when there was at least one known target for a chemical whose targets are predicted (Figs 3 and 4). REMAP showed greater than 90% recovery at the top 1% when the tested chemicals have at least one known target. All algorithms are sensitive to the number of ligands per target. The more ligands, the higher accuracy. While PRW also reached reasonably high recovery for some categories (e.g. more than 11 known ligands per proteins, or of the most similar trained chemicals), REMAP showed that it is reliable for testing chemicals without high similarity to the trained chemicals (Figs 3B and 4B). In other words, REMAP is applicable to chemicals that are structurally distant to the chemicals already in the dataset. Except where the target proteins have 1 to 5 known ligands, REMAP performed best among the three algorithms in all cases with at least one known target for the tested chemicals (Figs 3 and 4). In the most of cases, the differences in the performance between REMAP and other two algorithms are statistically significant. Therefore, in practice, REMAP can predict potential drug targets for chemicals with at least one known target as training data, even when the chemicals are structurally dissimilar to the training chemicals. With the optimized parameters (see below), ROC-like curves shows the general trend of performances of the three algorithms up to the top 10% of predictions (S3 and S4 Figs).
As shown in Figs 3 and 4, REMAP outperforms the state-of-the-art NRLFM algorithm in most of the tested cases. As NRLMF is sensitive to the rank parameter, we carried out optimizations to determine optimal rank and iterations for NRLMF (S5 Fig). The optimal rank and iterations used in the evaluation were 100 and 300, respectively. Moreover, in the current implementation, REMAP is approximately 10 times faster and uses 50% less memory than NRLMF. Consistent with the results by Liu et al. [17], the accuracies of NRLFM are significantly higher than KBMF2K, CMF, and WNNGIP in all of ZINC benchmarks. Overall, REMAP is one of the best-performing methods for the genome-wide off-target predictions.
Chemical-chemical similarity based on Tanimoto coefficient significantly helps REMAP’s performance, while protein-protein similarity information contains significant noise
To test whether the chemical-chemical similarity matrix helps prediction, we performed 10-fold cross validation on the ZINC dataset with the contents of the chemical-chemical or the protein-protein similarity matrix controlled. In other words, about half of the non-zero chemical-chemical similarity scores were randomly chosen and removed (set to 0) for the “half-filled chemical similarity” matrix, and all entries are set to 0 for the “zero-filled chemical similarity” matrix. The predictive power of REMAP showed noticeable improvement when all available chemical-chemical similarity pairs were used, compared to the half-filled or the zero-filled similarity matrix (Fig 5A). Similarly, the contents of the protein-protein similarity matrix were controlled (e.g. half-filled protein similarity, and zero-filled protein similarity) while the full chemical similarity matrix was used. Unlike the chemical-chemical similarity, the protein-protein similarity information did not necessarily improve REMAP’s predictive power. The performance was best when a half of the protein-protein similarity information was used together with the full chemical-chemical similarity matrix (Fig 5B). This suggests that there is significant noise in the protein-protein sequence similarity matrix although the information does help prediction. A careful examination of the BLAST-based protein-protein similarity matrix may give an insight into the design of a novel protein-protein similarity metric for drug-target binding activities (see discussion section).
We also performed optimization tests for pchem and pprot on ZINC dataset. Although the performance was slightly better when the chemical-chemical similarity importance was maximum (Fig 6A), the difference was too small to conclude that it is best to fix pchem = 1. Instead, the prediction may rely too much on the chemical-chemical similarity scores. Therefore, to allow flexibility on chemical-chemical similarity information, we set pchem = 0.75 at which the performance was almost as accurate as pchem = 1. On the other hand, the performance was best when the protein-protein sequence similarity importance, pprot, was 0.1 (Fig 6B), further supporting our claim that protein-protein sequence similarity is not an optimal choice for the prediction of a drug-target interaction. When jointly optimizing pchem and pprot, their optimal value is 0.25 and 0.25, respectively, in the 10-fold cross validation benchmark evaluation (S2B Fig).
Our result supports a recent study [25] which showed that Tanimoto coefficient is efficient for the chemical similarity calculation. Chemical fingerprint-based chemical-protein association prediction has been studied by Koutsoukas et al [6]. By defining bins (target proteins) that can contain certain chemical features based on the chemical fingerprints, Koutsoukas et al. successfully demonstrated that their algorithm, PRW, can efficiently predict unknown chemical-protein associations [6]. While the basic idea of dissecting chemical compounds into functional groups is the same, it should be noted that PRW does not consider the information obtained from proteins, as well as interactome.
REMAP is readily scalable for large chemical-protein data space
For all our tests, REMAP showed great speed without losing its accuracy. On our benchmark dataset (ZINC; 12,384 chemicals and 3,500 proteins), it took approximately 120 seconds to run 400 iterations at the rank of 200 (r = 200, piter = 400). The time complexity is linearly dependent on the rank (Fig 7A). The scalability of REMAP is superior when compared to KBMF2K, a state-of the art matrix factorization algorithm that is implemented in Matlab and has been extensively studied for predicting drug-target interactions [16]. KBMF2K took more than 10 days for the same size matrix using the same computer system in the ZINC benchmark. Moreover, REMAP was capable of higher rank factorization while KBMF2K was limited to rank 200 in our system due to the memory requirement (over 100 GB of memory). At a much higher rank (r = 2,000), less than one hour was required for REMAP on the same dataset (Fig 7A). Time complexity experiments on larger dataset showed that REMAP completed predictions on a dataset with 200,000 rows and 20,000 columns within 2 hours on a single core computing system with 2.88 GB of memory, demonstrating its ability to screen the whole human genome of approximately 20,000 proteins in two hours (Fig 7B).
Large scale prediction of drug-target interactions
Since REMAP is scalable and shows superior accuracy based on our benchmark tests, we performed large scale prediction of drug-target interactions on the ZCD dataset (explained in the Materials and Methods section). As explained in the prediction score adjustment section, prediction scores for the active pairs were mostly located between 0.75 and 1.0 (Fig 2A).
Low rank profile based drug-drug similarity analysis
As expected, the percentage of pairs of chemicals that share common targets decreases with the decrease of the chemical structural similarity measured by the Tc of ECFP fingerprints (). The percentage of target-sharing chemical pairs drops below 50% and 0.5% when the Tc is between 0.5 and 0.6, and less than 0.5, respectively (S6 Fig). Thus, it is less likely that the chemical structural similarity alone can reliably detect novel binding relations between two chemicals when the Tc is less than 0.5. It is interesting to see how REMAP performs when the chemical structural similarity fails.
We analyzed the low-rank drug profile (matrix UUP) to check whether it represented the target-binding behavior of the drugs. When filtered by low chemical structure similarity (), there are 899,871 drug-drug pairs. Among them, the profile similarity score () of 91,888 pairs is higher than 0.3. With high profile similarity (), a total of 1,327 drug-drug pairs were found of which 1,033 pairs shared at least one common known target. S7 Fig shows the percentage of pairs that share the common target in different profile similarity bucket for FDA-approved drugs. This result suggests that REMAP is able to provide a chemical-protein binding profile that cannot be captured by chemical structure similarity alone.
When , the percentage of two drugs that share a common target drops below 50% (S7 Fig). We constructed a drug-drug similarity network by filtering out drug pairs with , then applied the MCL algorithm on the drug-drug network to find clusters of similar drugs. The largest cluster of drugs contained a total of 313 drugs, and their relationships to diseases were examined based on the known associations annotated in CTD [31]. As a result, we found that the drugs are mostly related to mental disorders, including hyperkinesis, dystonia, catalepsy, schizophrenia and basal ganglia diseases as the mostly related diseases. The most frequent known protein targets by the drugs were GPCRs (S1 Table). It is comparable that GPCRs were 1,924 times targeted while kinases were targeted only 55 times. While it is interesting to further examine the cluster, validating all of the possible drug-target pairs in the largest cluster may be inefficient.
A smaller cluster of drugs contained a total of thirty-one FDA approved drugs twenty-six of which are known to target kinases or interact with microtubule (Table 2). Seven drugs in the cluster have not been used for cancer treatment and were found to be closely linked to the anti-cancer drugs (Fig 8 and Table 2). Interestingly, several of them have been tested for their anti-cancer activity. For example, colchicine (also known as colchine), an FDA approved drug for gout treatment, has been shown to have anti-proliferative effects on several human liver cancer cell lines at clinically acceptable concentrations [33]. Griseofulvin, an antifungal antibiotic drug, appears to be effective as an anti-cancer drug when used together with other anti-cancer drugs [34]. The three anthelmintic drugs, albendazole, mebendazole and niclosamide, have been studied and repurposed for their anti-cancer effects on different types of cancers. Albendazole has been shown to be effective in suppressing liver cancer cells both in vitro and in vivo [35], and recently has been repurposed for ovarian cancer treatment with a bovine serum albumin-based nanoparticle drug delivery system [36]. Mebendazole showed anti-cancer activities in human lung cancer cell lines [37] and human adrenocortical cell lines [38], and it has been repurposed for colon cancer treatment [39]. Both niclosamide and mebendazole showed beneficial effects in glioblastoma in different studies [40, 41]. It has been proposed to use aprepitant in combination with other compounds to improve the efficiency of temozolomide, the current standard drug for glioblastoma treatment [42]. Anti-cancer activity of carbidopa hydrate have not yet been reported. It will be interesting to experimentally validate the prediction.
Table 2. The known uses and target information for the anti-cancer drug cluster in Fig 8B obtained from DrugBank.
Drug name | Approved treatment(s) | Known binding target(s) | Principal mode of action |
---|---|---|---|
Albendazole | Parenchymal neurocysticercosis | F1L7U3, Q71U36, P68371, P83223 | Tubulin polymerization inhibitor |
Aprepitant | Antiemetic | P25103 | Substance P/Neurokinin NK1 receptor antagonist |
Carbidopa hydrate | Reduce adverse effects of levodopa in Parkinson disease treatment | P20711 | DOPA decarboxylase inhibitor |
Colchine | Gout | Q9H4B7, P07437 | N/A (depolymerize microtubule) |
Griseofulvin | Ringworm infection | P10875, P87066, Q99456 | N/A |
Mebendazole | Anthelmintic | Q71U36, P68371 | Tubulin polymerization inhibitor |
Niclosamide | Anthelmintic against tapeworm infections | P40763, O60674, P12931 | disrupt oxidative phosphorylation |
Aza-epothilone B | Breast cancer | Q13509 | Microtubule stabilizer |
Bosutinib | Chronic Myelogenous Leukemia | P11274, P00519, P07948, P08631, P12931, P24941, Q02750, P36507, Q9Y2U5, Q13555 | Tyrosin kinase inhibitor |
Cabazitaxel | Prostate cancer | P68366, Q9H4B7 | Microtubule stabilizer |
Crizotinib | Non-small cell lung cancer | Q9UM73, P08581 | Anaplastic lymphoma kinase inhibitor |
Dabrafenib | Metastatic melanoma | P15056, P04049, P57059, Q8NG66, P53667 | Inhibitor of some mutant BRAF kinases |
Dasatinib | Chronic myeloid leukemia | P00519, P12931, P29317, P06239, P07947, P10721, P09619, P51692, P24684, P06241 | BRC/ABL and Src family tyrosine kinase inhibitor |
Docetaxel | Breast, ovarian and non-small cell lung cancer | Q9H4B7, P10415, P11137, P27816, P10636, O75469 | Microtubule stabilizer |
Erlotinib | Non-small cell lung cancer, pancreatic cancer | P00533, O75469 | N/A (EGFR inhibitor) |
Gefitinib | Non-small cell lung cancer | P00533 | EGFR inhibitor |
Imatinib | Chronic myelogenous leukemia | A9UF02, P10721, O43519, P04629, P07333, P16234, Q08345, P00519, P09619 | Tyrosine kinase inhibitor |
Nilotinib | Various leukemias (investigational) | P00519, P10721 | Tyrosine kinase inhibitor |
Paclitaxel | Lung, ovarian and breast cancers | P10415, Q9H4B7, O75469, P27816, P11137, P10636 | Microtubule stabilizer |
Pazopanib | Renal cell cancer and soft tissue sarcoma | P17948, P35968, P35916, P16234, P09619, P10721, P22607, Q08881, P05230, Q9UQQ2 | Tyrosine kinase inhibitor |
Ponatinib | Chronic myeloid leukemia | P00519, P11274, P10721, P07949, Q02763, P36888, P11362, P21802, P22607, P22455, P06239, P12931, P07948, P35968, P16234 | Bcr-Abl tyrosine kinase inhibitor |
Regorafenib | Metastatic colorectal cancer and gastrointestinal stromal tumors | P07949, P17948, P35968, P35916, P10721, P16234, P09619, P11362, P21802, Q02763, Q16832, P04629, P29317, P04049, P15056, P15759, P42685, P00519 | Multiple kinases inhibitor |
Ruxolitinib | Myelofibrosis | P23458, O60674 | Janus Associated Kinases (JAK) 1 and 2 inhibitor |
Sorafenib | Renal cell carcinoma | P15056, P04049, P35916, P35968, P36888, P09619, P10721, P11362, P07949, P17948 | Inhibitor of Raf kinase, PDGF, VEGFR 2 and 3 |
Sunitinib | Renal cell carcinoma and gastrointestinal stromal tumor | P09619, P17948, P10721, P35968, P35916, P36888, P07333, P16234 | Multi-targeted receptor tyrosine kinase inhibitor |
Trametinib | Metastatic melanoma | Q02750, P36507 | Allosteric inhibitor of mitogen-activated extracellular signal regulated kinase 1 and 2 |
Vandetanib | Broad range tumor types | P15692, P00533, Q13882, Q02763 | Inhibitor of VEGFR |
Vinblastine | Breast, testicular cancers, lymphomas, neuroblastoma | Q71U36, P07437, Q9UJT1, P23258, Q9UJT0, P05412 | N/A (inhibition of mitosis at metaphase) |
Vincristine | Acute lymphocytic leukemia, lymphomas, neuroblastoma, rhabdomyosarcoma | P07437, P68366 | N/A (inhibition of mitosis at metaphase) |
Vindesine | Acute leukemia, malignant lymphoma, Hodgkin’s disease, acute erythraemia, acute panmyelosis | Q9H4B7 | Inhibition of mitosis at metaphase |
Vinorelbine | Non-small cell lung carcinoma | P07437 | N/A (inhibition of mitosis at metaphase) |
Discussion
REMAP improves the predictive power of off-target prediction and drug repurposing
Our extensive benchmark studies show that REMAP outperforms existing algorithms in most of the cases for the off-target prediction. Compared with other state-of-the-art matrix factorization algorithms, the predictive power of REMAP comes from several improvements. First, we formulated the drug-target prediction as a one-class collaborative filtering problem; thus the negative data are not required for the training. Second, a priori knowledge including known negative data can be incorporated into the matrix factorization with imputation and weighting. Finally, using global imputation and weighting, the algorithm is computationally efficient without significantly sacrificing its performance.
The efficiency and effectiveness of REMAP allows us to predict proteome-wide target binding profiles of hundreds of thousands of chemicals. As the proteome-wide target binding profile is more correlated with phenotypic response than a single target binding, REMAP will facilitate linking molecular interactions in the test tube with in vivo drug activity. When using a multi-target binding profile predicted by REMAP as the signature of a chemical compound, seven drugs were found to be associated with anti-cancer therapeutics, although they do not have detectable chemical structural similarity. Among them, the anti-cancer activity of six drugs was supported by experimental evidences. Thus, REMAP could be a useful tool for drug repurposing.
Remaining issues and future directions
Although REMAP showed its high potential on genome-wide off-target predictions as discussed above, two issues remain: the cold start problem and suboptimal protein-protein similarity metrics. Similar to matrix factorization algorithms such as NRLMF, REMAP suffers from cold start problem, also known as new user or new item problem. In other words, it is difficult to recommend a product for a new user if the new user has never purchased or reviewed a product in the database [28]. For novel chemicals that do not have any known target in the dataset, REMAP did not show better performance than PRW. Moreover, if the target of the novel chemical has 5 or fewer known ligands, the recovery of REMAP is lower than 0.5 (S8A Fig). When the novel chemical is similar to those chemicals in the database, the recovery of REMAP reached above 90% (S8B Fig). These results suggest that, in practice, existing matrix factorization-based methods, including REMAP, are not the optimal choice if the chemicals of interest do not have any known target. To resolve this issue, it is possible to design an algorithm that combines the benefits of PRW or other algorithms with REMAP. The use of confidence weights and a priori imputation makes it straightforward for REMAP to incorporate additional information. In addition, the time and memory efficiency of REMAP makes it possible to apply active learning to overcome the cold start problem [43–46].
The suboptimal performance of REMAP may arise from the lack of molecular-level biochemical details in deriving the protein-protein similarity metrics. When testing the ZINC dataset, we found that REMAP performs better as lower weight was assigned for protein-protein sequence similarity data (Fig 6B). In addition, the predictive power of REMAP improved when about half of the randomly selected protein-protein similarity scores were removed, further confirming that noise confounds relating global sequence similarity to ligand binding (Fig 5B). It is not surprising that proteins with similar sequences do not necessarily bind to similar chemicals, as protein-ligand interaction is governed by the spatial organization of amino acid residues in the protein structure [47]. Amino acid mutations/post-translational modifications and conformational dynamics may alter the binding of the ligand through direct modification of the ligand binding site or allosteric interaction. A protein may also consist of multiple binding sites that accommodate different types of ligands. Thus, two proteins with high sequence similarity do not necessarily bind the same ligands because the two proteins may possess different 3D conformations, especially in their binding pockets [47]. In contrast, two proteins with low sequence similarity can bind to the same ligands if their binding pockets are similar [48, 49]. The binding site similarity can be a more biologically sensitive measure of protein-protein similarity for the off-target prediction [50–55]. Such work is on-going.
Conclusion
In silico drug-target screening is an essential step to reduce costly experimental steps in drug development. In this study, we showed that dual-regularized one-class collaborative filtering algorithm, a class of computational methods frequently used in user-item preference recommendations, may be applied to drug-target association predictions. Our study presents REMAP, a collaborative filtering algorithm with capability of running whole human genome-level predictions within two hours. Other studies on some types of cancer treatment support our algorithm’s ability to capture drug-drug similarities based on both the drug-target interaction profile and the chemical structural similarity. Our study shows the limitation of REMAP in evaluating new chemicals or accommodating biochemical details. Further development of the computational tools for better prediction is needed.
Supporting Information
Acknowledgments
We acknowledge Miriam Cohen, Ph.D. for proof-reading the manuscript.
Data Availability
The software and benchmark data are available at https://github.com/hansaimlim/REMAP. All other relevant data are within the paper and its Supporting Information.
Funding Statement
This research was supported by the National Library of Medicine of the National Institute of Health under the award number R01LM011986 (LX), National Science Foundation under the award number CNS-0958379, CNS-0855217, ACI-1126113, and the City University of New York High Performance Computing Center at the College of Staten Island. AP is supported, in part, by the 2016 UNI’s Summer Fellowship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
References
- 1.Dickson M, Gagnon JP. The cost of new drug discovery and development. Discovery Medicine. 2009;4(22):172–9 . [PubMed] [Google Scholar]
- 2.Xie L, Xie L, Kinnings SL, Bourne PE. Novel computational approaches to polypharmacology as a means to define responses to individual drugs. Annual review of pharmacology and toxicology. 2012;52:361–79. 10.1146/annurev-pharmtox-010611-134630 [DOI] [PubMed] [Google Scholar]
- 3.Bowes J, Brown AJ, Hamon J, Jarolimek W, Sridhar A, Waldron G, et al. Reducing safety-related drug attrition: the use of in vitro pharmacological profiling. Nature reviews Drug discovery. 2012;11(12):909–22. 10.1038/nrd3845 [DOI] [PubMed] [Google Scholar]
- 4.Butler D, Callaway E. Scientists in the dark after French clinical trial proves fatal. Nature. 2016;529(7586):263–4. 10.1038/nature.2016.19189 [DOI] [PubMed] [Google Scholar]
- 5.Haggarty SJ, Koeller KM, Wong JC, Butcher RA, Schreiber SL. Multidimensional chemical genetic analysis of diversity-oriented synthesis-derived deacetylase inhibitors using cell-based assays. Chemistry & biology. 2003;10(5):383–96 10.1016/S1074-5521(03)00095-4 . [DOI] [PubMed] [Google Scholar]
- 6.Koutsoukas A, Lowe R, KalantarMotamedi Y, Mussa HY, Klaffke W, Mitchell JB, et al. In silico target predictions: defining a benchmarking data set and comparison of performance of the multiclass naïve bayes and parzen-rosenblatt window. Journal of chemical information and modeling. 2013;53(8):1957–66. 10.1021/ci300435j [DOI] [PubMed] [Google Scholar]
- 7.Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotechnol. 2007;25(2):197–206. 10.1038/nbt1284 [DOI] [PubMed] [Google Scholar]
- 8.Cruz-Monteagudo M, Medina-Franco JL, Perez-Castillo Y, Nicolotti O, Cordeiro MN, Borges F. Activity cliffs in drug discovery: Dr Jekyll or Mr Hyde? Drug Discov Today. 2014;19(8):1069–80. 10.1016/j.drudis.2014.02.003 . [DOI] [PubMed] [Google Scholar]
- 9.Dahl GE, Jaitly N, Salakhutdinov R. Multi-task Neural Networks for QSAR Predictions. arXiv:14061231 [statML]. 2014.
- 10.Ma J, Sheridan RP, Liaw A, Dahl GE, Svetnik V. Deep neural nets as a method for quantitative structure-activity relationships. J Chem Inf Model. 2015;55(2):263–74. 10.1021/ci500747n . [DOI] [PubMed] [Google Scholar]
- 11.Ramsundar B, Kearnes S, Riley P, Webster D, Kon- erding D, Pande V. Massively Multitask Networks for Drug Discovery. arXiv:150202072 [statML]. 2015.
- 12.Unterthiner T, Mayr A, Klambauer G, Hochreiter S. Toxicity Prediction Using Deep Learning. arXiv:150301445 [statML]. 2015.
- 13.van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug–target interaction. Bioinformatics. 2011;27(21):3036–43. 10.1093/bioinformatics/btr500 [DOI] [PubMed] [Google Scholar]
- 14.van Laarhoven T, Marchiori E. Predicting drug-target interactions for new drug compounds using a weighted nearest neighbor profile. PloS one. 2013;8(6):e66952 10.1371/journal.pone.0066952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zheng X, Ding H, Mamitsuka H, Zhu S, editors. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 2013: ACM.
- 16.Gönen M. Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics. 2012;28(18):2304–10. 10.1093/bioinformatics/bts360 [DOI] [PubMed] [Google Scholar]
- 17.Liu Y, Wu M, Miao C, Zhao P, Li X-L. Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction. PLoS Comput Biol. 2016;12(2):e1004760 10.1371/journal.pcbi.1004760 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling. 2012;52(7):1757–68. 10.1021/ci3001277 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, et al. The ChEMBL bioactivity database: an update. Nucleic acids research. 2013:gkt1031 10.1093/nar/gkt1031 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic acids research. 2008;36(suppl 1):D901–D6 10.1093/nar/gkm958 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, et al. PubChem substance and compound databases. Nucleic acids research. 2015:gkv951 10.1093/nar/gkv951 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Consortium U. UniProt: a hub for protein information. Nucleic acids research. 2014:gku989 10.1093/nar/gku989 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Liu H, Sun J, Guan J, Zheng J, Zhou S. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics. 2015;31(12):i221–i9. 10.1093/bioinformatics/btv256 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Yao Y, Tong H, Yan G, Xu F, Zhang X, Szymanski BK, et al., editors. Dual-regularized one-class collaborative filtering. Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management; 2014: ACM.
- 25.Bajusz D, Rácz A, Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics. 2015;7(1):1–13. 10.1186/s13321-015-0069-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.ChemAxon. Screen was used for generating pharmacophore descriptors and screening structures, JChem 15.3.2.0. 2015. [Google Scholar]
- 27.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC bioinformatics. 2009;10(1):1 10.1186/1471-2105-10-421 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Su X, Khoshgoftaar TM. A survey of collaborative filtering techniques. Advances in artificial intelligence. 2009;2009:4 10.1155/2009/421425 [DOI] [Google Scholar]
- 29.Van Dongen S. A cluster algorithm for graphs. Report-Information systems. 2000;(10):1–40. [Google Scholar]
- 30.Enright AJ, Van Dongen S, Ouzounis CA. An efficient algorithm for large-scale detection of protein families. Nucleic acids research. 2002;30(7):1575–84. 10.1093/nar/30.7.1575 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, et al. The Comparative Toxicogenomics Database's 10th year anniversary: update 2015. Nucleic Acids Res. 2015;43(Database issue):D914–20. 10.1093/nar/gku935 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome research. 2003;13(11):2498–504. 10.1101/gr.1239303 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Lin Z-Y, Wu C-C, Chuang Y-H, Chuang W-L. Anti-cancer mechanisms of clinically acceptable colchicine concentrations on hepatocellular carcinoma. Life sciences. 2013;93(8):323–8. 10.1016/j.lfs.2013.07.002 [DOI] [PubMed] [Google Scholar]
- 34.Singh P, Rathinasamy K, Mohan R, Panda D. Microtubule assembly dynamics: an attractive target for anticancer drugs. IUBMB life. 2008;60(6):368–75. 10.1002/iub.42 [DOI] [PubMed] [Google Scholar]
- 35.Pourgholami M, Woon L, Almajd R, Akhter J, Bowery P, Morris D. In vitro and in vivo suppression of growth of hepatocellular carcinoma cells by albendazole. Cancer letters. 2001;165(1):43–9. 10.1016/S0304-3835(01)00382-2 [DOI] [PubMed] [Google Scholar]
- 36.Noorani L, Stenzel M, Liang R, Pourgholami MH, Morris DL. Albumin nanoparticles increase the anticancer efficacy of albendazole in ovarian cancer xenograft model. J Nanobiotechnol. 2015;13(1):25 10.1186/s12951-015-0082-8 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Mukhopadhyay T, Sasaki J-I, Ramesh R, Roth JA. Mebendazole elicits a potent antitumor effect on human cancer cell lines both in vitro and in vivo. Clinical cancer research. 2002;8(9):2963–9. [PubMed] [Google Scholar]
- 38.Martarelli D, Pompei P, Baldi C, Mazzoni G. Mebendazole inhibits growth of human adrenocortical carcinoma cell lines implanted in nude mice. Cancer chemotherapy and pharmacology. 2008;61(5):809–17. 10.1007/s00280-007-0538-0 [DOI] [PubMed] [Google Scholar]
- 39.Nygren P, Fryknäs M, Ågerup B, Larsson R. Repositioning of the anthelmintic drug mebendazole for the treatment for colon cancer. Journal of cancer research and clinical oncology. 2013;139(12):2133–40. 10.1007/s00432-013-1539-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Bai R-Y, Staedtke V, Aprhys CM, Gallia GL, Riggins GJ. Antiparasitic mebendazole shows survival benefit in 2 preclinical models of glioblastoma multiforme. Neuro-oncology. 2011:nor077 10.1093/neuonc/nor077 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Wieland A, Trageser D, Gogolok S, Reinartz R, Höfer H, Keller M, et al. Anticancer effects of niclosamide in human glioblastoma. Clinical Cancer Research. 2013;19(15):4124–36. 10.1158/1078-0432.CCR-12-2895 [DOI] [PubMed] [Google Scholar]
- 42.Kast RE, Karpel-Massler G, Halatsch M-E. CUSP9* treatment protocol for recurrent glioblastoma: aprepitant, artesunate, auranofin, captopril, celecoxib, disulfiram, itraconazole, ritonavir, sertraline augmenting continuous low dose temozolomide. Oncotarget. 2014;5(18):8052–82. 10.18632/oncotarget.2408 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Fujiwara Y, Yamashita Y, Osoda T, Asogawa M, Fukushima C, Asao M, et al. Virtual screening system for finding structurally diverse hits by active learning. Journal of chemical information and modeling. 2008;48(4):930–40. 10.1021/ci700085q [DOI] [PubMed] [Google Scholar]
- 44.Grave K, Ramon J, Raedt L. Active Learning for High Throughput Screening. In: Jean-Fran J-F, Berthold MR, Horváth T, editors. Discovery Science: 11th International Conference, DS 2008, Budapest, Hungary, October 13–16, 2008 Proceedings. Berlin, Heidelberg: Springer Berlin Heidelberg; 2008. p. 185–96.
- 45.Lang T, Flachsenberg F, von Luxburg U, Rarey M. Feasibility of Active Machine Learning for Multiclass Compound Classification. Journal of Chemical Information and Modeling. 2016;56(1):12–20. 10.1021/acs.jcim.5b00332 [DOI] [PubMed] [Google Scholar]
- 46.Warmuth MK, Liao J, Rätsch G, Mathieson M, Putta S, Lemmen C. Active Learning with Support Vector Machines in the Drug Discovery Process. Journal of Chemical Information and Computer Sciences. 2003;43(2):667–73. 10.1021/ci025620t [DOI] [PubMed] [Google Scholar]
- 47.Creixell P, Palmeri A, Miller CJ, Lou HJ, Santini CC, Nielsen M, et al. Unmasking determinants of specificity in the human kinome. Cell. 2015;163(1):187–201. 10.1016/j.cell.2015.08.057 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Lin H, Sassano MF, Roth BL, Shoichet BK. A pharmacological organization of G protein-coupled receptors. Nature methods. 2013;10(2):140–6. 10.1038/nmeth.2324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Xie L, Bourne PE. Detecting evolutionary relationships across existing fold space, using sequence order-independent profile–profile alignments. Proceedings of the National Academy of sciences. 2008;105(14):5441–6 10.1073/pnas.0704422105 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Xie L, Wang J, Bourne PE. In silico elucidation of the molecular mechanism defining the adverse effect of selective estrogen receptor modulators. PLoS Comput Biol. 2007;3(11):e217 Epub 2007/12/07. 07-PLCB-RA-0389 [pii] 10.1371/journal.pcbi.0030217 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Kinnings SL, Liu N, Buchmeier N, Tonge PJ, Xie L, Bourne PE. Drug discovery using chemical systems biology: repositioning the safe medicine Comtan to treat multi-drug and extensively drug resistant tuberculosis. PLoS Comput Biol. 2009;5(7):e1000423 Epub 2009/07/07. 10.1371/journal.pcbi.1000423 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Xie L, Li J, Xie L, Bourne PE. Drug discovery using chemical systems biology: identification of the protein-ligand binding network to explain the side effects of CETP inhibitors. PLoS Comput Biol. 2009;5(5):e1000387 Epub 2009/05/14. 10.1371/journal.pcbi.1000387 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Durrant JD, Amaro RE, Xie L, Urbaniak MD, Ferguson MA, Haapalainen A, et al. A multidimensional strategy to detect polypharmacological targets in the absence of structural and sequence homology. PLoS Comput Biol. 2010;6(1):e1000648 Epub 2010/01/26. 10.1371/journal.pcbi.1000648 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Xie L, Evangelidis T, Xie L, Bourne PE. Drug Discovery Using Chemical Systems Biology: Weak inhibition of multiple kinases may contribute to the anti-cancer effect of Nelfinavir. PLoS Comput Biol. 2011;7(4):e1002037 10.1371/journal.pcbi.1002037 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Ho Sui SJ, Lo R, Fernandes AR, Caulfield MD, Lerman JA, Xie L, et al. Raloxifene attenuates Pseudomonas aeruginosa pyocyanin production and virulence. Int J Antimicrob Agents. 2012;40(3):246–51. Epub 2012/07/24. S0924-8579(12)00210-5 [pii] 10.1016/j.ijantimicag.2012.05.009 . [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The software and benchmark data are available at https://github.com/hansaimlim/REMAP. All other relevant data are within the paper and its Supporting Information.