Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Nov 1.
Published in final edited form as: IEEE/ACM Trans Comput Biol Bioinform. 2018 Mar 16;15(6):1960–1967. doi: 10.1109/TCBB.2018.2812189

ANTENNA, a Multi-Rank, Multi-Layered Recommender System for Inferring Reliable Drug-Gene-Disease Associations: Repurposing Diazoxide as a Targeted Anti-Cancer Therapy

Annie Wang 1, Hansaim Lim 2, Shu-Yuan Cheng 3, Lei Xie 4
PMCID: PMC6139288  NIHMSID: NIHMS958999  PMID: 29993812

Abstract

Existing drug discovery process follows a reductionist model of “one-drug-one-gene-one-disease,” which is inadequate to tackle complex diseases involving multiple malfunctioned genes. The availability of big omics data offers opportunities to transform drug discovery process into a new paradigm of systems pharmacology that focuses on designing drugs to target molecular interaction networks instead of a single gene. Here, we develop a reliable multi-rank, multi-layered recommender system, ANTENNA, to mine large-scale chemical genomics and disease association data for prediction of novel drug-gene-disease associations. ANTENNA integrates a novel tri-factorization based dual-regularized weighted and imputed One Class Collaborative Filtering (OCCF) algorithm, tREMAP, with a statistical framework based on Random Walk with Restart and assess the reliability of specific predictions. In the benchmark, tREMAP clearly outperforms the single-rank OCCF. We apply ANTENNA to a real-world problem: repurposing old drugs for new clinical indications without effective treatments. We discover that FDA-approved drug diazoxide can inhibit multiple kinase genes responsible for many diseases including cancer and kill triple negative breast cancer (TNBC) cells efficiently (IC50 = 0.87 μM). TNBC is a deadly disease without effective targeted therapies. Our finding demonstrates the power of big data analytics in drug discovery and developing a targeted therapy for TNBC.

Index Terms: Anti-cancer targeted therapy, big data analytics, data mining, diazoxide, drug discovery, drug repurposing, machine learning, multi-layered network, tri-factorization, triple negative breast cancer, prediction reliability

1 Introduction

The cost of bringing a drug to market has risen to approximately 2.6 billion dollars (Tufts Center for the Study of Drug Development, 2015), and the failure rate is daunting: only about one-third of drugs in phase III clinical trials reach the market. The limited success of the conventional drug discovery process is largely attributed to the wide adoption of a reductionist model of “one-drug-one-gene-one-disease” [1], [2], [3]. As a matter of fact, the onset and progress of many complex diseases such as cancer is a systematic process that involves multiple interacting genes. Thus, it is necessary to design drugs that target gene interaction networks instead of a single gene. Moreover, drug repurposing that reuses existing safe drugs to treat new diseases has emerged as a new paradigm to accelerate drug discovery and development. As the safety profile of existing medicines has already been well documented, the cost of clinical trials can be significantly reduced.

Recent advances in high-throughput technologies have generated abundant chemical genomics data on drug actions and disease genes. These big, complex, heterogeneous data sets provide unprecedented opportunities for identifying genome-wide drug-gene-disease associations, thereby facilitating multi-targeted drug design and drug repurposing. However, several challenges remain in mining chemical genomics and disease association data for drug discovery. Firstly, chemical genomics data from high-throughput screening campaigns are not only extremely large but also highly noisy, biased, and incomplete. Many existing data mining algorithms cannot be directly applied to model chemical genomics data. Secondly, drug action is a complex process. It starts with drug-gene interactions at the molecular level, and manifest clinical outcomes through biological network. A single genomics data set can only capture one part of whole drug process. Thus, it is necessary to integrate multiple data sets for chemical-gene interactions, gene-disease associations, and chemical-disease associations to model the drug action on a multi-layer. Finally, one of the fundamental problems in biomedical data mining has not been fully addressed: how to assess the individual reliability of a specific prediction from a data mining agent under a rigorous statistics framework. The reliable and unbiased assessment of the prediction quality for an individual instance is critical for cost-sensitive drug discovery process. For example, the selection of a novel chemical that is structurally different from patented drugs as a lead compound from a ranked list of candidate chemicals is a billion-dollar decision. Information on the individual predictive reliability of a novel chemical entity based on its weak chemical similarity to existing drugs in terms of bioactivity is invaluable. Most existing data mining tools can only provide an average predictive accuracy based on the population of training data, but not reliability for a specific new case. For example, in a ranking system, it is not straightforward to determine what the threshold is to select top-ranked hits. For a specific case, the top-first ranked hit could be a false positive. In another scenario, the top-N (N>1) ranked hits could all be true positives.

2 Contributions of This Work

To address challenges in the predictive modeling of drug-gene-disease associations as well as unmet needs in the treatment of complex diseases such as cancer, this work makes contributions to both methodology development and translational medicine.

On the side of methodology development, our contribution is twofold. First, we have developed a novel algorithm tREMAP based on tri-factorization to optimize matrix completion problem in which row and column have significantly different ranks. tREMAP formulates the chemical-gene predictions as a multi-rank dual-regularized weighted and imputed One Class Collaborative Filtering (OCCF) problem. Under the formulation of OCCF, negative data is not needed for the training, which is sparse and even unavailable. By using element-specific weights and imputation, tREMAP can handle noisy chemical genomics data in which the label is often uncertain. Finally, unlike conventional OCCF algorithm that applies a single rank to all layers, tREMAP assigns a different rank to a different layer. It is important since different layers can have dramatically different dimensions thus optimal ranks. For example, the dimension of a chemical layer is in the order of millions, while the dimension of a gene layer is only thousands. Our benchmark studies clearly show that tREMAP outperforms single-rank OCCF method. Second, to tailor the nature of chemical-gene-disease association data sets where observed chemical-disease associations are far sparser than known chemical-gene interactions and few three-way chemical-gene-disease associations exist, we have developed a multi-rank, multi-layered framework ANTENNA for inferring novel chemical-gene-disease associations. ANTENNA has three main components. (1) ANTENNA integrates multiple chemical genomics and disease association data set, and links them as a multi-layered network [4], as shown in Fig. 1. (2) ANTENNA uses tREMAP to infer genome-wide novel chemical-gene associations. (3) Based on the genome-wide chemical-gene association, ANTENNA applies Random Walk with Restart (RWR) and a statistics framework, Enrichment of Topological Similarity (ENTS) [5], to predict chemical-disease associations and assess their reliabilities.

Fig. 1.

Fig. 1

Illustration of Multi-Layered Network Model (MULAN) that integrates multiple genomics data sets.

Arguably, the most important contribution of this work is to discover a potentially safe and effective targeted therapy for triple negative breast cancer (TBNC). Using ANTENNA, we predicted that an FDA-approved drug diazoxide may inhibit multiple kinase genes. The malfunction of kinases is associated with many diseases such as cancer and Alzheimer’s disease. Among the kinases with the highest percentage of inhibition by diazoxide, one gene TTK is specifically over-expressed in the patients with TNBC [6], [7]. Thus, we hypothesized that diazoxide may kill TNBC cells. Our predictions were supported by multiple experimental evidence. TNBC is a subgroup of breast cancers, which is associated with the most aggressive clinical behavior. No targeted therapy is currently available for the treatment of TNBC. Our finding has a great potential for developing a targeted therapy for the effective treatment of TNBC.

3 Relevant Works

In principle, tensor factorization is a powerful method to infer three-way relationships. However, observed three-way chemical-gene-disease relations are extremely sparse. Majority of observed chemical-gene pairs are not associated with any diseases. Thus, the tensor factorization may be not the best option for this work. OCCF has been applied to a bipartite graph for predicting drug-target interactions [8], but not to inferring multiple drug-gene-disease associations. Moreover, existing OCCF algorithm is mainly based on the formulization of matrix factorization that only allows a common rank for both row and column. FASCINATE is an algorithm that can jointly infer missing links from a multi-layered network model [4]. However, FASCINATE is based on the formulation of a single rank collective OCCF. Moreover, it can only rank predicted relations [4]. There is no reliability information associated with each individual prediction. This work will address the drawbacks in matrix factorization, OCCF, and FASCINATE when applied to inferring chemical-gene-disease associations.

4 Experimentall and Computational Details

4.1 Overview of Computational and Experimental Procedure

Our primary purpose is to mine chemical genomics and disease association data to identify novel targeted therapies for unmet biomedical problems such as the treatment of TBNC. As shown in Fig. 2, the input of ANTENNA is the existing chemical genomics, drug, and disease databases including DrugBank [9], ZINC [10], ChEMBL [11], and CTD [12]. We first integrate these data sets into a multi-layered chemical-gene-disease network, MULAN. Then we apply tREMAP, a multi-rank dual-regularized weighted imputed OCCF algorithm, to infer novel chemical-gene associations. Next, we used ENTS to predict drug-disease association and to assess the reliability for each inferred association. The output of ANTENNA is a list of ranked drug-disease associations ranked by their statistical significance. Finally, we experimentally validate the top-ranked predictions.

Fig. 2.

Fig. 2

Workflow of drug discovery process using ANTENNA, a multi-layered recommender system.

4.2 Construction of Multi-Layered Chemical-Gene - Disease Network (MULAN)

We integrated heterogeneous data sets from genomics into a multi-layered network model, MULAN. In the MULAN, each node is a chemical entity (drugs and other chemicals), a biological entity (genes or proteins that it encodes), or a phenotypic entity (disease and side effect). Nodes in the same entity class are linked together by similarities (e.g., chemical-chemical similarity) or interactions (e.g., protein-protein interactions). Nodes that belong to different entity classes reside in different network layers and are linked by known associations (e.g., drug-target interactions, disease-gene associations). Integration of genomics data into a bipartite graph is of a proven value [13]. The MULAN can be considered as the unification of multiple bipartite graphs; thus, our new method is likely to be more robust than traditional approaches.

Chemical-gene associations including drug-gene associations were obtained from the ZINC [14], ChEMBL [15] and DrugBank [9] databases. To obtain reliable chemical-gene association pairs, binding assays records with IC50 (concentration of the chemical needed to inhibit 50% of the activity of the target protein) information were extracted from the databases, and the cutoff IC50 value of 10 μM was used where applicable. Chemical-gene pairs were considered associated if IC50≤10 μM (active pairs), unassociated if IC50>10 μM (inactive pairs), ambiguous if records exist in both ranges (ambiguous pairs), and unobserved otherwise (unknown pairs). A total of 198,712 unique chemicals and 3,549 unique genes were obtained from the combination of ChEMBL and ZINC with 228,725 unique chemical-gene active pairs, 76,643 inactive pairs, and 4,068 ambiguous pairs. Of the 198,712 chemicals, 722 were found to be FDA-approved drugs. Furthermore, drug-gene relationships were extracted from the Drug-Bank and integrated into the ZINC_ChEMBL dataset above. A total of 199,338 unique chemicals and 6,277 unique genes were obtained from the combination of ZINC, ChEMBL, and DrugBank with 233,378 unique chemical-gene active pairs. Drug-disease and gene-disease associations were directly obtained from the Comparative Toxicology Database (CTD) [12].

Chemical-chemical similarity scores are one of the required inputs of tREMAP. Although there are a number of metrics developed for chemical-chemical similarity, a recent study showed that Jaccard index-based similarity is highly efficient for fingerprint-based similarity measurement [16]. The fingerprint of choice in this study is the Extended Connectivity Fingerprint (ECFP), which has been successfully applied to chemical structure-based target prediction method, PRW [17]. Jaccard index is used to calculate a similarity score between two chemicals, c1 and c2.

Gene-gene similarity scores are also one of the required inputs for tREMAP. The similarity between two proteins encoded by genes was calculated based on their amino acid sequence similarity using NCBI BLAST [18] with an e-value threshold of 1 × 10−5 and its default options. A similarity score for query protein p1 to target protein p2, dbit (p1,p2), was calculated by the ratio of a bit score for the pair compared to the bit score of a self-query. To be specific, for the query protein p1 to the target protein p2, protein-protein similarity score was defined such that T(p1,p2) = dbit (p1,p2)/dbit (p1,p1).

Disease-disease similarity is required for tREMAP to infer chemical-disease associations and can be calculated using distributed word representations [19]. In this work, we do not infer the chemical-disease association directly using tREMAP, since only less than 0.4% of chemicals have observed associations with one or more diseases. Instead, we use ENTS and target binding profile of a chemical, which is derived from tREMAP, to infer the chemical-disease associations.

4.3 tREMAP Algorithm

Our prediction method tREMAP is based on a tri-factorization one-class collaborative filtering algorithm. In the case of chemical-gene association, it assumes that similar chemicals will interact with similar genes, and unobserved associations are not necessarily negative. Assuming that a fairly low number of factors (i.e. smaller number of features than the number of total chemicals or genes) may capture the characteristics determining the drug-gene associations, two low-rank matrices, F (drug side) and G (gene side), were approximated such that injm{R-(F·S·G)} is minimized where R is the matrix for known drug-gene interactions and G′ is the transposition of the gene side low-rank matrix G. The two low rank matrices, Fn×r1 with the rank of r1 and Gm×r2 with the rank of r2, and their connectivity matrix Sr1×r2 are obtained by iteratively minimizing the objective function.

minF,S,G0(u,i)W(u,i)(R(u,i)+P(u,i)-(FSG)(u,i))2+λr(F2+S2+G2)+λFtr(F(DM-M)F)+λGtr(F(DN-N)G) (1)

Here, W(u,i) is the penalty weight on the observed and unobserved associations which indicate the reliability of the assigned probability of true association, P(u,i) is the imputed value (i.e. the probability of unobserved associations as real associations), M and N is the symmetric chemical-chemical similarity matrix and gene-gene similarity matrix, respectively. DM and DN are the degree matrix of M and N, respectively. λr is the regularization parameter to prevent overfitting, λF is the importance parameter for chemical-chemical similarity, λG is the importance parameter for gene-gene similarity, and tr(A) is the trace of matrix A. The weight and imputation values can be determined by a priori knowledge or from the prediction of other machine learning algorithms. The first term in (1) forces the approximation FSG′ to be close to the observation matrix R The second term is regularization term preventing overfitting. The third and fourth terms force the low-rank feature vectors close to each other according to their chemical-chemical or protein-protein similarity score. Thus, the optimal low-rank matrix F was obtained after minimizing the sum of Euclidean distances for each row weighted by the chemical-chemical similarity score. The derivation of the formula can be found in [20].

Similar to the bi-factorization problem in [20], the optimization problem defined in (1) is non-convex. Thus, we seek to find a local optimum by the block coordinate descent method. In (1), DM, M, DN, and N are non-negative matrices. The derivative of (1) with regard to F, G, and S with the non-negativity constraint has a fixed-point solution. To scale up tREMAP in terms of both time and storage, we propose efficient multiplicative updating rules as follows:

F(u,r)F(u,r)[(1-wp)RGS+wp1m×nGS+λFMF](u,r)[(1-w)R1GS+wF(SGGS)+λrF+λFDMF](u,r) (2)
G(i,s)G(i,s)[(1-wp)RFS+wp1n×m(FS)+λGNG](i,s)[(1-w)R1(FS)+wG(SFFS)+λrG+λGDNG](i,s) (3)
S(r,s)S(r,s)[(1-wp)FRG+wp(F(1m×n)G)](r,s)[(1-w)FR1G+wF(FSG)G](r,s)+λrS (4)

Where w and p are weighted and imputed value, respectively. They are either set based on a priori knowledge (e.g. the false positive rate of high-throughput screening experiments) or can be tuned as hyper-parameters. R1(u,i) is the sparse matrix in which the value of elements is predicted by F and G on the observed cases Θ in R, i.e.

R1(u,i)={FSG(u,i)if(u,i)Θ0otherwise (5)

We use a block-coordinate descent algorithm to iteratively update F, G, and S.

The raw predicted score for the ith chemical to bind the jth protein can be calculated by P(i,j)=F(i,:)·S·G(j,:). Also, the matrix Fn×r1 is referred to as a low-rank drug profile since its ith row represents the ith drug’s behavior in the drug-gene association network as well as drug-drug similarity spaces compressed to r1 number of features.

4.4 ENTS Algorithm

The rationale of ENTS is that when clusters of instance share common features, a cluster ranked closely together is more likely similar to the new instance than a cluster ranked randomly or spread out across the ranking. In addition, network topological similarity provides more robust and accurate global ranking across an entire hypothesis space than pairwise similarity does. Unlike conventional local ranking (e.g., k-nearest neighbors), global instance ranking can support statistical enrichment analysis because it draws valuable information on the ranking for all instances in a cluster from lower, non-randomly ranked cases.

4.4.1 Classification or clustering of database instances

To initialize ENTS, part or all of the instances in the database (training set) are classified based on target feature T. In ANTENNA, the T is the disease associated with a drug. If database instances are not pre-classified, clusters of training data are assembled using T features under unsupervised clustering techniques [21] such as k-means [22], mean-shift [23], affinity propagation [24], or p-median model [25] etc. After the classification or clustering, each instance cluster will be assigned with a unique label (i.e. a specific disease in ANTENNA). These instance clusters are applied to the next step. It is noted that the instance clusters are not necessarily disjointed. They can overlap.

4.4.2 A weighted graph represents training instance similarity by T-features

After the initialization, ENTS builds a database instance graph; a weighted graph with one node for the T-feature of each training instance and an edge between two nodes only if their pairwise similarity exceeds a certain threshold. The threshold depends on the features and the pairwise similarity metric. Any similarity metric (e.g. Euclidean distance, Jaccard index, Hidden Markov Model, kernel-based similarity etc.) can be applied here. In ANTENNA, we use cosine similarity of low-rank profile of drugs to measure the distance between drugs.

4.4.3 Network topological similarity

Given a query with known K-feature and the goal to predict its unknown T-feature, ENTS first links the query to all nodes in the training instance graph, where new edges are not found in the training instance graph. The weights of these new edges are only based on K-feature similarity. Then Random Walk with Restart (RWR) is applied to perform a probabilistic traversal of the instance graph across all paths leading away from the query, where the probability of choosing an edge will be proportional to its weight. The algorithm will output a list of all instances in the graph, ranked by the probability that a path from the query will reach the node. In this way, RWR can capture global relationships that may be missed by pair-wise similarity [26].

We modified the RankProp algorithm [27], a variant of RWR. The graph is represented as an adjacency list to save memory and speed up the iterative algorithm. The current implementation is scalable to a graph with millions of nodes and hundreds of millions of edges.

4.4.4 Statistical significance of network topological similarity

A network topological search only ranks instances based on their similarity but gives no information on the reliability of the ranking. To assess the statistical significance of the ranking of an instance cluster Ci generated previously, ENTS compares the score distribution of the cluster Ci with that of a randomly drawn cluster of the same size. When the mean of global topological similarity scores in a cluster is used as the statistic, an efficient random-set method is used for the parametric approximation of the null distribution [28]. The random-set method compares an enriched cluster of size m with all other distinct clusters of size m drawn randomly from a case graph on N nodes. The exact distribution of is intractable, but can be approximated with the normal distribution with mean and variance as follows:

μ=1Nj=1Npjσ2=1m(N-mN-1)[(1Nj=1Npj2)-(1Nj=1Npj)2]

Where pj is the global topological similarity score of the structure j in the graph to the query. The enrichment score of the cluster Ci is then normalized with Z = (μ)/σ.

A p-value and Benjamini-Hochber adjusted false discovery rate (FDR) is then calculated for each Z-score.

4.5 Combining tREMAP and ENTS to Predict Drug - Disease Association

In ANTENNA, we firstly use tREMAP to generate chemical-side low rank matrix F and gene side low-rank matrix G. The ith row of F contains the gene association profile for the ith drug. Then, we calculated drug-drug cosine similarities based on the matrix F, and construct a drug-drug similarity graph. For each row of F for FDA approved drugs, the cosine similarity of drug c1 and drug c2 can be calculated by, Scos,(c1,c2)=Uc1·Uc2Uc1Uc2. To search for possibly undiscovered uses of the drugs, we focus on drugs that are found to have high cosine similarity but low chemical structural similarity (< 0.5). Finally, we cluster drugs based on their directly or indirectly associated diseases annotated in CTD database [12], and use ENT to assess and rank the statistical significance of novel drug-disease associations. The final output of ANTENNA is the ranked list of predicted drug-disease association based on FDR.

4.6 Experimental Validation

4.6.1 Kinase binding assay

Kinase is an enzyme that catalyzes the transfer of a chemical group phosphate to another biomolecule. It functions as a molecular switch in many biological processes. The malfunction of kinases is responsible for many diseases such as cancer. There are more than 400 kinases in the human genome, which is termed as kinome. To rigorously validate the performance of ANTENNA, we employed a competition binding assay to detect the binding of selected drugs to a set of 438 kinases (human kinome). The proprietary KinomeScan assay was performed by DiscoverX (CA). The assay tested the capacity for a drug to disrupt the binding of each DNA-tagged kinase to a support which one was in turn bound to the kinase’s known ligand. If binding between the kinase and its known ligand was disrupted in the presence of the drug, this indicated that the drug either competed directly with the known ligand or allosterically altered the kinase’s ability to bind to that ligand. DMSO was used as a positive control and a pico-molar kinase inhibitor was used as a negative control. Binding levels were quantitated by performing real-time polymerase chain reaction (qPCR) on the DNA tag of the ligand-bound kinases. The qPCR is a molecular biology technique to amplify a single copy or a few copies of DNA segment in several orders of magnitude and to measure the reaction in a real time. The tests were performed at 100 μM concentration of tested drug, and results were reported as %Control, calculated as follows, where a lower %Control score indicates a stronger interaction.

(testcompoundsignal-positivecontrolsignal)(negativecontrolsignal-positivecontrolsignal)×100

4.6.2 Cancer cell viability assay

MCF-7 cells from ATCC® and MDA-MB 468 cells (a gift of Dr. R Sullivan from Queens Community College, the City University of New York) were used for this study. MCF-7 is breast cancer cell line. MDA-MB 468 is triple negative breast cancer cell line which does not express estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor (Her2/neu). Cells were cultured in Dulbecco’s Modified Eagle Medium (DMEM) (Thermo Fisher Scientific) supplemented with 10% fetal bovine serum (Thermo Fisher Scientific) and 50 μg/ml gentamicin (Thermo Fisher Scientific) at 37°C 5% CO2 incubator.

Cell viability was determined by neutral red assay which is based on the lysosome uptake of neutral red dye [29]. Briefly, cells (2 × 104 cells per well) were plated onto 96-well plate in a total volume of 200 μl on the day before chemical treatments. Chemicals were dissolved in dimethyl sulfoxide (DMSO) to obtain 0.1 M stock solution 15 minutes before chemical treatments. Then, various concentrations (0.1–150 μM) of chemicals were prepared in fresh media. The final concentration of DMSO in each well was equal to or less than 0.15% which is considered non-toxic to cells [30].

After 24 hours of chemical treatments, 20 μl of 0.33% Neutral Red Solution (Sigma Aldrich) was added onto wells. After 2 hours incubation at 37°C 5% CO2 incubator, dye solution was carefully removed and cells were rinsed with 200 μl Neutral Red Assay Fixative (0.1% CaCl2 in 0.5% formaldehyde) (Sigma Aldrich) twice. The absorbed dye was then solubilized in 200 μl of Neutral Red Assay Solubilization Solution (1% acetic acid in 50% ethanol) (Sigma Aldrich) for 10 minutes at room temperature on a shaker. Absorbance at 540 nm and 690 nm (background) was measured by BioTek Synergy Mx microplate reader.

Each concentration in each experiment was done in at least triplicate. Multiple experiments were done to obtain IC50 values for each drug and each cell line. The viability was determined based on a comparison with untreated cells which were set as 100% cell viability. The IC50 values which represent the chemical concentration needed to inhibit 50% cell proliferation were calculated from the dose-response curve.

5 Results and Discussions

5.1 Performance evaluation of tREMAP

In our published study [8], single rank REMAP outperformed state-of-the-art methods: a chemical similarity-based method (PRW [17]), the best performed matrix factorization methods so far (NRLMF [31] and KBMF with twin kernels (KBMF2K) [32]), combination of WNN and GIP (WNNGIP [33]), and another type of collaborative filtering algorithm (Collaborative Matrix Factorization (CMF) [34]). Here we compare the performance of tREMAP with that of REMAP using two benchmarks. The first benchmark includes 3,494 chemicals, 25 G-protein coupled receptors (GPCRs), and 4,494 observed chemical-GPCR associations. The second benchmark includes 33,684 chemicals, 31 Cytochrome P450 enzymes (CYP450), and 51,699 observed chemical-CYP450 associations.

As shown in Fig. 3, tREMAP clearly outperforms REMAP when evaluated by both benchmarks. tREMAP identifies around 96% and 87% true associations ranked on the top 3 for GPCR and CYP450, respectively, while REMAP can only identify around 78% and 60% true hits ranked on top 3 respectively.

Fig. 3.

Fig. 3

Performance comparison of tREMAP with REMAP for GPCR (top) and CYP450 (bottom), respectively. Performance is measured by the recall at the top rank K.

When evaluated by the application to sequence-structure similarity search, ENTS is superior to Hidden Markov Model and RWR [5].

5.2 Time complexity of tREMAP

Empirically, the running time of tREMAP is linearly dependent on the number of chemicals and genes, as shown in Fig. 4. When evaluated in a machine with 2 cores of 2.18 GHz CPU. It takes around 1,000 seconds for a matrix with 15,000 chemicals, 200 genes, chemical-side rank of 1,000, and gene-side rank of 200 to converge.

Fig. 4.

Fig. 4

Running time of tREMAP vs the number of items. The computational time was measured using 2 cores of 2.18 GHz CPU, for a matrix with 200 genes and varied number of chemicals. The ranks for chemical and gene are fixed as 1,000 and 200, respectively

5.3 ANTENNA Predictions

By combining tREMAP with ENTS, ANTENNA predicted that 21,921 novel drug-disease associations with Benjamini-Hochberg adjusted false discovery rate (FDR) less than 0.02. We selected a drug-disease pair for further experimental evaluation based on the following criteria. First, the drug was predicted to bind kinases, as the genome-wide binding assay for kinases is accessible. Second, the associated disease does not have effective therapy, so that the repurposed drug will have the biggest clinical impact. Third, the cell-based disease model is available, so that we can evaluate the efficacy of the drug.

Based on above criteria, diazoxide, a safe FDA-approved drug for hypertension, was selected. Diazoxide was predicted to interact with protein kinases. Furthermore, ANTENNA predicted that diazoxide was associated with Triple Negative Breast Cancer (TNBC) with Benjamini-Hochber adjusted false discovery rate (FDR) of 0.0108. Thus, diazoxide may be repurposed for the treatment of TBNC which is the most aggressive type of breast cancer and cannot be treated by any existing targeted therapy. It notes that the FDR of predicted diazoxide-TNBC association is not particular statistically significant. If this prediction is experimentally validated, we will have more confidence in predictions with lower FDRs.

5.4 Kinase Binding Assay

We validated the binding of diazoxide to kinases using KinomeScan assay. Fig. 5 displays the binding profile of diazoxide across 438 kinases (kinome). Diazoxide has the highest percentage inhibition of kinases DRYK1A, IRAK1, and TTK with 7.0%, 8.9%, and 15.0% control. It is noted that the lower %Control, the higher inhibition of kinase activity.

Fig. 5.

Fig. 5

Binding profile of FDA-approved drug diazoxide (100 μM) on 438 kinases determined by KinomeScan assay.

As shown in Table 1, the malfunction of DYRK1A, IRAK1, and TTK is associated with multiple diseases, especially cancers and Alzheimer’s disease. To verify our predictions, we tested the effect of diazoxide on breast cancer cells.

Table 1.

Gene-disease Associations of three kinases having highest inhibition percentage by diazoxide

Kinase KinomeScan %Control Gene-Disease Association
DYRK1A 7.0 Multiple cancer drug-resistance, Alzheimer’s disease
IRAK1 8.9 Breast cancer metastasis, herpesvirus lymphoma, Alzheimer’s disease
TTK 15 TNBC, Hepatocellular Carcinoma

5.5 Cancer cell viability assay

The cytotoxicity of diazoxide was determined by neutral red cell viability assay. The IC50 values obtained from Estrogen positive breast cancer MCF-7 cells and TNBC MDA-MB-468 cells treated with chemicals for 24 hours were shown in Table 2. Diazoxide was much more effective in inhibiting the cell proliferation of TNBC cancer MDA-MB 468 cells as compared to MCF-7 breast cancer cells with the values of IC50 0.87 ± 0.39 μM and 130.0 ± 70.0 μM, respectively. The IC50 is the concentration of diazoxide that inhibits the cell proliferation of 50% cancer cells. The smaller the IC50 value is, the stronger anti-cancer activity diazoxide has. It is accepted that a chemical compound is active when the IC50 is less than 10 μM. Thus, diazoxide could be a highly effective targeted therapy for the treatment of TNBC at a low concentration.

Table 2.

IC50 values of diazoxide on cancer cells

Cell line IC50 (Mean ± SEM)
MCF-7 (ER positive) 130.0 ± 70.0 μM
MDA-MB-468 (TNBC) 0.87 ± 0.39 μM

6 CONCLUSIONS

In summary, we have developed a reliable and accurate multi-rank, multi-layered recommender system ANTENNA. Using ANTENNA, we predicted that FDA-approved safe medicine diazoxide could bind to kinases whose malfunction is associated with TNBC. KinomeScanTM assay confirmed the kinase binding of diazoxide. Cancer cell viability assay further validated that diazoxide is highly effective in inhibiting the proliferation of TNBC cancer cells. These findings suggest that diazoxide can be repurposed as an effective targeted therapy for the treatment of TNBC. Furthermore, diazoxide may be effective in the treatment of other diseases such as hepatocellular carcinoma and Alzheimer’s disease. We are carrying out experiments to verify these predictions. This study demonstrates that big data analytics provides new opportunities for accelerating drug discovery and development, and realizing the full potential of precision medicines.

Acknowledgments

This work was partly supported by Grant Number R01LM011986 from the National Library of Medicine (NLM) of the National Institute of Health (NIH), Grant Number R01GM122845 from the National Institute of General Medical Sciences (NIGMS) of the National Institute of Health (NIH), Grant Number R21TR001722 from the National Center for Advancing Translational Sciences of NIH, and Grant Number MD007599 from the National Institute on Minority Health and Health Disparities (NIMHD) of NIH.

Biographies

Annie Wang was born in Manhattan, New York, in 2000. She is currently a senior in the Bronx High School of Science, located in New York City. Her current research interests are drug development, computational biology and machine learning. She is interested in pursuing a career in biomedical engineering.

Hansaim Lim was born in Seoul, South Korea. He received the B.A. degree in chemistry from Hunter College of the City University of New York in 2014. He started his Ph.D. study in biochemistry at the Graduate Center of the City University of New York in 2015. He joined Dr. Lei Xie’s lab at Hunter College for his thesis project. His research interests cover machine learning-based drug activity prediction.

Shu-Yuan Cheng received the M.S. degree and the Ph.D. degree in Toxicology from St. John’s University, Jamaica, New York, in 1996 and 2003, respectively. She has joined John Jay College of Criminal Justice, the City University of New York, New York since 2008 and currently is an Associate Professor in Toxicology. She is a member of Society of Toxicology and Society for Neuroscience. She is also a member of editorial board of Journal of Cell Science and Apoptosis. She has received grants from NSF (RUI), NIH (SCORE), and DOJ to support her research. She has published 20 scientific papers in the field of toxicology, neuroscience, forensic toxicology, biochemistry and cancer research. Her research interests include the pathogenesis study of neurodegeneration, the epidemiology study of abused drugs in wastewater in NYC, the pharmacological mechanism study of mitomycin C and its analog, and the cytotoxicity study of the potential anti-cancer drugs.

Lei Xie was born in Jilin, P. R. China. He received the B.S. degree in polymer physics from University of Science and Technology of China, P. R. China., in 1990, the M.Sc. degree in Computer Science, and Ph.D. degree in Chemistry from Rutgers University, U.S.A., in 2000. He was an Associate Scientist at Columbia University and Howard Hughes Medical Institute, U.S.A. He has worked in pharmaceutical and biotechnology companies Roche and Eidogen, U.S.A. for several years. He was a Principal Scientist at San Diego Supercomputer Center from 2006 to 2011. He is currently an Associate Professor at Department of Computer Science, Hunter College, and The Graduate Center, The City University of New York, U.S.A. His research interests cover data mining, machine learning, biophysics, systems biology, and drug discovery with over 50 technical publications.

Contributor Information

Annie Wang, Bronx High School of Science, 75 W 205th St, Bronx, NY 10468.

Hansaim Lim, Ph.D. Program in Biochemistry, the City University of New York, 365 5th Avenue, New York, NY 10016.

Shu-Yuan Cheng, Department of Sciences, John Jay College, the City University of New York, 365 5th Avenue, New York, NY 10016.

Lei Xie, Department of Computer Science, Hunter College, and the Graduate Center, the City University of New York, 695 Park Ave, New York, NY 10065.

References

  • 1.Paolini GV, Shapland RH, van Hoorn WP, Mason JS, Hopkins AL. Global mapping of pharmacological space. Nat Biotechnol. 2006 Jul;24(7):805–15. doi: 10.1038/nbt1228. [DOI] [PubMed] [Google Scholar]
  • 2.Hopkins AL. Network pharmacology. Nat Biotechnol. 2007;25(10):1110–1111. doi: 10.1038/nbt1007-1110. [DOI] [PubMed] [Google Scholar]
  • 3.Hopkins AL. Network pharmacology: the next paradigm in drug discovery. Nat Chem Biol. 2008 Nov;4(11):682–90. doi: 10.1038/nchembio.118. [DOI] [PubMed] [Google Scholar]
  • 4.Chen C, Tong H, Xie L, Ying L, He Q. FASCINATE: Fast Cross-Layer Dependency Inference on Multi-layered Networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; San Francisco, California, USA. 2016. pp. 765–774. [Google Scholar]
  • 5.Lhota J, Hauptman R, Hart T, Ng C, Xie L. A new method to improve network topological similarity search: applied to fold recognition. Bioinformatics. 2015 Jul 1;31(13):2106–14. doi: 10.1093/bioinformatics/btv125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Maia AR, de Man J, Boon U, Janssen A, Song JY, Omerzu M, Sterrenburg JG, Prinsen MB, Willemsen-Seegers N, de Roos JA, van Doornmalen AM, Uitdehaag JC, Kops GJ, Jonkers J, Buijsman RC, Zaman GJ, Medema RH. Inhibition of the spindle assembly checkpoint kinase TTK enhances the efficacy of docetaxel in a triple-negative breast cancer model. Ann Oncol. 2015 Oct;26(10):2180–92. doi: 10.1093/annonc/mdv293. [DOI] [PubMed] [Google Scholar]
  • 7.Maire V, Baldeyron C, Richardson M, Tesson B, Vincent-Salomon A, Gravier E, Marty-Prouvost B, De Koning L, Rigaill G, Dumont A, Gentien D, Barillot E, Roman-Roman S, Depil S, Cruzalegui F, Pierre A, Tucker GC, Dubois T. TTK/hMPS1 is an attractive therapeutic target for triple-negative breast cancer. PLoS One. 2013;8(5):e63712. doi: 10.1371/journal.pone.0063712. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Lim H, Poleksic A, Yao Y, Tong H, He D, Zhuang L, Meng P, Xie L. Large-Scale Off-Target Identification Using Fast and Accurate Dual Regularized One-Class Collaborative Filtering and Its Application to Drug Repurposing. PLoS Comput Biol. 2016 Oct;12(10):e1005135. doi: 10.1371/journal.pcbi.1005135. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Tzur D, Gautam B, Hassanali M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008 Jan;36(Database issue):D901–6. doi: 10.1093/nar/gkm958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Irwin JJ, Shoichet BK. ZINC--a free database of commercially available compounds for virtual screening. J Chem Inf Model. 2005 Jan-Feb;45(1):177–82. doi: 10.1021/ci049714. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, Light Y, McGlinchey S, Michalovich D, Al-Lazikani B, Overington JP. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 2012 Jan;40(Database issue):D1100–7. doi: 10.1093/nar/gkr777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Davis AP, Grondin CJ, Lennon-Hopkins K, Saraceni-Richards C, Sciaky D, King BL, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015. Nucleic Acids Res. 2015 Jan;43(Database issue):D914–20. doi: 10.1093/nar/gku935. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ma’ayan A, Rouillard AD, Clark NR, Wang Z, Duan Q, Kou Y. Lean Big Data integration in systems biology and systems pharmacology. Trends Pharmacol Sci. 2014 Sep;35(9):450–60. doi: 10.1016/j.tips.2014.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Irwin JJ, Sterling T, Mysinger MM, Bolstad ES, Coleman RG. ZINC: a free tool to discover chemistry for biology. Journal of chemical information and modeling. 2012;52(7):1757–1768. doi: 10.1021/ci3001277. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Bento AP, Gaulton A, Hersey A, Bellis LJ, Chambers J, Davies M, Kruger FA, Light Y, Mak L, McGlinchey S. The ChEMBL bioactivity database: an update. Nucleic acids research. 2013:gkt1031. doi: 10.1093/nar/gkt1031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bajusz D, Racz A, Heberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? Journal of Cheminformatics. 2015;7(1):1–13. doi: 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Koutsoukas A, Lowe R, KalantarMotamedi Y, Mussa HY, Klaffke W, Mitchell JB, Glen RC, Bender A. In silico target predictions: defining a benchmarking data set and comparison of performance of the multiclass naive bayes and parzen-rosenblatt window. Journal of chemical information and modeling. 2013;53(8):1957–1966. doi: 10.1021/ci300435j. [DOI] [PubMed] [Google Scholar]
  • 18.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC bioinformatics. 2009;10(1):1. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Dider S, Ji J, Zhao Z, Xie L. Molecular mechanisms involved in the side effects of fatty acid amide hydrolase inhibitors: a structural phenomics approach to proteome-wide cellular off-target deconvolution and disease association. NPJ Systems Biology and Applications. 2016;2:16023. doi: 10.1038/npjsba.2016.23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Yao Y, Tong H, Yan G, Xu F, Zhang X, Szymanski BK, Lu J. Dual-regularized one-class collaborative filtering. :759–768. [Google Scholar]
  • 21.Estivill-Castro V. Why so many clustering algorithms — A Position Paper. ACM SIGKDD Explorations Newsletter. 2002;4(1):65–75. [Google Scholar]
  • 22.Hartigan JA, Wong MA. Algorithm AS 136: A K-Means Clustering Algorithm. Journal of the Royal Statistical Society, Series C. 1979;28(1):100–108. [Google Scholar]
  • 23.Comaniciu D, Meer P. Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence ( 2002;24(5):603–619. [Google Scholar]
  • 24.Frey BJ, Dueck D. Clustering by passing messages between data points. Science. 2007 Feb 16;315(5814):972–6. doi: 10.1126/science.1136800. [DOI] [PubMed] [Google Scholar]
  • 25.Brusco MJ, Kohn HF. Comment on “Clustering by passing messages between data points”. Science. 2008 Feb 8;319(5864):726. doi: 10.1126/science.1150938. author reply 726. [DOI] [PubMed] [Google Scholar]
  • 26.Tong H, Faloutsos C. Center-piece subgraphs: Problem definition and fast solutions. :404–413. [Google Scholar]
  • 27.Melvin I, Weston J, Leslie C, Noble WS. RANKPROP: a web server for protein remote homology detection. Bioinformatics. 2009 Jan 1;25(1):121–2. doi: 10.1093/bioinformatics/btn567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Newton MA, Quintana FA, den Boon JA, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat. 2007;1(1):85–106. [Google Scholar]
  • 29.Borenfreund E, Puerner JA. Toxicity determined in vitro by morphological alterations and neutral red absorption. Toxicol Lett. 1985 Feb-Mar;24(2–3):119–24. doi: 10.1016/0378-4274(85)90046-3. [DOI] [PubMed] [Google Scholar]
  • 30.Galvao J, Davis B, Tilley M, Normando E, Duchen MR, Cordeiro MF. Unexpected low-dose toxicity of the universal solvent DMSO. FASEB J. 2014 Mar;28(3):1317–30. doi: 10.1096/fj.13-235440. [DOI] [PubMed] [Google Scholar]
  • 31.Liu Y, Wu M, Miao C, Zhao P, Li XL. Neighborhood Regularized Logistic Matrix Factorization for Drug-Target Interaction Prediction. PLoS Comput Biol. 2016;12(2):e1004760. doi: 10.1371/journal.pcbi.1004760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Gonen M. Predicting drug–target interactions from chemical and genomic kernels using Bayesian matrix factorization. Bioinformatics. 2012;28(18):2304–2310. doi: 10.1093/bioinformatics/bts360. [DOI] [PubMed] [Google Scholar]
  • 33.van Laarhoven T, Marchiori E. Predicting drug-target interactions for new drug compounds using a weighted nearest neighbor profile. PloS one. 2013;8(6):e66952. doi: 10.1371/journal.pone.0066952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zheng X, Ding H, Mamitsuka H, Zhu S. Collaborative matrix factorization with multiple similarities for predicting drug-target interactions. :1025–1033. [Google Scholar]

RESOURCES