Abstract
Background
Drug repurposing offers a promising strategy for drug discovery. Drug repurposing involves identifying new therapeutic indications for existing, marketed drugs, thereby reducing the risks, costs, and time typically required for drug development. Various methods exist for drug repurposing, including high-throughput screening of drug compound libraries, computation in silico approaches, literature-based methods, etc. Currently, numerous methods utilize literature for data mining in drug repositioning; however, relatively few approaches leverage literature citation networks for this purpose.
Results
We identified 19,553 potential drug pairs for repurposing by analyzing biomedical literature data through the Jaccard coefficient. Our results demonstrated that the literature-based Jaccard coefficient was the most effective similarity metric for identifying drug repurposing opportunities. To refine our selection process, we applied a threshold defined by the upper
th quantile value of the Jaccard coefficient, enabling us to prioritize promising de novo drug repurposing candidates. Among the identified drug pairs, we found several with strong potential for repurposing, including combinations such as adapalene and bexarotene, guanabenz and tizanidine, alvimopan and methylnaltrexone, etc.
Conclusion
We created a validation set consisting of both true positives and true negatives for drug pairs using the repoDB dataset, a widely recognized resource for drug repurposing. To evaluate the performance of various similarity metrics for drug pairs, we compared their effectiveness based on AUC, F1 score, and AUCPR using the validation set.
Keywords: Drug repurposing, Literature citation network, Network analysis, Validation the results, Candidate repurposing drug pairs
Background
Developing a new drug is a time-consuming, high-cost, and high-risk endeavor. Recent estimates suggest that the average cost of developing a novel drug range from 314 million to 2.8 billion US dollars [1]. It takes approximately 12 to 15 years, from the initial concept to the completion of drug development [2]. Despite the significant time and financial investment, nearly
of candidate drugs that enter the first phase of clinical trials fail to receive approval [3]. Compared to traditional drug development, drug repurposing (or repositioning) can significantly reduce the risks and costs associated with drug discovery [4, 5]. Drug repurposing involves identifying new therapeutic indications for already marketed drugs [6]. Various methods are available for drug repurposing, including high-throughput screening of drug compound libraries and in silico approaches [7, 8]. In silico drug repurposing screening involves several approaches, including the computation of chemical similarity between ligands and drug targets, the development of machine learning models, the application of deep learning algorithms, literature-based methods and network-based approaches [9–13].
Ligand-based approaches rely on the principle that structurally similar compounds often exhibit similar biological properties. These methods are widely used to analyze and predict the activity of ligands for novel targets [14]. Machine learning and deep learning methods leverage publicly available databases (such as phenotypic profiling data, electronic health records, etc.) and sources of information like compound structures to facilitate drug repurposing [15]. Network-based drug repurposing can be performed by quantifying the proximity between disease genes and drug targets within the human interactome [16]. Literature-based methods typically enable drug repurposing by mining large-scale repositories of scientific literature to identify and curate repurposed drugs [17]. For example, textual semantics can identify and curate new drug repurposing candidates by analyzing relationships between diseases, genes, and drugs [18].
The methods outlined above offer various technical approaches for drug repurposing, but they also come with certain limitations. Machine learning techniques, for instance, require careful selection of features and targets to effectively identify potential drug repurposing candidates. This process involves decisions on which publicly available databases to use and how to optimally extract relevant drug feature characteristics. Deep learning methods encounter the ’black box’ challenge, making it difficult to explain the rationale behind the repurposed drug results. On the other hand, human protein–protein interaction-based drug repurposing methods face the issue of an incomplete human interactome, which limits their effectiveness. Textual semantics also presents several challenges. For instance, there is no unified standard for describing the same diseases and symptoms across different literatures during the text mining stage. Additionally, various studies employ different syntactic structures when analyzing the relationships between biological entities during the semantic analysis stage. Furthermore, obtaining drug repurposing results at the data analysis stage requires complex computations. If any of these stages encounter issues, the resulting drug repurposing outcomes may be influenced and inaccurate.
To address the limitations of the methods mentioned above, we propose a novel approach for drug repurposing that leverages the vast amount of literature data accumulated over more than a century. According to OpenAlex, there are approximately 200 million scientific articles available. OpenAlex is a fully open scientific knowledge graph that includes metadata for a wide range of works, such as journal articles and books, along with disambiguated author information, institutions, etc.
We aim to build connections between drugs and literature through genes associated with the literature, as suggested by previous studies [19]. In other words, we established a connection between drugs and literature through the links between drug-target coding genes and the literature. As a result, this study primarily focuses on drugs with known targets.
In this study, drug repurposing was achieved through pairwise combinations of all drugs with known targets. Utilizing drug combinations can enhance the success rate of drug repurposing screenings [20]. It is widely believed that ’similar drugs’ exhibit similar therapeutic effects. Targets, proteome, and transcriptome networks can be used to establish similarity between drug pairs [21, 22]. Once drug-drug similarity is established, it becomes possible to further explore their shared indications for treating various diseases [23]. Li and Lu also proposed a novel method for computational drug repurposing based on drug pairwise similarity [24].
For pairwise combinations of drugs, we constructed a citation network based on literature related to the drugs. The literature-based similarities between drug pairs were then calculated using this citation network. This approach allowed us to assess the overall impact of different types of data on drug-drug similarity [25]. The various types of data include chemical compounds of drugs, biological information of drugs, etc. The inspiration for this study came from the idea that for literature related to two drugs, the higher the overlap between the literature, the more similar the two drugs are likely to be. Since the relationship between drugs and literature is established through the links between drug targets and the literature, the literature-based drug-drug similarity is actually calculated based on literature-based target-target similarity. In other words, the more identical the literature is between different targets, the closer the relationship between those targets. As a result, we expect a high degree of similarity between these targets. We also considered using the references of articles related to drugs for drug repurposing, based on the assumption that the citation of literature by authors follows a normative pattern. In reality, literature citations are not arbitrary; they follow a certain logic and structure.
Meanwhile, we created a validation set containing true positives and true negatives for drug pairs, sourced from the repoDB database, a standard dataset for drug repurposing. We then compared the literature-based similarities with human interactome-based separation using the validation set, evaluating performance in terms of AUC,
score, and AUCPR. The results showed that literature-based Jaccard similarity outperformed other similarity measures based on AUC and
score. Finally, we ranked the Jaccard similarities of drug pairs from highest to lowest. De novo drug repurposing candidates were identified using a threshold defined as the
th upper quantile of Jaccard similarities. We also selected ten drug pairs with detailed information and drew several novel conclusions.
Results
Data structure
We collected 1978 FDA-approved or clinically investigational drugs, each with at least two targets, from a previous study [26]. The average number of targets per drug is 6, with a median of 3 and a maximum of 256.
A histogram of the number of targets for these drugs is shown in Fig. 1a, which clearly illustrates that most drugs have fewer than 20 targets. In total, the 1978 drugs without duplication are associated with 2254 targets. The average number of articles related to these targets is 249, the median is 108, and the maximum is 6563. A histogram of the number of articles for these targets is presented in Fig. 1b. The average number of articles per drug is 2658, with a median of 1397 and a maximum of 70,878. The histogram of the number of articles for these drugs is shown in Fig. 1c.
Fig. 1.
a A histogram showing the number of targets for different drugs. b A histogram showing the number of articles associated with different targets. c A histogram showing the number of articles associated with different drugs
Literature-based measure of drug–drug relationships
To make the literature-based approach for drug repurposing effective, we need to establish a measure of literature-based similarity between two drugs, such as the Jaccard coefficient or logarithmic ratio similarity. To better understand the relationship between literature-based similarity and biological and pharmacological properties, we examined the correlation between literature-based similarity and various biological and pharmacological similarities (e.g., GO similarities, chemical similarity, clinical similarity, co-expression similarity, and sequence similarity) through graphical analysis. If the literature-based similarity between a drug pair increases in tandem with the biological and pharmacological similarities, this suggests a positive correlation between literature-based similarity and GO, chemical, clinical, co-expression, and sequence similarities. This implies that these drugs may have potential for repurposing.
Figure 2a–g present boxplots of biological and pharmacological similarities across multiple intervals of the literature-based Jaccard coefficient, which helps confirm the effectiveness of literature-based similarity for drug repurposing.
Fig. 2.
The interplay between the Jaccard coefficient of drug pairs and GO similarity, presented from left to right: a biological processes; b molecular function; c cellular component. The interplay between the Jaccard coefficient of drug pairs and factors such as d chemical similarity; e clinical similarity; f drug target-encoding gene co-expression patterns across human tissues; g drug target protein sequence similarity. h The interplay between the Jaccard coefficient of drug pairs and human interactome-based separation
We found that the Jaccard coefficient similarity of drug-drug pairs is positively correlated with GO, chemical, clinical, co-expression, and sequence similarities, as shown in Fig. 2a–g. When there is overlap in the literature related to two drugs, the extent of this overlap reflects their pharmacological relationship. Specifically, a larger Jaccard coefficient for a drug pair corresponds to higher similarities in their GO, chemical, clinical, co-expression, and sequence similarities (Fig. 2a–g). It is evident that biological and pharmacological similarities increase as the Jaccard coefficient rises. This clearly indicates that literature-based similarity is positively correlated with GO, chemical, co-expression, sequence, and clinical similarities. Additionally, we compared the Jaccard coefficient similarity to the logarithmic ratio similarity, and found that the Jaccard coefficient outperforms the logarithmic ratio similarity in drug repurposing (Supplementary Fig. 1a–g).
Does the conclusion hold under the assumption that articles related to drugs are illogical and completely random? To test this, we downloaded all articles with PMIDs from OpenAlex and randomly selected an equal number of articles for each drug. We needed to control for the publication years of the randomly selected articles, as the literature related to drugs follows a chronological pattern. For each drug, the randomly selected articles were required to be published no earlier than the year of the earliest related publication and no later than the year of the most recent publication. We then calculated the literature-based Jaccard coefficient using (1). We expected that a small number of intersections between the randomized literature of paired drugs would result in a low Jaccard coefficient. Additionally, we performed a nonparametric Mann–Whitney U test for independent samples to determine whether there was a significant difference between the Jaccard coefficients of articles related to paired drugs and those randomly assigned to the paired drugs.
The small red rectangles on the bar plot in Fig. 3 represent the error bars. The error bars represent the standard error of the mean (SEM). The p value indicates the significance of the difference in means, as determined by the Mann–Whitney U test (Fig. 3). The p value demonstrates that the Jaccard coefficients are significantly different between articles related to paired drugs and those randomly assigned to the paired drugs.
Fig. 3.

A barplot showing the Jaccard coefficient of articles related to paired drugs versus articles randomly assigned to paired drugs
It is clear that the number of articles shared by paired drugs is not random; instead, the overlap of articles related to paired drugs reflects the degree of similarity between them. The Jaccard coefficient for articles randomly assigned to paired drugs, which captures changes in biological and pharmacological similarity, is shown in Supplementary Fig. f2a–g. Figure 2h is used to confirm the relationship between the literature-based Jaccard coefficient and human interactome-based separation. The relationship between the literature-based logarithmic ratio similarity and human interactome-based separation is shown in Supplementary Fig. 1h. Since the footprints of two drug-target modules are topologically separated (
), the drugs are pharmacologically distinct. This implies that the human interactome-based separation should decrease as the literature-based Jaccard coefficient between a drug-drug pair increases. Figure 2h shows that literature-based similarity is as effective as human interactome-based separation. It is clear that the literature-based Jaccard coefficient is just as effective as the human interactome-based separation in drug repurposing. The results of the literature-based Jaccard coefficient for each drug pair are summarized in the file Jaccard_Coefficient_Result.xls in the supplementary materials. Additionally, we plotted the relationship between human interactome-based separation and biological and pharmacological similarity in Supplementary Fig. 3.
The reliability of drug repurposing
The standard database for drug repurposing, named repoDB, was established in 2017 [27]. The repoDB includes 1571 drugs and 2051 UMLS disease concepts, with a total of 6677 approved and 4123 failed drug-indication pairs. Among the 1978 drugs we focused on, 723 overlap with repoDB. The number of unique pairwise combinations of these 723 drugs is 10,125, consisting of 3328 true negative drug pairs and 6797 true positive drug pairs. The definitions of true positive and true negative drug pairs are provided in Fig. 7. Briefly, a drug pair is considered a true positive if the two drugs share common indications that have been approved by the FDA. If two drugs share common indications, but any of those indications are not FDA-approved, then the drug pair is considered a true negative. We generated the validation set using both true positive and true negative drug pairs, resulting in an obviously imbalanced dataset. The validation set is available for researchers and can be found in the supplementary materials as Drug_Validation_Set.xls.
Fig. 7.
The evidence for classifying paired drugs as true positive or true negative (a true positive drug pairs: drug pairs share the same FDA-approved indications; b true negative drug pairs: drug pairs have different FDA-approved indications; c true negative drug pairs: drug pairs have the same indication but are not FDA-approved; d Invalid drug pairs: drug pairs do not meet any of the criteria for true positives or true negatives, possibly due to lacking sufficient evidence or data)
We proposed two literature-based similarity measures for drug repurposing: the Jaccard coefficient and the logarithmic ratio similarity. Additionally, several published studies have utilized protein–protein interactions to facilitate drug repurposing. To assess the performance of the three measures (literature-based Jaccard coefficient, logarithmic ratio similarity, and human interactome-based separation), we plotted the ROC and precision–recall curves and calculated the corresponding AUC and AUCPR, as shown in Fig. 4.
Fig. 4.
The ROC curves and precision–recall curves for
, separation and
(a ROC curve and AUC for
; b ROC curve and AUC for
; c ROC curve and AUC for separation; d precision–recall curve and AUCPR for
; e precision–recall curve and AUCPR for
; f precision–recall curve and AUCPR for separation)
Figure 4a–c show the ROC curves for the three measures, along with their corresponding AUC values. It is clear that the literature-based Jaccard coefficient outperforms both the literature-based logarithmic ratio similarity and the human interactome-based separation in terms of AUC. Figure 4d–f show the precision–recall curves for the three measures, along with their corresponding AUCPR values. The human interactome-based separation outperforms the other two measures in terms of AUCPR. Table 1 lists the
score, Precision and Recall for all three measures, with the Jaccard coefficient showing a significantly higher
score and Recall compared to the other two measures. Since the
score is commonly used for imbalanced datasets, it is the most appropriate metric among the three. Therefore, we found that the literature-based Jaccard coefficient outperforms the other two metrics in drug repurposing based on
score and Recall.
Table 1.
scores for three measures of drug repurposing
| Metric |
score |
Precision | Recall |
|---|---|---|---|
| Jaccard coefficient | 0.593 | 0.779 | 0.558 |
![]() |
0.559 | 0.751 | 0.523 |
| Separation | 0.565 | 0.795 | 0.529 |
Selection de novo candidate for drug repurposing
We selected de novo candidates for drug repurposing using a threshold defined as the upper
th quantile value of the literature-based Jaccard coefficient, i.e.,
. The
value was set between 0.01 and 0.10, with an interval of 0.01. The number of selected candidates for drug repurposing is recorded in Table 2.
Table 2.
The number of selected drug repurposing candidates under different thresholds of the Jaccard coefficent
|
0.1 | 0.09 | 0.08 | 0.07 | 0.06 |
|---|---|---|---|---|---|
| Number of drug pairs | 195,530 | 175,973 | 156,425 | 136,874 | 117,320 |
|
0.05 | 0.04 | 0.03 | 0.02 | 0.01 |
| Number of drug pairs | 97,769 | 78,211 | 58,659 | 39,107 | 19,553 |
Ten drug pairs were selected based on their
values, and the indications for each drug were identified using repoDB, as detailed in Table 1 of the supplementary material. We also saved the drug repurposing results for
in the supplementary materials under the file name Drug_Validation_Set.xls. Out of the 19,553 drug pairs, 2024 had
equal to 0. From these, we selected 20 drugs and identified 50 drug pairs with corresponding drug information, which are recorded in Table 2 of the supplementary materials. The supplementary materials also contain two files: Drug_Targets.xls, which provides the target data for each drug, and Drug_Pairs_Common_Targets.xls, which lists paired drugs that share common targets.
Conclusions
In this article, we proposed a literature-based similarity approach for drug repurposing. The relationship between drugs and literature was established through the linkage between drug-target coding genes and relevant publications. We introduced two measures,
and
, to facilitate drug repurposing using literature data. The
was computed based on a literature citation network. The literature citation network was constructed using the references from articles related to drugs. To validate the effectiveness of the literature-based similarity, we computed various biological and pharmacological similarities, including GO similarity, chemical similarity, clinical similarity, co-expression similarity, and sequence similarity. The effectiveness of literature-based similarity can be validated through its correlation with biological and pharmacological similarities. To further assess its performance, we compared the literature-based similarity to the widely used human interactome-based separation. Additionally, we utilized the repoDB dataset to create a validation set for verifying the reliability of drug repurposing. The
score demonstrated that
significantly outperformed both
and separation. We also calculated
for all drug pairs and ranked them from largest to smallest. The upper
th quantile value of
was then defined as the threshold for selecting de novo candidates for drug repurposing.
To demonstrate the effectiveness of the Jaccard coefficient similarity in drug repurposing, we selected ten drug pairs with higher
values. Among these, seven drug pairs share common indications, while three do not. Although the three drug pairs lack identical indications, this result still provides valuable insights for drug repurposing or potential drug combinations. Particularly, it is worth exploring situations where two drugs with completely different indications can still be repurposed for each other. Relying solely on relationships such as the ‘drug A-indication-drug B’ paradigm in some text mining methods may cause us to overlook valuable drug repurposing opportunities. Our proposed method offers distinct advantages compared to other drug repurposing approaches.
In brief, to achieve better drug repurposing results, it is essential to develop novel literature-based similarities and compare them with
. Future studies could focus on integrating semantics and literature citation networks to further enhance drug repurposing efforts. Our proposed literature-based similarity approach is also applicable to the study of drug combinations. The challenge with the proposed drug repurposing approach lies in determining the appropriate threshold for literature-based similarity to select de novo candidates. This is complicated by the lack of a theoretical basis for using the quantile of literature-based similarity as a threshold, and the arbitrary nature of the
value.
Methods
Literature-based similarities and human interactome-based separation
Drug discovery has its roots in modern times, dating back to the 19th century [28]. Over time, a vast amount of literature on drug research has been accumulated.
Figure 5a illustrates the number of published articles on drugs from 1954 to 2002, clearly showing an exponential growth in the volume of drug-related publications over the years. Our goal is to extract valuable information from drug-related literature to support the process of drug repurposing.
Fig. 5.

a The number of published articles on drugs across different years. b A flowchart for calculating the similarity between paired drugs
The relationship among drugs, targets, and literature was obtained from NCBI NIH via the link between genes and publications, with data collected in November 2023 [29]. Each biomedical publication is assigned a unique PMID, from which we can retrieve information such as references, publication year, author details, and more from OpenAlex. The literature citation network was constructed using references from articles related to drugs. The process involved: first, identifying articles related to drugs A and B through the drug-target-literature triplet; second, extracting the references for each article related to the drugs; and finally, building the literature citation network, as shown in Fig. 6. If there exist n drugs, the total number of drug pairs is n choose 2. Given n drugs, the total number of drug pairs is calculated as
. Literature-based similarities between drug pairs can be measured using methods such as the Jaccard coefficient, logarithmic ratio and others. Figure 5b presents a flowchart for calculating the literature-based similarity of paired drugs. The Jaccard coefficient is defined as follows:
| 1 |
Fig. 6.

An example of a citation network for a drug pair
The Jaccard coefficient J(A, B) measures the similarity between drug A and drug B, where A represents the set of papers related to drug A, and B represents the set of papers related to drug B. In Fig. 6, the nodes directly connected to drugs A and B are labeled as A and B, respectively. The arrowed lines represent the references cited in the literature associated with each drug. The number of blue curves indicates the count of articles related to drug A that are cited within the references of articles related to drug B. Similarly, the orange curve conveys the same meaning but in reverse, representing the number of articles related to drug B in the references of articles related to drug A. The value of J(A, B) ranges from 0 to 1, with values closer to 1 indicating greater similarity between the two drugs. To create a more compact distribution of biological and pharmacological similarities for paired drugs on the Jaccard coefficient, we take the logarithm of the Jaccard coefficient. In addition to the Jaccard coefficient, if we consider only the sizes of the two sets, we can construct two new indicators with a structure similar to that of the Jaccard coefficient as follows,
| 2 |
| 3 |
The notation used in these indicators is consistent with that in the Jaccard coefficient. The performance of these indicators in drug repositioning is illustrated and explained in Figs. 4 and 5 of the supplementary materials.
Another literature-based similarity measure is the logarithmic ratio, defined as follows:
| 4 |
Here,
represents the number of articles related to drug B in the reference list of articles associated with drug A, and
denotes the total number of articles related to drug A. The larger the value, the greater the similarity between the paired drugs. Recently, drug repurposing via protein–protein interactions has garnered significant attention [30–32]. Therefore, we compared literature-based drug similarity with the currently popular human interactome-based drug similarity. The experimentally validated protein–protein interactions (PPIs) are constructed using distinct proteins from various data sources, as outlined in previous studies [26]. The network proximity of drug-target modules A and B, as represented in the human interactome, is measured by their separation
| 5 |
it compares the mean shortest distance within the interactome between the targets of each drug, denoted as
and
, to the mean shortest distance,
, between A-B target pairs. Note that
denotes the average of the shortest distances.
Verifying the reliability of drug repurposing
Brown and Patel established the foundational database for drug repurposing in 2017 [33]. This database includes information on 1571 currently approved drugs, as annotated in DrugBank. It includes the drug names, DrugBank IDs, indications, indication IDs, and trial status for each drug. If the paired drugs share the same indications, and all of these indications are FDA-approved, the paired drugs are considered true positives. Otherwise, if the paired drugs share the same indications but one or more of these indications have not been FDA-approved, the paired drugs are classified as true negatives.
Figure 7 illustrates the detailed evidence for classifying paired drugs as true positives or true negatives. We ultimately obtained a validation set consisting of 10,125 drug pairs from the repoDB dataset, which included 6797 true positive drug pairs and 3328 true negative drug pairs. It was clear that the validation set was imbalanced, with the number of true positive drug pairs exceeding that of true negative pairs. We computed literature-based similarity metrics, such as the Jaccard coefficient and logarithmic ratio similarity, for these 10,125 drug pairs. The corresponding similarities were saved in the Drug_Validation_Set.xls, available in the supplementary materials. The true positive rates (TPRs) and false positive rates (FPRs) were calculated at different similarity thresholds. We used the receiver operating characteristic (ROC) curve, the area under the ROC curve (AUC), the precision–recall curve (PRC), the area under the precision–recall curve (AUCPR), and the
score to assess the reliability of drug repurposing. These metrics evaluate the performance of the classification model across all classification thresholds.
The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The TPR and FPR are defined as follows:
| 6 |
| 7 |
where TPs, FNs, FPs, and TNs represent true positives, false negatives, false positives, and true negatives, respectively. The Area Under the Curve (AUC) value ranges from 0 to 1, with a higher AUC indicating better performance of the classification method. The precision–recall curve illustrates the tradeoff between precision and recall at different threshold settings. A larger area under the precision–recall curve signifies both high recall and high precision. Precision and recall are defined as follows:
| 8 |
note that recall and the True Positive Rate (TPR) are identical, so we omit the formula for TPR here. The
score is the harmonic mean of precision and recall and can be calculated as follows:
| 9 |
its value ranges from 0 to 1, with 1 indicating perfect precision and recall. The
score is also a widely used metric for evaluating models on imbalanced datasets.
Five types of drug similarities
To validate the literature-based measure of drug-drug relationships, we calculated five types of drug profiles: Gene Ontology (GO) similarity, chemical similarity, co-expression similarity, sequence similarity, and clinical similarity, based on a previous study [26].
The Gene Ontology (GO) annotations for drug target-coding genes were downloaded from https://www.geneontology.org/. These annotations are based on three types of evidence: experimental validation or literature-derived data, encompassing biological process (BP), molecular function (MF), and cellular component (CC), while excluding computational inference. The semantic comparison of GO annotations provides a quantitative method to assess the similarity between genes and gene products. The overall GO similarity between two drugs, A and B, is defined as follows:
| 10 |
where a and b represent the drug targets for drug A and drug B, respectively, and
denotes the averaging of all pairs of a and b with
and
. The GO similarity, SGO, is computed using a graph-based semantic similarity measurement algorithm implemented in R [34].
The chemical structure information (in SMILES format) was down-loaded from the DrugBank database (v5.1.8), and the MACCS finger-prints for each drug were computed using the ‘rcdk’ R package. If two drug molecules have a and b bits set in their MACCS fragment bit-strings, with c of these bits being set in the fingerprints of both drugs, the Tanimoto coefficient (T) for the drug-drug pair is defined as:
| 11 |
The Tanimoto coefficient (T) is commonly used in drug discovery and development, with values ranging from 0 to 1. Here, 0 indicates no common bits, while 1 indicates that all bits are identical.
To calculate co-expression similarity, we obtained RNA-seq data (RPKM values) for 32 tissues from the GTEx V6 database (https://gtexportal.org/). Genes with an
in over
of the samples in each tissue were considered tissue-expressed genes, as described in a previous study [35]. The Pearson correlation coefficient (PCC) between drug targets (a and b) was calculated to determine their co-expression similarity. It was used to measure the degree of co-expression between drug targets associated with drug-treated diseases. The co-expression similarity between drug targets related to two drugs, A and B, was calculated by averaging the PCC(a,b) across all pairs of targets a and b associated with
and
, as shown below:
| 12 |
The canonical protein sequences of drug targets in Homo sapiens were downloaded from the UniProt database (https://www.uniprot.org/). The protein sequence similarity, SP(a, b), between two drug targets, a and b, was calculated using the Smith-Waterman algorithm [36]. The Smith-Waterman algorithm performs local sequence alignment by comparing all possible lengths of sequence fragments, with the condition
. This condition ensures that, for drugs with common targets, pairs are not considered where a target is compared to itself. The overall sequence similarity between two drugs, A and B, was determined as follows:
| 13 |
where
and
under the condition
. This condition ensures that for drugs with common targets, pairs where a target would be compared to itself are excluded.
The Anatomical Therapeutic Chemical (ATC) classification system codes were used to calculate the clinical similarity between drug pairs. Clinical similarity is commonly employed to predict new drug targets [37]. The ATC codes for the drugs used in this study were downloaded from the DrugBank database (v5.1.8). The clinical similarity at the ith level (
) between drugs A and B is defined by the following ATC codes:
| 14 |
where
represents all ATC codes at the ith level, and
indicates the cardinal. The clinical similarity between drugs A and B is defined by the score
as follows:
| 15 |
where n represents the five levels of ATC codes. If a drug has multiple ATC codes, the clinical similarity is calculated for each ATC code, and the average of these values is used as the overall clinical similarity.
Supplementary information
Additional file serves as a comprehensive supplement to the main research article, providing additional explanations and extended discussions. All figures and tables presented in this document are designed to provide further clarity and transparency regarding the study’s conclusions.
Supplementary Information
Acknowledgements
The authors want to thank the OpenAlex Community Group and Drugbank database (v5.1.8) for providing literature data and drug information database, therefore this research was able to progress smoothly.
Abbreviations
- FDA
Food and Drug Administration
- TPRs
True positive rates
- FPRs
False positive rates
- ROC
Receiver operating characteristic
- AUC
Area under the ROC curve
- PRC
Precision–recall curve
- AUCPR
Area under the precision–recall curve
- GO
Gene ontology
- BP
Biological process
- MF
Molecular function
- CC
Cellular component
- PCC
Pearson correlation coefficient
- ATC
Anatomical therapeutic chemical
- SEM
Standard error of the mean
Author contributions
YM conceived of study. XL and YM performed data analysis. XL, XJ, and YM wrote and critically revised the manuscript. All authors read and approved the final manuscript.
Funding
This work has been supported by the National Natural Science Foundation of China [Grant No. 62006109], the Stable Support Plan Program of Shenzhen Natural Science Fund [Grant No. 20220814165010001].
Data availability
For detailed information on data and materials availability, please refer to the ‘Availability of Data and Materials’ section in the Supplementary.docx file.
Declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12859-025-06237-7.
References
- 1.Wouters OJ, McKee M, Luyten J. Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA. 2020;323:844–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Hughes J, Rees S, Kalindjian S, Philpott K. Principles of early drug discovery. Br J Pharmacol. 2011;162(6):1239–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Wong CH, Siah KW, Lo AW. Estimation of clinical trial success rates and related parameters. Biostatistics. 2018;20(2):273–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Yi H, Xiaowen D, Yuan X, Guomeng X, Haichun L, Tao L, Yadong C, Yanmin Z. Drug repositioning: progress and challenges in drug discovery for various diseases. Eur J Med Chem. 2022;234:114239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Jourdan J-P, Bureau R, Rochais C, Dallemagne P. Drug repositioning: a brief overview. J Pharm Pharmacol. 2020;72(9):1145–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Sahoo BM, Ravi KBVV, Sruti J, Mahapatra MK, Banik BK, Borah P. Drug repurposing strategy (DRS): emerging approach to identify potential therapeutics for treatment of novel coronavirus infection. Front Mol Biosci. 2021;8:628144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zhichao L, Hong F, Kelly R, Xiaowei X, Donna LM, William S, Weida T. In silico drug repositioning—what we need to know. Drug Discov Today. 2013;18(3):110–5. [DOI] [PubMed] [Google Scholar]
- 8.Ryan PT, Paul MD, Edward MS, Stephen LK, Hani AA. A high-throughput screening approach to repurpose fda-approved drugs for bactericidal applications against staphylococcus aureus small-colony variants. MSphere. 2018;3(5):10-11280042218. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Guney E, Menche J, Vidal M, Barábasi A-L. Network-based in silico drug efficacy screening. Nat Commun. 2016;7(1):10331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotechnol. 2007;25(2):197–206. [DOI] [PubMed] [Google Scholar]
- 11.Yang F, Zhang Q, Ji X, Zhang Y, Li W, Peng S, Xue F. Machine learning applications in drug repurposing. INSC. 2022;14(1):15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yang H-T, Ju J-H, Wong Y-T, Shmulevich I, Chiang J-H. Literature-based discovery of new candidates for drug repurposing. Brief Bioinform. 2016;18(3):488–97. [DOI] [PubMed] [Google Scholar]
- 13.Yi H-C, You Z-H, Wang L, Su X-R, Zhou X, Jiang T-H. In silico drug repositioning using deep learning and comprehensive similarity measures. BMC Bioinform. 2021;22(3):293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.March-Vila E, Pinzi L, Sturm N, Tinivella A, Engkvist O, Chen H, Rastelli G. On the integration of in silico drug design methods for drug repurposing. Front Pharmacol. 2017;8:272508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Ziaurrehman T, Markus V-K, Tero A. Artificial intelligence, machine learning, and drug repurposing in cancer. Expert Opin Drug Dis. 2021;16(9):977–89. [DOI] [PubMed] [Google Scholar]
- 16.Cheng F, Desai RJ, Handy DE, Wang R, Schneeweiss S, Barabási A-L, Loscalzo J. Network-based approach to prediction and population-based validation of in silico drug repurposing. Nat Commun. 2018;9(1):2691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Yang H-T, Ju J-H, Wong Y-T, Shmulevich I, Chiang J-H. Literature-based discovery of new candidates for drug repurposing. Brief Bioinform. 2016;18(3):488–97. [DOI] [PubMed] [Google Scholar]
- 18.Gopal J, Prakash Sinnarasan VS, Venkatesan A. Identification of repurpose drugs by computational analysis of disease–gene–drug associations. J Comput Biol. 2021;28(10):975–84. [DOI] [PubMed] [Google Scholar]
- 19.Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16(9):2006643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Wei S, Philip ES, Wei Z. Drug combination therapy increases successful drug repositioning. Drug Discov Today. 2016;21(7):1189–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jin G, Wong STC. Toward better drug repositioning: prioritizing and integrating existing methods into efficient pipelines. Drug Discov Today. 2014;19(5):637–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Zhichao L, Hong F, Kelly R, Xiaowei X, Donna LM, William S, Weida T. In silico drug repositioning—what we need to know. Drug Discov Today. 2013;18(3):110–5. [DOI] [PubMed] [Google Scholar]
- 23.March-Vila E, Pinzi L, Sturm N, Tinivella A, Engkvist O, Chen H, Rastelli G. On the integration of in silico drug design methods for drug repurposing. Front Pharmacol. 2017;8(298):272508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Li J, Lu Z. A new method for computational drug repositioning using drug pairwise similarity. In: 2012 IEEE international conference on bioinformatics and biomedicine; 2012. pp. 1–4. [DOI] [PMC free article] [PubMed]
- 25.Shyam Sundar D, Pritish R, Ibrahim Roshan K. Literature-based drug–drug similarity for drug repurposing: Impact of medical subject headings term refinement and hierarchical clustering. Future Med Chem. 2022;14(18):1309–23. [DOI] [PubMed] [Google Scholar]
- 26.Cheng F, Kovács IA, Barabási A-L. Network-based prediction of drug combinations. Nat Commun. 2021;10(1):1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017;4(1):170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Pina AS, Hussain A, Roque ACA. An historical overview of drug discovery. Totowa, NJ: Humana Press; 2010. [Google Scholar]
- 29.Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16(9):2006643. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Li H, Xiao H, Lin L, Jou D, Kumari V, Lin J, Li C. Drug design targeting protein–protein interactions (PPIs) using multiple ligand simultaneous docking (MLSD) and drug repositioning: Discovery of raloxifene and bazedoxifene as novel inhibitors of il-6/gp130 interface. J Med Chem. 2014;57(3):632–41. [DOI] [PubMed] [Google Scholar]
- 31.Ma J, Wang J, Ghoraie LS, Men X, Haibe-Kains B, Penggao D. A comparative study of cluster detection algorithms in protein–protein interaction for drug target discovery and drug repurposing. Front Pharmacol. 2019;10(19):109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Soleimani Zakeri NS, Pashazadeh S, MotieGhader H. Drug repurposing for Alzheimer’s disease based on protein–protein interaction network. Biomed Res Int. 2021;2021(1):1280237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017;4(1):170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. Gosemsim: an r package for measuring semantic similarity among go terms and gene products. Bioinformatics. 2010;26(7):976–8. [DOI] [PubMed] [Google Scholar]
- 35.Cheng F, Kovács IA, Barabási A-L. Network-based prediction of drug combinations. Nat Commun. 2019;10(1):1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Smith TF, Waterman MS. Identification of common molecular subsequences. J Chem Inf Model. 1981;147(1):195–7. [DOI] [PubMed] [Google Scholar]
- 37.Cheng F, Li W, Wu Z, Wang X, Zhang C, Li J, Liu G, Tang Y. Prediction of polypharmacological profiles of drugs by the integration of chemical, side effect, and therapeutic space. J Mol Biol. 2013;53(4):753–62. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
For detailed information on data and materials availability, please refer to the ‘Availability of Data and Materials’ section in the Supplementary.docx file.






