Skip to main content
BMC Bioinformatics logoLink to BMC Bioinformatics
. 2025 Aug 1;26:203. doi: 10.1186/s12859-025-06237-7

Literature data-based de novo candidates for drug repurposing

Xianglong Liang 1, Xin Jiang 2,3, Yifang Ma 1,4,
PMCID: PMC12317455  PMID: 40750838

Abstract

Background

Drug repurposing offers a promising strategy for drug discovery. Drug repurposing involves identifying new therapeutic indications for existing, marketed drugs, thereby reducing the risks, costs, and time typically required for drug development. Various methods exist for drug repurposing, including high-throughput screening of drug compound libraries, computation in silico approaches, literature-based methods, etc. Currently, numerous methods utilize literature for data mining in drug repositioning; however, relatively few approaches leverage literature citation networks for this purpose.

Results

We identified 19,553 potential drug pairs for repurposing by analyzing biomedical literature data through the Jaccard coefficient. Our results demonstrated that the literature-based Jaccard coefficient was the most effective similarity metric for identifying drug repurposing opportunities. To refine our selection process, we applied a threshold defined by the upper Inline graphicth quantile value of the Jaccard coefficient, enabling us to prioritize promising de novo drug repurposing candidates. Among the identified drug pairs, we found several with strong potential for repurposing, including combinations such as adapalene and bexarotene, guanabenz and tizanidine, alvimopan and methylnaltrexone, etc.

Conclusion

We created a validation set consisting of both true positives and true negatives for drug pairs using the repoDB dataset, a widely recognized resource for drug repurposing. To evaluate the performance of various similarity metrics for drug pairs, we compared their effectiveness based on AUC, F1 score, and AUCPR using the validation set.

Keywords: Drug repurposing, Literature citation network, Network analysis, Validation the results, Candidate repurposing drug pairs

Background

Developing a new drug is a time-consuming, high-cost, and high-risk endeavor. Recent estimates suggest that the average cost of developing a novel drug range from 314 million to 2.8 billion US dollars [1]. It takes approximately 12 to 15 years, from the initial concept to the completion of drug development [2]. Despite the significant time and financial investment, nearly Inline graphic of candidate drugs that enter the first phase of clinical trials fail to receive approval [3]. Compared to traditional drug development, drug repurposing (or repositioning) can significantly reduce the risks and costs associated with drug discovery [4, 5]. Drug repurposing involves identifying new therapeutic indications for already marketed drugs [6]. Various methods are available for drug repurposing, including high-throughput screening of drug compound libraries and in silico approaches [7, 8]. In silico drug repurposing screening involves several approaches, including the computation of chemical similarity between ligands and drug targets, the development of machine learning models, the application of deep learning algorithms, literature-based methods and network-based approaches [913].

Ligand-based approaches rely on the principle that structurally similar compounds often exhibit similar biological properties. These methods are widely used to analyze and predict the activity of ligands for novel targets [14]. Machine learning and deep learning methods leverage publicly available databases (such as phenotypic profiling data, electronic health records, etc.) and sources of information like compound structures to facilitate drug repurposing [15]. Network-based drug repurposing can be performed by quantifying the proximity between disease genes and drug targets within the human interactome [16]. Literature-based methods typically enable drug repurposing by mining large-scale repositories of scientific literature to identify and curate repurposed drugs [17]. For example, textual semantics can identify and curate new drug repurposing candidates by analyzing relationships between diseases, genes, and drugs [18].

The methods outlined above offer various technical approaches for drug repurposing, but they also come with certain limitations. Machine learning techniques, for instance, require careful selection of features and targets to effectively identify potential drug repurposing candidates. This process involves decisions on which publicly available databases to use and how to optimally extract relevant drug feature characteristics. Deep learning methods encounter the ’black box’ challenge, making it difficult to explain the rationale behind the repurposed drug results. On the other hand, human protein–protein interaction-based drug repurposing methods face the issue of an incomplete human interactome, which limits their effectiveness. Textual semantics also presents several challenges. For instance, there is no unified standard for describing the same diseases and symptoms across different literatures during the text mining stage. Additionally, various studies employ different syntactic structures when analyzing the relationships between biological entities during the semantic analysis stage. Furthermore, obtaining drug repurposing results at the data analysis stage requires complex computations. If any of these stages encounter issues, the resulting drug repurposing outcomes may be influenced and inaccurate.

To address the limitations of the methods mentioned above, we propose a novel approach for drug repurposing that leverages the vast amount of literature data accumulated over more than a century. According to OpenAlex, there are approximately 200 million scientific articles available. OpenAlex is a fully open scientific knowledge graph that includes metadata for a wide range of works, such as journal articles and books, along with disambiguated author information, institutions, etc.

We aim to build connections between drugs and literature through genes associated with the literature, as suggested by previous studies [19]. In other words, we established a connection between drugs and literature through the links between drug-target coding genes and the literature. As a result, this study primarily focuses on drugs with known targets.

In this study, drug repurposing was achieved through pairwise combinations of all drugs with known targets. Utilizing drug combinations can enhance the success rate of drug repurposing screenings [20]. It is widely believed that ’similar drugs’ exhibit similar therapeutic effects. Targets, proteome, and transcriptome networks can be used to establish similarity between drug pairs [21, 22]. Once drug-drug similarity is established, it becomes possible to further explore their shared indications for treating various diseases [23]. Li and Lu also proposed a novel method for computational drug repurposing based on drug pairwise similarity [24].

For pairwise combinations of drugs, we constructed a citation network based on literature related to the drugs. The literature-based similarities between drug pairs were then calculated using this citation network. This approach allowed us to assess the overall impact of different types of data on drug-drug similarity [25]. The various types of data include chemical compounds of drugs, biological information of drugs, etc. The inspiration for this study came from the idea that for literature related to two drugs, the higher the overlap between the literature, the more similar the two drugs are likely to be. Since the relationship between drugs and literature is established through the links between drug targets and the literature, the literature-based drug-drug similarity is actually calculated based on literature-based target-target similarity. In other words, the more identical the literature is between different targets, the closer the relationship between those targets. As a result, we expect a high degree of similarity between these targets. We also considered using the references of articles related to drugs for drug repurposing, based on the assumption that the citation of literature by authors follows a normative pattern. In reality, literature citations are not arbitrary; they follow a certain logic and structure.

Meanwhile, we created a validation set containing true positives and true negatives for drug pairs, sourced from the repoDB database, a standard dataset for drug repurposing. We then compared the literature-based similarities with human interactome-based separation using the validation set, evaluating performance in terms of AUC, Inline graphic score, and AUCPR. The results showed that literature-based Jaccard similarity outperformed other similarity measures based on AUC and Inline graphic score. Finally, we ranked the Jaccard similarities of drug pairs from highest to lowest. De novo drug repurposing candidates were identified using a threshold defined as the Inline graphicth upper quantile of Jaccard similarities. We also selected ten drug pairs with detailed information and drew several novel conclusions.

Results

Data structure

We collected 1978 FDA-approved or clinically investigational drugs, each with at least two targets, from a previous study [26]. The average number of targets per drug is 6, with a median of 3 and a maximum of 256.

A histogram of the number of targets for these drugs is shown in Fig. 1a, which clearly illustrates that most drugs have fewer than 20 targets. In total, the 1978 drugs without duplication are associated with 2254 targets. The average number of articles related to these targets is 249, the median is 108, and the maximum is 6563. A histogram of the number of articles for these targets is presented in Fig. 1b. The average number of articles per drug is 2658, with a median of 1397 and a maximum of 70,878. The histogram of the number of articles for these drugs is shown in Fig. 1c.

Fig. 1.

Fig. 1

a A histogram showing the number of targets for different drugs. b A histogram showing the number of articles associated with different targets. c A histogram showing the number of articles associated with different drugs

Literature-based measure of drug–drug relationships

To make the literature-based approach for drug repurposing effective, we need to establish a measure of literature-based similarity between two drugs, such as the Jaccard coefficient or logarithmic ratio similarity. To better understand the relationship between literature-based similarity and biological and pharmacological properties, we examined the correlation between literature-based similarity and various biological and pharmacological similarities (e.g., GO similarities, chemical similarity, clinical similarity, co-expression similarity, and sequence similarity) through graphical analysis. If the literature-based similarity between a drug pair increases in tandem with the biological and pharmacological similarities, this suggests a positive correlation between literature-based similarity and GO, chemical, clinical, co-expression, and sequence similarities. This implies that these drugs may have potential for repurposing.

Figure 2a–g present boxplots of biological and pharmacological similarities across multiple intervals of the literature-based Jaccard coefficient, which helps confirm the effectiveness of literature-based similarity for drug repurposing.

Fig. 2.

Fig. 2

The interplay between the Jaccard coefficient of drug pairs and GO similarity, presented from left to right: a biological processes; b molecular function; c cellular component. The interplay between the Jaccard coefficient of drug pairs and factors such as d chemical similarity; e clinical similarity; f drug target-encoding gene co-expression patterns across human tissues; g drug target protein sequence similarity. h The interplay between the Jaccard coefficient of drug pairs and human interactome-based separation

We found that the Jaccard coefficient similarity of drug-drug pairs is positively correlated with GO, chemical, clinical, co-expression, and sequence similarities, as shown in Fig. 2a–g. When there is overlap in the literature related to two drugs, the extent of this overlap reflects their pharmacological relationship. Specifically, a larger Jaccard coefficient for a drug pair corresponds to higher similarities in their GO, chemical, clinical, co-expression, and sequence similarities (Fig. 2a–g). It is evident that biological and pharmacological similarities increase as the Jaccard coefficient rises. This clearly indicates that literature-based similarity is positively correlated with GO, chemical, co-expression, sequence, and clinical similarities. Additionally, we compared the Jaccard coefficient similarity to the logarithmic ratio similarity, and found that the Jaccard coefficient outperforms the logarithmic ratio similarity in drug repurposing (Supplementary Fig. 1a–g).

Does the conclusion hold under the assumption that articles related to drugs are illogical and completely random? To test this, we downloaded all articles with PMIDs from OpenAlex and randomly selected an equal number of articles for each drug. We needed to control for the publication years of the randomly selected articles, as the literature related to drugs follows a chronological pattern. For each drug, the randomly selected articles were required to be published no earlier than the year of the earliest related publication and no later than the year of the most recent publication. We then calculated the literature-based Jaccard coefficient using (1). We expected that a small number of intersections between the randomized literature of paired drugs would result in a low Jaccard coefficient. Additionally, we performed a nonparametric Mann–Whitney U test for independent samples to determine whether there was a significant difference between the Jaccard coefficients of articles related to paired drugs and those randomly assigned to the paired drugs.

The small red rectangles on the bar plot in Fig. 3 represent the error bars. The error bars represent the standard error of the mean (SEM). The p value indicates the significance of the difference in means, as determined by the Mann–Whitney U test (Fig. 3). The p value demonstrates that the Jaccard coefficients are significantly different between articles related to paired drugs and those randomly assigned to the paired drugs.

Fig. 3.

Fig. 3

A barplot showing the Jaccard coefficient of articles related to paired drugs versus articles randomly assigned to paired drugs

It is clear that the number of articles shared by paired drugs is not random; instead, the overlap of articles related to paired drugs reflects the degree of similarity between them. The Jaccard coefficient for articles randomly assigned to paired drugs, which captures changes in biological and pharmacological similarity, is shown in Supplementary Fig. f2a–g. Figure 2h is used to confirm the relationship between the literature-based Jaccard coefficient and human interactome-based separation. The relationship between the literature-based logarithmic ratio similarity and human interactome-based separation is shown in Supplementary Fig. 1h. Since the footprints of two drug-target modules are topologically separated (Inline graphic), the drugs are pharmacologically distinct. This implies that the human interactome-based separation should decrease as the literature-based Jaccard coefficient between a drug-drug pair increases. Figure 2h shows that literature-based similarity is as effective as human interactome-based separation. It is clear that the literature-based Jaccard coefficient is just as effective as the human interactome-based separation in drug repurposing. The results of the literature-based Jaccard coefficient for each drug pair are summarized in the file Jaccard_Coefficient_Result.xls in the supplementary materials. Additionally, we plotted the relationship between human interactome-based separation and biological and pharmacological similarity in Supplementary Fig. 3.

The reliability of drug repurposing

The standard database for drug repurposing, named repoDB, was established in 2017 [27]. The repoDB includes 1571 drugs and 2051 UMLS disease concepts, with a total of 6677 approved and 4123 failed drug-indication pairs. Among the 1978 drugs we focused on, 723 overlap with repoDB. The number of unique pairwise combinations of these 723 drugs is 10,125, consisting of 3328 true negative drug pairs and 6797 true positive drug pairs. The definitions of true positive and true negative drug pairs are provided in Fig. 7. Briefly, a drug pair is considered a true positive if the two drugs share common indications that have been approved by the FDA. If two drugs share common indications, but any of those indications are not FDA-approved, then the drug pair is considered a true negative. We generated the validation set using both true positive and true negative drug pairs, resulting in an obviously imbalanced dataset. The validation set is available for researchers and can be found in the supplementary materials as Drug_Validation_Set.xls.

Fig. 7.

Fig. 7

The evidence for classifying paired drugs as true positive or true negative (a true positive drug pairs: drug pairs share the same FDA-approved indications; b true negative drug pairs: drug pairs have different FDA-approved indications; c true negative drug pairs: drug pairs have the same indication but are not FDA-approved; d Invalid drug pairs: drug pairs do not meet any of the criteria for true positives or true negatives, possibly due to lacking sufficient evidence or data)

We proposed two literature-based similarity measures for drug repurposing: the Jaccard coefficient and the logarithmic ratio similarity. Additionally, several published studies have utilized protein–protein interactions to facilitate drug repurposing. To assess the performance of the three measures (literature-based Jaccard coefficient, logarithmic ratio similarity, and human interactome-based separation), we plotted the ROC and precision–recall curves and calculated the corresponding AUC and AUCPR, as shown in Fig. 4.

Fig. 4.

Fig. 4

The ROC curves and precision–recall curves for Inline graphic, separation and Inline graphic (a ROC curve and AUC for Inline graphic; b ROC curve and AUC for Inline graphic; c ROC curve and AUC for separation; d precision–recall curve and AUCPR for Inline graphic; e precision–recall curve and AUCPR for Inline graphic; f precision–recall curve and AUCPR for separation)

Figure 4a–c show the ROC curves for the three measures, along with their corresponding AUC values. It is clear that the literature-based Jaccard coefficient outperforms both the literature-based logarithmic ratio similarity and the human interactome-based separation in terms of AUC. Figure 4d–f show the precision–recall curves for the three measures, along with their corresponding AUCPR values. The human interactome-based separation outperforms the other two measures in terms of AUCPR. Table 1 lists the Inline graphic score, Precision and Recall for all three measures, with the Jaccard coefficient showing a significantly higher Inline graphic score and Recall compared to the other two measures. Since the Inline graphic score is commonly used for imbalanced datasets, it is the most appropriate metric among the three. Therefore, we found that the literature-based Jaccard coefficient outperforms the other two metrics in drug repurposing based on Inline graphic score and Recall.

Table 1.

Inline graphic scores for three measures of drug repurposing

Metric Inline graphic score Precision Recall
Jaccard coefficient 0.593 0.779 0.558
Inline graphic 0.559 0.751 0.523
Separation 0.565 0.795 0.529

Selection de novo candidate for drug repurposing

We selected de novo candidates for drug repurposing using a threshold defined as the upper Inline graphicth quantile value of the literature-based Jaccard coefficient, i.e., Inline graphic. The Inline graphic value was set between 0.01 and 0.10, with an interval of 0.01. The number of selected candidates for drug repurposing is recorded in Table 2.

Table 2.

The number of selected drug repurposing candidates under different thresholds of the Jaccard coefficent

Inline graphic 0.1 0.09 0.08 0.07 0.06
Number of drug pairs 195,530 175,973 156,425 136,874 117,320
Inline graphic 0.05 0.04 0.03 0.02 0.01
Number of drug pairs 97,769 78,211 58,659 39,107 19,553

Ten drug pairs were selected based on their Inline graphic values, and the indications for each drug were identified using repoDB, as detailed in Table 1 of the supplementary material. We also saved the drug repurposing results for Inline graphic in the supplementary materials under the file name Drug_Validation_Set.xls. Out of the 19,553 drug pairs, 2024 had Inline graphic equal to 0. From these, we selected 20 drugs and identified 50 drug pairs with corresponding drug information, which are recorded in Table 2 of the supplementary materials. The supplementary materials also contain two files: Drug_Targets.xls, which provides the target data for each drug, and Drug_Pairs_Common_Targets.xls, which lists paired drugs that share common targets.

Conclusions

In this article, we proposed a literature-based similarity approach for drug repurposing. The relationship between drugs and literature was established through the linkage between drug-target coding genes and relevant publications. We introduced two measures, Inline graphic and Inline graphic, to facilitate drug repurposing using literature data. The Inline graphic was computed based on a literature citation network. The literature citation network was constructed using the references from articles related to drugs. To validate the effectiveness of the literature-based similarity, we computed various biological and pharmacological similarities, including GO similarity, chemical similarity, clinical similarity, co-expression similarity, and sequence similarity. The effectiveness of literature-based similarity can be validated through its correlation with biological and pharmacological similarities. To further assess its performance, we compared the literature-based similarity to the widely used human interactome-based separation. Additionally, we utilized the repoDB dataset to create a validation set for verifying the reliability of drug repurposing. The Inline graphic score demonstrated that Inline graphic significantly outperformed both Inline graphic and separation. We also calculated Inline graphic for all drug pairs and ranked them from largest to smallest. The upper Inline graphicth quantile value of Inline graphic was then defined as the threshold for selecting de novo candidates for drug repurposing.

To demonstrate the effectiveness of the Jaccard coefficient similarity in drug repurposing, we selected ten drug pairs with higher Inline graphic values. Among these, seven drug pairs share common indications, while three do not. Although the three drug pairs lack identical indications, this result still provides valuable insights for drug repurposing or potential drug combinations. Particularly, it is worth exploring situations where two drugs with completely different indications can still be repurposed for each other. Relying solely on relationships such as the ‘drug A-indication-drug B’ paradigm in some text mining methods may cause us to overlook valuable drug repurposing opportunities. Our proposed method offers distinct advantages compared to other drug repurposing approaches.

In brief, to achieve better drug repurposing results, it is essential to develop novel literature-based similarities and compare them with Inline graphic. Future studies could focus on integrating semantics and literature citation networks to further enhance drug repurposing efforts. Our proposed literature-based similarity approach is also applicable to the study of drug combinations. The challenge with the proposed drug repurposing approach lies in determining the appropriate threshold for literature-based similarity to select de novo candidates. This is complicated by the lack of a theoretical basis for using the quantile of literature-based similarity as a threshold, and the arbitrary nature of the Inline graphic value.

Methods

Literature-based similarities and human interactome-based separation

Drug discovery has its roots in modern times, dating back to the 19th century [28]. Over time, a vast amount of literature on drug research has been accumulated.

Figure 5a illustrates the number of published articles on drugs from 1954 to 2002, clearly showing an exponential growth in the volume of drug-related publications over the years. Our goal is to extract valuable information from drug-related literature to support the process of drug repurposing.

Fig. 5.

Fig. 5

a The number of published articles on drugs across different years. b A flowchart for calculating the similarity between paired drugs

The relationship among drugs, targets, and literature was obtained from NCBI NIH via the link between genes and publications, with data collected in November 2023 [29]. Each biomedical publication is assigned a unique PMID, from which we can retrieve information such as references, publication year, author details, and more from OpenAlex. The literature citation network was constructed using references from articles related to drugs. The process involved: first, identifying articles related to drugs A and B through the drug-target-literature triplet; second, extracting the references for each article related to the drugs; and finally, building the literature citation network, as shown in Fig. 6. If there exist n drugs, the total number of drug pairs is n choose 2. Given n drugs, the total number of drug pairs is calculated as Inline graphic. Literature-based similarities between drug pairs can be measured using methods such as the Jaccard coefficient, logarithmic ratio and others. Figure 5b presents a flowchart for calculating the literature-based similarity of paired drugs. The Jaccard coefficient is defined as follows:

graphic file with name 12859_2025_6237_Article_Equ1.gif 1

Fig. 6.

Fig. 6

An example of a citation network for a drug pair

The Jaccard coefficient J(AB) measures the similarity between drug A and drug B, where A represents the set of papers related to drug A, and B represents the set of papers related to drug B. In Fig. 6, the nodes directly connected to drugs A and B are labeled as A and B, respectively. The arrowed lines represent the references cited in the literature associated with each drug. The number of blue curves indicates the count of articles related to drug A that are cited within the references of articles related to drug B. Similarly, the orange curve conveys the same meaning but in reverse, representing the number of articles related to drug B in the references of articles related to drug A. The value of J(AB) ranges from 0 to 1, with values closer to 1 indicating greater similarity between the two drugs. To create a more compact distribution of biological and pharmacological similarities for paired drugs on the Jaccard coefficient, we take the logarithm of the Jaccard coefficient. In addition to the Jaccard coefficient, if we consider only the sizes of the two sets, we can construct two new indicators with a structure similar to that of the Jaccard coefficient as follows,

graphic file with name 12859_2025_6237_Article_Equ2.gif 2
graphic file with name 12859_2025_6237_Article_Equ3.gif 3

The notation used in these indicators is consistent with that in the Jaccard coefficient. The performance of these indicators in drug repositioning is illustrated and explained in Figs. 4 and 5 of the supplementary materials.

Another literature-based similarity measure is the logarithmic ratio, defined as follows:

graphic file with name 12859_2025_6237_Article_Equ4.gif 4

Here, Inline graphic represents the number of articles related to drug B in the reference list of articles associated with drug A, and Inline graphic denotes the total number of articles related to drug A. The larger the value, the greater the similarity between the paired drugs. Recently, drug repurposing via protein–protein interactions has garnered significant attention [3032]. Therefore, we compared literature-based drug similarity with the currently popular human interactome-based drug similarity. The experimentally validated protein–protein interactions (PPIs) are constructed using distinct proteins from various data sources, as outlined in previous studies [26]. The network proximity of drug-target modules A and B, as represented in the human interactome, is measured by their separation

graphic file with name 12859_2025_6237_Article_Equ5.gif 5

it compares the mean shortest distance within the interactome between the targets of each drug, denoted as Inline graphic and Inline graphic, to the mean shortest distance, Inline graphic, between A-B target pairs. Note that Inline graphic denotes the average of the shortest distances.

Verifying the reliability of drug repurposing

Brown and Patel established the foundational database for drug repurposing in 2017 [33]. This database includes information on 1571 currently approved drugs, as annotated in DrugBank. It includes the drug names, DrugBank IDs, indications, indication IDs, and trial status for each drug. If the paired drugs share the same indications, and all of these indications are FDA-approved, the paired drugs are considered true positives. Otherwise, if the paired drugs share the same indications but one or more of these indications have not been FDA-approved, the paired drugs are classified as true negatives.

Figure 7 illustrates the detailed evidence for classifying paired drugs as true positives or true negatives. We ultimately obtained a validation set consisting of 10,125 drug pairs from the repoDB dataset, which included 6797 true positive drug pairs and 3328 true negative drug pairs. It was clear that the validation set was imbalanced, with the number of true positive drug pairs exceeding that of true negative pairs. We computed literature-based similarity metrics, such as the Jaccard coefficient and logarithmic ratio similarity, for these 10,125 drug pairs. The corresponding similarities were saved in the Drug_Validation_Set.xls, available in the supplementary materials. The true positive rates (TPRs) and false positive rates (FPRs) were calculated at different similarity thresholds. We used the receiver operating characteristic (ROC) curve, the area under the ROC curve (AUC), the precision–recall curve (PRC), the area under the precision–recall curve (AUCPR), and the Inline graphic score to assess the reliability of drug repurposing. These metrics evaluate the performance of the classification model across all classification thresholds.

The ROC curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The TPR and FPR are defined as follows:

graphic file with name 12859_2025_6237_Article_Equ6.gif 6
graphic file with name 12859_2025_6237_Article_Equ7.gif 7

where TPs, FNs, FPs, and TNs represent true positives, false negatives, false positives, and true negatives, respectively. The Area Under the Curve (AUC) value ranges from 0 to 1, with a higher AUC indicating better performance of the classification method. The precision–recall curve illustrates the tradeoff between precision and recall at different threshold settings. A larger area under the precision–recall curve signifies both high recall and high precision. Precision and recall are defined as follows:

graphic file with name 12859_2025_6237_Article_Equ8.gif 8

note that recall and the True Positive Rate (TPR) are identical, so we omit the formula for TPR here. The Inline graphic score is the harmonic mean of precision and recall and can be calculated as follows:

graphic file with name 12859_2025_6237_Article_Equ9.gif 9

its value ranges from 0 to 1, with 1 indicating perfect precision and recall. The Inline graphic score is also a widely used metric for evaluating models on imbalanced datasets.

Five types of drug similarities

To validate the literature-based measure of drug-drug relationships, we calculated five types of drug profiles: Gene Ontology (GO) similarity, chemical similarity, co-expression similarity, sequence similarity, and clinical similarity, based on a previous study [26].

The Gene Ontology (GO) annotations for drug target-coding genes were downloaded from https://www.geneontology.org/. These annotations are based on three types of evidence: experimental validation or literature-derived data, encompassing biological process (BP), molecular function (MF), and cellular component (CC), while excluding computational inference. The semantic comparison of GO annotations provides a quantitative method to assess the similarity between genes and gene products. The overall GO similarity between two drugs, A and B, is defined as follows:

graphic file with name 12859_2025_6237_Article_Equ10.gif 10

where a and b represent the drug targets for drug A and drug B, respectively, and Inline graphic denotes the averaging of all pairs of a and b with Inline graphic and Inline graphic. The GO similarity, SGO, is computed using a graph-based semantic similarity measurement algorithm implemented in R [34].

The chemical structure information (in SMILES format) was down-loaded from the DrugBank database (v5.1.8), and the MACCS finger-prints for each drug were computed using the ‘rcdk’ R package. If two drug molecules have a and b bits set in their MACCS fragment bit-strings, with c of these bits being set in the fingerprints of both drugs, the Tanimoto coefficient (T) for the drug-drug pair is defined as:

graphic file with name 12859_2025_6237_Article_Equ11.gif 11

The Tanimoto coefficient (T) is commonly used in drug discovery and development, with values ranging from 0 to 1. Here, 0 indicates no common bits, while 1 indicates that all bits are identical.

To calculate co-expression similarity, we obtained RNA-seq data (RPKM values) for 32 tissues from the GTEx V6 database (https://gtexportal.org/). Genes with an Inline graphic in over Inline graphic of the samples in each tissue were considered tissue-expressed genes, as described in a previous study [35]. The Pearson correlation coefficient (PCC) between drug targets (a and b) was calculated to determine their co-expression similarity. It was used to measure the degree of co-expression between drug targets associated with drug-treated diseases. The co-expression similarity between drug targets related to two drugs, A and B, was calculated by averaging the PCC(a,b) across all pairs of targets a and b associated with Inline graphic and Inline graphic, as shown below:

graphic file with name 12859_2025_6237_Article_Equ12.gif 12

The canonical protein sequences of drug targets in Homo sapiens were downloaded from the UniProt database (https://www.uniprot.org/). The protein sequence similarity, SP(ab), between two drug targets, a and b, was calculated using the Smith-Waterman algorithm [36]. The Smith-Waterman algorithm performs local sequence alignment by comparing all possible lengths of sequence fragments, with the condition Inline graphic. This condition ensures that, for drugs with common targets, pairs are not considered where a target is compared to itself. The overall sequence similarity between two drugs, A and B, was determined as follows:

graphic file with name 12859_2025_6237_Article_Equ13.gif 13

where Inline graphic and Inline graphic under the condition Inline graphic. This condition ensures that for drugs with common targets, pairs where a target would be compared to itself are excluded.

The Anatomical Therapeutic Chemical (ATC) classification system codes were used to calculate the clinical similarity between drug pairs. Clinical similarity is commonly employed to predict new drug targets [37]. The ATC codes for the drugs used in this study were downloaded from the DrugBank database (v5.1.8). The clinical similarity at the ith level (Inline graphic) between drugs A and B is defined by the following ATC codes:

graphic file with name 12859_2025_6237_Article_Equ14.gif 14

where Inline graphic represents all ATC codes at the ith level, and Inline graphic indicates the cardinal. The clinical similarity between drugs A and B is defined by the score Inline graphic as follows:

graphic file with name 12859_2025_6237_Article_Equ15.gif 15

where n represents the five levels of ATC codes. If a drug has multiple ATC codes, the clinical similarity is calculated for each ATC code, and the average of these values is used as the overall clinical similarity.

Supplementary information

Additional file serves as a comprehensive supplement to the main research article, providing additional explanations and extended discussions. All figures and tables presented in this document are designed to provide further clarity and transparency regarding the study’s conclusions.

Supplementary Information

Supplementary Material 1. (287.5KB, xls)
Supplementary Material 3. (271.5KB, xls)

Acknowledgements

The authors want to thank the OpenAlex Community Group and Drugbank database (v5.1.8) for providing literature data and drug information database, therefore this research was able to progress smoothly.

Abbreviations

FDA

Food and Drug Administration

TPRs

True positive rates

FPRs

False positive rates

ROC

Receiver operating characteristic

AUC

Area under the ROC curve

PRC

Precision–recall curve

AUCPR

Area under the precision–recall curve

GO

Gene ontology

BP

Biological process

MF

Molecular function

CC

Cellular component

PCC

Pearson correlation coefficient

ATC

Anatomical therapeutic chemical

SEM

Standard error of the mean

Author contributions

YM conceived of study. XL and YM performed data analysis. XL, XJ, and YM wrote and critically revised the manuscript. All authors read and approved the final manuscript.

Funding

This work has been supported by the National Natural Science Foundation of China [Grant No. 62006109], the Stable Support Plan Program of Shenzhen Natural Science Fund [Grant No. 20220814165010001].

Data availability

For detailed information on data and materials availability, please refer to the ‘Availability of Data and Materials’ section in the Supplementary.docx file.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12859-025-06237-7.

References

  • 1.Wouters OJ, McKee M, Luyten J. Estimated research and development investment needed to bring a new medicine to market, 2009–2018. JAMA. 2020;323:844–53. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Hughes J, Rees S, Kalindjian S, Philpott K. Principles of early drug discovery. Br J Pharmacol. 2011;162(6):1239–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wong CH, Siah KW, Lo AW. Estimation of clinical trial success rates and related parameters. Biostatistics. 2018;20(2):273–86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Yi H, Xiaowen D, Yuan X, Guomeng X, Haichun L, Tao L, Yadong C, Yanmin Z. Drug repositioning: progress and challenges in drug discovery for various diseases. Eur J Med Chem. 2022;234:114239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Jourdan J-P, Bureau R, Rochais C, Dallemagne P. Drug repositioning: a brief overview. J Pharm Pharmacol. 2020;72(9):1145–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sahoo BM, Ravi KBVV, Sruti J, Mahapatra MK, Banik BK, Borah P. Drug repurposing strategy (DRS): emerging approach to identify potential therapeutics for treatment of novel coronavirus infection. Front Mol Biosci. 2021;8:628144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Zhichao L, Hong F, Kelly R, Xiaowei X, Donna LM, William S, Weida T. In silico drug repositioning—what we need to know. Drug Discov Today. 2013;18(3):110–5. [DOI] [PubMed] [Google Scholar]
  • 8.Ryan PT, Paul MD, Edward MS, Stephen LK, Hani AA. A high-throughput screening approach to repurpose fda-approved drugs for bactericidal applications against staphylococcus aureus small-colony variants. MSphere. 2018;3(5):10-11280042218. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Guney E, Menche J, Vidal M, Barábasi A-L. Network-based in silico drug efficacy screening. Nat Commun. 2016;7(1):10331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Keiser MJ, Roth BL, Armbruster BN, Ernsberger P, Irwin JJ, Shoichet BK. Relating protein pharmacology by ligand chemistry. Nat Biotechnol. 2007;25(2):197–206. [DOI] [PubMed] [Google Scholar]
  • 11.Yang F, Zhang Q, Ji X, Zhang Y, Li W, Peng S, Xue F. Machine learning applications in drug repurposing. INSC. 2022;14(1):15–21. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Yang H-T, Ju J-H, Wong Y-T, Shmulevich I, Chiang J-H. Literature-based discovery of new candidates for drug repurposing. Brief Bioinform. 2016;18(3):488–97. [DOI] [PubMed] [Google Scholar]
  • 13.Yi H-C, You Z-H, Wang L, Su X-R, Zhou X, Jiang T-H. In silico drug repositioning using deep learning and comprehensive similarity measures. BMC Bioinform. 2021;22(3):293. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.March-Vila E, Pinzi L, Sturm N, Tinivella A, Engkvist O, Chen H, Rastelli G. On the integration of in silico drug design methods for drug repurposing. Front Pharmacol. 2017;8:272508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Ziaurrehman T, Markus V-K, Tero A. Artificial intelligence, machine learning, and drug repurposing in cancer. Expert Opin Drug Dis. 2021;16(9):977–89. [DOI] [PubMed] [Google Scholar]
  • 16.Cheng F, Desai RJ, Handy DE, Wang R, Schneeweiss S, Barabási A-L, Loscalzo J. Network-based approach to prediction and population-based validation of in silico drug repurposing. Nat Commun. 2018;9(1):2691. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Yang H-T, Ju J-H, Wong Y-T, Shmulevich I, Chiang J-H. Literature-based discovery of new candidates for drug repurposing. Brief Bioinform. 2016;18(3):488–97. [DOI] [PubMed] [Google Scholar]
  • 18.Gopal J, Prakash Sinnarasan VS, Venkatesan A. Identification of repurpose drugs by computational analysis of disease–gene–drug associations. J Comput Biol. 2021;28(10):975–84. [DOI] [PubMed] [Google Scholar]
  • 19.Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16(9):2006643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wei S, Philip ES, Wei Z. Drug combination therapy increases successful drug repositioning. Drug Discov Today. 2016;21(7):1189–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Jin G, Wong STC. Toward better drug repositioning: prioritizing and integrating existing methods into efficient pipelines. Drug Discov Today. 2014;19(5):637–44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Zhichao L, Hong F, Kelly R, Xiaowei X, Donna LM, William S, Weida T. In silico drug repositioning—what we need to know. Drug Discov Today. 2013;18(3):110–5. [DOI] [PubMed] [Google Scholar]
  • 23.March-Vila E, Pinzi L, Sturm N, Tinivella A, Engkvist O, Chen H, Rastelli G. On the integration of in silico drug design methods for drug repurposing. Front Pharmacol. 2017;8(298):272508. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Li J, Lu Z. A new method for computational drug repositioning using drug pairwise similarity. In: 2012 IEEE international conference on bioinformatics and biomedicine; 2012. pp. 1–4. [DOI] [PMC free article] [PubMed]
  • 25.Shyam Sundar D, Pritish R, Ibrahim Roshan K. Literature-based drug–drug similarity for drug repurposing: Impact of medical subject headings term refinement and hierarchical clustering. Future Med Chem. 2022;14(18):1309–23. [DOI] [PubMed] [Google Scholar]
  • 26.Cheng F, Kovács IA, Barabási A-L. Network-based prediction of drug combinations. Nat Commun. 2021;10(1):1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017;4(1):170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Pina AS, Hussain A, Roque ACA. An historical overview of drug discovery. Totowa, NJ: Humana Press; 2010. [Google Scholar]
  • 29.Stoeger T, Gerlach M, Morimoto RI, Nunes Amaral LA. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 2018;16(9):2006643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Li H, Xiao H, Lin L, Jou D, Kumari V, Lin J, Li C. Drug design targeting protein–protein interactions (PPIs) using multiple ligand simultaneous docking (MLSD) and drug repositioning: Discovery of raloxifene and bazedoxifene as novel inhibitors of il-6/gp130 interface. J Med Chem. 2014;57(3):632–41. [DOI] [PubMed] [Google Scholar]
  • 31.Ma J, Wang J, Ghoraie LS, Men X, Haibe-Kains B, Penggao D. A comparative study of cluster detection algorithms in protein–protein interaction for drug target discovery and drug repurposing. Front Pharmacol. 2019;10(19):109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Soleimani Zakeri NS, Pashazadeh S, MotieGhader H. Drug repurposing for Alzheimer’s disease based on protein–protein interaction network. Biomed Res Int. 2021;2021(1):1280237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Brown AS, Patel CJ. A standard database for drug repositioning. Sci Data. 2017;4(1):170029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. Gosemsim: an r package for measuring semantic similarity among go terms and gene products. Bioinformatics. 2010;26(7):976–8. [DOI] [PubMed] [Google Scholar]
  • 35.Cheng F, Kovács IA, Barabási A-L. Network-based prediction of drug combinations. Nat Commun. 2019;10(1):1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Smith TF, Waterman MS. Identification of common molecular subsequences. J Chem Inf Model. 1981;147(1):195–7. [DOI] [PubMed] [Google Scholar]
  • 37.Cheng F, Li W, Wu Z, Wang X, Zhang C, Li J, Liu G, Tang Y. Prediction of polypharmacological profiles of drugs by the integration of chemical, side effect, and therapeutic space. J Mol Biol. 2013;53(4):753–62. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1. (287.5KB, xls)
Supplementary Material 3. (271.5KB, xls)

Data Availability Statement

For detailed information on data and materials availability, please refer to the ‘Availability of Data and Materials’ section in the Supplementary.docx file.


Articles from BMC Bioinformatics are provided here courtesy of BMC

RESOURCES