Abstract
Objectives
This study extends prior research by combining a chronological pharmacovigilance network approach with machine-learning (ML) techniques to predict adverse drug events (ADEs) based on the drugs’ similarities in terms of the proteins they target in the human body. The focus of this research, though, is particularly centered on predicting the drug-ADE associations for a set of 8 common and high-risk ADEs.
Materials and methods
large collection of annotated MEDLINE biomedical articles was used to construct a drug-ADE network, and the network was further equipped with information about drugs’ target proteins. Several network metrics were extracted and used as predictors in ML algorithms to predict the existence of network edges (ie, associations or relationships).
Results
Gradient boosted trees (GBTs) as an ensemble ML algorithm outperformed other prediction methods in identifying the drug-ADE associations with an overall accuracy of 92.8% on the validation sample. The prediction model was able to predict drug-ADE associations, on average, 3.84 years earlier than they were actually mentioned in the biomedical literature.
Conclusion
While network analysis and ML techniques were used in separation in prior ADE studies, our results showed that they, in combination with each other, can boost the power of one another and predict better. Moreover, our results highlight the superior capability of ensemble-type ML methods in capturing drug-ADE patterns compared to the regular (ie, singular), ML algorithms.
Keywords: adverse drug events, network analysis, machine learning, prediction, target proteins, ensemble models
INTRODUCTION
Today, every new drug to be approved by healthcare authorities and marketed by pharmaceutical companies has to pass through numerous clinical trials, which on average take 10 to 15 years.1 These clinical trials aim mainly at ensuring efficacy and safety of the drug. A considerable number of drugs fail to get US Food and Drug Administration (FDA) approval due to the potential threats their usage involves, even though they might show effectiveness with regard to treating some specific diseases.2 Nevertheless, even such tough regulations and approval procedures do not 100% guarantee the safety of a drug, as those trials themselves involve several limitations and may fail to capture some potential—in some cases, serious—safety issues.3,4
A classic example of such cases is Rofecoxib, a non-steroidal anti-inflammatory drug approved in 1999 that became highly welcomed by physicians in a short time. The drug was originally aimed to treat acute pains and osteoarthritis, but after a while turned out to cause heart attacks in more than 100 000 patients and ended up being withdrawn by the FDA in 2004. During that time, apart from the lives threatened, this possibly avoidable problem also imposed huge losses to pharmaceutical and insurance companies.
Pharmacovigilance (a.k.a. drug safety surveillance) is a field of science that monitors the drugs during their lifecycle to detect, assess, and understand their potential adverse effects and prevent harm and injuries caused thereof. Although pharmacovigilance activities begin early after drug discovery, its role becomes more critical after drug approval, when humans start to take it.
In pharmacovigilance terminology, an adverse drug event (ADE) refers to any injury occurring to a patient caused by administering a drug. It should be noted that there is still no consensus on this terminology across pharmacovigilance and pharmacoepidemiology studies. Some studies2,4–6 define ADE as any injury that does not necessarily have a causal relationship with the drug (eg, injuries due to human errors) and therefore use the more specific term adverse drug reaction (ADR) to refer to the injuries directly caused by the drug. However, in the present study, we stick with the term ADE, while we emphasize that by ADE, we mean a drug-induced (ie, causally related) injury in patients. It is estimated that in the United States, each ADE case in community hospitals on average costs $3000.3,4 Also, ADEs are reported by the Australian Commission on Safety and Quality in HealthCare (ACSQHC) to cause about 400 000 admissions to general practitioners in Australia with a population of only 23 million.3
Given the great potential health and financial threats mentioned, and considering the fact that today the trend is toward faster approval processes and smaller clinical trials, especially in oncology and rare diseases,2 a great amount of research has been done in the past decade to find faster and more effective ways to detect, predict, understand, and prevent ADEs before they affect too many (or ideally any) people.
In this study, we extend the extant literature on ADE prediction by proposing a chronological network analytics approach that can help pharmacovigilance practitioners save lots of time, money, and, more importantly, lives by enabling them to predict potential ADEs prior to drug approval. The proposed approach uses historical information of known drug-ADE relationships in addition to similarities between new and approved drugs, in terms of the proteins they target in human bodies, and tries to predict potential ADEs.
The remainder of this article is organized as follows: the following section reviews the extant literature on detection, prediction, and understanding of ADEs and states the research goals. Then, we explain the materials and methods used to conduct the study followed by the results. Finally, we discuss the contributions of our study and conclude with a few potential future research directions.
BACKGROUND
Resources for ADE studies
Before discussing different approaches used in prior ADE studies, in this section, we discuss various data sources used by researchers to conduct those studies. Four main types of data sources have been identified in the literature. The following 4 sub-sections introduce these resources and mention prior research conducted using each.
Spontaneous reporting systems
As an effort to rapidly detect and prevent ADEs in the post-marketing phase, many countries and international organizations have run spontaneous reporting systems (SRSs)—systems designed to allow patients and professionals to submit their reports of suspected ADEs. This includes the World Health Organization’s (WHO) Individual Case Safety Reports (ICSR) database, the yellow card system of Medicines and Healthcare products Regulatory Agency (MHRA) in the UK, and the FDA Adverse Event Reporting System (FAERS) in the United States.3 Although SRSs were the main source for ADE studies for several years, their limitations such as over-reporting and voluntary submissions3,7 made pharmacovigilance practitioners look for more efficient alternatives.
Electronic health records
During the past decade, electronic health records (EHR) have been widely used in the healthcare industry to help practitioners in the collection, storage, and tracking of patients’ information. The vast amount of data collected by EHRs along with their increasing availability have made them interesting resources for pharmacovigilance researchers and enabled them to detect ADE signals closer to real time.8 Yet, using EHR data involves challenges such as complex data preprocessing requirements and multiple standards across different databases.7
Social media
Recently, social media have been introduced as novel resources for conducting ADE as well as other healthcare studies. Virtual communities such as health forums (eg, DailyStrength and PatientsLikeMe) and social networks (eg, Twitter and Facebook) are places where people discuss their daily health-related experiences and concerns. Such information, although noisy, is likely to appear there long before it is reported to any SRS or recorded in any her,9,10 and this has made social media precious resources for early detection of ADEs.
Biomedical literature
Recently, researchers have realized biomedical literature as well as chemical and biological databases as feature-rich sources for ADE studies. Databases such as PubMed, PubChem, KEGG, and DrugBank are rich sources of information about drugs, their chemical and biological characteristics, and their identified ADEs.
ADE studies: detection, prediction, and understanding
Due to considerable potential costs and damages of ADEs, in the past decades, there has been a great deal of research on this issue in many disciplines, including pharmacology, economics, and information systems. While the ultimate goal of all of these studies is to identify drugs’ potential ADEs and prevent losses of lives and money thereof, they pursue different tools and strategies to achieve that goal. We believe that ADE studies can be classified into 3 distinct categories, namely, detection, prediction, and understanding.
Detection studies are the largest group of ADE research works focused on finding new and undetected ADE signals (ie, associations, not necessarily causal) between the existent drugs (already in the market) and adverse events. The signals detected by these studies need to be assessed and verified by clinical trials. ADE detection studies heavily rely on applying statistical11,12 or data mining5–8,13–16 methods and quasi-experimental settings to the historical data from SRSs,7,11,12,17 EHRs,6,8,14,15,18 or social media5,13,16,19,20 to extract signals from them.
In the ADE prediction studies, on the other hand, instead of detecting signals for the existent drugs using collected data from their past usage experiences, the focus is on creating signals for the new drugs before they cause any adverse events to patients. The strategy in this group of studies is mainly to find similarities between the existent and the new drugs and thereby to predict ADEs for the new drugs given the already known relationships between their similar existent drugs with the corresponding ADEs. The statistical regression-based methods21,22 as well as machine-learning (ML) techniques23–25 are the dominant methods used by researchers for this purpose. Also, in terms of data sources, prediction studies heavily rely on the biomedical literature as well as drug databases, including chemical, physical, and biological information of drugs, as such resources enable them to identify drug similarities. Just like ADE detection studies, this group of studies also serves as a signal detector, but the difference is they capture signals for new drugs as well.
The last group of ADE studies in our taxonomy is those focusing on verifying ADE signals and understanding the mechanism through which the drug causes the ADE. Pharmacoepidemiology and pharmacometrics studies fall into this group, as they use mathematical and parametric models of biology, pharmacology, and physiology to clarify and understand mechanisms of both beneficial and adverse molecular interactions.2 Several different types of models have been used by researchers in this group, among which pharmakokinetics and pharmacodynamics26–30 are the most popular modeling approaches. The former focuses on modeling how the organism affects the drug, whereas the focus in the latter is on studying the effect of the drug on the organism; so the researchers usually employ them together, as the complement to each other to determine optimal dosing as well as the beneficial and adverse effects of drugs. In terms of data sources, this group of studies mostly relies on drug databases and EHR historical transactions.
Network analysis and pharmacovigilance
Although network analysis (NA) has been widely used in many areas of science, including sociology, communication, biology, economics, and computer science starting from a few decades ago, its application in pharmacovigilance studies is hardly older than 10 years. The main reason for that could be the lack of appropriate information systems and infrastructures for collecting the data required for constructing networks in large scale before the early 2000s.
Networks have been used in pharmacovigilance research with a variety of data sources and for different purposes (not limited to ADE prediction, which is the case in our study). Some researchers, including Ball et al31 and Botsis et al,32 used network representations of vaccines and their reported ADEs in the FDA’s VAERS to identify the frequent patterns of interactions. Also Zhang et al33 showed that patterns identified in vaccine–vaccine networks can contribute to the vaccine ontology knowledge base. A recent study by Kim et al34 on hospitalized patients with hematologic malignancies revealed that network centrality metrics can be used to identify the most important causes for drug-related problems (DRPs) by constructing a cause-DRP network using ward pharmacists’ documentations in hospital settings.
Apart from the mentioned studies that have used descriptive and qualitative techniques to extract information/knowledge from networks, there are also a few studies focused on using networks of drugs and ADEs for predicting their associations. For instance, Atias and Sharan22 and Cami et al21 in their studies used a diffusion process and a logistic regression model, respectively, with NA to make ADE predictions. Nevertheless, to the best of our knowledge, NA has not been combined with ML methods in the literature so far for the prediction purposes, and the present study is the first one to do so.
Research goals
While statistical and ML techniques have been widely used with various data sources for pharmacovigilance prediction purposes,19,25,35–38 we found only a few studies that have utilized the incredible potential of network analysis approaches to explore drug-ADE associations. Specifically, Atias and Sharan22 applied a network-based diffusion process to predict drugs’ ADEs. Also, in a later study, Cami et al21 employed a logistic regression (LR) technique in a network approach using data from biomedical literature and chemical databases to predict drug-ADE associations.
We extend the ADE prediction research by employing a chronological pharmacovigilance network (CPN) along with machine-learning techniques to predict drugs’ ADEs. For this purpose, we use biomedical literature citations as the main source of data for extracting previously identified drug-ADE associations. Additionally, we incorporate information about the target proteins of drugs into our network structure to make it more informative for training machine-learning algorithms.
A target protein is a chemically definable molecular structure that will undergo a specific interaction with chemicals that we call drugs because they are administered to treat or diagnose a disease.39 In other words, drugs act by binding to specific target proteins and changing their biochemical or biophysical activities to treat their indicated diseases.40 Given that, we argue that knowledge about the similarity of drugs, in terms of the proteins they target, can contribute to the quality of ADE predictions. Moreover, we believe that the complexity of drug-ADE relationships is so much that machine-learning algorithms, and especially ensemble models, are more efficient than statistical-based methods (eg, LR) in capturing that.
MATERIALS AND METHODS
Materials
We integrated data from 2 sources, namely, National Library of Medicine’s (NLM) MEDLINE and the DrugBank’s database of drug-target proteins, in order to operationalize our approach towards modeling of the CPN. MEDLINE, a subset of the PubMed database, is a bibliographic database of biomedical information from multiple disciplines that includes more than 29M citations starting from 1946. What sets MEDLINE apart from the rest of PubMed is the added value of using the NLM-controlled vocabulary, Medical Subject Headings (MeSH), for indexing, cataloging, and searching for biomedical documents. Also, DrugBank is a freely accessible online drug database, including biological, chemical, and genetic information of 10 986 approved and experimental drugs.
First, we selected a sample of 8 common and high-risk ADEs reported in the literature8 (acute renal failure, myocardial infarction, leukopenia, agranulocytosis, rhabdomyolysis, neutropenia, thrombocytopenia, and anemia) and collected all MEDLINE articles mentioning at least 1 of them as the ADE identified in the article. To this end, we used a search strategy based on NLM’s MeSH thesaurus (see the Appendix). NLM indexers select the most appropriate MeSH indexes to resume the full content of an article after reading the full text.41
The initially downloaded dataset involved 10 890 unique publications mentioning associations among 657 drugs with 769 ADEs. However, considering only drugs approved by the FDA by December 2017, we ended up with a dataset including 9672 publications, 582 drugs, and 732 ADEs (ie, the 8 original high-risk ADEs as well as 724 others that were mentioned in the included articles).
Second, we used the DrugBank42 database to extract target proteins associated with each FDA-approved drug. While most drugs target only a few proteins in the human body, some have many targets.40 In addition to the 582 drugs in the initial dataset, we included information about 217 other drugs having at least 1 common target with 1 of those 582 drugs. Therefore, the integrated dataset used in the study involved 799 approved drugs and 732 ADEs. The publication years as well as the drugs approval dates were also imported into our data to be used in constructing training and validation datasets for the model-building stage. All of the drugs and ADEs were then mapped to their unique terms from the NLM’s Unified Medical Language System (UMLS) for consistency.
Method
Network construction
A chronological approach was employed to construct drug-drug and drug-ADE relationships in the network. The ultimate goal in pharmacovigilance is to identify as many as possible ADEs in the pre-marketing phase. Hence, in order to have a valid prediction model, one is allowed to use only drug information as well as the known drug-ADE associations that are available prior to the time of the drug approval. Given this idea and using the dates of publications and drug approvals, we used all of the information available prior to 2001 to predict drug-ADE associations for the drugs marketed during 2001 to 2017.
First, a network was constructed in which both drugs and ADEs were considered as vertices. An undirected edge was created between 2 drugs if they had at least 1 common target protein. Additionally, a drug was connected to an ADE in the network if there was at least 1 PubMed article published before 2001 mentioning such association. The network involved all of the 799 drug vertices (regardless of their approval dates) and 10 094 drug-drug edges indicating common target proteins, as well as 5264 drug-ADE edges representing pre-2001 identified associations. We kept aside drug-ADE relationships recognized (for the first time) during 2001 to 2017 to validate our prediction model, as they were unknown at the time of prediction (ie, beginning of 2001). Figure 1 provides a visualization of the network created.
Figure 1.
Drug-ADE network created by Cytoscape v3.6. Triangle (blue) nodes represent drugs, and circular (orange) nodes represent ADEs. Yellow links between drugs indicate the existence of at least one common target protein by the drugs connected. Also, gray links between drugs and ADEs indicate an association mentioned in at least one PubMed article for the corresponding drug and ADE.
Network metrics
Drug-ADE links were considered as the unit of analysis in this study. Since the focus of our study was on a set of 8 common and critical ADEs, we created our dataset by considering all possible combinations of the 799 drugs with those ADEs (ie, 6392 records). Once the network was constructed, we extracted 7 similarity- as well as 3 centrality-based metrics for each record to be used as link predictors. The metrics had been proposed in the network analysis literature for link prediction purposes.21,43,44
The 3 centrality-based metrics we used were the absolute difference, product, and sum of degree centralities of corresponding drug and ADE vertices involved in each link. All of these metrics were used in similar studies21,43 to capture assortativity1 (absolute difference and ratio) and preferential attachment2 (sum and product).
Table 1 indicates the similarity-based predictors extracted from the network along with their definitions. While all of the similarity metrics are defined based on the notion of commonality of neighborhoods between the 2 nodes of interest, each reflects a different aspect of similarity. In these definitions, Γ(i) and Di denote the set of neighbors and degree of node i, respectively. Also, d and a were used to denote drug and ADE, respectively. Therefore, refers to the set of common neighbors of a drug and an ADE; similarly, refers to the set of all of their neighbors.
Table 1.
Similarity metrics and their formulaic definitions
The network metrics were obtained with the help of the igraph package in R, a comprehensive package for network analysis.
Apart from the 5 mentioned standard similarity metrics, we also incorporated 2 derived similarity metrics for each drug-ADE pair. First, for each drug-ADE pair, we calculated the average Jaccard similarity of the corresponding drug with all of the drugs connected to the ADE. To calculate this variable we constructed and used a network including only the drugs (and no ADEs) and extracted Jaccard similarities of each drug with all of those connected drugs. We believe that such a variable reflects how a new drug is chemically similar to drugs in general and therefore might cause the same ADE as they do. Based on the same logic and in a similar manner, for each drug-ADE pair in our dataset, we also incorporated average distance from the corresponding drug to all of the drugs connected to the ADE (ie, the second derived variable). While the first derived variable captures general similarity of each drug with the connected drugs based on their direct neighborhoods, the second one takes into account the indirect links as well.
In the end, a binary target variable was created for each drug-ADE pair to indicate whether that association actually exists according to the MEDLINE citations.
Training and validation data
Once we formed the dataset using the network, we applied the following rules to divide the dataset into training and validation subsets to train the prediction models and test their efficiency.
Drug-ADE pairs that were actually discovered after 2001, regardless of the drug approval year, were placed into the validation dataset. All of the remaining pairs, including drugs approved after 2001, were also added to the validation set. All other pairs were classified as the training dataset. Applying these rules, we ended up with a training dataset containing 5357 records with 1087 (ie, 20.3%) positive responses (target = 1). Also, the validation set contained 1035 records with a response rate of 14.6% (ie, 151 positives).
Prediction model
We used the training dataset to train and build our prediction models. Four different classification algorithms were employed, namely, artificial neural network (ANN), gradient boosted trees (GBTs), random forests (RFs), and LR.
Due to the unbalanced proportion of positive and negative responses in training data, the synthetic minority oversampling technique (SMOTE)50 was applied to make a balanced training (model-building dataset), henceforth avoiding biases in the training of the models. The KNIME analytics platform version 3.5.1 (a free and open source analytics software platform) was used to build the classification models. Figure 2 shows a flowchart-like graphical depiction of the data preparation and model-building methods and procedures.
Figure 2.
Flowchart-like graphical depiction of the methods and procedures.
RESULTS
Models accuracy
Table 2 shows the prediction results of the best models of each algorithm on the validation data. As shown, RF and GBT, the 2 ensemble-type algorithms, provided more accurate results than ANN and LR.3 Also, overall, GBT turned out to be the best model among all with an overall accuracy of 92.8% and the ability to correctly predict 72.8% of real drug-ADE associations in the validation data (ie, sensitivity). It suggests that given historical information about drug-ADE associations as well the target proteins of drugs, our best model was able to predict 110/151 (ie, 72.8%) of drug-ADE associations that were actually discovered during a 17-year period after building the prediction network. In addition, the positive predictive value (PPV) for the GBT model indicates that out of 143 pairs predicted as associations by this model, 110 (ie, 76.9%) were real associations reported in MEDLINE. Also overall, the PPV values highlight the superiority of the 2 ensemble models over the individual models (ie, ANN and LR) in which only around half of the positive predictions were correct.
Table 2.
Prediction models’ accuracy statistics
Model | Accuracy | Sensitivity | PPV | AUROC |
---|---|---|---|---|
ANN | 85.5% | 65.6% | 50.3% | 0.868 |
RF | 92.1% | 64.9% | 77.2% | 0.893 |
GBT | 92.8% | 72.8% | 76.9% | 0.916 |
LR | 85.7% | 56.3% | 50.9% | 0.793 |
In the only similar study we are aware of in the literature, conducted by Cami et al,21 historical drug-ADE associations along with drugs’ taxonomical and intrinsic properties (eg, molecular weight, atom count, and so on) from pre-2005 years were used in multiple LR models to predict associations identified during 2005 to 2010. Compared to their best model (Area Under Receiver Operating Characteristic curve [AUROC] = 0.869), 2 of our prediction models (RF and GBT) provide superior results, while prediction power of our ANN model is also comparable to theirs.
Our further investigation revealed that from the 110 true positive predictions made by the GBT model, 29 were related to post-2001 marketed drugs, which, given 42 actual positive associations, means a 69% true positive rate for these new drugs. The true positive rate for older drugs was 74.3% (ie, 81/109 actual associations). Moreover, it turned out that out of 143 positive predictions, 102 were related to pre-2001 marketed drugs, which (given that 81 of the true positive cases were pre-2001 marketed drugs) suggests a PPV of 79.4% (81/102) for this group. Also, 41 positive predictions were related to post-2001 marketed drugs, resulting in a PPV of 70.7% (ie, 29/41). These statistics seem reasonable, given the higher number of historical publications about these drugs that makes the model better trained for classifying their associations.
Furthermore, in terms of sensitivity, our approach outperforms Cami et al’s, as their best-reported model had a sensitivity of 61.2% compared to 72.8% for our model. While this difference might be argued to be due to the narrower focus of our study (ie, including 8 ADEs), we believe it mostly has to do with the more informative nature of the network we used to train our models as well as the ability of ML techniques to capture complex/nonlinear relationships compared to statistical methods such as LR. As Table 2 shows, our LR model did not perform as well as the other 3 ML methods. Nevertheless, it is still comparable and complementary to the models provided by Cami et al.21 Even comparing our results to those of the studies that have employed ML techniques (mostly using drugs’ structural variables as predictors) with a non-network approach,23–25 our approach outperforms theirs in terms of most of the accuracy statistics. Table 3 indicates that, especially in terms of sensitivity and PPV, using an ensemble ML model along with the network approach has significantly improved ADE predictions.
Table 3.
Comparison of model results with the best results reported by similar studies
Article | Network approach | Model | Chem. | Bio. | Other | Acc | Sens | PPV | AUROC | |
---|---|---|---|---|---|---|---|---|---|---|
Liu et al.24 | No | SVM | Yes | Yes | Yes | 0.967 | 0.631 | 0.662 | 0.952 | |
Huang et al.23 | No | SVM | Yes | Yes | No | NR | NR | NR | 0.760 | |
Cami et al.21 | Yes | LR | Yes | No | Yes | NR | 0.608 | NR | 0.869 | |
Huang et al.25 | No | SVM | No | Yes | No | 0.675 | 0.632 | NR | 0.771 | |
Present study | Yes | GBT | No | Yes | No | 0.928 | 0.728 | 0.769 | 0.916 |
Chem. indicates whether chemical features of drugs are used for ADE prediction. *Bio. indicates whether biological features of drugs are used for ADE prediction. *Other indicates whether other features (eg, taxonomical, phenotypical, etc.) of drugs are used for ADE prediction.
Variables importance
Having the superior prediction model identified, we further investigated how each of the predictors contributed to the model accuracy. To this end, we dropped predictor variables 1 at a time from our data and ran the best prediction model. Each time, we recorded the model’s AUROC to be compared to that of the original model. Table 4 indicates the amount of decrease in AUROC after dropping each predictor along with the relative importance of variables based on normalized AUROC differences.
Table 4.
Variable importance statistics
Dropped variable | New AUROC | AUROC_diff | Relative importance |
---|---|---|---|
Degree_product | 0.86 | 0.056 | 1 |
Degree_ratio | 0.864 | 0.052 | 0.875 |
Degree_sum | 0.872 | 0.045 | 0.656 |
Geometric index | 0.884 | 0.032 | 0.250 |
Avg_Jacc_connected | 0.885 | 0.031 | 0.219 |
Adamic/Adar index | 0.887 | 0.029 | 0.156 |
Simpson index | 0.887 | 0.029 | 0.156 |
Dice index | 0.888 | 0.028 | 0.125 |
Jaccard index | 0.889 | 0.027 | 0.094 |
Avg_dist_connected | 0.891 | 0.025 | 0.031 |
Abs_degree_diff | 0.892 | 0.024 | 0 |
As shown in this table Degree_product, Degree_ratio, and Degree_sum representing preferential attachment as well as assortativity of drug-ADE pairs turned out to have the highest contribution to the predictive power of the best (ie, GBT) model. It suggests that our centrality-based predictors generally played a more important role than similarity-based metrics. Of the 3 top predictors, 2 of them (Degree_product and Degree_sum) were also among the top 3 in the study performed by Cami et al.21Degree_product was also identified as a strong predictor in the work conducted by Liben-Nowell & Kleinberg.43 Interestingly, the results show that one of the derived variables, namely, Avg_Jacc_connected, was the fifth most important predictor with a relative importance of around 22%. Also consistent with prior research,21Abs_degree_diff was the least important predictor of network links.
Finally, by investigating our true positive predictions and considering the actual years that corresponding drug-ADE associations were identified for the first time, we realized that, on average, our model was able to predict ADEs 3.84 years (SD = 1.97 years) before they were mentioned in PubMed articles. Table 5 indicates a summary of associations predicted by the model for the 8 ADEs of interest along with the top associated drug predicted for each. The “average probability” column in this table shows the average across all the real associations, not just those that correctly predicted. The results show that, disregarding a few exceptions, the model performance in predicting associations across the ADEs of interest was roughly the same. This suggests generalizability of the proposed approach, as it has performed equally well with regard to various ADEs.
Table 5.
Summary of predicted associations by ADE
ADE | Real associations | Predicted associations | Average time saving | Average probability | Top associated drug |
---|---|---|---|---|---|
Acute renal failure | 32 | 26 (81%) | 3.62 | 0.7655 | Ceftazidime (2)* |
Agranulocytosis | 12 | 10 (83%) | 5.70 | 0.8293 | Albendazole (6) |
Anemia | 9 | 6 (67%) | 2.67 | 0.7277 | Ribavirin (3) |
Leukopenia | 4 | 4 (100%) | 3 | 0.9436 | Dexamethasone (1) |
Myocardial infarction | 27 | 18 (67%) | 3.72 | 0.7035 | Doxazosin (5) |
Neutropenia | 17 | 12 (71%) | 3.83 | 0.7739 | Flucytosine (2) |
Rhabdomyolysis | 40 | 27 (68%) | 3.92 | 0.6910 | Doxylamine (0) |
Thrombocytopenia | 10 | 7 (70%) | 3.50 | 0.7517 | Tamoxifen (0) |
Numbers in front of drug names in the last column indicate the number of years the model predicted their associations with the corresponding ADE earlier than it was published in PubMed.
Analysis of prediction errors
Even though our prediction model performed well in terms of common accuracy metrics, it is always insightful to qualitatively analyze the cases that a model fails to accurately predict. Such an undertaking may involve both the drug-ADE pairs that were predicted to be associated while they actually were not (ie, the false positive cases) and the drug-ADE pairs that were actually associated whereas the model failed to predict their association correctly (ie, the false negative cases).
We found 41 false negative predictions made by the model. Our further investigation revealed that 20 (ie, around half) of them are related to the drugs approved after 2008. More specifically, we realized that 6 drugs, all approved after 2008, account for 17 (ie, 41%) of false negative predictions. We then looked into the known associated ADEs, other than the 8 ADEs of interest, for each of those 6 drugs before 2001 (ie, when they were experimental drugs yet), which were used to train the prediction model. We found that, compared to average (ie, 6.58), the number of known associations for most of those 6 drugs was considerably low with only 1 having more than 5 known ADEs. Given these findings, we believe that 1 main reason for the model making those false negative predictions could be the relatively low number of known ADE associations (ie, network edges) involving those drugs in the training dataset. Since we used only network metrics as the predictor variables, such lack of sufficient drug-ADE edges may possibly affect all of the predictor variables related to the corresponding drugs. Of course, 1 way to address this issue is to change the cutoff point for data partitioning (which is currently 2001), so that our training data include more of the known MEDLINE citations involving the drugs approved more recently. In the present study, however, changes in the cutoff year considerably affect the size of the validation dataset,4 which could jeopardize the validity of the prediction model.
Our predictions also involved 33 false positive cases. Again, to further investigate the potential causes for those classification errors, we looked into the specific drugs and ADEs involved. We realized that around 61% (ie, 20) of these cases were related to the relatively older drugs, approved in the early 1990s or even earlier. For such drugs, due to numerous biomedical studies conducted on them over time, the number of known ADE associations and consequently their degree centrality in the network tend to be higher than newer drugs. This directly inflates the centrality-based predictors of drug-ADE pairs, namely, Degree_ratio, Degree_sum, and Degree_product. Moreover, it was shown that these were the top 3 influential predictors of network links in our study. Table 6 compares the values of these 3 predictors, on average, for the false positive vs true positive as well as true negative cases. Clearly, the predictor values in false positive cases are far from those of the true negative cases and are very close to the cases correctly predicted as positive.
Table 6.
Comparing top predictors’ values in false positive, true positive, and false negative predictions
Predictor | False positives | True positives | True negatives |
---|---|---|---|
Degree_Sum | 234.82 | 237.55 | 203.05 |
Degree_Product | 7131.76 | 7615.02 | 3344.65 |
Degree_Ratio | 0.21 | 0.23 | 0.11 |
Overall, our findings suggest that for older drugs, the centrality-based predictor values are overly inflated, due to the higher number of citations involving them, for which other predictor variables cannot help the model discern those cases from actual/real positive cases. Hence, probably incorporating some other network-independent informative covariates suggested in the literature (eg, molecular or chemical features of drugs) can address this issue to some extent and help the model to better differentiate between positive and negative cases.
DISCUSSION AND CONCLUSION
In this study, we proposed a new approach to predict ADEs by constructing drug-ADE networks, using biomedical citations as well as drug target proteins information, and then employing network metrics as predictors of associations in ML algorithms.
While both NA approaches and ML techniques had been employed in the past separately, to the best of our knowledge, the present study is the first one that employs ML along with an NA approach together in a single study. The promising results we obtained suggest that combining these 2 powerful tools can enhance the results we may get from each in separation. Our proposed approach outperformed the prior studies (see Table 3), while the number of predictor variables used in this study is relatively lower than that of similar studies.
We believe that part of these superior results is owed to the incredible power of ensemble ML algorithms. As shown in our results, the 2 ensemble algorithms (RF and GBT) considerably outperformed the other 2 approaches. That is simply because of the higher power of ensemble algorithms in capturing sophisticated patterns in the data. While statistical and regular ML techniques train a single model (either linear or nonlinear) to reflect the relationship between the variables, ensemble algorithms sample the data hundreds of times and use those samples to build hundreds of prediction models. Then, to predict a new case, they vote from the created models to specify the final prediction. This way, instead of a single model, which is subject to sample randomization errors, many models are employed to yield predictions. Given the superior performance of GBT in this study, future research may put more emphasis on employing GBT as well as other ensemble algorithms, such as extreme gradient boosting (XGBoost) and ensemble Bayesian models for classification and prediction purposes in the pharmacovigilance studies.
The results also suggest that assortativity and preferential attachment (ie, centrality-based metrics) are better predictors of network edges than similarity-based metrics (eg, Jaccard coefficient). This is in line with the results from Cami et al21 and Liben-Nowell & Kleinberg.43 Additionally, we introduced 2 derived similarity-based network metrics, namely, Avg_Jacc_connected and Avg_Distance_connected, for predicting network edges, and it turned out that the former is among the top 5 most important predictors. In terms of relative importance, Table 4 shows that this derived variable has contributed to the quality of the model around 50% more than the Adamic/Adar index and around 100% more than the Jaccard index, 2 popular similarity-based metrics. It suggests that considering the similarity of a drug with the drugs already associated with an ADE provides more useful information in predicting drug-ADE associations than considering the similarity of that drug with the ADE itself.
Although the present study is particularly focused on 8 highly common and risky ADEs, we argue that the high accuracy of our predictions has nothing to do with that matter because we did not incorporate any information about the ADEs or their relationships in building our prediction models. All of the information used to train our prediction models was historical drug-ADE associations as well as drug-target proteins. Hence, we believe that replicating our approach on a larger scale and with a higher number of ADEs would result in the same quality results, if not better.
Another limitation of this study is that it does not account for the strength of drug-ADE associations in the construction of the network. In network analysis, using the strengths of associations as the linkage weights and extracting weighted metrics is a popular and informative approach provided that the weights are assigned to the links in a meaningful way. Considering the frequency of citations mentioning a given association as the strength of that association is not a decent or even meaningful way for weighing the network edges because this frequency does not necessarily reflect the strength of association and might very well be, for instance, due to the high amount of risk involved in the corresponding ADE. Therefore, in this study, we used an unweighted network for the analysis. Future research could extend our approach by developing a way to score drug-ADE associations and use weighted network metrics in building the prediction models.
While our best model performed well in terms of sensitivity, it still made 33 false positive and 41 false negative predictions. Even though we analyzed some potential reasons for these prediction errors, we suspect that a portion of the false positive cases, especially those involving recently approved drugs, might be actually real drug-ADE associations that have not yet been studied and mentioned in biomedical citations. This also could be the case with all the other ADE prediction studies where the models yield a considerable number of false positives. Future research may focus on such cases resulting from ADE predictions and try to investigate them using clinical trials or by analyzing patient transactions from EHR data using methods such as prescription sequence symmetry analysis.51,52
Given the relatively high accuracy of predictions resulting from employing a network approach, both in this study and the other few similar works, we strongly encourage future researchers to utilize the incredible power of networks for prediction purposes in pharmacovigilance. Especially, we believe that incorporating more data sources to construct more informative training networks can lead to even better predictions in the future. Specifically, chemical, physical, and molecular features of drugs (eg, molecular weight, heavy atom count, melting point, etc.) can be added to the model as covariates to enhance its prediction power. We believe that one big methodological advantage of our study is producing quality results using a considerably lower number of predictors than prior studies21,22,24,25 and relying mostly on the power of networks and ensemble ML algorithms to identify patterns. Nevertheless, as discussed in the Results section, incorporating some additional covariates can potentially improve the model while maintaining its simplicity. Databases such as DrugBank and PubChem are freely accessible and rich sources of information about drugs that can be used for this purpose.
We used biomedical literature citations as the only resource for the known drug-ADE associations in constructing the network. There are, however, some other resources such as the side effect resource (SIDER) database (http://www.sideeffects.embl.de) or some commercial databases such as Lexicomp (http://www.lexi.com) that can be used for this purpose as well. Future research may extend our approach by incorporating multiple resources to add as many as possible drug-ADE links to the network, as doing so can enhance the information extent of the network and potentially improve the quality and accuracy of the predictions.
Similarly, with regard to the drug targets, we used only a single source (ie, DrugBank) for this purpose. Even though it was suggested in prior research53 that the network-based organization of DrugBank data, particularly the drug similarity network (DSN), can potentially contribute to the prediction of side effects, and we showed that in this study, it involves some potential limitations. DrugBank is primarily focused on labeling targets from a pharmacokinetic point of view and possibly includes some determinants of drug disposition labeled as drug targets. We are not sure, though, whether the existence of such instances has improved or limited our model performance, as on one hand, they may make the DSN more information rich, but on the other hand, the nature of drug similarities may not be the same across the network.
Finally, we believe that the chronological settings used in the present study to construct a drug-ADE network based on the chronological drug approvals and known ADE associations may be extended by future researchers to conduct a longitudinal study by constructing multiple drug-ADE networks at different time points and show that evolution of this network over time enriches its informativeness and yields better predictions both in general and with regard to specific associations.
FUNDING
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
CONTRIBUTORS
Both authors have been involved in all stages of this research from the idea creation to writing and editing the manuscript with a roughly 70% to 30% contribution rates.
Conflict of interest statement. The authors have no competing interests to declare.
APPENDIX
This appendix describes the search strategy used for extracting data from the National Library of Medicine’s (NLM) MEDLINE database of biomedical citations. MEDLINE is a subset of the PubMed database and includes more than 26 million biomedical citations starting from 1946 onward. Each article is carefully read and annotated by a group of trained indexers using a vocabulary system called Medical Subject Headings (MeSH). The MeSH thesaurus is a controlled vocabulary system produced by NLM to be used for indexing, cataloging, and searching for biomedical citations and health-related documents. After carefully reading an article, the NLM indexing experts select the most appropriate descriptors and subheadings (a.k.a. qualifiers) that best describe the content.
To extract required data for the present study, we downloaded all the MEDLINE citations from PubMed (https://www.ncbi.nlm.nih.gov/pubmed) with the “AE” MeSH subheading, which is used to indicate mentions of adverse effects, and containing at least 1 of the 8 high-risk ADEs of interest indexed as “Chemically induced.” These 2 MeSH indices, together, specify the drug and the adverse event mentioned as an association in a given article. For instance, the combination of “acetaminophen/AE” and “Acute kidney failure/chemically induced” for a given article indicates that a drug-ADE association suggesting the potential adverse effect of acetaminophen on developing kidney failure is mentioned in that study.
The search was done multiple times, each time for one of the ADEs of interest; however, at the end, we removed duplicated citations from our records. For each article, the following information was collected: article PubMed ID (PMID), MeSH descriptors, subheadings, substances, and date of publication.
Since drugs’ target protein and date of approval information were to be extracted from another resource (ie, DrugBank), we then used the Unified Medical Language System (UMLS) to map the drug and ADE terms. UMLS is a biomedical terminology integration system handling more than 150 terminologies, including MeSH. It integrates various alternatives of the same biomedical concepts and assigns each concept a unique identifier (CUI) across the whole database. All the drug and ADE terms in the collected dataset were mapped to their corresponding UMLS terms, and the CUI associated with each was queried and added to the dataset.
The list of approved FDA drugs along with their target proteins was downloaded from DrugBank’s (https://www.drugbank.ca) Therapeutic Target Database (TTD) ver. 6.1.01 and mapped to UMLS terms as well. Then the list was used to filter the articles collected from MEDLINE, so that we kept only articles including approved drugs and put away studies focusing on experimental drugs or chemical compounds.
Finally, drug-ADE pairs were created by matching mentions of the “AE” and “Chemically induced” tags in the same publications, and the corresponding publication dates were assigned to the created pairs. Repeated pairs were then identified, and redundancies were removed by maintaining only the earliest drug-ADE mention (based on dates). Also using the DrugBank data, drug-drug pairs were created by matching the drugs sharing at least one target protein.
The created pairs were then used as the input to both Cytoscape v3.6.0 and the igraph package in R to create network visualization and metrics, respectively.
Footnotes
Assortativity is defined as the extent to which highly central drugs tend to connect more frequently to high- or low-central ADEs.21
Preferential attachment denotes that the probability that a new edge has a specific node x as an endpoint is proportional to the current number of neighbors of x.43
The parameter settings for the best models in KNIME were as follows:
– RF: split criterion: Gini index; number of models: 400; no limit on the tree depth or node size.
– ANN: number of hidden layers: 2; number of neurons per layer: 5; maximum number of iterations: 60.
– GBT: number of model: 300; learning rate: 0.3; tree depth limit: 4; no attribute sampling.
For instance, we changed the cutoff year to 2003, and we ended up with only 732 records (ie, a decrease of 303 records) in the validation dataset.
REFERENCES
- 1. Iizuka T. Experts’ agency problems: evidence from the prescription drug market in Japan. Rand J Econ 2007; 383: 844–62. [DOI] [PubMed] [Google Scholar]
- 2. Trame MN, Biliouris K, Lesko LJ, Mettetal JT.. Systems pharmacology to predict drug safety in drug development. Eur J Pharm Sci 2016; 94: 93–5. [DOI] [PubMed] [Google Scholar]
- 3. Karimi S, Wang C, Metke-Jimenez A, Gaire R, Paris C.. Text and data mining techniques in adverse drug reaction detection. ACM Comput Surv 2015; 474: 1. [Google Scholar]
- 4. Zeng Q, Kogan S, Ash N, Greenes RA, Boxwala AA.. Characteristics of consumer terminology for health information retrieval. Methods Inf Med 2002; 4104: 289–98. [PubMed] [Google Scholar]
- 5. Nikfarjam A, Sarker A, O’connor K, Ginn R, Gonzalez G.. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc 2015; 223: 671–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Reps JM, Aickelin U, Hubbard RB.. Refining adverse drug reaction signals by incorporating interaction variables identified using emergent pattern mining. Comput Biol Med 2016; 69: 61–70. [DOI] [PubMed] [Google Scholar]
- 7. Harpaz R, Vilar S, DuMouchel W, et al. Combing signals from spontaneous reports and electronic health records for detection of adverse drug reactions. J Am Med Inform Assoc 2013; 203: 413–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Trifirò G, Pariente A, Coloma PM, et al. Data mining on electronic health record databases for signal detection in pharmacovigilance: which events to monitor? Pharmacoepidemiol Drug Saf 2009; 1812: 1176–84. [DOI] [PubMed] [Google Scholar]
- 9. Leaman R, Wojtulewicz L, Sullivan R, Skariah A, Yang J, Gonzalez G. Towards internet-age pharmacovigilance : extracting adverse drug reactions from user posts to health-related social networks. In: BioNLP ’10 Proceedings of the 2010 Workshop on Biomedical Natural Language Processing, Uppsala, Sweden; July 2010: 117–25.
- 10. Benton A, Ungar L, Hill S, et al. Identifying potential adverse effects using the web: a new approach to medical hypothesis generation. J Biomed Inform 2011; 446: 989–96. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Cai R, Liu M, Hu Y, et al. Identification of adverse drug-drug interactions through causal association rule discovery from spontaneous adverse event reports. Artif Intell Med 2017; 76: 7–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. van Puijenbroek EP, Bate A, Leufkens HGM, Lindquist M, Orre R, Egberts ACG.. A comparison of measures of disproportionality for signal detection in spontaneous reporting systems for adverse drug reactions. Pharmacoepidemiol Drug Saf 2002; 111: 3–10. [DOI] [PubMed] [Google Scholar]
- 13. Yang CC, Jiang L, Yang H, Tang X. Detecting signals of adverse drug reactions from health consumer contributed content in social media. In: Proceedings of ACM SIGKDD Workshop on Health Informatics, Beijing, China; 2012.
- 14. Friedman C. Discovering novel adverse drug events using natural language processing and mining of the electronic health record. In: Conference on Artificial Intelligence in Medicine in Europe, Verona, Italy; 2009: 1–5.
- 15. Harpaz R, Haerian K, Chase HS, Friedman C. Mining electronic health records for adverse drug effects using regression based methods. In: Proceedings of the 1st ACM International Health Informatics Symposium, Arlington, VA, USA; 2010: 100–7.
- 16. Liu X, Chen H. AZDrugMiner: an information extraction system for mining patient-reported adverse drug events in online patient forums. In: International Conference on Smart Health, Beijing, China; 2013: 134–50.
- 17. DuMouchel W. Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. Am Stat 1999; 533: 177–90. [Google Scholar]
- 18. Haerian K, Varn D, Vaidya S, Ena L, Chase HS, Friedman C.. Detection of pharmacovigilance‐related adverse events using electronic health records and automated methods. Clin Pharmacol Ther 2012; 922: 228–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Liu J, Zhao S, Zhang X.. An ensemble method for extracting adverse drug events from social media. Artif Intell Med 2016; 70: 62–76. [DOI] [PubMed] [Google Scholar]
- 20. Hoang T, Liu J, Pratt N, et al. Detecting signals of detrimental prescribing cascades from social media. Artif Intell Med 2016; 71: 43–56. [DOI] [PubMed] [Google Scholar]
- 21. Cami A, Arnold A, Manzi S, Reis B.. Predicting adverse drug events using pharmacological network models. Sci Transl Med 2011; 3114: 114ra127.. [DOI] [PubMed] [Google Scholar]
- 22. Atias N, Sharan R.. An algorithmic framework for predicting side effects of drugs. J Comput Biol 2011; 183: 207–18. [DOI] [PubMed] [Google Scholar]
- 23. Huang L, Wu X, Chen JY.. Predicting adverse drug reaction profiles by integrating protein interaction networks with drug structures. Proteomics 2013; 132: 313–24. [DOI] [PubMed] [Google Scholar]
- 24. Liu M, Wu Y, Chen Y, et al. Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. J Am Med Inform Assoc 2012; 19 (e1): e28–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Huang L-C, Wu X, Chen JY.. Predicting adverse side effects of drugs. BMC Genomics 2011; 12 (Suppl 5): S11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lazaar AL, Yang L, Boardley RL, et al. Pharmacokinetics, pharmacodynamics and adverse event profile of GSK2256294, a novel soluble epoxide hydrolase inhibitor. Br J Clin Pharmacol 2016; 815: 971–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Wedemeyer R-S, Blume H.. Pharmacokinetic drug interaction profiles of proton pump inhibitors: an update. Drug Saf 2014; 374: 201–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Vazzana M, Andreani T, Fangueiro J, et al. Tramadol hydrochloride: pharmacokinetics, pharmacodynamics, adverse side effects, co-administration of drugs and new drug delivery systems. Biomed Pharmacother 2015; 70: 234–8. [DOI] [PubMed] [Google Scholar]
- 29. Chiang C, Zhang P, Wang X, et al. Translational high‐dimensional drug interaction discovery and validation using health record databases and pharmacokinetics models. Clin Pharmacol Ther 2018; 1032: 287–95. [DOI] [PubMed] [Google Scholar]
- 30. Albrecht D, Ellis D, Canafax DM, et al. Pharmacokinetics and pharmacodynamics of tecarfarin, a novel vitamin K antagonist oral anticoagulant. Thromb Haemost 2017; 11704: 706–17. [DOI] [PubMed] [Google Scholar]
- 31. Ball R, Botsis T.. Can network analysis improve pattern recognition among adverse events following immunization reported to VAERS. Clin Pharmacol Ther 2011; 902: 271–8. [DOI] [PubMed] [Google Scholar]
- 32. Botsis T, Ball R.. Network analysis of possible anaphylaxis cases reported to the US vaccine adverse event reporting system after H1N1 influenza vaccine. Stud Health Technol Inform 2011; 169: 564–8. [PubMed] [Google Scholar]
- 33. Zhang Y, Tao C, He Y, Kanjamala P, Liu H.. Network-based analysis of vaccine-related associations reveals consistent knowledge with the vaccine ontology. J Biomed Sem 2013; 41: 33–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Kim MG, Jeong CR, Kim H, et al. Network analysis of drug-related problems in hospitalized patients with hematologic malignancies. Supportive Care in Cancer 2018; 268: 2737–42. [DOI] [PubMed] [Google Scholar]
- 35. Bender A, Scheiber J, Glick M, et al. Analysis of pharmacology data and the prediction of adverse drug reactions and off‐target effects from chemical structure. ChemMedChem 2007; 26: 861–73. [DOI] [PubMed] [Google Scholar]
- 36. Pouliot Y, Chiang AP, Butte AJ.. Predicting adverse drug reactions using publicly available PubChem BioAssay data. Clin Pharmacol Ther 2011; 901: 90–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. LaBute MX, Zhang X, Lenderman J, Bennion BJ, Wong SE, Lightstone FC.. Adverse drug reaction prediction using scores produced by large-scale drug-protein target docking on high-performance computing machines. PLoS One 2014; 99: e106298.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Hammann F, Gutmann H, Vogt N, Helma C, Drewe J.. Prediction of adverse drug reactions using decision tree modeling. Clin Pharmacol Ther 2010; 881: 52–9. [DOI] [PubMed] [Google Scholar]
- 39. Imming P, Sinning C, Meyer A.. Drugs, their targets and the nature and number of drug targets. Nat Rev Drug Discov 2006; 510: 821. [DOI] [PubMed] [Google Scholar]
- 40. Yildirim MA, Il Goh K, Cusick ME, Barabási AL, Vidal M.. Drug-target network. Nat Biotechnol 2007; 2510: 1119–26. [DOI] [PubMed] [Google Scholar]
- 41. Avillach P, Dufour JC, Diallo G, et al. Design and validation of an automated method to detect known adverse drug reactions in MEDLINE: a contribution from the EU-ADR project. J Am Med Inform Assoc 2013; 203: 446–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Wishart DS, Knox C, Guo AC, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 2006; 3490001: D668–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Liben-Nowell D, Kleinberg J.. The link-prediction problem for social networks. J Am Soc Inf Sci 2007; 587: 1019–31. [Google Scholar]
- 44. Zhou T, Lü L, Zhang Y-C.. Predicting missing links via local information. Eur Phys J B 2009; 714: 623–30. [Google Scholar]
- 45. Jaccard P. The distribution of the flora in the alpine zone. New Phytol 1912; 112: 37–50. [Google Scholar]
- 46. Dice LR. Measures of the amount of ecologic association between species. Ecology 1945; 263: 297–302. [Google Scholar]
- 47. Adamic LA, Adar E.. Friends and neighbors on the web. Soc Networks 2003; 253: 211–30. [Google Scholar]
- 48. Simpson GG. Notes on the measurement of faunal resemblance. Am J Sci 1960; 2582: 300–11. [Google Scholar]
- 49. Bass JIF, Diallo A, Nelson J, Soto JM, Myers CL, Walhout AJM.. Using networks to measure similarity between genes: association index selection. Nat Methods 2013; 1012: 1169.. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP.. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res 2002; 16: 321–57. [Google Scholar]
- 51. Tsiropoulos I, Andersen M, Hallas J.. Adverse events with use of antiepileptic drugs: a prescription and event symmetry analysis. Pharmacoepidemiol Drug Saf 2009; 186: 483–91. [DOI] [PubMed] [Google Scholar]
- 52. Pratt N, Chan EW, Choi N, et al. Prescription sequence symmetry analysis: assessing risk, temporality, and consistency for adverse drug reactions across datasets in five countries. Pharmacoepidemiol Drug Saf 2015; 248: 858–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Barneh F, Jafari M, Mirzaie M.. Updates on drug–target network; facilitating polypharmacology and data integration by growth of DrugBank database. Brief Bioinform 2015; 176: 1070–80. [DOI] [PubMed] [Google Scholar]