Skip to main content
Computational and Mathematical Methods in Medicine logoLink to Computational and Mathematical Methods in Medicine
. 2022 Apr 1;2022:9547317. doi: 10.1155/2022/9547317

Similarity-Based Method with Multiple-Feature Sampling for Predicting Drug Side Effects

Zixin Wu 1, Lei Chen 1,
PMCID: PMC8993545  PMID: 35401786

Abstract

Drugs can treat different diseases but also bring side effects. Undetected and unaccepted side effects for approved drugs can greatly harm the human body and bring huge risks for pharmaceutical companies. Traditional experimental methods used to determine the side effects have several drawbacks, such as low efficiency and high cost. One alternative to achieve this purpose is to design computational methods. Previous studies modeled a binary classification problem by pairing drugs and side effects; however, their classifiers can only extract one feature from each type of drug association. The present work proposed a novel multiple-feature sampling scheme that can extract several features from one type of drug association. Thirteen classification algorithms were employed to construct classifiers with features yielded by such scheme. Their performance was greatly improved compared with that of the classifiers that use the features yielded by the original scheme. Best performance was observed for the classifier based on random forest with MCC of 0.8661, AUROC of 0.969, and AUPR of 0.977. Finally, one key parameter in the multiple-feature sampling scheme was analyzed.

1. Introduction

Drugs are important in treating various diseases; however, their therapeutic effects are accompanied by negative effects called side effects. In the pharmaceutical field, drug side effect is classified as an adverse drug reaction (ADR), the harmful or accidental reactions of qualified drugs that are irrelevant to the purpose of their use under normal usage and dosage. Some market-approved drugs may generate unaccepted side effects that can be harmful to the human body and bring high risks to pharmaceutical companies. For example, fluconazole and atorvastatin have potential hepatotoxicity and nephrotoxicity that can increase transaminase when used in specific patients such as those with liver disease. Side effects are one of the major obstacles in launching new drugs and delaying their development. Thus, determining all the side effects for a given drug is an important topic in drug development. Despite their efficiency in identifying side effects, solid clinical trials are time consuming and expensive and thus cannot meet the demand of large-scale tests. Thus, rapid and cheap methods for the identification of drug side effects must be developed.

Many advanced computational algorithms have been proposed [15] to provide strong technique support to deal with various medical problems. Several computational methods have been developed for the identification of drug side effects. Most of them are machine learning-based techniques that deeply investigate current information on drug side effects and develop proper patterns that can be used to predict side effects for a given new drug. Some early methods consisted of an individual binary classifier for each side effect [610]; hence, they always contain several binary classifiers that must be simultaneously executed to determine all side effects for a given drug. In view of this situation, some other techniques were directly built with multilabel classifiers [1116] that identify side effects as labels and drugs as samples. Recommender systems were also proposed to predict drug side effects [1719]. Recent works paired drugs and side effects as samples to convert the original problem as binary classification [2022]. A key step in developing such binary classifiers is to extract essential properties from each drug–side effect pair. Some researchers used a similarity-based scheme to extract features [21, 22]; for convenience, they extracted only one feature from one type of drug association, a process called single-feature sampling scheme. However, some essential information may be omitted. For research continuation, a novel feature extraction scheme that can hold essential information for each drug–side effect pair must be developed.

In this study, an efficient binary classifier was proposed for the identification of drug side effects. Drugs and side effects were also paired as samples [2022]. The single-feature sampling scheme [21, 22] was generalized to extract essential features from each pair. Named as multiple-feature sampling scheme, this newly proposed strategy can generate multiple features from each type of drug association. Classic machine learning algorithm, random forest (RF) [23], was adopted as the prediction engine. According to the 10-fold cross-validation results, the performance of such classifier was better than that of the previous classifier that uses original single sampling scheme for feature extraction. Further tests suggested that classifiers with other classification algorithms and features yielded by the multiple sampling scheme were all superior to those with the same classification algorithm and features generated by the original scheme. This finding indicated the power of the features generated by the proposed feature extraction scheme.

2. Materials and Methods

2.1. Benchmark Dataset

Data on 841 drugs and their side effects (824) [2022] were extracted from SIDER (http://sideeffects.embl.de/) [24], a public database collecting the information of marketed drugs and their ADRs. The original data contained 888 drugs and 1385 side effects. The side effects that were annotated to no more than five drugs were excluded. Furthermore, drugs without the properties mentioned in Section 2.2 were discarded. From the remaining 841 drugs and 824 side effects, 57,058 drug–side effect pairs were obtained. Each pair indicated that the specific drug in the pair has the side effect in the same pair. Given that these pairs indicate the relationship between one drug and one side effect, they were termed as positive samples and comprised the positive dataset (PDS).

In addition to PDS, a negative dataset (NDS) was necessary in building an efficient binary classifier. A total of 57,058 drug–side effect pairs were produced by randomly pairing one drug and one side effect [20, 21]. However, no pairs can be labeled as positive samples. Therefore, these pairs constituted one NDS. Different NDSs may influence the performance of the classifier. Therefore, four other NDSs were also generated. Finally, five datasets each containing the PDS and one NDS were produced and denoted by DS1, DS2, DS3, DS4, and DS5.

2.2. Drug Association Obtained from Different Drug Properties

Two drugs with strong associations always share similar functions [2529]. Side effects can be deemed as one type of drug function. Thus, classifiers can be constructed by adopting features derived from drug associations. From different aspects of drugs, several types of drug associations can be measured and quantified. For easy comparisons, the drug associations adopted in a previous study [21] were adopted, and their brief descriptions are as follows.

2.2.1. Drug Fingerprint Association

Simplified molecular input line entry specification (SMILES) string [30] is a widely used scheme for drug representation. Fingerprints can be extracted from this string using existing software, such as RDKit [31]. The associations of two drugs can be evaluated by comparing their fingerprints. Here, ECFP_4 fingerprints and Tanimoto coefficient were used to measure such association between any two drugs. For formulation, this association for drugs d1 and d2 was denoted by Gf(d1, d2).

2.2.2. Drug Structural Association

In addition to SMILES string, another popular drug representation scheme is graph-based method. Here, each drug is represented by a graph with nodes depicting atoms and edges indicating bonds. The association of two drugs can be assessed by considering the similarity of two corresponding graphs. “SIMCOMP” (https://www.genome.jp/tools/simcomp/) reported in the KEGG [32, 33] was set up based on such idea. This tool can output the associations of a given drug with other drugs as measured by scores between 0 and 1. Such association for drugs d1 and d2 was denoted by Gs(d1, d2).

2.2.3. Drug Anatomical Therapeutic Chemical (ATC) Code Association

The ATC system is a widely accepted and used in drug classification. Each drug in such system is assigned five-level ATC codes that indicate its essential properties. For two drugs, their association can be measured according to their ATC codes. This study used the same method in [21] to evaluate drug association based on their ATC codes. For convenience, the association of drugs d1 and d2 was denoted by Ga(d1, d2).

2.2.4. Drug Literature Association

Given the extensive literature on drugs, the association of two drugs can be measured from their cooccurrence in some literature and natural language processing methods. The well-known public database, STITCH (version 4.0, http://stitch4.embl.de/) [34], provides such associations, which were directly employed in this study. “Textmining” score was extracted from the downloaded file “chemical_chemical.links.detailed.v4.0.tsv.” For drugs d1 and d2, their literature association was denoted by Gtm(d1, d2).

2.2.5. Drug Target Protein Association

Target protein is the basic property of drugs. Hence, the association of two drugs can be estimated by comparing their target proteins. In this study, the target proteins of drugs were retrieved from DrugBank (https://go.drugbank.com/) [35]. Each drug was encoded into a binary vector by applying one-hot scheme to its target proteins. The direction cosine of two vectors was defined as such association of two drugs. For formulation, this association between drugs d1 and d2 was denoted as Gt(d1, d2).

2.3. Feature Engineering

In Section 2.2, five types of drug associations that have been used to extract features to represent drug–side effect pairs [21, 22] were employed. These features indicated the linkage between one drug and one side effect in a drug–side effect pair. However, they extract only one feature from each type of drug association and thus cannot fully capture the essential linkage between the drug and the side effect. This study proposed a novel feature extraction scheme called multiple-feature sampling scheme, which can extract multiple features from one type of drug association. For a clear description, some denotations are necessary. For one drug–side effect pair p = <d, s>, where d and s indicate one drug and one side effect, respectively, let S be a set consisting of drugs having side effect s that have been extracted from the training dataset. If d also has side effect s, then, it would not be included in S. For one type of drug association, all values between d and drugs in S are selected. Denoted by Ψk(p) (where k ∈ {f, s, a, tm, t} represents the type of drug association used to construct such list), a candidate feature list for p is then constructed with the decreasing order of above values. The top value in this list has been previously chosen as exclusive feature [21, 22]. Selection of several values in this list can contain more information to represent the linkage of drug d and side effect s. On the basis of the different selection models, two strategies were proposed, namely, discrete and continuous strategies. Their procedures are shown in Figure 1.

Figure 1.

Figure 1

Procedures of multiple-feature sampling scheme to extract essential features from a drug–side effect pair. For a pair of drug d and side effect s, drugs having the side effect s are extracted from the training dataset. The association scores between d and these drugs constitute a candidate feature list. The discrete strategy selects discrete values in such list as features, and the continuous strategy picks up some top values in this list as features.

2.3.1. Discrete Strategy

In this strategy, several values from the list Ψk(p) are selected to indicate the distribution of values in the list. In this way, these selected values can fully indicate the linkage between drug d and side effect s. This process can be achieved by selecting some discrete values in the list. For example, the value at the first place or that at the top q% place can be selected. These values comprise a set of features from one type of drug association.

2.3.2. Continuous Strategy

This strategy differs from the first one. Given that the linkage of drug d and side effect s is highly indicated by some top values in the list, these values must be properly selected because they may fully contain the essential information. For an integer q between 1 and 100, the top q% values in the list Ψk(p) were selected as features.

2.4. Classification Algorithm

A proper classification algorithm is important in building an efficient classifier. In this study, RF [23] was adopted to construct the classifier. RF is one of the most classic classification algorithms and has been used to set up many classifiers in bioinformatics [3641].

RF is an integrated classification algorithm containing several decision trees, each of which is constructed by two random selection procedures. The first procedure is to select samples. Given a dataset with n samples, randomly select n samples with replacement from such dataset. The second procedure is to select features to split each node. The selected features should be much less than overall features. After the predefined number of decision trees has been constructed, RF integrates them by major voting. For a query sample, each decision tree gives its prediction. The majority prediction is the predicted result of RF. Although a decision tree is a relative weak classification algorithm, RF is extremely powerful and has always been an important candidate to build different classifiers.

In this study, “RandomForest” in Weka [42] was directly used to implement the abovementioned RF. Default parameters were adopted, and the number of decision trees was set to 100.

In addition to RF, the following classification algorithms were used to build corresponding classifiers: support vector machine (SVM) (polynomial kernel, RBF kernel) [43], Adaboost M1 [44], Bagging [45], Bayesian network [46], Naive Bayes [47], K-nearest neighbor (KNN) [48], decision tree (C4.5) [49], PART [50], logistic regression [51], multilayer perceptron (MLP) [52], and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [53]. The goal is to confirm that the features yielded by the multiple sampling scheme are more effective than those yielded by the single sampling scheme. For convenience, corresponding tools in Weka were used to implement the above classification algorithms under default parameters. These classification algorithms adopt different principles and procedures for classification. Therefore, their usage can fully test the utility of the proposed feature sampling scheme. If the classifier with features yielded by the multiple sampling scheme is superior to that with previous features for any of these classification algorithms, then, the robustness of the novel features obtained by the multiple sampling scheme is confirmed.

2.5. Accuracy Measurement

Ten-fold cross-validation [5459] was adopted to evaluate the performance of all constructed classifiers. Such method randomly divides the original dataset into ten parts. Each part is singled out one by one as the test set, and the remaining parts constitute the training set. Samples in the test set are predicted by the classifier based on the training set. Thus, each sample is tested exactly once.

For a binary classification problem, four entries can be counted by comparing the predicted and true classes of each sample, that is, true positive (TP), false positive (FP), true negative (TN), and false negative (FN). The following measurements were based on these four entries: sensitivity (SN) (also called recall), specificity (SP), prediction accuracy (ACC), Matthews correlation coefficient (MCC) [20, 21, 37, 6063], precision, and F1-measure. Their definitions are as follows:

SNrecall=TPTP+FN, (1)
SP=TNTN+FP, (2)
ACC=TP+TNTP+FN+FP+TN, (3)
MCC=TP×TNFP×FNTN+FNTN+FPTP+FNTP+FP, (4)
precision=TPTP+FP, (5)
F1measure=2×precision×recallprecision+recall. (6)

ACC, MCC, and F1-measure use all four entries and thus are more important than the other three measurements. Receiver operating characteristic (ROC) curve [64] and precision-recall (PR) curve were further employed to fully assess the performance of constructed classifiers. These curves indicate the performance of classifiers under different thresholds. ROC curve takes 1-SP as x-axis and SN as the y-axis, and PR curve takes recall as x-axis and precision as y-axis. Areas under these two curves (AUROC and AUPR) are important measurements to evaluate the performance of classifiers. Among the abovementioned parameters, MCC was selected as the main measurement.

3. Results and Discussion

A novel feature extraction method was proposed to extract essential features from drug–side effect pairs. On the basis of these features, efficient classifiers to predict drug side effects were established. All procedures are illustrated in Figure 2.

Figure 2.

Figure 2

Entire procedures of the method for identification of drug side effects. Positive dataset (reported drug–side effect pairs) is retrieved from SIDER, and five negative datasets are randomly generated. From the four public databases or tools, five drug properties are employed and used to extract features with multiple-feature sampling scheme. Random forest is adopted to build the model and is further evaluated by 10-fold cross-validation.

3.1. Performance of the RF Classifiers with Discrete Strategy

The discrete strategy picks some discrete values in the candidate feature list. Given that the top value in such list is the most important and has been previously selected as the exclusive feature [21, 65], this top value is always picked up as one feature. As mentioned in Section 2.3, the value located at top q% place in the list was also selected. In this study, q was set as 5, 10, 15, and 20. Values with high ranks in the candidate feature list are more important than those with low ranks, that is, the top value is the most important, followed by values at 5%, 10%, 15%, and 20%. Incremental feature selection was adopted to generate four feature subsets as listed in column 1 of Table 1. With each feature subsets derived from five types of drug associations, a RF classifier was built on each of five datasets and evaluated by 10-fold cross-validation. The average performance is listed in Table 1. MCC followed an increasing trend when the values at top 5%, 10%, 15%, and 20% were added. Other five measurements also generally followed such trend. The RF classifiers with all selected features (top values and those at 5%, 10%, 15%, and 20%) generated the highest MCC of 0.7172. This finding indicated that the features yielded by such multiple-feature sampling scheme were quite efficient for the identification of drug side effects.

Table 1.

Performance of the RF classifiers with discrete strategy.

Feature sampling SN SP ACC MCC Precision F1 measure
Top + 5% 0.8072 0.8694 0.8383 0.6780 0.8608 0.8331
Top + 5% + 10% 0.8209 0.8829 0.8519 0.7058 0.8751 0.8467
Top + 5% + 10% + 15% 0.8214 0.8907 0.8561 0.7145 0.8825 0.8505
Top + 5% + 10% + 15% + 20% 0.8201 0.8944 0.8573 0.7172 0.8860 0.8514

The ROC and PR curves of these four RF classifiers were investigated, and the results are shown in Figure 3. All AUROCs and AUPRs were higher than 0.900 and 0.910, respectively, thus, further suggesting the good performance of RF classifiers with discrete strategy.

Figure 3.

Figure 3

Receiver operating characteristic (ROC) curve and precision-recall (PR) curve of RF classifiers with single-feature sampling scheme and multiple-feature sampling scheme (discrete strategy). (a) ROC curves and (b) PR curves.

3.2. Performance of RF Classifiers with Continuous Strategy

Different from discrete strategy, continuous strategy selected values from the candidate feature list in a continuous way. As mentioned in Section 2.3, top q% values in the candidate feature list can be chosen as features. Here, some q values including 10, 20, 30, and 40 and four feature subsets were tested. A RF classifier was also built on each of the five datasets by using the feature subsets derived from the five types of drug associations. Each classifier was assessed by 10-fold cross-validation, and the average performance is listed in Table 2. When q = 20 (top 20%), the RF classifier yielded the highest MCC of 0.8661 and generated the ACC of 0.9312, F1-measure of 0.9278, SN of 0.8852, SP of 0.9771, and precision of 0.9747. Compared with the RF classifiers with discrete strategy, the best RF with continuous strategy had higher measurements, particularly for MCC (by 15%), ACC (by 7%), and F1-measure (by 7%). These results indicated that the features obtained by continuous strategy were more powerful in identifying drug side effects than those yielded by discrete strategy.

Table 2.

Performance of the RF classifiers with continuous strategy.

Feature sampling SN SP ACC MCC Precision F1 measure
Top 10% 0.8737 0.9644 0.9190 0.8416 0.9609 0.9152
Top 20% 0.8852 0.9771 0.9312 0.8661 0.9747 0.9278
Top 30% 0.8844 0.9770 0.9307 0.8652 0.9747 0.9273
Top 40% 0.8834 0.9775 0.9305 0.8648 0.9751 0.9270

The ROC and PR curves of RF classifiers with continuous strategy were plotted as shown in Figure 4. All ROC curves were close to the point (0, 1), and all PR curves were close to the point (1, 1). The AUROCs and AUPRs were all quite high. Compared with AUROCs and AUPRs for discrete strategy, those for continuous strategy were generally higher. This finding further confirmed that the features yielded by continuous strategy were more powerful than those yielded by discrete strategy.

Figure 4.

Figure 4

Receiver operating characteristic (ROC) curve and precision-recall (PR) curve of RF classifiers with multiple-feature sampling scheme (continuous strategy). (a) ROC curves and (b) PR curves.

3.3. Comparison of RF Classifiers with Single- and Multiple-Feature Sampling

A multiple-feature sampling scheme was proposed to extract essential features from each drug–side effect pair. Previous studies [21, 22] only picked up the top value as the feature, and this technique was called single sampling scheme. This section compares the RF classifiers with these two feature sampling schemes.

The average performances of RF classifiers with single-feature sampling scheme are listed in Table 3. The MCC was 0.5997, ACC was 0.7999, and F1-measure was 0.7988. Other three measurements (SN, SP, and precision) were 0.7948, 0.8049, and 0.8030, respectively. The best performing (highest MCC) RF classifiers with discrete and continuous strategies were selected for comparison and are also listed in Table 3. The MCCs for two strategies were 0.7172 and 0.8661, which were higher than that for the RF classifier with single-feature sampling scheme. Same conclusions can be obtained for other five measurements. The ROC and PR curves of RF classifier with single-feature sampling scheme were also plotted (Figure 3) and were found to be always under those of RF classifiers with discrete strategy. The AUROC and AUPR of the RF classifier with single-feature sampling scheme were 0.870 and 0.878, respectively, which were also lower than those of the RF classifier with discrete strategy. For the RF classifier with continuous strategy, its AUROCs and AUPRs (Figure 4) were even better than those of the RF classifier with discrete strategy and were also higher than those of the RF classifier with single-feature sampling scheme. All these results implied that the features yielded by the multiple sampling scheme contained more essential information of drug–side effect pairs than those obtained by the single sampling scheme. These features provide RF with improved performance.

Table 3.

Comparison of RF classifiers with single- and multiple-feature sampling schemes.

Scheme SN SP ACC MCC Precision F1 measure
Single sampling 0.7948 0.8049 0.7999 0.5997 0.8030 0.7988
Multiple sampling Discrete strategy 0.8201 0.8944 0.8573 0.7172 0.8860 0.8514
Continuous strategy 0.8852 0.9771 0.9312 0.8661 0.9747 0.9278

3.4. Performance of Other Classifiers with Multiple-Feature Sampling Scheme

The RF classifiers with features yielded by multiple sampling (discrete strategy) were superior to those with features yielded by single sampling, and the RF classifiers with continuous strategy were better than those with discrete strategy. However, the relevance of this result to the selection of classification algorithms must be explored. In this section, 12 classification algorithms mentioned in Section 2.4 were tested. The classifiers with different algorithms and all feature subsets used for RF were constructed and evaluated by 10-fold cross-validation. The predicted results are listed in Tables S1–S24.

The performances of classifiers with single sampling and the best performance of classifiers with multiple sampling are listed in Table 4. The classifiers with multiple sampling (discrete strategy) were generally better than those with single sampling, and those with continuous strategy were superior to those with discrete strategy and single sampling. For a visualized confirmation, a radar graph was plotted for each value of ACC, MCC, and F1-measure as illustrated in Figure 5. For each measurement, the area in the closed curve of classifiers with multiple sampling (continuous strategy) was the largest, followed by the closed curve of classifiers with multiple sampling (discrete strategy); the area in the closed curve of classifiers with single sampling was the smallest. On the basis of these results, multiple sampling scheme is more efficient to capture the essential properties of drug–side effect pairs than single sampling scheme, and continuous strategy is better than discrete strategy.

Table 4.

Performance of classifiers with different classification algorithms and feature extraction schemes.

Classification algorithm Feature extraction scheme ACC MCC F1-measure
SVM (polynomial kernel) Single sampling 0.6487 0.2997 0.6252
Multiple sampling Discrete strategy 0.6989 0.4240 0.6357
Continuous strategy 0.9152 0.8356 0.9101

SVM (RBF kernel) Single sampling 0.6608 0.3276 0.6251
Multiple sampling Discrete strategy 0.6987 0.4188 0.6415
Continuous strategy 0.9191 0.8428 0.9147

Adaboost M1 Single sampling 0.6693 0.3435 0.6392
Multiple sampling Discrete strategy 0.6574 0.3186 0.6287
Continuous strategy 0.9024 0.8102 0.8963

Bagging Single sampling 0.7909 0.5828 0.7848
Multiple sampling Discrete strategy 0.8386 0.6799 0.8317
Continuous strategy 0.9273 0.8580 0.9240

Bayesian network Single sampling 0.7007 0.4076 0.6722
Multiple sampling Discrete strategy 0.6950 0.3980 0.6614
Continuous strategy 0.8473 0.7236 0.8225

Naive Bayes Single sampling 0.6368 0.2822 0.5859
Multiple sampling Discrete strategy 0.6272 0.2616 0.5782
Continuous strategy 0.8528 0.7329 0.8296

KNN Single sampling 0.7652 0.5321 0.7740
Multiple sampling Discrete strategy 0.7918 0.5838 0.7931
Continuous strategy 0.9071 0.8148 0.9054

Decision tree Single sampling 0.7635 0.5315 0.7471
Multiple sampling Discrete strategy 0.8154 0.6333 0.8080
Continuous strategy 0.9170 0.8359 0.9142

PART Single sampling 0.6986 0.4015 0.6753
Multiple sampling Discrete strategy 0.8022 0.6105 0.7874
Continuous strategy 0.9192 0.8402 0.9166

Logistic regression Single sampling 0.6501 0.3008 0.6383
Multiple sampling Discrete strategy 0.7690 0.5442 0.7515
Continuous strategy 0.9157 0.8353 0.9115

Multilayer perceptron Single sampling 0.6680 0.3438 0.6352
Multiple sampling Discrete strategy 0.8139 0.6305 0.8052
Continuous strategy 0.8616 0.7299 0.8688

RIPPER Single sampling 0.7037 0.4090 0.6904
Multiple sampling Discrete strategy 0.7546 0.5156 0.7382
Continuous strategy 0.9215 0.8460 0.9181

Figure 5.

Figure 5

Radar graphs to show performance of classifiers with single- and multiple-feature sampling schemes. (a) MCC; (b) ACC; (c) F1-measure. Classifiers with multiple-feature sampling scheme (continuous strategy) provide best performance.

3.5. Analysis of the Parameter of Continuous Strategy

For the continuous strategy, the parameter q is a key factor that determines the number of selected features from the candidate feature list. Here, its influence on the performance of classifiers was investigated.

For RF classifiers, the highest MCC of 0.8661 was achieved when q = 20 (Table 2). For other classifiers with different classification algorithms, q = 20 always yields the best performance as shown in Figure 6. Among the 13 classifiers with different classification algorithms, 10 provided the best performance when q = 20, occupying 76.92%. Meanwhile, two yielded the best performance when q = 30. This phenomenon was reasonable. When q is extremely small, some essential information of drug–side effect pairs cannot be included. When q is large, several noises may be employed. Current investigation revealed that the values of q can be taken in an interval [20, 30].

Figure 6.

Figure 6

Performance of classifiers with continuous strategy under different parameters.

4. Conclusions

This study prevents a novel investigation on drug side effects. The contributions contained two aspects. One was the multiple-feature sampling scheme that can extract essential features from drug–side effect pairs, and other one was novel computational methods for the identification of drug side effects based on the features yielded by the multiple sampling scheme. Classifiers were built on the basis of different classification algorithms. By comparison, the classifiers using features yielded by the multiple sampling scheme performed better than those using features yielded by the single sampling scheme. The proposed classifiers can be useful tools to identify drug side effects, and the novel feature extraction scheme can be applied to other similar biological or medical problems.

Acknowledgments

This work was supported by the Natural Science Foundation of Shanghai (17ZR1412500).

Data Availability

The original data used to support the findings of this study are available at SIDER and in supplementary information files.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

Supplementary Materials

Supplementary Materials

Table S1: performance of SVM (polynomial kernel) classifier with discrete strategy. Table S2: performance of SVM (polynomial kernel) classifier with continuous strategy. Table S3: performance of SVM (RBF kernel) classifier with discrete strategy. Table S4: performance of SVM (RBF kernel) classifier with continuous strategy. Table S5: performance of Adaboost M1 classifier with discrete strategy. Table S6: performance of Adaboost M1 classifier with continuous strategy. Table S7: performance of Bagging classifier with discrete strategy. Table S8: performance of Bagging classifier with continuous strategy. Table S9: performance of Bayesian network classifier with discrete strategy. Table S10: performance of Bayesian network classifier with continuous strategy. Table S11: performance of Naive Bayes classifier with discrete strategy. Table S12: performance of Naive Bayes classifier with continuous strategy. Table S13: performance of KNN classifier with discrete strategy. Table S14: performance of KNN classifier with continuous strategy. Table S15: performance of decision tree classifier with discrete strategy. Table S16: performance of decision tree classifier with continuous strategy. Table S17: performance of PART classifier with discrete strategy. Table S18: performance of PART classifier with continuous strategy. Table S19: performance of logistic regression classifier with discrete strategy. Table S20: performance of logistic regression classifier with continuous strategy. Table S2: performance of multilayer perceptron classifier with discrete strategy. Table S22: performance of multilayer perceptron classifier with continuous strategy. Table S23: performance of RIPPER classifier with discrete strategy. Table S24: performance of RIPPER classifier with continuous strategy.

References

  • 1.Onan A., Korukoğlu S., Bulut H. A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications . 2016;62:1–16. doi: 10.1016/j.eswa.2016.06.005. [DOI] [Google Scholar]
  • 2.Onan A., Korukoğlu S., Bulut H. Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications . 2016;57:232–247. doi: 10.1016/j.eswa.2016.03.045. [DOI] [Google Scholar]
  • 3.Onan A., Korukoğlu S. Artificial Intelligence Perspectives in Intelligent Systems . Springer; 2016. Exploring performance of instance selection methods in text sentiment classification; pp. 167–179. [DOI] [Google Scholar]
  • 4.Onan A., Korukoğlu S., Bulut H. A hybrid ensemble pruning approach based on consensus clustering and multi- objective evolutionary algorithm for sentiment classification. Information Processing & Management . 2017;53(4):814–833. doi: 10.1016/j.ipm.2017.02.008. [DOI] [Google Scholar]
  • 5.Onan A., Toçoğlu M. A. A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access . 2021;9:7701–7722. doi: 10.1109/ACCESS.2021.3049734. [DOI] [Google Scholar]
  • 6.Pauwels E., Stoven V., Yamanishi Y. Predicting drug side-effect profiles: a chemical fragment-based approach. BMC Bioinformatics . 2011;12(1):p. 169. doi: 10.1186/1471-2105-12-169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jamal S., Goyal S., Shanker A., Grover A. Predicting neurological adverse drug reactions based on biological, chemical and phenotypic properties of drugs using machine learning models. Scientific Reports . 2017;7(1):p. 872. doi: 10.1038/s41598-017-00908-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Zheng Y., Peng H., Ghosh S., Lan C., Li J. Inverse similarity and reliable negative samples for drug side-effect prediction. BMC Bioinformatics . 2019;19(Suppl 13):p. 554. doi: 10.1186/s12859-018-2563-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Liu M., Wu Y., Chen Y., et al. Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs. Journal of the American Medical Informatics Association . 2012;19(e1):e28–e35. doi: 10.1136/amiajnl-2011-000699. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Dey S., Luo H., Fokoue A., Hu J., Zhang P. Predicting adverse drug reactions through interpretable deep learning framework. BMC Bioinformatics . 2018;19(Suppl 21):p. 476. doi: 10.1186/s12859-018-2544-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chen L., Huang T., Zhang J., et al. Predicting drugs side effects based on chemical-chemical interactions and protein-chemical interactions. BioMed Research International . 2013;2013:8. doi: 10.1155/2013/485034.485034 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Zhang W., Liu F., Luo L., Zhang J. Predicting drug side effects by multi-label learning and ensemble learning. BMC Bioinformatics . 2015;16(1):p. 365. doi: 10.1186/s12859-015-0774-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Atias N., Sharan R. An algorithmic framework for predicting side effects of drugs. Journal of Computational Biology . 2011;18(3):207–218. doi: 10.1089/cmb.2010.0255. [DOI] [PubMed] [Google Scholar]
  • 14.Muñoz E., Novácek V., Vandenbussche P. Y. Facilitating prediction of adverse drug reactions by using knowledge graphs and multi-label learning models. Briefings in Bioinformatics . 2019;20(1):190–202. doi: 10.1093/bib/bbx099. [DOI] [PubMed] [Google Scholar]
  • 15.Zhang W., Chen Y., Tu S., Liu F., Qu Q. Drug side effect prediction through linear neighborhoods and multiple data source integration. IEEE International Conference on Bioinformatics and Biomedicine; 2016; Shenzhen, Guangdong, China. pp. 427–434. [Google Scholar]
  • 16.Munoz E., Novacek V., Vandenbussche P. Y. Using drug similarities for discovery of possible adverse reactions. American Medical Informatics Association Annual Symposium Proceedings; 2016; USA. pp. 924–933. [PMC free article] [PubMed] [Google Scholar]
  • 17.Ding Y. J., Tang J. J., Guo F. Identification of drug-side effect association via multiple information integration with centered kernel alignment. Neurocomputing . 2019;325:211–224. doi: 10.1016/j.neucom.2018.10.028. [DOI] [Google Scholar]
  • 18.Guo X., Zhou W., Yu Y., Ding Y., Tang J., Guo F. A novel triple matrix factorization method for detecting drug-side effect association based on kernel target alignment. BioMed Research International . 2020;2020:11. doi: 10.1155/2020/4675395.4675395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ding Y., Tang J., Guo F. Identification of drug-side effect association via semi-supervised model and multiple kernel learning. IEEE Journal of Biomedical and Health Informatics . 2019;23(6):2619–2632. doi: 10.1109/JBHI.2018.2883834. [DOI] [PubMed] [Google Scholar]
  • 20.Zhao X., Chen L., Guo Z. H., Liu T. Predicting drug side effects with compact integration of heterogeneous networks. Current Bioinformatics . 2019;14(8):709–720. doi: 10.2174/1574893614666190220114644. [DOI] [Google Scholar]
  • 21.Zhao X., Chen L., Lu J. A similarity-based method for prediction of drug side effects with heterogeneous information. Mathematical Biosciences . 2018;306:136–144. doi: 10.1016/j.mbs.2018.09.010. [DOI] [PubMed] [Google Scholar]
  • 22.Liang H., Chen L., Zhao X., Zhang X. Prediction of drug side effects with a refined negative sample selection strategy. Computational and Mathematical Methods in Medicine . 2020;2020:16. doi: 10.1155/2020/1573543.1573543 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Breiman L. Random forests. Machine Learning . 2001;45(1):5–32. doi: 10.1023/A:1010933404324. [DOI] [Google Scholar]
  • 24.Kuhn M., Campillos M., Letunic I., Jensen L. J., Bork P. A side effect resource to capture phenotypic effects of drugs. Molecular Systems Biology . 2010;6(1):p. 343. doi: 10.1038/msb.2009.98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Hu L. L., Chen C., Huang T., Cai Y. D., Chou K. C. Predicting biological functions of compounds based on chemical-chemical interactions. PLoS One . 2011;6(12, article e29491) doi: 10.1371/journal.pone.0029491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chen L., Zeng W. M., Cai Y. D., Feng K. Y., Chou K. C. Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities. PLoS One . 2012;7(4, article e35254) doi: 10.1371/journal.pone.0035254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Chen L., Lu J., Zhang N., Huang T., Cai Y. D. A hybrid method for prediction and repositioning of drug anatomical therapeutic chemical classes. Molecular BioSystems . 2014;10(4):868–877. doi: 10.1039/c3mb70490d. [DOI] [PubMed] [Google Scholar]
  • 28.Chen L., Liu T., Zhao X. Inferring anatomical therapeutic chemical (ATC) class of drugs using shortest path and random walk with restart algorithms. Biochimica et Biophysica Acta-Molecular Basis of Disease . 2018;1864(6):2228–2240. doi: 10.1016/j.bbadis.2017.12.019. [DOI] [PubMed] [Google Scholar]
  • 29.Liang H. Y., Hu B., Chen L., Wang S., Aorigele Recognizing novel chemicals/drugs for anatomical therapeutic chemical classes with a heat diffusion algorithm. Biochimica et Biophysica Acta-Molecular Basis of Disease . 2020;1866(11, article 165910) doi: 10.1016/j.bbadis.2020.165910. [DOI] [PubMed] [Google Scholar]
  • 30.Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. Journal of Chemical Information and Computer Sciences . 1988;28(1):31–36. doi: 10.1021/ci00057a005. [DOI] [Google Scholar]
  • 31.Landrum G. RDKit: open-source cheminformatics. 2006. http://www.rdkit.org .
  • 32.Kanehisa M., Furumichi M., Tanabe M., Sato Y., Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research . 2017;45(D1):D353–D361. doi: 10.1093/nar/gkw1092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Kanehisa M., Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Research . 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kuhn M., Szklarczyk D., Pletscher-Frankild S., et al. STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Research . 2014;42(Database issue):401–407. doi: 10.1093/nar/gkt1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wishart D. S., Feunang Y. D., Guo A. C., et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Research . 2018;46(D1):D1074–D1082. doi: 10.1093/nar/gkx1037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Carlos M., Zoran K., Juan S. Predicting non-deposition sediment transport in sewer pipes using random forest. Water Research . 2021;189:p. 116639. doi: 10.1016/j.watres.2020.116639. [DOI] [PubMed] [Google Scholar]
  • 37.Jia Y., Zhao R., Chen L. Similarity-based machine learning model for predicting the metabolic pathways of compounds. IEEE Access . 2020;8:130687–130696. doi: 10.1109/ACCESS.2020.3009439. [DOI] [Google Scholar]
  • 38.Urista D. V., Carrué D. B., Otero I., et al. Prediction of antimalarial drug-decorated nanoparticle delivery systems with random forest models. Biology . 2020;9(8):p. 198. doi: 10.3390/biology9080198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Lv Z. B., Zhang J., Ding H., Zou Q. RF-PseU: a random forest predictor for RNA pseudouridine sites. Frontiers in Bioengineering and Biotechnology . 2020;8:p. 10. doi: 10.3389/fbioe.2020.00134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Baranwal M., Magner A., Elvati P., Saldinger J., Violi A., Hero A. O. A deep learning architecture for metabolic pathway prediction. Bioinformatics . 2020;36(8):2547–2553. doi: 10.1093/bioinformatics/btz954. [DOI] [PubMed] [Google Scholar]
  • 41.Yang Y., Chen L. Identification of drug–disease associations by using multiple drug and disease networks. Current Bioinformatics . 2022;17(1):48–59. [Google Scholar]
  • 42.Witten I. H., Frank E. Data Mining:Practical Machine Learning Tools and Techniques . 2nd ed. San Francisco, Morgan: Kaufmann; 2005. [Google Scholar]
  • 43.Cortes C., Vapnik V. Support-vector networks. Machine Learning . 1995;20(3):273–297. doi: 10.1007/BF00994018. [DOI] [Google Scholar]
  • 44.Freund Y., Schapire R. E. Thirteenth International Conference on ML . Citeseer; 1996. Experiments with a new boosting algorithm. [Google Scholar]
  • 45.Breiman L. Bagging predictors. Machine Learning . 1996;24(2):123–140. doi: 10.1007/BF00058655. [DOI] [Google Scholar]
  • 46.Lee S., Shimoji S. BAYESNET: Bayesian Classification Network Based on Biased Random Competition Using Gaussian Kernels. IEEE International Conference on Neural Networks; 1993; San Francisco, CA, USA. [Google Scholar]
  • 47.Rish I. An empirical study of the naive Bayes classifier. IJCAI 2001 workshop on empirical methods in artificial intelligence; 2001; IBM New York, USA. [Google Scholar]
  • 48.Cover T., Hart P. Nearest neighbor pattern classification. IEEE Transactions on Information Theory . 1967;13(1):21–27. doi: 10.1109/TIT.1967.1053964. [DOI] [Google Scholar]
  • 49.Quinlan R. C4.5: Programs for Machine Learning. San Mateo, CA, USA: Morgan Kaufmann Publishers; 1993. [Google Scholar]
  • 50.Frank E., Witten I. H. Generating accurate rule sets without global optimization. 15th International Conference on Machine Learning; 1998; San Francisco, CA, USA. pp. 144–151. [Google Scholar]
  • 51.Sumner M., Frank E., Hall M. European Conference on Principles of Data Mining and Knowledge Discovery . Springer; 2005. Speeding up logistic model tree induction. [DOI] [Google Scholar]
  • 52.Pal S. K., Mitra S. Multilayer perceptron, fuzzy sets, classifiaction . IEEE; 1992. [DOI] [PubMed] [Google Scholar]
  • 53.Cohen W. W. Fast effective rule induction. Machine Learning Proceedings 1995; 1995; Morgan Kaufmann Publishers, Inc; [DOI] [Google Scholar]
  • 54.Kohavi R. International joint Conference on artificial intelligence . Lawrence Erlbaum Associates Ltd.; 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. [Google Scholar]
  • 55.Zhang Y.-H., Li Z., Zeng T., et al. Detecting the multiomics signatures of factor-specific inflammatory effects on airway smooth muscles. Frontiers in Genetics . 2021;11, article 599970 doi: 10.3389/fgene.2020.599970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Zhang Y. H., Li H., Zeng T., et al. Identifying transcriptomic signatures and rules for SARS-CoV-2 infection. Frontiers in Cell and Development Biology . 2021;8, article 627302 doi: 10.3389/fcell.2020.627302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Pan X., Li H., Zeng T., et al. Identification of protein subcellular localization with network and functional embeddings. Frontiers in Genetics . 2021;11, article 626500 doi: 10.3389/fgene.2020.626500. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Zhu Y., Hu B., Chen L., Dai Q. iMPTCE-Hnetwork: a multi-label classifier for identifying metabolic pathway types of chemicals and enzymes with a heterogeneous network. Computational and Mathematical Methods in Medicine . 2021;2021:12. doi: 10.1155/2021/6683051.6683051 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Zhou J.-P., Chen L., Guo Z.-H. iATC-NRAKEL: an efficient multi-label classifier for recognizing anatomical therapeutic chemical classes of drugs. Bioinformatics . 2020;36(5):1391–1396. doi: 10.1093/bioinformatics/btz757. [DOI] [PubMed] [Google Scholar]
  • 60.Matthews B. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Structure . 1975;405(2):442–451. doi: 10.1016/0005-2795(75)90109-9. [DOI] [PubMed] [Google Scholar]
  • 61.Zhang Y.-H., Zeng T., Chen L., Huang T., Cai Y. D. Determining protein-protein functional associations by functional rules based on gene ontology and KEGG pathway. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics . 2021;1869(6, article 140621) doi: 10.1016/j.bbapap.2021.140621. [DOI] [PubMed] [Google Scholar]
  • 62.Chen L., Chu C., Zhang Y. H., et al. Identification of drug-drug interactions using chemical interactions. Current Bioinformatics . 2017;12(6):526–534. doi: 10.2174/1574893611666160618094219. [DOI] [Google Scholar]
  • 63.Chen L., Wang S., Zhang Y. H., et al. Identify key sequence features to improve CRISPR sgRNA efficacy. IEEE Access . 2017;5:26582–26590. doi: 10.1109/ACCESS.2017.2775703. [DOI] [Google Scholar]
  • 64.Egan J. Signal Detection Theory and ROC Analysis . New York: Academic Press; 1975. [Google Scholar]
  • 65.Liu Z., Guo F., Gu J., et al. Similarity-based prediction for anatomical therapeutic chemical classification of drugs by integrating multiple data sources. Bioinformatics . 2015;31(11):1788–1795. doi: 10.1093/bioinformatics/btv055. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Materials

Table S1: performance of SVM (polynomial kernel) classifier with discrete strategy. Table S2: performance of SVM (polynomial kernel) classifier with continuous strategy. Table S3: performance of SVM (RBF kernel) classifier with discrete strategy. Table S4: performance of SVM (RBF kernel) classifier with continuous strategy. Table S5: performance of Adaboost M1 classifier with discrete strategy. Table S6: performance of Adaboost M1 classifier with continuous strategy. Table S7: performance of Bagging classifier with discrete strategy. Table S8: performance of Bagging classifier with continuous strategy. Table S9: performance of Bayesian network classifier with discrete strategy. Table S10: performance of Bayesian network classifier with continuous strategy. Table S11: performance of Naive Bayes classifier with discrete strategy. Table S12: performance of Naive Bayes classifier with continuous strategy. Table S13: performance of KNN classifier with discrete strategy. Table S14: performance of KNN classifier with continuous strategy. Table S15: performance of decision tree classifier with discrete strategy. Table S16: performance of decision tree classifier with continuous strategy. Table S17: performance of PART classifier with discrete strategy. Table S18: performance of PART classifier with continuous strategy. Table S19: performance of logistic regression classifier with discrete strategy. Table S20: performance of logistic regression classifier with continuous strategy. Table S2: performance of multilayer perceptron classifier with discrete strategy. Table S22: performance of multilayer perceptron classifier with continuous strategy. Table S23: performance of RIPPER classifier with discrete strategy. Table S24: performance of RIPPER classifier with continuous strategy.

Data Availability Statement

The original data used to support the findings of this study are available at SIDER and in supplementary information files.


Articles from Computational and Mathematical Methods in Medicine are provided here courtesy of Wiley

RESOURCES