Abstract
In-silico prediction of repurposable drugs is an effective drug discovery strategy that supplements de-nevo drug discovery from scratch. Reduced development time, less cost and absence of severe side effects are significant advantages of using drug repositioning. Most recent and most advanced artificial intelligence (AI) approaches have boosted drug repurposing in terms of throughput and accuracy enormously. However, with the growing number of drugs, targets and their massive interactions produce imbalanced data which may not be suitable as input to the classification model directly. Here, we have proposed DTI-SNNFRA, a framework for predicting drug-target interaction (DTI), based on shared nearest neighbour (SNN) and fuzzy-rough approximation (FRA). It uses sampling techniques to collectively reduce the vast search space covering the available drugs, targets and millions of interactions between them. DTI-SNNFRA operates in two stages: first, it uses SNN followed by a partitioning clustering for sampling the search space. Next, it computes the degree of fuzzy-rough approximations and proper degree threshold selection for the negative samples’ undersampling from all possible interaction pairs between drugs and targets obtained in the first stage. Finally, classification is performed using the positive and selected negative samples. We have evaluated the efficacy of DTI-SNNFRA using AUC (Area under ROC Curve), Geometric Mean, and F1 Score. The model performs exceptionally well with a high prediction score of 0.95 for ROC-AUC. The predicted drug-target interactions are validated through an existing drug-target database (Connectivity Map (Cmap)).
1 Introduction
Drug development strategies, also known as drug repositioning or drug repurposing or drug reprofiling, predict the interaction among drugs and targets from the existing drug-target databases [1]. There are two types of drug-target interaction: competitive inhibitors and allosteric inhibitors. Competitive inhibitors adhere to the target’s active site to suppress reactions. Allosteric inhibitors bind to the target’s allosteric site, which in turn prevents reactions, correct metabolic imbalance, and kills pathogens to cure diseases. There exist several synthesized compounds whose target profiles and effects are still unknown. The research and findings of compounds’ properties, their reactions/responses to drugs, and targets have generated large, complex databases that need efficient computational methods to analyze and predict drug-target interaction. New drug design requires more than 13.5 years and the cost exceeds 1.8 billion dollars [2, 3]. Moreover, new drugs may have unwanted side effects on patients. Therefore, due to known side effects and easier government approval, drug-repurposing facilitate pharmaceutical companies to launch existing authorized drugs and compounds in the market for new therapeutic purposes [4]. Drug repositioning usually reinvestigates existing drugs which were denied approval due to new therapeutic indications.
Practical laboratory experiments to discover the interactions among the drugs and targets are expensive, time-consuming and labour-intensive. Therefore, in-silico approaches are gaining attention, in which virtual screening is initially accomplished, and then possible candidates go through experimental verification. Docking simulations is a type of in-silico approach that need 3D structure analysis of drugs and target molecules to determine the potential binding sites. Despite the excellent accuracy of this process, unavailability of the proper 3D structure of drugs and targets, and long processing time hinders the docking simulation. Chemogenomics was introduced to tackle this problem in which the chemical space and genomic space are mined together to find the potential compounds such as imaging probes and drug leads [4]. Plenty of machine learning techniques based on similarity computation, matrix factorization, network models, features vectors, and deep learning models for DTIs prediction are prevalent in the literature [1, 5]. Similarity-based approaches find how a new drug and target is similar to known drug-target pairs based on the pharmacological similarities between drugs and the genomic similarity of protein sequences. Here, similarity measures may be either chemical-based, ligand-based, expression-based, side effect-based, or annotation-based [4]. But the disadvantage of this approach is that only a tiny proportion of drug-target interaction pairs are known and available for comparison. There are many matrix factorization algorithms, in which given an interaction matrix Xn×m, the main goal is to decompose it into two lower-order matrices, Yn×k and Zm×k such that X = YZT with k < n, m [4]. The matrix completion technique is then used to compute the missing data that help in the DTI prediction task. In feature-based [4] methods, the drug and target vector are concatenated. A binary or real label is then appended that denotes interaction outcome or affinity score for each drug-target pairs. Examples of features-based methods include the Bagging-based Ensemble method(BE-DTI) that employs dimensionality reduction, and active learning [6]. In [7], first feature sub-spacing and then three different dimensionality reduction techniques, namely Singular Value Decomposition(SVD), Partial Least Squares (PLS), and Laplacian Eigenmaps (LapEig) are used to prepare training data. They have used decision tree and kernel ridge regression classifiers as base learners. Network-based models such as TL-HGBI, DrugNet utilizes heterogeneous networks not only to predict the drug but also recommend the way of treatments [2, 4]. In [8], the matrix inverse computation is used to compute relevance grade between two nodes in a weighted network of drug-target interactions. Deep learning-based DTIs prediction utilizes the biological, topological, and physicochemical information of the drugs and targets to compute feature vectors/matrix [4, 9]. They can capture the inherent drug-target interactions over other state-of-the-art feature computation methods and classifiers. Deep learning techniques sometimes can not be applied due to the unavailability of sufficient data.
In this article, a feature-based method, DTI-SNNFRA, is proposed. Here, we have represented each drug or target by a feature vector. Initially, all the approved drug-target pairs are considered as a set of positive samples. The remaining unannotated and non-approved interaction pairs from which interaction may be predicted can be initially treated as a set of negative samples. Here, the drug-target interaction prediction task is a class imbalance problem, as most interaction pairs are unannotated. Our proposed framework predicts DTI in two phases that considerably reduce the unannotated drug-target pairs’ search space. In the first phase, from each known drug-target interaction pair, the shared nearest-neighbours (SNN) of the associated drug and target are computed using their feature vectors. Then, SNNs of the drug are clustered, and each cluster’s centroid is taken as a representative. Representative targets are also determined similarly. These representative drugs and targets are used to form drug-target pairs that are fewer and are probable candidates for possible interactions. The pairs obtained in this way are treated as negative interaction pairs.
Despite the reduction in search space, the obtained training set created in this way is highly imbalanced. To encounter this problem, in the second phase, our prediction model computes a fuzzy rough upper approximation score (grade membership degree) as the strength of the interaction between a drug and target for each of the remaining unannotated pairs. Based on this score’s different threshold cut-off values, we have initially divided all the unannotated drug-target pairs into positive and negative classes. The size of the so obtained negative samples is dependent on the threshold cut-off, and if it is significantly larger than the size of the positive samples, then the drug-target interaction training dataset remains imbalanced. On the other hand, if the number of unannotated negative samples is considerably less than the approved positive samples, oversampling is carried out by an Adaptive Synthetic Sampling Method (ADASYN). It produces a reduced and balanced training set that can be used by any general classifier. We have applied several state-of-the-art classifiers such as SVM, decision tree, random forest, and RUSBoost to find predicted interactions’ correctness.
In section 2 of this article, the datasets utilized in this work along with method and algorithms, is explained. In section 2.3, a brief description and definition of the fuzzy-rough set based lower and upper approximation are outlined. In section 3, results and discussions are presented and finally section 4 draws the conclusion.
2 Materials and methods
In this section, we describe the datasets used in this work, key ideas of our algorithms, and some background of the fuzzy-rough set. The building block of the proposed DTI-SNNFRA method is shown in Fig 1.
Fig 1. Building block of proposed DTI-SNNFRA method.
2.1 Dataset preparation
In this article, the drug-target interaction data is taken from the DrugBank database [10] (version 4.3, released on 17 Nov. 2015) and from dataset mentioned in [11]. In dataset 1 [10], the number of drugs is 5877, targets are 3348, and the number of interactions between the drugs and targets is 12674. Here, a drug or a target is represented by its feature vector. The drug feature vector is computed by Rcpi [12] package, and the PROFEAT [13] web server. It is represented by constitutional, topological, and geometrical descriptors. The target feature vector is computed using different types of compositions, such as amino acid, pseudo-amino acid, and CTD (composition, transition, distribution) descriptors. The number of features for drug and target of dataset 1 are 193 and 1290, respectively.
In dataset 2 [11], the number of drugs is 1862, targets are 1554, and the number of interactions between the drugs and targets is 4809. Here, each drug is represented by a binary vector known as PubChem fingerprint. Each element of this vector exhibits the existence and non-existence of one of the 881 chemical substructures. Similarly, each target in the dataset 2 is also represented as a fingerprint of an 876-dimensional binary vector. Here, each element of this vector indicates the existence and non-existence of 876 different protein domains, as mentioned in the Pfam database [14]. The drug feature vector and target feature vector are then concatenated to represent the drug- target pair feature vector and can be represented for dataset 1 as:
These drug-target pairs feature vectors are then normalized in the range [0, 1] by min-max method for avoiding bias towards any feature.
2.2 Workflow of the proposed framework
In this section, the necessary steps of our proposed method are described.
2.2.1 Step 1: Finding positive and negative drug-target pairs
After the normalization, only the drugs and targets which have known interactions in the interaction matrix are used to form the positive samples for classifiers. But the number of unannotated and non-approved interaction pairs derived from the interaction matrix is significantly greater than the number of positive samples. Note that the high dimensionality and numerous samples may have diverse effects in the prediction task. Finding characteristically similar drugs and targets using the nearest-neighbour search facilitates new drug-target prediction. Determination of the nearest-neighbours using similarity distance measures are sensitive to the dimensionality and the distribution of the dataset. The popular similarity function L1 and L2 in Minkowski space infers the fact that, for particular data distribution, if the dataset’s dimensionality is increased then the relative difference of the distance of the closest and farthest data point of an independently selected point goes to 0. For this reason, the primary distance functions like L1, L2, and cosine, etc. are not suitable for high dimensional data. In this context, computing shared nearest neighbours (SNN) using the primary distance functions instead of computing nearest neighbours reduce the disadvantage of higher dimensions [15]. Assume the dataset S consisting of n = |S| objects and k ∈ N+. For each individual drug (or target), let NNk(x) ⊆ S represents k-nearest-neighbors of x ∈ S. It is computed using L2 similarity measure. The overlap between the computed k-nearest-neighbors sets of the objects x and y is represented as:
| (1) |
The Algorithm 1, provides the procedure to compute shared nearest neighbours, and the Algorithm 2.3, outlines how the training dataset is prepared for classifiers.
Suppose, the total number of drugs and targets are M and N. Assume drug di,i ∈ M interacts with target tj,j ∈ N. Now for this di, the indices of all drugs in ⋃SNNk(di, dr), ∀r ∈ M and i ≠ r are identified and assigned to snnDi. Similarly, for the target tj, the indices of all targets in ⋃SNNk(tj, tr), ∀r ∈ N and j ≠ r are identified and assigned to snnTj. Then all the drugs and targets in snnDi and snnTj are clustered using the k-medoids clustering and centroids are selected as a representatives of snnDi and snnTj. The Calinski-Harabasz criterion is used here to determine the correct number of clusters. These representatives drugs and targets from snnDi and snnTj are used to construct cartesian product pairs. Subsequently, the corresponding drug vector and target vector are concatenated for each cartesian product pair, which are included in the negative samples set. Forming negative samples by the above SNN approach followed by k-medoids clustering reduces the inclusion of the irrelevant drug-target pairs. For example, in dataset 1, the number of approved drug-target pairs is 12674, and the number of all possible pairs from which interaction may be predicted is 19663522. The number of drug-target pairs selected by the SNN followed by k-medoids clustering is 45933, which indicates 427 times samples removal.
2.2.2 Step 2: Decision table preparation and average approximation degree computation
The positive and negative sets of samples obtained in 2.2.1 are divided into m and n groups, respectively. Each group from the negative set, say, nj is taken m times with m group from the positive set, and m number of the decision table is prepared. Each decision table is used to compute the fuzzy rough upper approximation degree of each sample in the nj group. The m number of upper approximation degree of each sample in the nj group are then taken for average upper approximation degree computation. In Algorithm 3, We have mentioned this average upper approximation degree computation.
2.2.3 Step 3: Under-sampling based on approximation degree
A fuzzy rough grade membership is computed for every negative sample using all positive samples’ interactions via Algorithm 3. This fuzzy-rough upper approximation degree possibly indicates the possible interaction degree value between 0 to 1 scale. Now, one threshold value near 1 called th1 can be assumed to select many samples whose fuzzy-rough upper approximation degree is smaller than or equal to th1. Another one threshold value near 0 called th0 can be assumed to select many samples whose fuzzy-rough upper approximation degree are less than or equal to th0. This th0 and th1 based sample selection both under-samples the negative samples set.
2.2.4 Step 4: Oversampling, if required
The datasets used here has several approved drug-target pairs, which are treated as a set of positive samples. The remaining pairs that are unannotated may or may not interact with each other. These unannotated (and also non-approved) interaction pairs are enormous, from which interaction is predicted. For example, in dataset 1, the number of approved drug-target pairs is 12674, and the number of remaining unannotated pairs is 19663522. Initially, we have reduced the number of unannotated pairs (i.e. initially treated as a set of negative samples), by using Shared Nearest Neighbor in Step 2.2.1. The number of unannotated negative samples, previously selected by SNN, remains higher than positive samples. Our prediction model then computes a fuzzy rough upper approximation score (grade membership degree) as the strength of the interaction between a drug and target for each of the remaining unannotated pairs. Based on different threshold cut-off values of this score, we have initially divided all the unannotated drug-target pairs into positive and negative classes. The size of the so obtained negative samples is dependent on the threshold cut-off, and if it is significantly larger than the size of the positive samples, then the drug-target interaction training dataset remains imbalanced. Therefore, we have selected one threshold value of grade membership degree to under-sample the negative samples to get an approximately equal number of negative and positive samples. In this case, no oversampling is needed. However, if we select different threshold values where the number of negative samples is less than the number of positive samples, the oversampling of negative samples is required to balance negative and positive samples.
2.2.5 Step 5: Interaction prediction
As obtained in section 2.2.4, the dataset is then used to predict the negative set’s drug-target interaction pairs.
2.3 Fuzzy rough set
Assume that the drug-target pairs obtained by the given interaction matrix and SNN-based initial filtering constitute a decision table called . In this table, every row is denoted by m numbers of features i.e. C = {fi: 1 ≤ i ≤ m} and one decision attribute D = {d}. In this , how two objects are indiscernible is determined by calculating fuzzy indiscernibility relation (FIR). Subsequently, this indiscernibility relation is taken to determine fuzzy-rough lower and upper approximation. The fuzzy lower and upper approximations using fuzzy similarity relation (either fuzzy equivalence or tolerance relation), in pursuance of Radzikowska’s model, to approximate a concept Y are defined as [16]:
| (2) |
| (3) |
Here, in Eqs 2 and 3, I indicates a fuzzy implicator, T denotes a t-norm and RP is the fuzzy similarity relation computed by the features subset P ⊆ C. To calculate the fuzzy similarity relation RP, which is used in fuzzy lower and upper approximations as mentioned in the Eqs 2 and 3, for the features subset P ⊆ C the following equation may be taken.
| (4) |
Here, denotes the similarity degree between object x and y with respect to feature f. Some examples of fuzzy similarity relations are given below:
| (5) |
| (6) |
| (7) |
where σ2 stands for the variance of feature f.
Upper approximation degree computation
In Fig 1, the fuzzy rough upper approximation degree is computed as follows:
Computing fuzzy indiscernibility relation of conditional attributes using the Lukasiewicz t-norm and tolerance relation, as mentioned in section 2.3.
Computing fuzzy indiscernibility relation of decision attribute using its crisp relation.
Computing fuzzy upper approximation using the Lukasiewicz t-norm as per the Eq 3.
This fuzzy upper approximation degree can be used to select the samples from the negative samples set.
Data preprocessing for upper approximation degree computation
To reduce the dimension of feature vectors of the two datasets, we have utilized a dimensionality reduction method called incremental PCA. The feature dimension of a drug, target, and drug-target pair is 193, 1290, and 1483 for dataset1 and 881, 876, and 1757 for dataset2. To reduce the high computational cost of the fuzzy similarity computation (see Eq 4), we used incremental PCA to reduce feature dimension. This fuzzy similarity relation is further used to compute the upper or lower approximation. The computational complexity to compute the upper/lower approximation is O(|N|2 × |D|) where |N| is the size of the Universe and |D| is the number of the decision classes. The computational complexity of the fuzzy similarity relation is O(|N|2 × |C|) where |C| is the number of attributes. For one single attribute, the similarity relation’s computational complexity is O(|N|2 × 1). For the attribute set C, there exist |C| number of similarity relations in memory which incurs high computational cost. The situation goes, even more, worse for a high-dimensional dataset. To tackle this issue, we use incremental PCA which process the whole data by splitting it into mini-batches. Each batch can easily fit into the memory and is given as input to the incremental PCA at a time. Please note that the classical PCA and its variation (sparse-PCA, kernel-PCA) may also be applicable here, but this will results high computational cost, particularly for high dimensional data the algorithm may not be feasible in reality.
Algorithm 1: sharedNN
Input: D = feature matrix for the drug
T = feature matrix for the target
Output: shared nearest neighbors represented by feature vectors
k ← Neighborhood size
X ← D or T
n ← sampleSize(X)
distances = pairWiseDistance(X)
sorted, indexes = sort(distances, ascendOrder)
for i ← 1 to n do
sharedNN = []
for j ← 1 to n do
C = intersect(indexes(i, 2:k + 1),
indexes(j, 2:k + 1))
sharedNN = sharedNN ⋃ X(C)
Algorithm 2: Dataset Preparation
Input: DT = drug-target interaction matrix
D = feature matrix for the drug
T = feature matrix for the target
Output: labeled TrainingDataSet
P ← {} % P = positive samples set
N ← {} % N = negative samples set
for i ← 1 to m do
for j ← 1 n do
if DT(i, j) = 1 then
P ← P ∪ concat(drugVeci, targetVecj)
/* drugVeci: ith drug vector, targetVecj: jth target vector */
tempDi ← sharedNN(drugVeci)
snnDi ← optimalKmedoidsCentroids(tempDi)
tempTj ← sharedNN(targetVecj)
snnTj ← optimalKmedoidsCentroids(tempTj)
N ← N ∪ cartesianProductPairConcat(snnDi, snnTj)
TrainingDataset ←P ∪ N
Algorithm 3: Average FRUA degree computation and sampling.
Data: Imbalanced TrainingDataset with M samples {xi, yi} where i = 1 to M and xi is an d-dimensional vector in drug-target pair feature space and yi ∈ {0, 1}. Assume Mp and Mq represent number of minority and majority class samples respectively, such that Mp ≤ Mq and Mp+ Mq = M
Result: BalancedTraingDataset
Begin
function upperAproxCalc(decisionTable)
begin
uDegree → {} /* upper approximation degree vector */
objCount → sizeof(decisionTable) /* No. of object in decision table */
for k ← 1 to objCount do
here C: conditional attributesp set as per Eq 3
end
/* Split Mp and Mqinto m and n groups respectively */
split(Mp) → m groups
split(Mq) → n groups
totalNoGroupPair ← m × n /* total no. of group pairs between m and n */
allGroupPairIndices ← cartesianProduct(seq(1: m), seq(1: n)) /* It holds 1 to m × n indices where ith index holds ith pair */
for i ← 1 to totalNoGroupPair do
allGroupPairIndices(i)→(groupIndexOfm, groupIndexOfn) /*
groupIndexOfm, groupIndexOfn: mth and nth group index no. from m and n groups respectively */ /*
: set of positive samples taken from groupIndexOfm, : set of negative samples taken from groupIndexOfn */ Ui ← upperAproxCalc(decisionTablei) Ui holds upper approx. degree of all samples in and upper approx. degree of all samples in */
FRUA:
for each groupIndexOfn ∈ seq(1:n) and ∀ groupIndexOfm))
Sampling:
tp and tq are the thresolds for Mp and Mq
Z → ∅
for x ∈ Mq do
if FRUA(x) ≥ tp then
Mp ← Mp ∪ x
if FRUA(x) ≤ tq then
Z ← Z ∪ x
BalancedTraingDataset = ADASYN(Mp ∪ Z)
End
3 Results and discussions
3.1 Performance metrics
This section explains the experimental results by using three metrics referred to as ROC-AUC scores, F1 scores, and Geometric Mean scores [17]. The ROC-AUC provides a single score used to compare the models. It ranges from 0 to 1 where 1 indicates the perfect model and 0.5 represents a model having no prediction skill and the values less than 0.5 indicate that the prediction skill is worse than no skill. The ROC-AUC performance evaluation is insensitive to highly imbalanced datasets. How well a model predicts the positive class and the negative class are represented by the sensitivity and specificity. The sensitivity and specificity together can be integrated into a single score called geometric mean is represented by sqrt(Sensitivity * Specificity) where the Sensitivity = TruePositive / (TruePositive + FalseNegative) and Specificity = TrueNegative / (FalsePositive + TrueNegative).
The F1-score can be used to achieve a balance between Precision and Recall. It is also used where the class imbalance is present. All three scores are calculated using 5-fold cross-validation, and the average AUC, F1-score and G-mean score is computed. Note that the datasets 1 and 2 as mentioned in section 2 are used for prediction.
3.2 Proposed method vs some state-of-the-art sampling techniques
The proposed method deals with imbalance classification problems for drug-target interaction prediction. We have compared it with the five state-of-the-art sampling techniques known as RUS, SMOTE, ADASYN, SMOTEENN, and SMOTETomek to deal with the imbalanced dataset. Four classifiers, namely, decision tree(DT), random forest (RF), SVM, and RUSBoost are used to evaluate our proposed method’s performance. The ROC-AUC, F1, sand G-Mean scores of the proposed method, in Fig 2, are better than all the sampling methods. The RUS and SMOTE are performing poorly here for high-dimensional training data specified in [18]. ADASYN pays much attention to those samples of the minority class that are harder to learn. As our proposed method initially uses SNN, there may not be many samples that are harder to learn or the outliers. For this reason, directly using ADASYN, unlike our proposed method, is not producing satisfactory results here. The Tomek’s link in SMOTETomek and edited nearest-neighbours in SMOTEENN is used to clean the noisy samples or marginal outliers in training data. The SMOTEENN and SMOTETomek are not performing well because there are no noisy samples or marginal outliers (due to shared nearest neighbours computation) in the training data.
Fig 2. Fig (A) and (B) represents the performance on two datasets.
The AUC, F1 and G-mean scores under the classification models of decision tree, random forest and support vector machine, respectively are demonstrated using various sampling methods.
3.3 Comparisons with state-of-the-art methods
We have compared the proposed method with five state-of-the-art methods, DeepPurpose [19], RLS-avg (Regularized Least Squares-Average) [20] and RLS-kron (Regularized Least Squares-Kronecker product) [21], EnsemDT [7], and EnsemKRR [7]. The DeepPurpose [19] is a deep learning-based method for drug-target interaction prediction. It is an encoder-decoder framework that uses eight encoders for a compound (drug) and seven encoders for an amino acid sequence (target). For this encoding, it uses deep neural networks, 1D convolutional neural networks, recurrent neural networks, transformer encoders, and message-passing neural networks. The drug-target pairs, along with their fuzzy-rough upper approximation scores of our method, are compatible with the input data of the DeepPurpose model. The results in Table 1, show that the proposed method performs better than the DeepPurpose for ROC-AUC score with the same data. For each of the remaining methods, we have utilized three different dimensionality reduction techniques, namely Singular Value Decomposition(SVD), Partial Least Squares (PLS), and Laplacian Eigenmaps (LapEig) for the preparation of training data. The results in Table 1, show that our proposed method has satisfactory ROC-AUC results (0.955, 0.961, 0.951, 0.947 for dataset-1 and 0.930, 0.943, 0.970 and 0.912 for dataset 2 using DT, RF, SVM and RUSBoost classifier respectively.
Table 1. Comparisons with the five state-of-the-arts methods.
| Methods | Dataset 1 | Dataset 2 | |
|---|---|---|---|
| AUC | AUC | ||
| RLS-avg, SVD | 0.912 | 0.899 | |
| RLS-avg, PLS | 0.915 | 0.918 | |
| RLS-avg, LapEig | 0.909 | 0.916 | |
| RLS-kron, SVD | 0.889 | 0.873 | |
| RLS-kron, PLS | 0.899 | 0.913 | |
| RLS-kron, LapEig | 0.889 | 0.874 | |
| EnsemDT, SVD | 0.899 | 0.914 | |
| EnsemDT, PLS | 0.902 | 0.898 | |
| EnsemDT, LapEig | 0.901 | 0.914 | |
| EnsemKRR, SVD | 0.942 | 0.931 | |
| EnsemKRR, PLS | 0.941 | 0.930 | |
| EnsemKRR, LapEig | 0.941 | 0.930 | |
| DeepPurpose | 0.938 | 0.911 | |
| Proposed | DT | 0.955 | 0.930 |
| RF | 0.961 | 0.943 | |
| SVM | 0.951 | 0.970 | |
| RUSBoost | 0.947 | 0.912 | |
We have only provided the ROC-AUC scores of all these competing methods due to unavailability of the F1 and G-Mean scores in [7]. The parameters of RLS-avg, RLS-kron, EnsemDT, and EnsemKRR are set to the default values as specified in [7, 20, 21].
3.4 Tuning of hyperparameters
The proposed method performs grid search-based hyperparameter tuning for computing ROC-AUC, F1, and G-Mean scores. For the DT classifier, we have observed that the best ROC-AUC, F1, and G-Mean scores are obtained using the hyperparameters combination is criterion: gini, maxDepth: 9, minSamplesLeaf: 1, minSamplesSplit: 6 for dataset 1. For dataset 2, the best ROC-AUC, F1, and G-Mean scores have been achieved by criterion: gini, maxDepth: 9, minSamplesLeaf: 1, minSamplesSplit: 4. In the case of RF classifier, for dataset 1 and dataset 2, the best hyperparameters combination is determined as criterion: gini, maxDepth: 20, minSamplesLeaf: 3, minSamplesSplit: 8, nEstimators: 200 for ROC-AUC scores of 0.961 and 0.943, respectively. Fig 3(A) and 3(B) demonstrate the variation of the AUC score of the decision tree with respect to only two hyperparameters called tree_depth and max_feature. In Fig 3(C), a heatmap is shown only for hyperparameters (n_estimators, max_depths) for the random forest model. The maximum depth of the tree is decided as nodes are expanded until all leaves are pure or until all leaves contain less than minSamplesSplit samples. The number of features for both the RF and DT is equal to maxFeatures = sqrt(nFeatures). The best hyperparameters combinations in SVM for dataset 1 are determined as kernel: RBF, C: 10.0, gamma: 0.1. As for dataset 2, the best ROC-AUC, F1, and G-Mean scores are 0.97, 0.93, and 0.929 achieved using kernel: RBF, C: 1.0, gamma: 0.1. Fig 3(D) represents the ROC-AUC scores with two hyperparameters (C, gamma) for dataset 2.
Fig 3.
Fig (A) and (B) represent the hyperparameters of decision tree called max feature and tree depth vs AUC graph for dataset 1, respectively. In (C), the hyperparameters of random forest along with the AUC scores are shown in the heatmap. Fig. (D) represents one heatmap for AUC scores of SVM for two hyperparameters called C and gamma.
To prepare negative drug-target pairs, the number of nearest neighbours is 11, which is later used to compute the shared nearest neighbours. We observed that for 11 nearest neighbours, the shared nearest neighbours computation step determines the number of drugs and targets that have a good balance between the number of samples and feature dimension.
3.5 Feature selection and comparisons
In Fig 4(A) and 4(B), the prediction scores in terms of ROC-AUC values have been shown for both datasets considering feature selection and no feature selection. In our method, after SNN computation followed by k-medoids clustering, we have computed a fuzzy rough upper approximation score (grade membership degree) as the strength of the interaction between a drug and a target for each of the unannotated pairs. Based on different threshold cut-off values of this score, we divided all the unannotated drug-target pairs into positive and negative classes. Negative samples detected from the unannotated pairs via fuzzy rough upper approximation score and the initially obtained annotated positive samples constitute the input data for RUSBoostClassifier. For different threshold cut-off values of fuzzy rough upper approximation scores, the RUSBoostClassifier produces the Fig 4(A) and 4(B). In these experiments, we used the holdout strategy for training with the training and testing ratio of 70:30. Table 1, the ROC-AUC scores of RUSBoostClassifier for one threshold cut-off value, for dataset 1 and dataset 2, are obtained by executing hyperparameters tuning using grid search. The best hyperparameters are determined as nEstimators: 500, learningRate: 1.0 which produces 0.9477 and 0.912 for ROC-AUC for dataset 1 and dataset 2. The RUSBoostClassifier is used here because it mitigates the class imbalance problem during learning by the random under-sampling of the samples at each iteration of boosting. For feature selection, the features importance scores have been computed using XGBoost and random forest. These two feature importance computation methods split the positive and negative samples into many groups, where the number of positive and, negative samples in each group is approximately equal. All the positive and negative group pairs were individually taken by the XGBoost and random forest classifiers for computing the feature importance. Finally, average feature importance scores are computed and top 100 features are taken for prediction. The average execution time, without feature selection, over 50 thresholds for dataset 1 and dataset 2 are 617.66 sec., and 232.07 sec., respectively. When feature selection is considered, the average execution time, over 50 thresholds, for dataset 1 and dataset 2 are 232.07 sec., and 77.61 sec., respectively.
Fig 4.
Fig (A) and (B) represent Threshold vs AUC graph for dataset 1 and dataset 2 using feature selection and without feature selection respectively. (C) and (D) represent M vs Sensitivity plots for both datasets using five thresholds. (E) and (F) represent classification errors for both dataset 1 and dataset 2, respectively using one threshold.
3.6 Sensitivity vs number of base learners and classification errors
In Fig 4(C) and 4(D), two plots represent the M vs Sensitivity graph for both datasets where M represents the number of base learner that is ranging from 1 to 50. This experiment is carried out for a few threshold values. For each threshold, the variation of the ROC-AUC is minimal. The classification error indicates the proportion of samples that the classifier misclassified are also reported in Fig 4(E) and 4(F).
3.7 Drug-target interaction of the proposed method
In Table 2, some existing and predicted drug-target interactions have been provided. To test the efficacy of the proposed method, we have omitted several known interactions from training data. Then, we have trained our model with the remaining data and verified our prediction results. We have observed that our prediction model has even successfully recovered (predicted) those omitted known interactions. Seven drugs for the target Serine hydroxymethyltransferase, cytosolic are predicted correctly, and among them, five are listed in Table 2. For the same target, we predicted five additional interactions with drugs. Similarly, we have displayed results of some correctly predicted and novel drug-target interactions in this table. In Fig 5, some drug-target interactions have been shown, along with some interactions between the treatment areas and drugs.
Table 2. Drug target interaction verification and new interaction by the proposed method.
| Correct prediction of existing interactions | Novel Predicted interactions | ||
|---|---|---|---|
| Target name: Serine hydroxymethyl transferase, cytosolic | Drugs | Mimosine | Pyridostigmine |
| Pyridoxal phosphate | Willardiine | ||
| Glycine | acetamides | ||
| tetrahydrofolic acids | Betamipron | ||
| N-Pyridoxyl-Glycine-5-Monophosphate | Tyrosine | ||
| Target name: Monoamine oxidase | Drugs | Amphetamine | Diethylpropion |
| Phentermine | Ethinamate | ||
| Tranylcypromine | Alprenolol | ||
| Phenelzine | Phenylephrine | ||
| Selegiline | Probenecid | ||
| Drug name: alpha-D- glucose 6-phosphate | Targets | Glucose-6-phosphateisomerase | Peptide deformylase |
| Glycogen phosphorylase, muscle form | Adenylate kinase isoenzyme 1 | ||
| Aldose reductase | Adenosylhomocysteinase | ||
| Glutamine–fructose-6-phosphate aminotransferase [isomerizing] | Phosphoheptose isomerase | ||
| Hexokinase-1 | Low molecular weight2 tyrosine protein phosphatase | ||
| Drug name: Adenosine-5- Diphospho- ribose | Targets | MutT/nudix family protein | Enoyl-[acyl-carrierprotein] reductase [NADH] FabI |
| p-hydroxy-benzoate hydroxylase | GDP-mannose6-dehydrogenase | ||
| Glyceraldehyde-3-phosphate dehydrogenase | RNA-directed RNA polymerase |
||
| Lactaldehyde reductase | Serine hydroxymethyl-transferase | ||
| Elongation factor 2 | Bifunctional protein BirA | ||
Fig 5. Some drug-target interactions with treatment areas of the drugs.
3.8 Drug-target interaction validation
To verify our drug-target interaction prediction results, we have used the Connectivity Map (Cmap) [22] prediction results provided by the Broad Institute. The drug name and target name in the Drugbank dataset have different representations in Cmap. Therefore, we have performed the conversion between Drugbank ID and Cmap using the webchem R package [23]. This R package retrieves the chemical information from the web using a suite of 14 web services.
Our prediction results of drug-target pairs for Drugbank dataset are utilized in the webchem packages, which only fetches information from the Wikidata. Due to lack of information in the suite of web services, except the Wikidata, as provided by webchem R package, we have not obtained complete matching between our prediction and Cmap predictions. In Table 3, a list of 50 drug-target interaction pairs is shown that has been predicted by our method. Thirty-four interaction pairs which are also available in the Cmap predicted database is marked in bold face.
Table 3. Drug-target interactions by proposed method.
| Drug | Target | FruaScore | Drug | Target | FruaScore |
|---|---|---|---|---|---|
| DB04094 | Q9Y296 | 0.933385 | DB00839 | Q09428 | 0.814468 |
| DB03750 | P0CG47 | 0.933299 | DB00476 | P28335 | 0.810978 |
| DB03988 | Q9Y296 | 0.933073 | DB00450 | P35462 | 0.806337 |
| DB03320 | Q9Y296 | 0.932387 | DB00776 | P35498 | 0.804604 |
| DB08242 | P0AEK4 | 0.932214 | DB00929 | P43119 | 0.803532 |
| DB08137 | P0AEK4 | 0.932189 | DB00433 | P35462 | 0.802923 |
| DB07153 | P16184 | 0.932128 | DB00794 | Q14524 | 0.799097 |
| DB00992 | Q9Y296 | 0.932054 | DB00917 | P21731 | 0.798244 |
| DB04789 | P16184 | 0.932053 | DB01121 | Q14524 | 0.795084 |
| DB07000 | P0AEK4 | 0.932018 | DB00645 | Q14524 | 0.793230 |
| DB04197 | Q9Y296 | 0.932002 | DB00850 | P35367 | 0.764447 |
| DB07281 | P0AEK4 | 0.931912 | DB04846 | P08913 | 0.759809 |
| DB03448 | P0A884 | 0.931780 | DB00782 | P08172 | 0.758948 |
| DB04796 | P14867 | 0.931678 | DB01365 | P08913 | 0.751881 |
| DB02456 | P0A884 | 0.931636 | DB01121 | Q9NY46 | 0.751538 |
| DB04680 | P0CG29 | 0.931635 | DB03719 | P30542 | 0.747386 |
| DB01248 | P07437 | 0.922451 | DB00670 | P08172 | 0.745866 |
| DB00518 | P07437 | 0.919137 | DB07954 | P30542 | 0.744886 |
| DB00391 | P00915 | 0.915100 | DB00794 | Q9Y5Y9 | 0.730465 |
| DB01248 | Q13509 | 0.914888 | DB00776 | Q9Y5Y9 | 0.710952 |
| DB01248 | P68363 | 0.911210 | DB00252 | Q9Y5Y9 | 0.709006 |
| DB05294 | Q15303 | 0.904014 | DB00999 | Q08460 | 0.594489 |
| DB00361 | P68363 | 0.897636 | DB01119 | Q08460 | 0.589146 |
| DB01121 | P35499 | 0.824893 | DB00356 | Q08460 | 0.583733 |
| DB04846 | P07550 | 0.816920 | DB03719 | P29274 | 0.556650 |
We have also observed that most of predicted drug-target interaction pairs e.g. (DB01248, P07437), (DB04846, P07550), (DB00839, Q09428), (DB00450, P35462), (DB00776, Q9Y5Y9), (DB00776, P35498) shown in Table 3, are also reported in [24–28].
4 Conclusion
In this article, a novel computational approach for drug-target interaction prediction is presented utilizing existing drug-target data. There are two critical issues in this domain: a massive amount of drugs and targets creating a vast search space and highly imbalanced drug-target interactions dataset as there is a tiny number of drug-target interactions unveiled so far. Thus, the size of the negative samples is much larger than the size of the positive samples.
Here, we have used shared nearest neighbours rather than taking a fixed number of nearest neighbours as it is more effective in the higher dimensional dataset. The reason behind this is, typically, the size of the overlapped items within the neighbourhoods of a pair of drugs (or targets) inside the same cluster is substantially larger than the neighbourhoods of a pair of drugs (or targets) belonging to different clusters. Moreover, to tackle the curse of the imbalanced dataset, these shared nearest neighbours are further grouped by k-medoids. The representative centroids of k-medoids for the drug and target are then considered new possible drug-target interaction pairs for each known drug-target pair. Additionally, to deal with imbalanced dataset further, we have computed the degree of fuzzy-rough upper approximation of all the possible interaction pairs in the negative samples to perform undersampling. After that, selecting a threshold of the computed degrees, the size of the negative and positive samples sets are balanced. This upper approximation degree-based undersampling of the negative samples causes improvement in the prediction scores. Computation of degree in the fuzzy-rough upper approximation is challenging as interaction pairs’ dimension is exceptionally high. The execution time of this fuzzy-rough upper approximation degree is directly proportional to the number of features. Therefore, further investigation on fuzzy-rough set based feature selection followed by fuzzy-rough upper approximation computation may improve the prediction score. Instead of using a single threshold for undersampling, multiple threshold-based undersampling may be investigated for tackling the curse of imbalanced datasets. Moreover, the positive samples’ oversampling to balance with the number of negative samples may be explored to improve the prediction score. We believe that DTI-SNNFRA may be a promising framework for drug-target interaction prediction.
Supporting information
(ZIP)
Data Availability
All relevant data are within the manuscript and its Supporting information files.
Funding Statement
The author(s) received no specific funding for this work.
References
- 1. Sachdev K, Gupta MK. A comprehensive review of feature based methods for drug target interaction prediction. Journal of Biomedical Informatics. 2019;93:103159 10.1016/j.jbi.2019.103159 [DOI] [PubMed] [Google Scholar]
- 2. Cui Z, Gao YL, Liu JX, Wang J, Shang J, Dai LY. The computational prediction of drug-disease interactions using the dual-network L2,1-CMF method. BMC Bioinformatics. 2019;20(1):5 10.1186/s12859-018-2575-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Ezzat A, Wu M, Li XL, Kwoh CK. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics. 2016;17(19):509 10.1186/s12859-016-1377-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K. Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Briefings in Bioinformatics. 2020; [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. D’Souza S, Prema KV, Balaji S. Machine learning models for drug–target interactions: current knowledge and future directions. Drug Discovery Today. 2020;25(4):748–756. 10.1016/j.drudis.2020.03.003 [DOI] [PubMed] [Google Scholar]
- 6. Sharma A, Rani R. BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. Computer Methods and Programs in Biomedicine. 2018;165:151–162. 10.1016/j.cmpb.2018.08.011 [DOI] [PubMed] [Google Scholar]
- 7. Ezzat A, Wu M, Li XL, Kwoh CK. Drug-target interaction prediction using ensemble learning and dimensionality reduction. Methods. 2017;129:81–88. 10.1016/j.ymeth.2017.05.016 [DOI] [PubMed] [Google Scholar]
- 8. Seal A, Ahn YY, Wild DJ. Optimizing drug-target interaction prediction based on random walk on heterogeneous networks. Journal of cheminformatics. 2015;7:40–40. 10.1186/s13321-015-0089-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discovery Today. 2018;23(6):1241–1250. 10.1016/j.drudis.2018.01.039 [DOI] [PubMed] [Google Scholar]
- 10. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, et al. DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs. Nucleic Acids Research. 2010;39(suppl_1):D1035–D1041. 10.1093/nar/gkq1126 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y. Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics. 2012;28(18):i487–i494. 10.1093/bioinformatics/bts412 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Cao DS, Xiao N, Xu QS, Chen AF. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics. 2014;31(2):279–281. 10.1093/bioinformatics/btu624 [DOI] [PubMed] [Google Scholar]
- 13. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research. 2006;34(suppl_2):W32–W37. 10.1093/nar/gkl305 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 2015;44(D1):D279–D285. 10.1093/nar/gkv1344 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Gertz M, Ludäscher B, editors. Scientific and Statistical Database Management. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010. p. 482–500. [Google Scholar]
- 16. Jensen R, Shen Q. New Approaches to Fuzzy-Rough Feature Selection. Fuzzy Systems, IEEE Transactions on. 2009;17(4):824–838. 10.1109/TFUZZ.2008.924209 [DOI] [Google Scholar]
- 17. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861–874. 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]
- 18. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):106 10.1186/1471-2105-14-106 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics. 2020; 10.1093/bioinformatics/btaa1005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Laarhoven TV, Nabuurs S, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27 21:3036–43. 10.1093/bioinformatics/btr500 [DOI] [PubMed] [Google Scholar]
- 21. van Laarhoven T, Marchiori E. Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile. PLOS ONE. 2013;8:1–6. 10.1371/journal.pone.0066952 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences. 2010;107(33):14621–14626. 10.1073/pnas.1000138107 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Szöcs E. webchem: retrieve chemical information from the web; 2015. Available from: 10.5281/zenodo.33823. [DOI]
- 24. Matesanz R, Barasoain I, Yang CG, Wang L, Li X, de Inés C, et al. Optimization of Taxane Binding to Microtubules: Binding Affinity Dissection and Incremental Construction of a High-Affinity Analog of Paclitaxel. Chemistry and Biology. 2008;15(6):573–585. 10.1016/j.chembiol.2008.05.008 [DOI] [PubMed] [Google Scholar]
- 25. Yao EH, Fukuda N, Matsumoto T, Katakawa M, Yamamoto C, Han Y, et al. Effects of the antioxidative beta-blocker celiprolol on endothelial progenitor cells in hypertensive rats. American journal of hypertension. 2008;21 9:1062–8. 10.1038/ajh.2008.233 [DOI] [PubMed] [Google Scholar]
- 26. Asano K, Cortes P, Garvin JL, Riser BL, Rodríguez-Barbero A, Szamosfalvi B, et al. Characterization of the rat mesangial cell type 2 sulfonylurea receptor. Kidney International. 1999;55(6):2289–2298. 10.1046/j.1523-1755.1999.00485.x [DOI] [PubMed] [Google Scholar]
- 27. Gao HR, Shi TF, Yang CX, Zhang D, Zhang GW, Zhang Y, et al. The effect of dopamine on pain-related neurons in the parafascicular nucleus of rats. Journal of neural transmission (Vienna, Austria: 1996). 2010;117(5):585–591. 10.1007/s00702-010-0398-3 [DOI] [PubMed] [Google Scholar]
- 28. Vohora D, Saraogi P, Yazdani M, Bhowmik M, Khanam R, Pillai K. Recent advances in adjunctive therapy for epilepsy: focus on sodium channel blockers as third-generation antiepileptic drugs. Drugs of today (Barcelona, Spain: 1998). 2010;46(4):265–277. 10.1358/dot.2010.46.4.1445795 [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
(ZIP)
Data Availability Statement
All relevant data are within the manuscript and its Supporting information files.





