DTI-SNNFRA: Drug-target interaction prediction by shared nearest neighbors and fuzzy-rough approximation

Sk Mazharul Islam; Sk Md Mosaddek Hossain; Sumanta Ray

doi:10.1371/journal.pone.0246920

. 2021 Feb 19;16(2):e0246920. doi: 10.1371/journal.pone.0246920

DTI-SNNFRA: Drug-target interaction prediction by shared nearest neighbors and fuzzy-rough approximation

Sk Mazharul Islam ¹, Sk Md Mosaddek Hossain ^2,^*, Sumanta Ray ²

Editor: Alberto Fernández-Hilario³

PMCID: PMC7894953 PMID: 33606741

Abstract

In-silico prediction of repurposable drugs is an effective drug discovery strategy that supplements de-nevo drug discovery from scratch. Reduced development time, less cost and absence of severe side effects are significant advantages of using drug repositioning. Most recent and most advanced artificial intelligence (AI) approaches have boosted drug repurposing in terms of throughput and accuracy enormously. However, with the growing number of drugs, targets and their massive interactions produce imbalanced data which may not be suitable as input to the classification model directly. Here, we have proposed DTI-SNNFRA, a framework for predicting drug-target interaction (DTI), based on shared nearest neighbour (SNN) and fuzzy-rough approximation (FRA). It uses sampling techniques to collectively reduce the vast search space covering the available drugs, targets and millions of interactions between them. DTI-SNNFRA operates in two stages: first, it uses SNN followed by a partitioning clustering for sampling the search space. Next, it computes the degree of fuzzy-rough approximations and proper degree threshold selection for the negative samples’ undersampling from all possible interaction pairs between drugs and targets obtained in the first stage. Finally, classification is performed using the positive and selected negative samples. We have evaluated the efficacy of DTI-SNNFRA using AUC (Area under ROC Curve), Geometric Mean, and F1 Score. The model performs exceptionally well with a high prediction score of 0.95 for ROC-AUC. The predicted drug-target interactions are validated through an existing drug-target database (Connectivity Map (Cmap)).

1 Introduction

Drug development strategies, also known as drug repositioning or drug repurposing or drug reprofiling, predict the interaction among drugs and targets from the existing drug-target databases [1]. There are two types of drug-target interaction: competitive inhibitors and allosteric inhibitors. Competitive inhibitors adhere to the target’s active site to suppress reactions. Allosteric inhibitors bind to the target’s allosteric site, which in turn prevents reactions, correct metabolic imbalance, and kills pathogens to cure diseases. There exist several synthesized compounds whose target profiles and effects are still unknown. The research and findings of compounds’ properties, their reactions/responses to drugs, and targets have generated large, complex databases that need efficient computational methods to analyze and predict drug-target interaction. New drug design requires more than 13.5 years and the cost exceeds 1.8 billion dollars [2, 3]. Moreover, new drugs may have unwanted side effects on patients. Therefore, due to known side effects and easier government approval, drug-repurposing facilitate pharmaceutical companies to launch existing authorized drugs and compounds in the market for new therapeutic purposes [4]. Drug repositioning usually reinvestigates existing drugs which were denied approval due to new therapeutic indications.

Practical laboratory experiments to discover the interactions among the drugs and targets are expensive, time-consuming and labour-intensive. Therefore, in-silico approaches are gaining attention, in which virtual screening is initially accomplished, and then possible candidates go through experimental verification. Docking simulations is a type of in-silico approach that need 3D structure analysis of drugs and target molecules to determine the potential binding sites. Despite the excellent accuracy of this process, unavailability of the proper 3D structure of drugs and targets, and long processing time hinders the docking simulation. Chemogenomics was introduced to tackle this problem in which the chemical space and genomic space are mined together to find the potential compounds such as imaging probes and drug leads [4]. Plenty of machine learning techniques based on similarity computation, matrix factorization, network models, features vectors, and deep learning models for DTIs prediction are prevalent in the literature [1, 5]. Similarity-based approaches find how a new drug and target is similar to known drug-target pairs based on the pharmacological similarities between drugs and the genomic similarity of protein sequences. Here, similarity measures may be either chemical-based, ligand-based, expression-based, side effect-based, or annotation-based [4]. But the disadvantage of this approach is that only a tiny proportion of drug-target interaction pairs are known and available for comparison. There are many matrix factorization algorithms, in which given an interaction matrix X_n×m, the main goal is to decompose it into two lower-order matrices, Y_n×k and Z_m×k such that X = YZ^T with k < n, m [4]. The matrix completion technique is then used to compute the missing data that help in the DTI prediction task. In feature-based [4] methods, the drug and target vector are concatenated. A binary or real label is then appended that denotes interaction outcome or affinity score for each drug-target pairs. Examples of features-based methods include the Bagging-based Ensemble method(BE-DTI) that employs dimensionality reduction, and active learning [6]. In [7], first feature sub-spacing and then three different dimensionality reduction techniques, namely Singular Value Decomposition(SVD), Partial Least Squares (PLS), and Laplacian Eigenmaps (LapEig) are used to prepare training data. They have used decision tree and kernel ridge regression classifiers as base learners. Network-based models such as TL-HGBI, DrugNet utilizes heterogeneous networks not only to predict the drug but also recommend the way of treatments [2, 4]. In [8], the matrix inverse computation is used to compute relevance grade between two nodes in a weighted network of drug-target interactions. Deep learning-based DTIs prediction utilizes the biological, topological, and physicochemical information of the drugs and targets to compute feature vectors/matrix [4, 9]. They can capture the inherent drug-target interactions over other state-of-the-art feature computation methods and classifiers. Deep learning techniques sometimes can not be applied due to the unavailability of sufficient data.

In this article, a feature-based method, DTI-SNNFRA, is proposed. Here, we have represented each drug or target by a feature vector. Initially, all the approved drug-target pairs are considered as a set of positive samples. The remaining unannotated and non-approved interaction pairs from which interaction may be predicted can be initially treated as a set of negative samples. Here, the drug-target interaction prediction task is a class imbalance problem, as most interaction pairs are unannotated. Our proposed framework predicts DTI in two phases that considerably reduce the unannotated drug-target pairs’ search space. In the first phase, from each known drug-target interaction pair, the shared nearest-neighbours (SNN) of the associated drug and target are computed using their feature vectors. Then, SNNs of the drug are clustered, and each cluster’s centroid is taken as a representative. Representative targets are also determined similarly. These representative drugs and targets are used to form drug-target pairs that are fewer and are probable candidates for possible interactions. The pairs obtained in this way are treated as negative interaction pairs.

Despite the reduction in search space, the obtained training set created in this way is highly imbalanced. To encounter this problem, in the second phase, our prediction model computes a fuzzy rough upper approximation score (grade membership degree) as the strength of the interaction between a drug and target for each of the remaining unannotated pairs. Based on this score’s different threshold cut-off values, we have initially divided all the unannotated drug-target pairs into positive and negative classes. The size of the so obtained negative samples is dependent on the threshold cut-off, and if it is significantly larger than the size of the positive samples, then the drug-target interaction training dataset remains imbalanced. On the other hand, if the number of unannotated negative samples is considerably less than the approved positive samples, oversampling is carried out by an Adaptive Synthetic Sampling Method (ADASYN). It produces a reduced and balanced training set that can be used by any general classifier. We have applied several state-of-the-art classifiers such as SVM, decision tree, random forest, and RUSBoost to find predicted interactions’ correctness.

In section 2 of this article, the datasets utilized in this work along with method and algorithms, is explained. In section 2.3, a brief description and definition of the fuzzy-rough set based lower and upper approximation are outlined. In section 3, results and discussions are presented and finally section 4 draws the conclusion.

2 Materials and methods

In this section, we describe the datasets used in this work, key ideas of our algorithms, and some background of the fuzzy-rough set. The building block of the proposed DTI-SNNFRA method is shown in Fig 1.

2.1 Dataset preparation

In this article, the drug-target interaction data is taken from the DrugBank database [10] (version 4.3, released on 17 Nov. 2015) and from dataset mentioned in [11]. In dataset 1 [10], the number of drugs is 5877, targets are 3348, and the number of interactions between the drugs and targets is 12674. Here, a drug or a target is represented by its feature vector. The drug feature vector is computed by Rcpi [12] package, and the PROFEAT [13] web server. It is represented by constitutional, topological, and geometrical descriptors. The target feature vector is computed using different types of compositions, such as amino acid, pseudo-amino acid, and CTD (composition, transition, distribution) descriptors. The number of features for drug and target of dataset 1 are 193 and 1290, respectively.

In dataset 2 [11], the number of drugs is 1862, targets are 1554, and the number of interactions between the drugs and targets is 4809. Here, each drug is represented by a binary vector known as PubChem fingerprint. Each element of this vector exhibits the existence and non-existence of one of the 881 chemical substructures. Similarly, each target in the dataset 2 is also represented as a fingerprint of an 876-dimensional binary vector. Here, each element of this vector indicates the existence and non-existence of 876 different protein domains, as mentioned in the Pfam database [14]. The drug feature vector and target feature vector are then concatenated to represent the drug- target pair feature vector and can be represented for dataset 1 as:

{d_{1}, d_{2}, ….., d_{193}, t_{1}, t_{2}, ….., t_{1290}}

These drug-target pairs feature vectors are then normalized in the range [0, 1] by min-max method for avoiding bias towards any feature.

2.2 Workflow of the proposed framework

In this section, the necessary steps of our proposed method are described.

2.2.1 Step 1: Finding positive and negative drug-target pairs

After the normalization, only the drugs and targets which have known interactions in the interaction matrix are used to form the positive samples for classifiers. But the number of unannotated and non-approved interaction pairs derived from the interaction matrix is significantly greater than the number of positive samples. Note that the high dimensionality and numerous samples may have diverse effects in the prediction task. Finding characteristically similar drugs and targets using the nearest-neighbour search facilitates new drug-target prediction. Determination of the nearest-neighbours using similarity distance measures are sensitive to the dimensionality and the distribution of the dataset. The popular similarity function L₁ and L₂ in Minkowski space infers the fact that, for particular data distribution, if the dataset’s dimensionality is increased then the relative difference of the distance of the closest and farthest data point of an independently selected point goes to 0. For this reason, the primary distance functions like L₁, L₂, and cosine, etc. are not suitable for high dimensional data. In this context, computing shared nearest neighbours (SNN) using the primary distance functions instead of computing nearest neighbours reduce the disadvantage of higher dimensions [15]. Assume the dataset S consisting of n = |S| objects and k ∈ N⁺. For each individual drug (or target), let NN_k(x) ⊆ S represents k-nearest-neighbors of x ∈ S. It is computed using L₂ similarity measure. The overlap between the computed k-nearest-neighbors sets of the objects x and y is represented as:

\begin{matrix} S N N_{k} (x, y) = | N N_{k} (x) \cap N N_{k} (y) | \end{matrix}

(1)

The Algorithm 1, provides the procedure to compute shared nearest neighbours, and the Algorithm 2.3, outlines how the training dataset is prepared for classifiers.

Suppose, the total number of drugs and targets are M and N. Assume drug d_{i,i ∈ M} interacts with target t_{j,j ∈ N}. Now for this d_i, the indices of all drugs in ⋃SNN_k(d_i, d_r), ∀r ∈ M and i ≠ r are identified and assigned to snnD_i. Similarly, for the target t_j, the indices of all targets in ⋃SNN_k(t_j, t_r), ∀r ∈ N and j ≠ r are identified and assigned to snnT_j. Then all the drugs and targets in snnD_i and snnT_j are clustered using the k-medoids clustering and centroids are selected as a representatives of snnD_i and snnT_j. The Calinski-Harabasz criterion is used here to determine the correct number of clusters. These representatives drugs and targets from snnD_i and snnT_j are used to construct cartesian product pairs. Subsequently, the corresponding drug vector and target vector are concatenated for each cartesian product pair, which are included in the negative samples set. Forming negative samples by the above SNN approach followed by k-medoids clustering reduces the inclusion of the irrelevant drug-target pairs. For example, in dataset 1, the number of approved drug-target pairs is 12674, and the number of all possible pairs from which interaction may be predicted is 19663522. The number of drug-target pairs selected by the SNN followed by k-medoids clustering is 45933, which indicates 427 times samples removal.

2.2.2 Step 2: Decision table preparation and average approximation degree computation

The positive and negative sets of samples obtained in 2.2.1 are divided into m and n groups, respectively. Each group from the negative set, say, n_j is taken m times with m group from the positive set, and m number of the decision table is prepared. Each decision table is used to compute the fuzzy rough upper approximation degree of each sample in the n_j group. The m number of upper approximation degree of each sample in the n_j group are then taken for average upper approximation degree computation. In Algorithm 3, We have mentioned this average upper approximation degree computation.

2.2.3 Step 3: Under-sampling based on approximation degree

A fuzzy rough grade membership is computed for every negative sample using all positive samples’ interactions via Algorithm 3. This fuzzy-rough upper approximation degree possibly indicates the possible interaction degree value between 0 to 1 scale. Now, one threshold value near 1 called th1 can be assumed to select many samples whose fuzzy-rough upper approximation degree is smaller than or equal to th1. Another one threshold value near 0 called th0 can be assumed to select many samples whose fuzzy-rough upper approximation degree are less than or equal to th0. This th0 and th1 based sample selection both under-samples the negative samples set.

2.2.4 Step 4: Oversampling, if required

The datasets used here has several approved drug-target pairs, which are treated as a set of positive samples. The remaining pairs that are unannotated may or may not interact with each other. These unannotated (and also non-approved) interaction pairs are enormous, from which interaction is predicted. For example, in dataset 1, the number of approved drug-target pairs is 12674, and the number of remaining unannotated pairs is 19663522. Initially, we have reduced the number of unannotated pairs (i.e. initially treated as a set of negative samples), by using Shared Nearest Neighbor in Step 2.2.1. The number of unannotated negative samples, previously selected by SNN, remains higher than positive samples. Our prediction model then computes a fuzzy rough upper approximation score (grade membership degree) as the strength of the interaction between a drug and target for each of the remaining unannotated pairs. Based on different threshold cut-off values of this score, we have initially divided all the unannotated drug-target pairs into positive and negative classes. The size of the so obtained negative samples is dependent on the threshold cut-off, and if it is significantly larger than the size of the positive samples, then the drug-target interaction training dataset remains imbalanced. Therefore, we have selected one threshold value of grade membership degree to under-sample the negative samples to get an approximately equal number of negative and positive samples. In this case, no oversampling is needed. However, if we select different threshold values where the number of negative samples is less than the number of positive samples, the oversampling of negative samples is required to balance negative and positive samples.

2.2.5 Step 5: Interaction prediction

As obtained in section 2.2.4, the dataset is then used to predict the negative set’s drug-target interaction pairs.

2.3 Fuzzy rough set

Assume that the drug-target pairs obtained by the given interaction matrix and SNN-based initial filtering constitute a decision table called $ℐ T$ . In this table, every row is denoted by m numbers of features i.e. C = {f_i: 1 ≤ i ≤ m} and one decision attribute D = {d}. In this $ℐ T$ , how two objects are indiscernible is determined by calculating fuzzy indiscernibility relation (FIR). Subsequently, this indiscernibility relation is taken to determine fuzzy-rough lower and upper approximation. The fuzzy lower and upper approximations using fuzzy similarity relation (either fuzzy equivalence or tolerance relation), in pursuance of Radzikowska’s model, to approximate a concept Y are defined as [16]:

\begin{matrix} μ_{\underline{R_{P}} Y} (x) = \underset{y \in I T}{i n f} I (μ_{R_{P}} (x, y), μ_{Y} (y)) \end{matrix}

(2)

\begin{matrix} μ_{\bar{R_{P}} Y} (x) = \underset{y \in I T}{s u p} T (μ_{R_{P}} (x, y), μ_{Y} (y)) \end{matrix}

(3)

Here, in Eqs 2 and 3, I indicates a fuzzy implicator, T denotes a t-norm and R_P is the fuzzy similarity relation computed by the features subset P ⊆ C. To calculate the fuzzy similarity relation R_P, which is used in fuzzy lower and upper approximations as mentioned in the Eqs 2 and 3, for the features subset P ⊆ C the following equation may be taken.

\begin{matrix} μ_{R_{P}} (x, y) = ⋂_{f \in P} {μ_{R_{f}} (x, y)} \end{matrix}

(4)

Here, $μ_{R_{f}} (x, y)$ denotes the similarity degree between object x and y with respect to feature f. Some examples of fuzzy similarity relations are given below:

\begin{matrix} μ_{R_{f}} (x, y) = 1 - \frac{| f (x) - f (y) |}{| f_{m a x} - f_{m i n} |} \end{matrix}

(5)

\begin{matrix} μ_{R_{f}} (x, y) = e x p (- \frac{{(f (x) - f (y))}^{2}}{2 σ^{2}}) \end{matrix}

(6)

$μ_{R_{f}} (x, y)$

\begin{matrix} = m a x (m i n (\frac{(f (y) - (f (x) - σ_{f}))}{(f (x) - (f (x) - σ_{f}))}, \frac{(f (x) + σ_{f}) - f (y))}{(f (x) + σ_{f}) - f (x))}), 0) \end{matrix}

(7)

where σ² stands for the variance of feature f.

Upper approximation degree computation

In Fig 1, the fuzzy rough upper approximation degree is computed as follows:

Computing fuzzy indiscernibility relation of conditional attributes using the Lukasiewicz t-norm and tolerance relation, as mentioned in section 2.3.
Computing fuzzy indiscernibility relation of decision attribute using its crisp relation.
Computing fuzzy upper approximation using the Lukasiewicz t-norm as per the Eq 3.

This fuzzy upper approximation degree can be used to select the samples from the negative samples set.

Data preprocessing for upper approximation degree computation

To reduce the dimension of feature vectors of the two datasets, we have utilized a dimensionality reduction method called incremental PCA. The feature dimension of a drug, target, and drug-target pair is 193, 1290, and 1483 for dataset1 and 881, 876, and 1757 for dataset2. To reduce the high computational cost of the fuzzy similarity computation (see Eq 4), we used incremental PCA to reduce feature dimension. This fuzzy similarity relation is further used to compute the upper or lower approximation. The computational complexity to compute the upper/lower approximation is O(|N|² × |D|) where |N| is the size of the Universe and |D| is the number of the decision classes. The computational complexity of the fuzzy similarity relation is O(|N|² × |C|) where |C| is the number of attributes. For one single attribute, the similarity relation’s computational complexity is O(|N|² × 1). For the attribute set C, there exist |C| number of similarity relations in memory which incurs high computational cost. The situation goes, even more, worse for a high-dimensional dataset. To tackle this issue, we use incremental PCA which process the whole data by splitting it into mini-batches. Each batch can easily fit into the memory and is given as input to the incremental PCA at a time. Please note that the classical PCA and its variation (sparse-PCA, kernel-PCA) may also be applicable here, but this will results high computational cost, particularly for high dimensional data the algorithm may not be feasible in reality.

Algorithm 1: sharedNN

Input: D = feature matrix for the drug

T = feature matrix for the target

Output: shared nearest neighbors represented by feature vectors

k ← Neighborhood size

X ← D or T

n ← sampleSize(X)

distances = pairWiseDistance(X)

sorted, indexes = sort(distances, ascendOrder)

for i ← 1 to n do

sharedNN = []

for j ← 1 to n do

C = intersect(indexes(i, 2:k + 1),

indexes(j, 2:k + 1))

sharedNN = sharedNN ⋃ X(C)

Algorithm 2: Dataset Preparation

Input: DT = drug-target interaction matrix

D = feature matrix for the drug

T = feature matrix for the target

Output: labeled TrainingDataSet

P ← {} % P = positive samples set

N ← {} % N = negative samples set

for i ← 1 to m do

for j ← 1 n do

if DT(i, j) = 1 then

P ← P ∪ concat(drugVec_i, targetVec_j)

/* drugVec_i: i^th drug vector, targetVec_j: j^th target vector */

tempD_i ← sharedNN(drugVec_i)

snnD_i ← optimalKmedoidsCentroids(tempD_i)

tempT_j ← sharedNN(targetVec_j)

snnT_j ← optimalKmedoidsCentroids(tempT_j)

N ← N ∪ cartesianProductPairConcat(snnD_i, snnT_j)

TrainingDataset ←P ∪ N

Algorithm 3: Average FRUA degree computation and sampling.

Data: Imbalanced TrainingDataset $ℐ$ with M samples {x_i, y_i} where i = 1 to M and x_i is an d-dimensional vector in drug-target pair feature space and y_i ∈ {0, 1}. Assume M_p and M_q represent number of minority and majority class samples respectively, such that M_p ≤ M_q and M_p+ M_q = M

Result: BalancedTraingDataset

Begin

function upperAproxCalc(decisionTable)

begin

uDegree → {} /* upper approximation degree vector */

objCount → sizeof(decisionTable) /* No. of object in decision table */

for k ← 1 to objCount do

$u D e g r e e (k) \leftarrow μ_{\bar{R_{C}} Y} (d e c i s i o n T a b l e_{k})$

here C: conditional attributesp set as per Eq 3

end

/* Split M_p and M_qinto m and n groups respectively */

split(M_p) → m groups

split(M_q) → n groups

totalNoGroupPair ← m × n /* total no. of group pairs between m and n */

allGroupPairIndices ← cartesianProduct(seq(1: m), seq(1: n)) /* It holds 1 to m × n indices where i^th index holds i^th pair */

for i ← 1 to totalNoGroupPair do

allGroupPairIndices(i)→(groupIndexOf_m, groupIndexOf_n) /*

groupIndexOf_m, groupIndexOf_n: m^th and n^th group index no. from m and n groups respectively */ $d e c i s i o n T a b l e_{i} \to (P_{g r o u p I n d e x O f} w i t h p o s i t i v e l a b e l) \cup (N_{g r o u p I n d e x O f} w i t h n e g a t i v e l a b e l)$ /*

$P_{g r o u p I n d e x O f}$ : set of positive samples taken from groupIndexOf_m, $N_{g r o u p I n d e x O f_{n}}$ : set of negative samples taken from groupIndexOf_n */ U_i ← upperAproxCalc(decisionTable_i) U_i holds upper approx. degree of all samples in $P_{g r o u p I n d e x O f_{m}}$ and upper approx. degree of all samples in $N_{g r o u p I n d e x O f}$ */

FRUA: $(\frac{1}{m} \sum (u p p e r A p p r o x D e g r e e o f N_{g r o u p I n d e x O f} |$

for each groupIndexOf_n ∈ seq(1:n) and ∀ groupIndexOf_m))

Sampling:

t_p and t_q are the thresolds for M_p and M_q

Z → ∅

for x ∈ M_q do

if FRUA(x) ≥ t_p then

M_p ← M_p ∪ x

if FRUA(x) ≤ t_q then

Z ← Z ∪ x

BalancedTraingDataset = ADASYN(M_p ∪ Z)

End

3 Results and discussions

3.1 Performance metrics

This section explains the experimental results by using three metrics referred to as ROC-AUC scores, F1 scores, and Geometric Mean scores [17]. The ROC-AUC provides a single score used to compare the models. It ranges from 0 to 1 where 1 indicates the perfect model and 0.5 represents a model having no prediction skill and the values less than 0.5 indicate that the prediction skill is worse than no skill. The ROC-AUC performance evaluation is insensitive to highly imbalanced datasets. How well a model predicts the positive class and the negative class are represented by the sensitivity and specificity. The sensitivity and specificity together can be integrated into a single score called geometric mean is represented by sqrt(Sensitivity * Specificity) where the Sensitivity = TruePositive / (TruePositive + FalseNegative) and Specificity = TrueNegative / (FalsePositive + TrueNegative).

The F1-score can be used to achieve a balance between Precision and Recall. It is also used where the class imbalance is present. All three scores are calculated using 5-fold cross-validation, and the average AUC, F1-score and G-mean score is computed. Note that the datasets 1 and 2 as mentioned in section 2 are used for prediction.

3.2 Proposed method vs some state-of-the-art sampling techniques

The proposed method deals with imbalance classification problems for drug-target interaction prediction. We have compared it with the five state-of-the-art sampling techniques known as RUS, SMOTE, ADASYN, SMOTEENN, and SMOTETomek to deal with the imbalanced dataset. Four classifiers, namely, decision tree(DT), random forest (RF), SVM, and RUSBoost are used to evaluate our proposed method’s performance. The ROC-AUC, F1, sand G-Mean scores of the proposed method, in Fig 2, are better than all the sampling methods. The RUS and SMOTE are performing poorly here for high-dimensional training data specified in [18]. ADASYN pays much attention to those samples of the minority class that are harder to learn. As our proposed method initially uses SNN, there may not be many samples that are harder to learn or the outliers. For this reason, directly using ADASYN, unlike our proposed method, is not producing satisfactory results here. The Tomek’s link in SMOTETomek and edited nearest-neighbours in SMOTEENN is used to clean the noisy samples or marginal outliers in training data. The SMOTEENN and SMOTETomek are not performing well because there are no noisy samples or marginal outliers (due to shared nearest neighbours computation) in the training data.

Fig 2 — The AUC, F1 and G-mean scores under the classification models of decision tree, random forest and support vector machine, respectively are demonstrated using various sampling methods.

3.3 Comparisons with state-of-the-art methods

We have compared the proposed method with five state-of-the-art methods, DeepPurpose [19], RLS-avg (Regularized Least Squares-Average) [20] and RLS-kron (Regularized Least Squares-Kronecker product) [21], EnsemDT [7], and EnsemKRR [7]. The DeepPurpose [19] is a deep learning-based method for drug-target interaction prediction. It is an encoder-decoder framework that uses eight encoders for a compound (drug) and seven encoders for an amino acid sequence (target). For this encoding, it uses deep neural networks, 1D convolutional neural networks, recurrent neural networks, transformer encoders, and message-passing neural networks. The drug-target pairs, along with their fuzzy-rough upper approximation scores of our method, are compatible with the input data of the DeepPurpose model. The results in Table 1, show that the proposed method performs better than the DeepPurpose for ROC-AUC score with the same data. For each of the remaining methods, we have utilized three different dimensionality reduction techniques, namely Singular Value Decomposition(SVD), Partial Least Squares (PLS), and Laplacian Eigenmaps (LapEig) for the preparation of training data. The results in Table 1, show that our proposed method has satisfactory ROC-AUC results (0.955, 0.961, 0.951, 0.947 for dataset-1 and 0.930, 0.943, 0.970 and 0.912 for dataset 2 using DT, RF, SVM and RUSBoost classifier respectively.

Table 1. Comparisons with the five state-of-the-arts methods.

Methods		Dataset 1	Dataset 2
Methods		AUC	AUC
RLS-avg, SVD		0.912	0.899
RLS-avg, PLS		0.915	0.918
RLS-avg, LapEig		0.909	0.916
RLS-kron, SVD		0.889	0.873
RLS-kron, PLS		0.899	0.913
RLS-kron, LapEig		0.889	0.874
EnsemDT, SVD		0.899	0.914
EnsemDT, PLS		0.902	0.898
EnsemDT, LapEig		0.901	0.914
EnsemKRR, SVD		0.942	0.931
EnsemKRR, PLS		0.941	0.930
EnsemKRR, LapEig		0.941	0.930
DeepPurpose		0.938	0.911
Proposed	DT	0.955	0.930
	RF	0.961	0.943
	SVM	0.951	0.970
	RUSBoost	0.947	0.912

Open in a new tab

We have only provided the ROC-AUC scores of all these competing methods due to unavailability of the F1 and G-Mean scores in [7]. The parameters of RLS-avg, RLS-kron, EnsemDT, and EnsemKRR are set to the default values as specified in [7, 20, 21].

3.4 Tuning of hyperparameters

The proposed method performs grid search-based hyperparameter tuning for computing ROC-AUC, F1, and G-Mean scores. For the DT classifier, we have observed that the best ROC-AUC, F1, and G-Mean scores are obtained using the hyperparameters combination is criterion: gini, maxDepth: 9, minSamplesLeaf: 1, minSamplesSplit: 6 for dataset 1. For dataset 2, the best ROC-AUC, F1, and G-Mean scores have been achieved by criterion: gini, maxDepth: 9, minSamplesLeaf: 1, minSamplesSplit: 4. In the case of RF classifier, for dataset 1 and dataset 2, the best hyperparameters combination is determined as criterion: gini, maxDepth: 20, minSamplesLeaf: 3, minSamplesSplit: 8, nEstimators: 200 for ROC-AUC scores of 0.961 and 0.943, respectively. Fig 3(A) and 3(B) demonstrate the variation of the AUC score of the decision tree with respect to only two hyperparameters called tree_depth and max_feature. In Fig 3(C), a heatmap is shown only for hyperparameters (n_estimators, max_depths) for the random forest model. The maximum depth of the tree is decided as nodes are expanded until all leaves are pure or until all leaves contain less than minSamplesSplit samples. The number of features for both the RF and DT is equal to maxFeatures = sqrt(nFeatures). The best hyperparameters combinations in SVM for dataset 1 are determined as kernel: RBF, C: 10.0, gamma: 0.1. As for dataset 2, the best ROC-AUC, F1, and G-Mean scores are 0.97, 0.93, and 0.929 achieved using kernel: RBF, C: 1.0, gamma: 0.1. Fig 3(D) represents the ROC-AUC scores with two hyperparameters (C, gamma) for dataset 2.

Fig 3 — Fig (A) and (B) represent the hyperparameters of decision tree called max feature and tree depth vs AUC graph for dataset 1, respectively. In (C), the hyperparameters of random forest along with the AUC scores are shown in the heatmap. Fig. (D) represents one heatmap for AUC scores of SVM for two hyperparameters called C and gamma.

To prepare negative drug-target pairs, the number of nearest neighbours is 11, which is later used to compute the shared nearest neighbours. We observed that for 11 nearest neighbours, the shared nearest neighbours computation step determines the number of drugs and targets that have a good balance between the number of samples and feature dimension.

3.5 Feature selection and comparisons

In Fig 4(A) and 4(B), the prediction scores in terms of ROC-AUC values have been shown for both datasets considering feature selection and no feature selection. In our method, after SNN computation followed by k-medoids clustering, we have computed a fuzzy rough upper approximation score (grade membership degree) as the strength of the interaction between a drug and a target for each of the unannotated pairs. Based on different threshold cut-off values of this score, we divided all the unannotated drug-target pairs into positive and negative classes. Negative samples detected from the unannotated pairs via fuzzy rough upper approximation score and the initially obtained annotated positive samples constitute the input data for RUSBoostClassifier. For different threshold cut-off values of fuzzy rough upper approximation scores, the RUSBoostClassifier produces the Fig 4(A) and 4(B). In these experiments, we used the holdout strategy for training with the training and testing ratio of 70:30. Table 1, the ROC-AUC scores of RUSBoostClassifier for one threshold cut-off value, for dataset 1 and dataset 2, are obtained by executing hyperparameters tuning using grid search. The best hyperparameters are determined as nEstimators: 500, learningRate: 1.0 which produces 0.9477 and 0.912 for ROC-AUC for dataset 1 and dataset 2. The RUSBoostClassifier is used here because it mitigates the class imbalance problem during learning by the random under-sampling of the samples at each iteration of boosting. For feature selection, the features importance scores have been computed using XGBoost and random forest. These two feature importance computation methods split the positive and negative samples into many groups, where the number of positive and, negative samples in each group is approximately equal. All the positive and negative group pairs were individually taken by the XGBoost and random forest classifiers for computing the feature importance. Finally, average feature importance scores are computed and top 100 features are taken for prediction. The average execution time, without feature selection, over 50 thresholds for dataset 1 and dataset 2 are 617.66 sec., and 232.07 sec., respectively. When feature selection is considered, the average execution time, over 50 thresholds, for dataset 1 and dataset 2 are 232.07 sec., and 77.61 sec., respectively.

3.6 Sensitivity vs number of base learners and classification errors

In Fig 4(C) and 4(D), two plots represent the M vs Sensitivity graph for both datasets where M represents the number of base learner that is ranging from 1 to 50. This experiment is carried out for a few threshold values. For each threshold, the variation of the ROC-AUC is minimal. The classification error indicates the proportion of samples that the classifier misclassified are also reported in Fig 4(E) and 4(F).

3.7 Drug-target interaction of the proposed method

In Table 2, some existing and predicted drug-target interactions have been provided. To test the efficacy of the proposed method, we have omitted several known interactions from training data. Then, we have trained our model with the remaining data and verified our prediction results. We have observed that our prediction model has even successfully recovered (predicted) those omitted known interactions. Seven drugs for the target Serine hydroxymethyltransferase, cytosolic are predicted correctly, and among them, five are listed in Table 2. For the same target, we predicted five additional interactions with drugs. Similarly, we have displayed results of some correctly predicted and novel drug-target interactions in this table. In Fig 5, some drug-target interactions have been shown, along with some interactions between the treatment areas and drugs.

Table 2. Drug target interaction verification and new interaction by the proposed method.

		Correct prediction of existing interactions	Novel Predicted interactions
Target name: Serine hydroxymethyl transferase, cytosolic	Drugs	Mimosine	Pyridostigmine
		Pyridoxal phosphate	Willardiine
		Glycine	acetamides
		tetrahydrofolic acids	Betamipron
		N-Pyridoxyl-Glycine-5-Monophosphate	Tyrosine
Target name: Monoamine oxidase	Drugs	Amphetamine	Diethylpropion
		Phentermine	Ethinamate
		Tranylcypromine	Alprenolol
		Phenelzine	Phenylephrine
		Selegiline	Probenecid
Drug name: alpha-D- glucose 6-phosphate	Targets	Glucose-6-phosphateisomerase	Peptide deformylase
		Glycogen phosphorylase, muscle form	Adenylate kinase isoenzyme 1
		Aldose reductase	Adenosylhomocysteinase
		Glutamine–fructose-6-phosphate aminotransferase [isomerizing]	Phosphoheptose isomerase
		Hexokinase-1	Low molecular weight2 tyrosine protein phosphatase
Drug name: Adenosine-5- Diphospho- ribose	Targets	MutT/nudix family protein	Enoyl-[acyl-carrierprotein] reductase [NADH] FabI
		p-hydroxy-benzoate hydroxylase	GDP-mannose6-dehydrogenase
		Glyceraldehyde-3-phosphate dehydrogenase	RNA-directed RNA polymerase
		Lactaldehyde reductase	Serine hydroxymethyl-transferase
		Elongation factor 2	Bifunctional protein BirA

Open in a new tab

3.8 Drug-target interaction validation

To verify our drug-target interaction prediction results, we have used the Connectivity Map (Cmap) [22] prediction results provided by the Broad Institute. The drug name and target name in the Drugbank dataset have different representations in Cmap. Therefore, we have performed the conversion between Drugbank ID and Cmap using the webchem R package [23]. This R package retrieves the chemical information from the web using a suite of 14 web services.

Our prediction results of drug-target pairs for Drugbank dataset are utilized in the webchem packages, which only fetches information from the Wikidata. Due to lack of information in the suite of web services, except the Wikidata, as provided by webchem R package, we have not obtained complete matching between our prediction and Cmap predictions. In Table 3, a list of 50 drug-target interaction pairs is shown that has been predicted by our method. Thirty-four interaction pairs which are also available in the Cmap predicted database is marked in bold face.

Table 3. Drug-target interactions by proposed method.

Drug	Target	FruaScore	Drug	Target	FruaScore
DB04094	Q9Y296	0.933385	DB00839	Q09428	0.814468
DB03750	P0CG47	0.933299	DB00476	P28335	0.810978
DB03988	Q9Y296	0.933073	DB00450	P35462	0.806337
DB03320	Q9Y296	0.932387	DB00776	P35498	0.804604
DB08242	P0AEK4	0.932214	DB00929	P43119	0.803532
DB08137	P0AEK4	0.932189	DB00433	P35462	0.802923
DB07153	P16184	0.932128	DB00794	Q14524	0.799097
DB00992	Q9Y296	0.932054	DB00917	P21731	0.798244
DB04789	P16184	0.932053	DB01121	Q14524	0.795084
DB07000	P0AEK4	0.932018	DB00645	Q14524	0.793230
DB04197	Q9Y296	0.932002	DB00850	P35367	0.764447
DB07281	P0AEK4	0.931912	DB04846	P08913	0.759809
DB03448	P0A884	0.931780	DB00782	P08172	0.758948
DB04796	P14867	0.931678	DB01365	P08913	0.751881
DB02456	P0A884	0.931636	DB01121	Q9NY46	0.751538
DB04680	P0CG29	0.931635	DB03719	P30542	0.747386
DB01248	P07437	0.922451	DB00670	P08172	0.745866
DB00518	P07437	0.919137	DB07954	P30542	0.744886
DB00391	P00915	0.915100	DB00794	Q9Y5Y9	0.730465
DB01248	Q13509	0.914888	DB00776	Q9Y5Y9	0.710952
DB01248	P68363	0.911210	DB00252	Q9Y5Y9	0.709006
DB05294	Q15303	0.904014	DB00999	Q08460	0.594489
DB00361	P68363	0.897636	DB01119	Q08460	0.589146
DB01121	P35499	0.824893	DB00356	Q08460	0.583733
DB04846	P07550	0.816920	DB03719	P29274	0.556650

Open in a new tab

We have also observed that most of predicted drug-target interaction pairs e.g. (DB01248, P07437), (DB04846, P07550), (DB00839, Q09428), (DB00450, P35462), (DB00776, Q9Y5Y9), (DB00776, P35498) shown in Table 3, are also reported in [24–28].

4 Conclusion

In this article, a novel computational approach for drug-target interaction prediction is presented utilizing existing drug-target data. There are two critical issues in this domain: a massive amount of drugs and targets creating a vast search space and highly imbalanced drug-target interactions dataset as there is a tiny number of drug-target interactions unveiled so far. Thus, the size of the negative samples is much larger than the size of the positive samples.

Here, we have used shared nearest neighbours rather than taking a fixed number of nearest neighbours as it is more effective in the higher dimensional dataset. The reason behind this is, typically, the size of the overlapped items within the neighbourhoods of a pair of drugs (or targets) inside the same cluster is substantially larger than the neighbourhoods of a pair of drugs (or targets) belonging to different clusters. Moreover, to tackle the curse of the imbalanced dataset, these shared nearest neighbours are further grouped by k-medoids. The representative centroids of k-medoids for the drug and target are then considered new possible drug-target interaction pairs for each known drug-target pair. Additionally, to deal with imbalanced dataset further, we have computed the degree of fuzzy-rough upper approximation of all the possible interaction pairs in the negative samples to perform undersampling. After that, selecting a threshold of the computed degrees, the size of the negative and positive samples sets are balanced. This upper approximation degree-based undersampling of the negative samples causes improvement in the prediction scores. Computation of degree in the fuzzy-rough upper approximation is challenging as interaction pairs’ dimension is exceptionally high. The execution time of this fuzzy-rough upper approximation degree is directly proportional to the number of features. Therefore, further investigation on fuzzy-rough set based feature selection followed by fuzzy-rough upper approximation computation may improve the prediction score. Instead of using a single threshold for undersampling, multiple threshold-based undersampling may be investigated for tackling the curse of imbalanced datasets. Moreover, the positive samples’ oversampling to balance with the number of negative samples may be explored to improve the prediction score. We believe that DTI-SNNFRA may be a promising framework for drug-target interaction prediction.

Supporting information

S1 File

(ZIP)

Click here for additional data file.^{(9MB, zip)}

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

The author(s) received no specific funding for this work.

References

1. Sachdev K, Gupta MK. A comprehensive review of feature based methods for drug target interaction prediction. Journal of Biomedical Informatics. 2019;93:103159 10.1016/j.jbi.2019.103159 [DOI] [PubMed] [Google Scholar]
2. Cui Z, Gao YL, Liu JX, Wang J, Shang J, Dai LY. The computational prediction of drug-disease interactions using the dual-network L2,1-CMF method. BMC Bioinformatics. 2019;20(1):5 10.1186/s12859-018-2575-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Ezzat A, Wu M, Li XL, Kwoh CK. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics. 2016;17(19):509 10.1186/s12859-016-1377-y [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K. Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Briefings in Bioinformatics. 2020; [DOI] [PMC free article] [PubMed] [Google Scholar]
5. D’Souza S, Prema KV, Balaji S. Machine learning models for drug–target interactions: current knowledge and future directions. Drug Discovery Today. 2020;25(4):748–756. 10.1016/j.drudis.2020.03.003 [DOI] [PubMed] [Google Scholar]
6. Sharma A, Rani R. BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. Computer Methods and Programs in Biomedicine. 2018;165:151–162. 10.1016/j.cmpb.2018.08.011 [DOI] [PubMed] [Google Scholar]
7. Ezzat A, Wu M, Li XL, Kwoh CK. Drug-target interaction prediction using ensemble learning and dimensionality reduction. Methods. 2017;129:81–88. 10.1016/j.ymeth.2017.05.016 [DOI] [PubMed] [Google Scholar]
8. Seal A, Ahn YY, Wild DJ. Optimizing drug-target interaction prediction based on random walk on heterogeneous networks. Journal of cheminformatics. 2015;7:40–40. 10.1186/s13321-015-0089-z [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discovery Today. 2018;23(6):1241–1250. 10.1016/j.drudis.2018.01.039 [DOI] [PubMed] [Google Scholar]
10. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, et al. DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs. Nucleic Acids Research. 2010;39(suppl_1):D1035–D1041. 10.1093/nar/gkq1126 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y. Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics. 2012;28(18):i487–i494. 10.1093/bioinformatics/bts412 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Cao DS, Xiao N, Xu QS, Chen AF. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics. 2014;31(2):279–281. 10.1093/bioinformatics/btu624 [DOI] [PubMed] [Google Scholar]
13. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research. 2006;34(suppl_2):W32–W37. 10.1093/nar/gkl305 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 2015;44(D1):D279–D285. 10.1093/nar/gkv1344 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Gertz M, Ludäscher B, editors. Scientific and Statistical Database Management. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010. p. 482–500. [Google Scholar]
16. Jensen R, Shen Q. New Approaches to Fuzzy-Rough Feature Selection. Fuzzy Systems, IEEE Transactions on. 2009;17(4):824–838. 10.1109/TFUZZ.2008.924209 [DOI] [Google Scholar]
17. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861–874. 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]
18. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):106 10.1186/1471-2105-14-106 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics. 2020; 10.1093/bioinformatics/btaa1005 [DOI] [PMC free article] [PubMed] [Google Scholar]
20. Laarhoven TV, Nabuurs S, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27 21:3036–43. 10.1093/bioinformatics/btr500 [DOI] [PubMed] [Google Scholar]
21. van Laarhoven T, Marchiori E. Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile. PLOS ONE. 2013;8:1–6. 10.1371/journal.pone.0066952 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences. 2010;107(33):14621–14626. 10.1073/pnas.1000138107 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Szöcs E. webchem: retrieve chemical information from the web; 2015. Available from: 10.5281/zenodo.33823. [DOI]
24. Matesanz R, Barasoain I, Yang CG, Wang L, Li X, de Inés C, et al. Optimization of Taxane Binding to Microtubules: Binding Affinity Dissection and Incremental Construction of a High-Affinity Analog of Paclitaxel. Chemistry and Biology. 2008;15(6):573–585. 10.1016/j.chembiol.2008.05.008 [DOI] [PubMed] [Google Scholar]
25. Yao EH, Fukuda N, Matsumoto T, Katakawa M, Yamamoto C, Han Y, et al. Effects of the antioxidative beta-blocker celiprolol on endothelial progenitor cells in hypertensive rats. American journal of hypertension. 2008;21 9:1062–8. 10.1038/ajh.2008.233 [DOI] [PubMed] [Google Scholar]
26. Asano K, Cortes P, Garvin JL, Riser BL, Rodríguez-Barbero A, Szamosfalvi B, et al. Characterization of the rat mesangial cell type 2 sulfonylurea receptor. Kidney International. 1999;55(6):2289–2298. 10.1046/j.1523-1755.1999.00485.x [DOI] [PubMed] [Google Scholar]
27. Gao HR, Shi TF, Yang CX, Zhang D, Zhang GW, Zhang Y, et al. The effect of dopamine on pain-related neurons in the parafascicular nucleus of rats. Journal of neural transmission (Vienna, Austria: 1996). 2010;117(5):585–591. 10.1007/s00702-010-0398-3 [DOI] [PubMed] [Google Scholar]
28. Vohora D, Saraogi P, Yazdani M, Bhowmik M, Khanam R, Pillai K. Recent advances in adjunctive therapy for epilepsy: focus on sodium channel blockers as third-generation antiepileptic drugs. Drugs of today (Barcelona, Spain: 1998). 2010;46(4):265–277. 10.1358/dot.2010.46.4.1445795 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File

(ZIP)

Click here for additional data file.^{(9MB, zip)}

Data Availability Statement

All relevant data are within the manuscript and its Supporting information files.

[pone.0246920.ref001] 1. Sachdev K, Gupta MK. A comprehensive review of feature based methods for drug target interaction prediction. Journal of Biomedical Informatics. 2019;93:103159 10.1016/j.jbi.2019.103159 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref002] 2. Cui Z, Gao YL, Liu JX, Wang J, Shang J, Dai LY. The computational prediction of drug-disease interactions using the dual-network L2,1-CMF method. BMC Bioinformatics. 2019;20(1):5 10.1186/s12859-018-2575-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref003] 3. Ezzat A, Wu M, Li XL, Kwoh CK. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics. 2016;17(19):509 10.1186/s12859-016-1377-y [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref004] 4. Bagherian M, Sabeti E, Wang K, Sartor MA, Nikolovska-Coleska Z, Najarian K. Machine learning approaches and databases for prediction of drug–target interaction: a survey paper. Briefings in Bioinformatics. 2020; [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref005] 5. D’Souza S, Prema KV, Balaji S. Machine learning models for drug–target interactions: current knowledge and future directions. Drug Discovery Today. 2020;25(4):748–756. 10.1016/j.drudis.2020.03.003 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref006] 6. Sharma A, Rani R. BE-DTI’: Ensemble framework for drug target interaction prediction using dimensionality reduction and active learning. Computer Methods and Programs in Biomedicine. 2018;165:151–162. 10.1016/j.cmpb.2018.08.011 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref007] 7. Ezzat A, Wu M, Li XL, Kwoh CK. Drug-target interaction prediction using ensemble learning and dimensionality reduction. Methods. 2017;129:81–88. 10.1016/j.ymeth.2017.05.016 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref008] 8. Seal A, Ahn YY, Wild DJ. Optimizing drug-target interaction prediction based on random walk on heterogeneous networks. Journal of cheminformatics. 2015;7:40–40. 10.1186/s13321-015-0089-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref009] 9. Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discovery Today. 2018;23(6):1241–1250. 10.1016/j.drudis.2018.01.039 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref010] 10. Knox C, Law V, Jewison T, Liu P, Ly S, Frolkis A, et al. DrugBank 3.0: a comprehensive resource for ‘Omics’ research on drugs. Nucleic Acids Research. 2010;39(suppl_1):D1035–D1041. 10.1093/nar/gkq1126 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref011] 11. Tabei Y, Pauwels E, Stoven V, Takemoto K, Yamanishi Y. Identification of chemogenomic features from drug–target interaction networks using interpretable classifiers. Bioinformatics. 2012;28(18):i487–i494. 10.1093/bioinformatics/bts412 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref012] 12. Cao DS, Xiao N, Xu QS, Chen AF. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics. 2014;31(2):279–281. 10.1093/bioinformatics/btu624 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref013] 13. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Research. 2006;34(suppl_2):W32–W37. 10.1093/nar/gkl305 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref014] 14. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J, Mitchell AL, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Research. 2015;44(D1):D279–D285. 10.1093/nar/gkv1344 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref015] 15. Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A. Can Shared-Neighbor Distances Defeat the Curse of Dimensionality? In: Gertz M, Ludäscher B, editors. Scientific and Statistical Database Management. Berlin, Heidelberg: Springer Berlin Heidelberg; 2010. p. 482–500. [Google Scholar]

[pone.0246920.ref016] 16. Jensen R, Shen Q. New Approaches to Fuzzy-Rough Feature Selection. Fuzzy Systems, IEEE Transactions on. 2009;17(4):824–838. 10.1109/TFUZZ.2008.924209 [DOI] [Google Scholar]

[pone.0246920.ref017] 17. Fawcett T. An introduction to ROC analysis. Pattern Recognition Letters. 2006;27(8):861–874. 10.1016/j.patrec.2005.10.010 [DOI] [Google Scholar]

[pone.0246920.ref018] 18. Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC Bioinformatics. 2013;14(1):106 10.1186/1471-2105-14-106 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref019] 19. Huang K, Fu T, Glass LM, Zitnik M, Xiao C, Sun J. DeepPurpose: a deep learning library for drug–target interaction prediction. Bioinformatics. 2020; 10.1093/bioinformatics/btaa1005 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref020] 20. Laarhoven TV, Nabuurs S, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27 21:3036–43. 10.1093/bioinformatics/btr500 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref021] 21. van Laarhoven T, Marchiori E. Predicting Drug-Target Interactions for New Drug Compounds Using a Weighted Nearest Neighbor Profile. PLOS ONE. 2013;8:1–6. 10.1371/journal.pone.0066952 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref022] 22. Iorio F, Bosotti R, Scacheri E, Belcastro V, Mithbaokar P, Ferriero R, et al. Discovery of drug mode of action and drug repositioning from transcriptional responses. Proceedings of the National Academy of Sciences. 2010;107(33):14621–14626. 10.1073/pnas.1000138107 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0246920.ref023] 23.Szöcs E. webchem: retrieve chemical information from the web; 2015. Available from: 10.5281/zenodo.33823. [DOI]

[pone.0246920.ref024] 24. Matesanz R, Barasoain I, Yang CG, Wang L, Li X, de Inés C, et al. Optimization of Taxane Binding to Microtubules: Binding Affinity Dissection and Incremental Construction of a High-Affinity Analog of Paclitaxel. Chemistry and Biology. 2008;15(6):573–585. 10.1016/j.chembiol.2008.05.008 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref025] 25. Yao EH, Fukuda N, Matsumoto T, Katakawa M, Yamamoto C, Han Y, et al. Effects of the antioxidative beta-blocker celiprolol on endothelial progenitor cells in hypertensive rats. American journal of hypertension. 2008;21 9:1062–8. 10.1038/ajh.2008.233 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref026] 26. Asano K, Cortes P, Garvin JL, Riser BL, Rodríguez-Barbero A, Szamosfalvi B, et al. Characterization of the rat mesangial cell type 2 sulfonylurea receptor. Kidney International. 1999;55(6):2289–2298. 10.1046/j.1523-1755.1999.00485.x [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref027] 27. Gao HR, Shi TF, Yang CX, Zhang D, Zhang GW, Zhang Y, et al. The effect of dopamine on pain-related neurons in the parafascicular nucleus of rats. Journal of neural transmission (Vienna, Austria: 1996). 2010;117(5):585–591. 10.1007/s00702-010-0398-3 [DOI] [PubMed] [Google Scholar]

[pone.0246920.ref028] 28. Vohora D, Saraogi P, Yazdani M, Bhowmik M, Khanam R, Pillai K. Recent advances in adjunctive therapy for epilepsy: focus on sodium channel blockers as third-generation antiepileptic drugs. Drugs of today (Barcelona, Spain: 1998). 2010;46(4):265–277. 10.1358/dot.2010.46.4.1445795 [DOI] [PubMed] [Google Scholar]

PERMALINK

DTI-SNNFRA: Drug-target interaction prediction by shared nearest neighbors and fuzzy-rough approximation

Sk Mazharul Islam

Sk Md Mosaddek Hossain

Sumanta Ray

Roles

Abstract

1 Introduction

2 Materials and methods

Fig 1. Building block of proposed DTI-SNNFRA method.

2.1 Dataset preparation

2.2 Workflow of the proposed framework

2.2.1 Step 1: Finding positive and negative drug-target pairs

2.2.2 Step 2: Decision table preparation and average approximation degree computation

2.2.3 Step 3: Under-sampling based on approximation degree

2.2.4 Step 4: Oversampling, if required

2.2.5 Step 5: Interaction prediction

2.3 Fuzzy rough set

Upper approximation degree computation

Data preprocessing for upper approximation degree computation

3 Results and discussions

3.1 Performance metrics

3.2 Proposed method vs some state-of-the-art sampling techniques

Fig 2. Fig (A) and (B) represents the performance on two datasets.

3.3 Comparisons with state-of-the-art methods

Table 1. Comparisons with the five state-of-the-arts methods.

3.4 Tuning of hyperparameters

Fig 3.

3.5 Feature selection and comparisons

Fig 4.

3.6 Sensitivity vs number of base learners and classification errors

3.7 Drug-target interaction of the proposed method

Table 2. Drug target interaction verification and new interaction by the proposed method.

Fig 5. Some drug-target interactions with treatment areas of the drugs.

3.8 Drug-target interaction validation

Table 3. Drug-target interactions by proposed method.

4 Conclusion

Supporting information

Data Availability

Funding Statement

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases