Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2023 Apr 12;18(4):e0282042. doi: 10.1371/journal.pone.0282042

Predicting activatory and inhibitory drug–target interactions based on structural compound representations and genetically perturbed transcriptomes

Won-Yung Lee 1, Choong-Yeol Lee 1, Chang-Eop Kim 1,*
Editor: Jinn-Moon Yang2
PMCID: PMC10096289  PMID: 37043429

Abstract

A computational approach to identifying drug–target interactions (DTIs) is a credible strategy for accelerating drug development and understanding the mechanisms of action of small molecules. However, current methods to predict DTIs have mainly focused on identifying simple interactions, requiring further experiments to understand mechanism of drug. Here, we propose AI-DTI, a novel method that predicts activatory and inhibitory DTIs by combining the mol2vec and genetically perturbed transcriptomes. We trained the model on large-scale DTIs with MoA and found that our model outperformed a previous model that predicted activatory and inhibitory DTIs. Data augmentation of target feature vectors enabled the model to predict DTIs for a wide druggable targets. Our method achieved substantial performance in an independent dataset where the target was unseen in the training set and a high-throughput screening dataset where positive and negative samples were explicitly defined. Also, our method successfully rediscovered approximately half of the DTIs for drugs used in the treatment of COVID-19. These results indicate that AI-DTI is a practically useful tool for guiding drug discovery processes and generating plausible hypotheses that can reveal unknown mechanisms of drug action.

1. Introduction

Identifying drug–target interactions (DTIs) is an essential step in drug discovery and repurposing. Proper understanding of DTIs can lead to fast optimization of small molecules derived from phenotypic screening and elucidation of the mechanism of action for experimental drugs [1]. However, identifying a candidate drug for a putative target by relying solely on in vivo and biochemical approaches often takes 2–3 years with tremendous economic costs [2, 3]. Computational approaches have emerged as an alternative strategy for reducing the workload and resources by efficiently identifying potential DTIs. This strategy has the potential to accelerate the drug development process by prioritizing candidate compounds for putative targets or vice versa.

Conventional methods for predicting DTIs can be broadly categorized into docking simulations and ligand-based approaches [4, 5]. However, their prediction is often unreliable when the 3D structure of a protein or target is unavailable or when an insufficient number of ligands is known for the target, respectively [6]. Recently, chemogenomic approaches have emerged as an alternative enabling large-scale predictions by leveraging recent advances in network-based approaches or machine learning techniques [711]. For example, Yunan et al. proposed DTI-Net, a network-integrated pipeline that predicts DTIs by constructing a heterogeneous network using the information collected from various sources [12]. Other researchers have proposed deep learning-based methods, such as convolutional neural network, graph convolutional network (GCN), and natural language processing, to predict novel DTIs [1315]. Despite their state-of-the-art performance, these models predict simple interactions without the mode of action, necessitating further experimental validation to fully understand the mechanisms of action of the drug.

Researchers thus attempted to develop a model that predicts DTIs by specifying the mode of action. Specifically, Sawada et al. proposed a model that predicts activatory and inhibitory DTIs by combining transcriptome profiles measured after compound treatment and genetic perturbation [16]. Although the method showed the possibility of predicting DTIs with modes of action using transcriptome data, it did not provide a satisfactory tool that could be applied for drug discovery. First, employing compound-induced transcriptome profiles as a representative vector of compound limited the range of predictable compounds significantly. Second, the number of predictable activatory and inhibitory targets was 74 and 755, respectively, which covered only a fraction of the druggable targets. Finally, the employed algorithms, called joint learning, could not learn nonlinear relationships between input vectors and DTI labels, and thus, the performance of DTI prediction was insufficient. Therefore, there is still a pressing need to develop a method with superior performance for predicting a wide range of activatory and inhibitory targets for novel compounds and natural products.

In this paper, we present AI-DTI, a new computational methodology for predicting activatory and inhibitory DTIs, by integrating a mol2vec method and genetically perturbed transcriptomes (Fig 1). Employing mol2vec enabled our method to expand the drug space to most compounds with 2D structures. In addition, by inferring the target vector representation based on the protein–protein interaction (PPI) network, the number of predictable targets was expanded to cover a majority of druggable targets. We compared the performance of various classifiers on the training data set and selected the optimized classifier with the best performance. The prediction capacity of our model was also evaluated on independent datasets with unseen DTI pairs, and high-throughput biological assay results. Finally, as a case study, we evaluated whether our method could be applied to the prediction of DTIs for the novel disease, coronavirus disease 2019 (COVID-19). All these results demonstrate that AI-DTI is a practically useful tool for predicting unknown activatory and inhibitory DTIs, which provide new insights into drug discovery and help in understanding modes of drug action.

Fig 1. Overview of the AI-DTI pipeline.

Fig 1

(A) A structure of AI-DTI. (B) Feature vector generation for activatory and inhibitory drug-target interactions (DTIs). For activatory DTIs, the feature vector was represented as a concatenation of the compound vector and aggregated gene overexpression signatures. For inhibitory DTIs, the feature vector was represented as a concatenation of the compound vector and aggregated gene knockdown signatures.

2. Materials and methods

2.1. Employing mol2vec-based compound features

The vector representation of compounds was obtained using mol2vec [17], a word2vec-inspired model that learns the vector representations of molecular substructures. Mol2vec applies the word2vec algorithm to the corpus of compounds by considering compound substructures derived from the Morgan fingerprint as “words” and compounds as “sentences”. The vector representations of molecular substructures are encoded to point in directions similar to those that are chemically related, and entire compound representations are obtained by summing the vectors of the individual substructures. Among the mol2vec versions, we implemented the skip-gram model with a window size of 10 and 300-dimensional embedding of Morgan substructures, which demonstrated the best prediction capabilities in several compound property and bioactivity datasets.

2.2. Constructing genetically perturbed transcriptome-based target features

The genetically perturbed transcriptome of the L1000 dataset was downloaded from the Gene Expression Omnibus (accession number: GSE92742), which contains 473,647 signatures. Each signature (i.e., transcriptome profile) consists of a mediated z-score of 978 landmark genes whose expression levels were measured directly and 11,350 genes whose expression values were inferred from them. Landmark gene refers to one whose gene expression has been determined as being informative to characterize the transcriptome and which is measured directly in the L1000 assay. In our study, level 5 landmark gene data were used to represent the target vector. Level 5 data are a normalized dataset suggested by the LINCS team for use without additional processing. Among the types of perturbations, “cDNA for overexpression of wild-type gene” and “consensus signature from shRNAs for loss of function” were considered vector representations for activatory and inhibitory targets, respectively. From the downloaded data, the gene expression signatures of the landmark gene set were parsed using the cmapPy module [18], resulting in 36,720 gene knockdown and 22,205 gene overexpression signatures.

The parsed data contained multiple gene expression profiles for single genetic perturbations measured in various cell lines and/or under perturbing conditions, which necessitated further preprocessing. To obtain the representative vector by each target, we applied a weighted average procedure divided into two steps: aggregation according to experimental conditions and aggregation across cell lines (Fig 2B). Before introducing the weighted average procedure, we described the process of applying the aggregation procedure. Specifically, the signatures measured after the same genetic perturbation in a specific cell, but with different perturbational dose or time, were first aggregated by weighted averaging. Then, the representative vector for a particular target is obtained by reapplying weighted averaging to these signatures (i.e., the aggregated signatures measured after the same genetic perturbation in different cell lines). This segmentation process reduces the potential biases on the representative vector calculation that occurs when the number of genetically perturbed signatures skewed on a particular cell line. The process also allows us to compute embedding for targets that are at least 15% wider than when using gene expression in a single cell line (S1 Table).

Fig 2. Schematics of aggregation and inference for a genetically perturbed transcriptome.

Fig 2

(A) Weighted averaging for combining individual signatures into consensus gene signatures. Individual profiles were weighted by the sum of their correlations to other individual signatures and then averaged. (B) Generation of target vectors by cross-perturbation and cross-cell line aggregation. Multiple signatures measured after the same genetic perturbation in a specific cell, but with different perturbational doses or times, were first aggregated by weighted averaging. The same procedure was performed between multiple signatures measured after the same genetic perturbation in different cell lines. (C) Examples of inferring target vector representation. AGT, AGTR1, and KNG1 interacted with ACE in the PPI network, and their inhibitory target vectors were available. The inhibitory target vector of ACE was inferred by aggregating these three representative inhibitory target vectors using weighted averaging.

The weighted average procedure is a method of weighted averaging of multiple signatures based on a pairwise correlation matrix (Fig 2A) [19, 20]. Suppose that xtR978 (t = 1, 2…, n) is a vector representing each L1000 signature of a landmark gene (normalized gene expression profiles for directly measured genes) for a specific functional perturbation, where t represents the elements of the signatures and n represents the total number of signatures to be aggregated. To generate the aggregated vector xAgg, a pairwise correlation matrix Rn×n is defined as the Spearman coefficient between signature pairs,

R=ρ11ρ1nρn1ρnn (1)

where ρij denotes the Spearman correlation coefficient between the signature pairs xi and xj for i, j ∈ {1, 2, …, n}.

The weight vector (w) is obtained by summing across the columns of R after excluding trivial self-correlation and then normalizing them,

w=1jTR-IjR-Ij, (2)

where I denotes the identity matrix, and j denotes column vectors of 1s.

Finally, xAgg is obtained from the average of xt based on the weight vector w,

xAgg=t=1nwtxt, (3)

where wt denotes the t-th entries of w. By aggregating the signatures across experimental conditions and cell lines, we obtained the representative vectors of 3,114 and 4,345 activatory and inhibitory targets, respectively, embedded in 978 dimensions (Fig 2B). The obtained target vector representation was used as target features of DTIs in an original dataset to be constructed later.

The target list of the obtained vector contained only a fraction of the druggable targets, thus significantly limiting the target space of our method. Therefore, the target space was extended by inferring the vector representation of activatory or inhibitory signatures based on the PPI network. The PPI network was constructed from the STRING database (v 11.0) [21] by setting the organism as “homo sapiens” and an interaction score > 0.9 (highest confidence score suggested by STRING). The vector representation of the activation or inhibition target was inferred by aggregating the vector representation of the interacting protein in the PPI network using a weight averaging procedure (Fig 2C). To ensure the quality of the data, we limited the inferred targets to proteins with at least three neighbours whose target vectors are available. The inferred target vector representation was used as target features of DTIs in an additional dataset to be constructed later.

2.3. Collection of activatory and inhibitory DTIs

Known activatory and inhibitory DTIs used as ground truth in our model were obtained from the Therapeutic Target Database (TTD) 2.0 (accessed October 15, 2020) [22] and DrugBank 5.1.7 (accessed January 12, 2021) [23]. We selected DTIs that explicitly defined activatory or inhibitory interactions (“activator” or “agonist” for activatory DTIs and “inhibitor” or “antagonist” for inhibitory DTIs). Identifiers of compounds and targets with their annotations were standardized by PubChem chemical ID and gene symbols, respectively. The chemical structures of our dataset were retrieved in canonical SMILE format using the Python package PubChemPy. From TTD, we obtained 2,925 activatory and 32,417 inhibitory DTIs between 24,145 compounds and 2,117 targets. From DrugBank, we obtained 919 activatory and 4,719 inhibitory DTIs between 1,600 compounds and 1,022 targets.

2.4. Dataset construction

Two types of data sets were constructed respectively, which we refer to as the original dataset and the additional dataset (Fig 3A). Specifically, original dataset was constructed by selecting a known pair of activatory and inhibitory DTI pairs that include a compound for which ECFPs can be calculated and a target for which transcriptome data are available. Another dataset, additional dataset, was constructed by selecting DTI pairs that include a compound for which ECFPs can be calculated and a target for which inferred transcriptome data are available. Finally, the integrated dataset was constructed by combining these data sets and contained 1,755 activatory DTIs between 1,265 compounds and 273 targets, and 17,873 inhibitory DTIs between 12,259 compounds and 1,034 targets (Table 1).

Fig 3. Construction of a dataset of known DTIs and features.

Fig 3

(A) Training dataset construction. Transcriptome profiles were obtained from the L1000 array data and then aggregated to generate a representative target vector. A mol2vec method was used to generate representative vectors for compounds. DTIs with modes of action were collected from the TTD. The original dataset was constructed by selecting activatory and inhibitory DTI pairs that include a compound for which ECFPs can be calculated and an original target (i.e., a target for which genetically perturbed transcriptome data are available). The additional dataset was constructed by selecting activatory and inhibitory DTI pairs that include a compound for which ECFPs can be calculated and an additional target (i.e., a target for which inferred transcriptome data are available). (B) Independent dataset construction. Two independent datasets, Drugbank and LIT-PCBA datasets, were constructed to evaluate the reliability of predictions for unseen DTI in training datasets.

Table 1. Overview of the drug–target interaction dataset for model training and external validation.

Activatory DTIs Inhibitory DTIs
No. of compounds No. of targets No. of DTIs No. of compounds No. of targets No. of DTIs
Internal sets
 Original dataset 457 87 513 6,789 702 8,328
 Additional dataset 887 186 1,242 6,781 332 9,545
 Integrated dataset 1,265 273 1,755 12,259 1,034 17,873
External sets
 DrugBank* 374 172 808 1,217 909 4,267
 LIT-PCBA 130,412 3 30/318,066# 302,567 12 10,003/2,480,671#

*A dataset with only DTIs unseen in the TTD dataset.

#Number of active and inactive DTIs.

Two independent datasets, Drugbank dataset and LIT-PCBA dataset, were constructed to measure the generalized ability of the trained model on predicting unseen DTIs (Fig 3B). We selected two datasets as independent datasets because Drugbank provides broad and highly versatile DTI information, and LIT-PCBA offers systematically integrated high-throughput screening assay results. It is noteworthy that LIT-PCBA provides whether compounds are active/inactive for a specific target, so evaluation in this dataset can measure generalized performance in independent datasets where both negative samples and positive samples are explicitly defined. Each dataset was constructed by gathering a known pair of activatory and inhibitory pairs that include a compound with ECFPs and a target with (inferred) transcriptome data. Collectively, Drugbank datasets contained 808 activatory DTIs and 4,267 inhibitory DTIs (all positive samples), and LIT-PCBA datasets contained 318,096 activatory DTIs (30 positive and 318,066 negative samples) and 2,490,674 inhibitory DTIs (10,003 positive and 2,480,671 negative samples).

2.5. Training classifier models

The features of activatory and inhibitory DTIs were fed into classifier models to predict their interactions. The DTI prediction performance was evaluated by logistic regression (Logit), random forest (RF), multilayer perceptron (MLP), and cascade deep forest (CDF) models. RF is ensemble model that combine the probabilistic predictions of a number of decision tree-based classifiers to improve the generalization capability over a single estimator. MLP is a supervised learning algorithm that can learn nonlinear models. The architecture and hyperparameters of the MLP models in this study are summarized in S1 Fig. CDF employs a cascade structure, where the model consists of a multilayered architecture, and each level consists of an RF and extra trees [24]. Each level of cascade receives feature information processed by the preceding level and conveys its result to the next level. A key feature of the updated CDF model, deep forest 21, is that it automatically determines the number of cascade levels that are suitable for the training data by terminating the training procedure when performance improvements through adding cascade levels are no longer significant [25]. Greedy-search methods were used to select optimized hyperparameters of CDF as follows: the number of estimators in each cascade layer, the number of trees in the forest, the function to measure the quality of a split, maximum depth, minimum number of samples to be split, maximum features, and minimum impurity decrease. The detailed search range of the optimization is summarized in S2 Table.

2.6. Performance evaluation

The performance on the training set was evaluated using fivefold cross-validation (CV). In each fold, a training set was constructed by randomly selecting a subset of the 80% known DTI pairs (assigned as the positive sample) and the randomly selected DTI pairs (assigned as the negative sample), and the test set was constructed by selecting the remaining 20% of the known DTI pairs and a matching number of randomly selected DTI pairs. For each fold of the predictive model, the following metrics were calculated:

Precision=TP/TP+FP
Truepositiverate=TP/TP+FN
Falsepositiverate=FP/FP+TN,

where TP is true positive, FP is false positive, FN is false negative, and TN is true negative. We plotted the receiver operating characteristic (ROC) curves based on different recall and false positive rates and precision-recall curves based on different precision and recall values under the conditions of different classification cut-off values. Area under the receiver operating characteristic curve (AUROC) and Area under the precision-recall curve (AUPR) were calculated over each fold, and their average values were recorded as measures of model performance. AUPR provides a better assessment in highly skewed datasets, whereas AUROC is prone to be an overoptimistic metric. Thus, we used AUPR as the key metric for model selection [26, 27].

To evaluate the performance in the specific threshold in the ROC curve, the enrichment in true actives at a constant x% false positive rate over random picking (EFx%) was calculated as follows:

EFx%=TPx%/(TPx%+FPx%)(TPx%+FNx%)/(TPx%+TNx%+FPx%+TNx%)

Where TPx%, FPx%, FNx% and TNx% are the number of true positive, false positive, false negative and true negative samples at the threshold showing a false positive rate of x%, respectively. The EFx% value represents the enriched true positive rate compared to the expected value at the threshold representing a false positive rate of x%. For example, the EF1% value indicates the enrichment of the true positive rate compared to the chance level (0.01) at the threshold where the false positive rate is 0.01.

3. Results

3.1. Overview of AI-DTI

We developed AI-DTI for the in silico identification of activatory and inhibitory DTIs. AI-DTI is composed of two models that predict activatory DTIs and inhibitory DTIs (Fig 1A). When an input query (drug-target pair) is received, AI-DTI transforms it into activatory and inhibitory DTI feature vectors, and then estimated their interaction scores using each prediction model. The feature vector of DTIs was constructed by concatenating the compound vector calculated by mol2vec and genetically perturbed transcriptomes (gene-over expression signatures for activatory DTIs and gene-knock out signatures for inhibitory DTIs, Fig 1B). The mol2vec method transforms the structural information of a 2D compound into a continuous multidimensional vector. A genetically perturbed transcriptome reflects the response of the biological system following genetic perturbations, which is associated with the biological responses when drugs activate or inhibit specific gene targets [2830].

The procedure for constructing AI-DTI and demonstrating its performance on independent dataset consisted largely of the following three steps below. First, we constructed three types of data sets–Original dataset, additional dataset, and integrated dataset–consisting of DTI labels and their feature vectors. Original and additional datasets were constructed by selecting known pairs of activatory and inhibitory DTIs that contained targets for which transcriptome data could be measured or inferred, respectively. Integrated Dataset refers the sum of these two datasets. We assigned known DTIs for each mode of action constructed from each dataset as positive samples. Due to the lack of an adequate golden-standard negative set, negative samples are inevitably generated by random selection of non-interacting pairs from these drugs and targets in each dataset. Then, we trained various classifiers to discriminate positive (activatory or inhibitory) and negative DTIs on Original dataset and selected the optimized one with the best performance. The performance of the classifier was also evaluated in additional datasets and integrated datasets. Finally, the generalized performance of the optimized classifier trained on the integrated dataset was measured using independent datasets consisting of unseen DTIs [23, 31].

3.2. Selecting an optimized classifiers of AI-DTI

We first aimed to select the optimized classifier of the model on the original dataset. The dataset contained 1,755 and 17,873 activatory and inhibitory DTIs (assigned as the positive sample). Because the golden negative data set is not available, we randomly selected the non-positive samples as many as the number of positive samples and assigned them as negative samples. We trained Logit, RF, MLP, and CDF models on the dataset and then evaluated the performance. The performance of the model was evaluated under a condition where 5-fold CV was repeatedly conducted 5 times with different data split. Our results showed that the CDF model yielded the highest AUROC and AUPR values in both situations when predicting activatory or inhibitory targets (Table 2). We subsequently tried to optimize the CDF model and found that the highest AUROC and AUPR values were obtained in all situations when the following hyperparameters were selected: ‘500’ as the number of trees and ‘8’ as the number of estimators (S2 Table). The optimized CDF model achieved AUROC and AUPR values of 0.880 and 0.899 for predicting activatory DTIs and 0.935 and 0.946 for predicting inhibitory DTIs, respectively.

Table 2. Assessment of performance using the original datasets through fivefold cross-validation.

Sample ratio = 1:1 (mean±S.D) Sample ratio = 1:10 (mean±S.D)
Activatory DTIs Inhibitory DTIs Activatory DTIs Inhibitory DTIs Activatory DTIs Inhibitory DTIs Activatory DTIs Inhibitory DTIs
AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR
Logit 0.725±0.032 0.690±0.041 0.823±0.007 0.806±0.006 0.697±0.027 0.166±0.016 0.823±0.007 0.340±0.004
RF 0.868±0.022 0.890±0.021 0.921±0.004 0.932±0.003 0.858±0.026 0.559±0.054 0.923±0.004 0.729±0.010
MLP 0.841±0.029 0.851±0.035 0.920±0.004 0.918±0.006 0.835±0.020 0.379±0.038 0.916±0.004 0.559±0.018
CDF# 0.876±0.021 0.899±0.020 0.934±0.004 0.945±0.004 0.871±0.021 0.611±0.046 0.936±0.004 0.775±0.011

Boldface indicates the highest value for each performance metric. Logit, logistic regression; RF, random forest; MLP, multilayer perceptron; CDF, cascade deep forest.

# CDF model with 2 estimators in each cascade layer and 100 trees in each forest.

Actually, DTI prediction in real world is an imbalanced classification problem where positive labels are sparse, so the performance measured on the dataset in which positive and negative samples are balanced does not fully reflect the situations in real drug discovery scenarios. To mimic the practical situation in which positive DTI is sparse, we also performed an additional CV test, in which the negative set in the test data contained ten times more negative samples than positive samples. With this experimental setup, the known DTI (i.e., positive samples) accounts for only 9% of the total data set, allowing a performance assessment closer to the situation of real drug discovery. Although the scores dropped when compared to the previous test, we observed that the optimized CDF and RF models still achieved high AUPR values (Table 2). The AUPR of the MLP model was significantly lower than that of the above two models, indicating that the performance of the MLP model was insufficient in the skewed dataset. Considering the highest performances in the experimental setup, we decided to employ the optimized CDF model as the classifier model of AI-DTI in subsequent analyses.

3.3. Performance comparison with previous models

The performances of our model were evaluated with previous approaches. We first focused on the performance evaluation in terms of different molecular embedding methods. Performance was measured under the same conditions as above in which 5-fold CV was repeated 5 times using a dataset in which the same number of positive and negative samples were selected. We found that the combination of mol2vec and CDF showed the highest performance for both AUROC and AUPR (Table 3). This result is consistent with a previous report that showed superior performance in the prediction of compound properties and bioactivity compared to Morgan fingerprints, chemical descriptors and some deep learning-based embedding models [17]. Taken together, we showed that mol2vec can provides rich information that can help accurately classify activatory and inhibitory DTIs.

Table 3. Assessment of performance across compound embedding methods.

Activatory DTIs Inhibitory DTIs
AUROC AUPR AUROC AUPR
MACCS 0.883±0.023 0.852±0.028 0.941±0.005 0.923±0.006
Morgan 0.861±0.026 0.837±0.033 0.935±0.004 0.923±0.006
Mol2vec 0.885±0.024 0.863±0.027 0.945±0.004 0.934±0.005

Boldface indicates the highest value for each performance metric. Logit, logistic regression; RF, random forest; ERT, extremely randomized trees; MLP, multilayer perceptron; CDF, cascade deep forest.

# CDF model with 2 estimators in each cascade layer and 100 trees in each forest.

The performance of our model was also compared with joint learning, a previous approach proposed by Sawada et al [16]. They constructed feature vectors based on the drug-induced signature and trained classifier models that predicts activatory and inhibitory DTIs for each target using joint learning. For comparison under the same conditions, we selected DTIs and their features from the original dataset for which drug-induced signatures were available in L1000 dataset. We obtained 55 activatory DTIs between 28 targets and 47 drugs and 592 inhibitory DTIs between 217 targets and 367 drugs. We note that any target included in all activatory DTIs have an insufficient number of DTIs (less than 5). Since this sparsity makes it difficult to properly assign a positive DTI to each fold during CV experiments, so we trained models and evaluate their performances focusing on the inhibitory DTI dataset which have sufficient number of DTIs for each target. As a result of comprehensive comparative evaluation over hyperparameters of joint learning and classifiers of AI-DTI, we found that AI-DTI showed higher AUROC and AUPR than joint learning (S3 Table). These results indicate that feature vectors calculated utilizing mol2vec could be more useful than drug-induced signatures in predicting DTIs.

3.4. AI-DTI can predict diverse druggable targets

The drawback of previous models using genetically perturbed transcriptomes is that the range of predictable targets is constrained to the targets for which the genetically perturbed transcriptome is measured. For example, in a previous study [16], the number of predictable activatory and inhibitory DTI targets was only 77 and 769, respectively, covering only a fraction of druggable targets. To broaden the applicability of our method, we attempted to expand the target space of our model by inferring target vectors based on PPI networks (see Materials and Methods, Fig 2C). The assumption for using this method is that genetically perturbed transcriptomes are correlated with those of functionally interacting proteins. The inferring procedure calculates a representative vector for a target whose genetically perturbed transcriptome was not measured, thus enabling the construction of the input feature for wider targets. To check the reliability of the method, we first measured correlation between inferred data and genetically perturbed transcriptome. We estimated 1,673 activatory target vectors and 2,805 inhibitory target vectors for genes with measured transcriptome and computed the Spearman correlation between the inferred vectors and genetically perturbed transcriptome for the same gene. For comparison, we also calculated the Spearman correlations between genetically perturbed transcriptomes and inferred vectors of other genes. We found that the values of correlation between the same gene was significantly higher than those of other genes (p < 0.001 for both activatory and inhibitory targets, S2 Fig).

We then evaluated the predictive performance of the DTI in an additional dataset where target vectors consist of the inferred transcriptome. The performance of the model was measured in the same manner as in the above experiments. The results showed that the CDF model achieved satisfactory AUROC and AUPR values in the extended dataset (Table 4), indicating that activatory and inhibitory DTIs can be accurately predicted even with the inferred target vectors. We observed that there was no significant change in the performance when training the model by integrating the original dataset and extended dataset and when training the model using a separate dataset (Table 4). For ease of use, we decided to conducted subsequent analysis using a trained model in an integrated dataset that incorporates the original dataset and the extended dataset. It is noteworthy that our model trained in the datasets can predict more than 70% of druggable targets (targets that appeared in TTD), indicating that AI-DTI can be employed to predict a wide range of drug targets.

Table 4. Assessment of the performance of the optimized CDF model for various datasets.

Sample ratio = 1:1 (mean±S.D) Sample ratio = 1:10 (mean±S.D)
Activatory DTIs Inhibitory DTIs Activatory DTIs Inhibitory DTIs
AUROC AUPR AUROC AUPR AUROC AUPR AUROC AUPR
Original dataset 0.880±0.029 0.899±0.019 0.935±0.003 0.946±0.003 0.873±0.007 0.629±0.033 0.939±0.004 0.780±0.007
Additional dataset 0.873±0.011 0.869±0.013 0.953±0.002 0.957±0.002 0.864±0.012 0.430±0.030 0.955±0.002 0.800±0.008
Separate model* 0.875±0.011 0.878±0.011 0.944±0.002 0.952±0.002 0.867±0.009 0.488±0.022 0.947±0.002 0.790±0.004
Integrated model# 0.875±0.010 0.881±0.008 0.943±0.002 0.951±0.001 0.869±0.008 0.489±0.023 0.946±0.002 0.786±0.005

Boldface indicates the highest value for each performance metric between the separate model and integrative model.

*Models trained on the original and additional datasets separately.

#Models trained on an integrated dataset.

3.5. AI-DTI achieved substantial performance on independent datasets

To test the generalization abilities of the model, the performance of AI-DTI was further evaluated on independent datasets. We obtained activatory and inhibitory DTIs from DrugBank and selected DTIs that meets the following two criteria: (1) DTI pairs that include a compound for which ECFPs can be calculated and a target for which (inferred) transcriptome profiles are available and (2) DTIs that were not seen during the training phase (Table 1). We were extremely careful that data leakage can significantly affect the performance of the model, and found that this process could successfully identify independent DTIs that are unseen in the training phase. All the remaining non-positive samples between drugs and targets were assigned to the negative samples. We predicted interaction scores using AI-DTI and evaluated their performances on predicting activatory and inhibitory DTIs. We found that our models achieved satisfactory AUROC and AUPR values (Fig 4A). Note that only a small ratio of the samples of the datasets are positive samples (1.26% and 0.03% for activatory and inhibitory DTIs, respectively), and this imbalance could be more unfavourable condition for DTI classification. Even on the highly skewed dataset, the optimized CDF-based model achieved the highest AUROC of 0.773 for activatory DTIs and 0.723 for inhibitory DTIs. The precision-recall curves also revealed that the performance of the optimized CDF model was still better than that of the other models. Taken together, these results indicate the generalizability of our model to predict DTIs in which the targets were unseen during the training phase.

Fig 4. Assessment of performance on independent datasets.

Fig 4

(A) Performance curves for activatory (top) and inhibitory (bottom) DTIs on the DrugBank dataset. *CDF model with 8 estimators in each cascade layer and 500 trees in each forest. Logit, logistic regression; RF, random forest; MLP, multilayer perceptron. CDF, cascade deep forest. (B) Performance comparison between our optimized model and the conventional virtual screening methods on LIT-PCBA. * CDF model with 8 estimators in each cascade layer and 500 trees in each forest.

Evaluating the performance on the dataset by assigning nonpositive samples as negative samples does not fully reflect practical drug discovery scenarios. To this end, we evaluated the performance of AI-DTI in another benchmark dataset, LIT-PCBA [31]. A key feature of LIT-PCBA is that it systematically integrates high throughput screening datasets consisting of components that active or inactive to targets. Therefore, performance evaluation on this dataset could reflect more realistic drug discovery scenarios where negative samples are explicitly defined as well as positive samples. We predicted activatory and inhibitory DTIs using the optimized CDF model and compared the performance with the baseline three virtual screening (VS) methods presented by LIT-PCBA, i.e., the 2D fingerprint similarity method, 3D shape similarity method, and molecular docking. To reduce the bias of the performance, we trained our methods ten times using different negative samples and measured the mean performance on a fully processed target set. The results show that AI-DTI achieved a higher mean AUROC than those achieved by conventional VS methods optimized with a max-pooling approach (Fig 4B). Specifically, our method achieved the highest AUROC values in all target sets (FEN1, IDH1, KAT2A, and VDR) where the other VS methods produced worse AUROC values than chance and for three target sets (ALDH1, FEN1, and KAT2A) that were unseen in the training phase. Moreover, our method achieved higher EF1% values than conventional methods for all activatory ligands (ESR_ago, FEN1, and OPRK1) and one inhibitory ligand (FEN1). It is worth noting that AI-DTI is a large-scale method that can predict a wide range of targets, whereas the other comparative models are local models built separately to predict specific protein targets. In summary, we found that our model still offers superior performance for classifying active and inactive compounds in high-throughput screening datasets containing DTIs with an unseen target and/or an unseen compound.

3.6. AI-DTI can predict DTIs for novel diseases

Another approach to assessing the practicality of a DTI predictive model could be whether it can aid the drug discovery process for unseen diseases. In other words, evaluating the performance of DTIs for unseen diseases is expected to assess their generalized ability to guide target-based drug discovery processes. We thus attempted to test whether AI-DTI could identify the DTIs of candidate drugs for COVID-19 treatment. Validated DTIs for COVID-19 were collected from DrugBank, and DTIs that met two criteria were further selected as follows: (1) DTIs containing a compound for which ECFPs could be calculated and a target with (inferred) transcriptome profiles available and (2) DTIs that were not seen during the training phase. We employed optimized CDF-based models to predict the activatory or inhibitory interaction scores for validated DTIs. We regard the validated DTI to be rediscovered when the predicted score exceeds the default threshold of our model (0.5). To assess how uncommon the predicted score is, we constructed a reference distribution and compared a relative rank of the score (i.e., top %) to the reference distribution of interaction scores. The reference distribution was defined as the distribution of interaction scores calculated using our method for DTI pairs between 2500 FDA-approved drugs and a target of interest.

We found that approximately half of the DTIs (12/25) were successfully rediscovered by our method, of which three and five activatory and inhibitory DTIs were found to be in the top 5%, respectively (Table 5). It is noteworthy that the targets of two activatory DTIs (metenkephalin—OPRM1 and metenkephalin—OPRM1) and two inhibitory DTIs (ifenprodil—GRIN1 and ifenprodil—GRIN2B) were included in the extended dataset, so the high top percentages of these DTIs support the reliability of our results within the extended target space. On the other hand, the low true positive rate of the inhibitory DTIs (33%, 6/18) might raise concerns about the reliability of our method’s prediction results. However, except for the three targets (TNF, HMGB1m and JAK1), we found that all DTI scores between the FDA-approved drugs and targets showing false-negative results did not exceed the default threshold, which indicates that these false-negative results did not affect the precision of the predicted results. Also, we calculated a confusion matrix focused on DTIs between FDA-approved drugs and related targets with COVID-19, and found that AI-DTI achieved an F1-score more than twice the chance level (S3 Fig). To facilitate drug repurposing, we used our method to summarize a list of FDA-approved drugs yielding high scores for COVID-19 targets (S4 and S5 Tables).

Table 5. Predicted results for validated DTIs related to COVID-19.

Mode of action Target name Gene symbol Drug name Score* Top percentage (rank)#
Activation
Peroxisome proliferator-activated receptor gamma PPARG Ibuprofen 0.89 21.52 (538)
Peroxisome proliferator-activated receptor alpha PPARA Ibuprofen 0.61 41.12 (1028)
Nuclear receptor subfamily 1 group I member 2 NR1I2 Dexamethasone 0.75 5.12 (128)
Annexin A1 ANXA1 Dexamethasone 0.17 48.36 (1209)
Glucocorticoid receptor NR3C1 Dexamethasone 0.97 1.48 (37)
Mu-type opioid receptor OPRM1 Metenkephalin 0.92 1.16 (29)
Delta-type opioid receptor OPRD1 Metenkephalin 0.96 1.28 (32)
Inhibition
Tumour necrosis factor TNF Chloroquine 0.21 86.84 (2171)
Glutathione S-transferase A2 GSTA2 Chloroquine 0.11 74.64 (1866)
Glutathione S-transferase Mu 1 GSTM1 Chloroquine 0.15 71.68 (1792)
Toll-like receptor 9 TLR9 Chloroquine 0.09 59.08 (1477)
High mobility group protein B1 HMGB1 Chloroquine 0.10 65.44 (1636)
Tubulin beta chain TUBB Colchicine 0.10 58.84 (1471)
Prostaglandin G/H synthase 2 PTGS2 Ibuprofen 0.95 2.84 (71)
Cystic fibrosis transmembrane conductance regulator CFTR Ibuprofen 0.19 80.56 (2014)
Glutamate receptor ionotropic, NMDA 2B GRIN2B Ifenprodil 1.00 0.08 (2)
Glutamate receptor ionotropic, NMDA 1 GRIN1 Ifenprodil 0.78 6.28 (157)
G protein-activated inward rectifier potassium channel 1 KCNJ3 Ifenprodil 0.07 93.24 (2331)
G protein-activated inward rectifier potassium channel 4 KCNJ5 Ifenprodil 0.07 93.24 (2331)
G protein-activated inward rectifier potassium channel 2 KCNJ6 Ifenprodil 0.07 93.24 (2331)
Histone deacetylase 1 HDAC1 Fingolimod 0.69 17 (425)
Tyrosine-protein kinase JAK3 JAK3 Baricitinib 0.60 2.44 (61)
Tyrosine-protein kinase JAK1 JAK1 Baricitinib 0.44 2.76 (69)
Tyrosine-protein kinase JAK2 JAK2 Baricitinib 0.65 4.2 (105)
Protein-tyrosine kinase 2-beta PTK2B Baricitinib 0.14 27.2 (680)

* Scores that exceed the model’s default threshold are in boldface.

# The top percentage and rank were calculated against the score for 2500 FDA-approved drugs and the corresponding genes.

4. Discussion

Accurately identifying DTIs with a mode of action is a crucial step in the drug development process and understanding the modes of action of drugs. Here, we present AI-DTI, a novel computational approach for identifying activatory and inhibitory targets for small molecules. By leveraging a mol2vec model and genetically perturbed transcriptome, AI-DTI was able to accurately predict active and inhibitory DTIs for a wide range of small molecule and drug targets. The comprehensive evaluation demonstrated that AI-DTI accurately predicts activatory and inhibitory DTI pairs, even in datasets containing sparse positive samples, DTI pairs unseen in the training phase, and high-throughput biological assay results. A case study of COVID-19 DTIs shows that AI-DTI can be used to prioritize activatory and inhibitory DTIs even for unseen diseases.

We believe that AI-DTI can bring significant contributions and advantages in drug discovery and research on the mechanisms of drugs. In drug discovery, our method can be applied to discover candidate compounds for diseases involving a variety of targets by providing large-scale predictions between a series of small molecules and a wide range of targets. Also, by predicting DTIs using only 2D structures, our method can generate plausible hypotheses for understanding the mechanisms of action including novel compounds and natural products whose known target information is scarce or sparse.

Among the employed classifier models, we found that the CDF model yielded the highest performances in our comprehensive experiments. Unlike the deep learning model, the CDF model automatically determines the complexity of the model in a data-dependent way with relatively few parameters and achieves excellent performance across various domains, including simple DTI prediction [7, 9, 25]. It is difficult to compare performance directly due to differences in datasets; however, the our method using the CDF model not only outperformed the previous model that predicts activatory and inhibitory DTIs but also competed with some state-of-the-art models that predict only simple interactions while requiring functional annotations of compounds such as drug-drug interactions, drug-disease relationships, and drug side effects [12, 32, 33].

We showed that AI-DTI is a practical tool that accurately predicts DTIs and their modes of action; however, there are several limitations of this study with potential for further improvement. First, the prediction performance can be further improved by applying advanced algorithms, such as GCN, which have been recently reported to show state-of-the-art performance [14]. Since the previous model still requires functional annotation of drugs, such as drug-drug interactions, an interesting future study will be to develop a model that predicts DTIs more accurately, even for novel compounds. Second, we used transcriptome profiles transduced with cDNA and shRNA as target vectors, which could include signal-to-noise issues due to background noise. The performance of our model may be improved further by upgrades based on large-scale datasets created using advanced techniques, such as CRISPR. A future direction of our work is to develop a versatile predictive model that accurately predicts DTIs with various modes of action.

Supporting information

S1 Table. Distribution of targets whose genetically perturbed transcriptome was measured by cell line.

(XLSX)

S2 Table. Search range and selected hyperparameter values for the cascade deep forest models.

(XLSX)

S3 Table. Performance comparison between joint learning and AI-DTI.

(XLSX)

S4 Table. Candidate FDA-approved drugs for COVID-19-related activatory targets.

(XLSX)

S5 Table. Candidate FDA-approved drugs for COVID-19-related inhibitory targets.

(XLSX)

S1 Fig. The architecture and hyperparameters of the MLP models.

(TIF)

S2 Fig. Distribution of spearman correlation coefficients between the inferred data and genetically perturbed transcriptome for the same gene (Within pair) and the other genes (Between pair).

(TIF)

S3 Fig. Predictive performance of AI-DTI on DTIs between FDA-approved drugs and COVID-19-related targets.

(TIF)

Data Availability

All relevant data are within the manuscript and its Supporting information files or at: https://bitbucket.org/NNSM/ai_dti.

Funding Statement

This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number HF20C0087) awarded to CK, the National Research Foundation of Korea (NRF), funded by the Korean government (MSIT) (grant number NRF-2020R1A6A3A13075094) awarded to WL, and the Ministry of Food and Drug Safety in 2021 (grant 21173MFDS561) awarded to CK. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Hughes JP, Rees SS, Kalindjian SB, Philpott KL. Principles of early drug discovery. Br J Pharmacol. 2011;162: 1239–1249. doi: 10.1111/j.1476-5381.2010.01127.x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Kapetanovic IM. Computer-aided drug discovery and development (CADDD): In silico-chemico-biological approach. Chem Biol Interact. 2008. doi: 10.1016/j.cbi.2006.12.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Hassan Baig M, Ahmad K, Roy S, Mohammad Ashraf J, Adil M, Haris Siddiqui M, et al. Computer Aided Drug Design: Success and Limitations. Curr Pharm Des. 2016. doi: 10.2174/1381612822666151125000550 [DOI] [PubMed] [Google Scholar]
  • 4.Lee WY, Lee CY, Kim YS, Kim CE. The methodological trends of traditional herbal medicine employing network pharmacology. Biomolecules. 2019;9: 362. doi: 10.3390/biom9080362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Fang J, Liu C, Wang Q, Lin P, Cheng F. In silico polypharmacology of natural products. Brief Bioinform. 2017;19: 1153–1171. doi: 10.1093/bib/bbx045 [DOI] [PubMed] [Google Scholar]
  • 6.Mousavian Z, Masoudi-Nejad A. Drug-target interaction prediction via chemogenomic space: Learning-based methods. Expert Opinion on Drug Metabolism and Toxicology. 2014. doi: 10.1517/17425255.2014.950222 [DOI] [PubMed] [Google Scholar]
  • 7.Chu Y, Kaushik AC, Wang X, Wang W, Zhang Y, Shan X, et al. DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Brief Bioinform. 2021;22: 451–462. doi: 10.1093/bib/bbz152 [DOI] [PubMed] [Google Scholar]
  • 8.Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, et al. Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci. 2020;11: 1775–1797. doi: 10.1039/c9sc04336e [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Zeng X, Zhu S, Hou Y, Zhang P, Li L, Li J, et al. Network-based prediction of drug-target interactions using an arbitrary-order proximity embedded deep forest. Bioinformatics. 2020. doi: 10.1093/bioinformatics/btaa010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wang Z, Zhou M, Arnold C. Toward heterogeneous information fusion: bipartite graph convolutional networks for in silico drug repurposing. Bioinformatics. 2020;36: i525–i533. doi: 10.1093/bioinformatics/btaa437 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Olayan RS, Ashoor H, Bajic VB. DDR: Efficient computational method to predict drug-Target interactions using graph mining and machine learning approaches. Bioinformatics. 2018;34: 1164–1173. doi: 10.1093/bioinformatics/btx731 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, et al. A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun. 2017;8. doi: 10.1038/s41467-017-00680-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Lee I, Keum J, Nam H. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol. 2019;15: 1–21. doi: 10.1371/journal.pcbi.1007129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Zhao T, Hu Y, Valsdottir LR, Zang T, Peng J. Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief Bioinform. 2020;00: 1–10. doi: 10.1093/bib/bbaa044 [DOI] [PubMed] [Google Scholar]
  • 15.Zhang YF, Wang X, Kaushik AC, Chu Y, Shan X, Zhao MZ, et al. SPVec: A Word2vec-Inspired Feature Representation Method for Drug-Target Interaction Prediction. Front Chem. 2020;7: 1–11. doi: 10.3389/fchem.2019.00895 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Sawada R, Iwata M, Tabei Y, Yamato H, Yamanishi Y. Predicting inhibitory and activatory drug targets by chemically and genetically perturbed transcriptome signatures. Sci Rep. 2018; 1–4. doi: 10.1038/s41598-017-18315-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Jaeger S, Fulle S, Turk S. Mol2vec: Unsupervised Machine Learning Approach with Chemical Intuition. J Chem Inf Model. 2018;58: 27–35. doi: 10.1021/acs.jcim.7b00616 [DOI] [PubMed] [Google Scholar]
  • 18.Enache OM, Lahr DL, Natoli TE, Litichevskiy L, Wadden D, Flynn C, et al. The GCTx format and cmap{Py, R, M, J} packages: Resources for optimized storage and integrated traversal of annotated dense matrices. Bioinformatics. 2019;35: 1427–1429. doi: 10.1093/bioinformatics/bty784 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Subramanian A, Narayan R, Corsello SM, Root DE, Wong B, Golub TR, et al. Resource A Next Generation Connectivity Map: L1000 Platform Resource A Next Generation Connectivity Map: Cell. 2017;171: 1437–1452.e17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Smith I, Greenside PG, Natoli T, Lahr DL, Wadden D, Tirosh I, et al. Evaluation of RNAi and CRISPR technologies by large-scale gene expression profiling in the Connectivity Map. PLoS Biol. 2017;15: 1–23. doi: 10.1371/journal.pbio.2003213 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, Huerta-Cepas J, et al. STRING v11: Protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 2019. doi: 10.1093/nar/gky1131 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Wang N, Zhao G, Zhang Y, Wang X, Zhao L, Xu P, et al. A Network Pharmacology Approach to Determine the Active Components and Potential Targets of Curculigo Orchioides in the Treatment of Osteoporosis. Med Sci Monit. 2017. doi: 10.12659/msm.904264 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Wishart DS, Feunang YD, Guo AC, Lo EJ, Marcu A, Grant JR, et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 2018. doi: 10.1093/nar/gkx1037 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhou ZH, Feng J. Deep forest: Towards an alternative to deep neural networks. IJCAI International Joint Conference on Artificial Intelligence. 2017.
  • 25.Zhou ZH, Feng J. Deep forest. Natl Sci Rev. 2019;6: 74–86. doi: 10.1093/nsr/nwy108 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.van Laarhoven T, Nabuurs SB, Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011. doi: 10.1093/bioinformatics/btr500 [DOI] [PubMed] [Google Scholar]
  • 27.Davis J, Goadrich M. The relationship between precision-recall and ROC curves. ACM International Conference Proceeding Series. 2006.
  • 28.Huang CT, Hsieh CH, Chung YH, Oyang YJ, Huang HC, Juan HF. Perturbational Gene-Expression Signatures for Combinatorial Drug Discovery. iScience. 2019. doi: 10.1016/j.isci.2019.04.039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Noh H, Shoemaker JE, Gunawan R. Network perturbation analysis of gene transcriptional profiles reveals protein targets and mechanism of action of drugs and influenza A viral infection. Nucleic Acids Res. 2018. doi: 10.1093/nar/gkx1314 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Spreafico R, Soriaga LB, Grosse J, Virgin HW, Telenti A. Advances in genomics for drug development. Genes. 2020. doi: 10.3390/genes11080942 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Rognan D. LIT-PCBA: An Unbiased Data Set for Machine Learning and Virtual Screening. 2020. [DOI] [PubMed]
  • 32.Zheng X, Ding H, Mamitsuka H, Zhu S. Collaborative matrix factorization with multiple similarities for predicting drug-Target interactions. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2013. doi: 10.1145/2487575.2487670 [DOI] [Google Scholar]
  • 33.Zong N, Kim H, Ngo V, Harismendy O. Deep mining heterogeneous networks of biomedical linked data to predict novel drug-target associations. Bioinformatics. 2017. doi: 10.1093/bioinformatics/btx160 [DOI] [PMC free article] [PubMed] [Google Scholar]
PLoS One. 2023 Apr 12;18(4):e0282042. doi: 10.1371/journal.pone.0282042.r001

Author response to previous submission


Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

11 Apr 2022

Attachment

Submitted filename: renamed_e7942.docx

Decision Letter 0

Jinn-Moon Yang

24 Jun 2022

PONE-D-22-10335Predicting Activatory and Inhibitory Drug–target Interactions based on Structural Compound Representations and Genetically Perturbed Transcriptomes PLOS ONE

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 07 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jinn-Moon Yang

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Please note that PLOS ONE has specific guidelines on code sharing for submissions in which author-generated code underpins the findings in the manuscript. In these cases, all author-generated code must be made available without restrictions upon publication of the work. Please review our guidelines at https://journals.plos.org/plosone/s/materials-and-software-sharing#loc-sharing-code and ensure that your code is shared in a way that follows best practice and facilitates reproducibility and reuse.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: In this manuscript, the authors presented a method to predict the activatory and inhibitory Drug-target Interactions. The structural compound representations (Mol2vec) and genetically perturbed transcriptomes (978 landmark genes) were used as the features for machine learning. The training datasets were collected from the Therapeutic Target Database 2.0 and DrugBank 5.1.7. The independent test sets were constructed from Drugbank and LIT-PCBA. Four machine learning methods were used for model training, and cascade deep forest has the best predictive performance. The topic is essential, but there are several major concerns that the authors need to address.

1. The proposed method has only one prediction model for active DTIs. The predicted result for a DTI is either activatory or inhibitory. However, many relationships between drugs and targets are inactive. The application of the prediction model could be pretty limited to the availability of knowing interaction between drug and target.

2. How many mol2vec-based compound features were used for a compound? Were they 300-dimensional embedding of Morgan substructures? How does the window size of 10 work?

3. There are 1,278 features if the number of compound features is 300. The number of activatory DTIs in the original and additional are 457 and 887, respectively. It may cause the model to overfit if the number of features exceeds the number of positive/negative samples.

3. The EFx%, precision, and true and false positive rate equations were wrong without parentheses.

4. Why is the number of inactive activatory DTIs (318,066) more than the number of compounds (130,412) multiply the number of targets (2) in the LIT-PCBA set?

5. In lines 212-213, active and inactive DTIs should be described more clearly.

6. In Table 4, the authors compared their method with the previous model (ref. 18). How to get the predictive performance of independent sets using the previous prediction model?

7. In line 437, the authors indicated that approximately half of the DTIs (12/25) were successfully rediscovered. It is similar to random if the proposed model was for predicting the DTI is either activatory or inhibitory.

Reviewer #2: Reviewer Response:

The reviewer appreciates the efforts put forth by the authors. However, there are several concerns regarding the study that hinder its overall effectiveness.

Major:

- A fundamental issue with the submitted manuscript seems to be the combination of two types of features – chemical features and gene expression. The two types have different meanings in drug design. An inhibitor is usually designed to inhibit the activity of a target protein instead of changing the expression level of the target protein. For example, Gefitinib is an EGFR inhibitor, and the drug inhibits EGFR activity. When treated with the drug, the expression level of p-EGFR is inhibited, while the expression level of EGFR is similar. Therefore, combining the fingerprints of Gefitinib with the gene knock-down signature of EGFR may be not reasonable for use as features to predict the drug-target interaction.

- Regarding the feature generation, the use of mol2vec is interesting. However, concatenation of gene expression is a major flaw. We can use HDAC inhibitors as an example. Many HDAC inhibitors contain a hydroxamic moiety. The fingerprint for each compound would be different; however, this information is concatenated with additional information (known gene expression). Adding gene expression would then train a model to “assume” all compounds containing a hydroxamic moiety would induce the same gene expression. The information (mol2vec and the concatenated gene expression) would be better used separately.

- In Figure 2, the authors show that they aggregated the transcriptome signatures from L1000 profiles. However, the information is aggregated across different cell lines. Following the workflow in Figure 2, it seems the authors combined the expression across different cancers (such as breast cancer and colorectal cancer cell lines). Combining these results does not seem reasonable.

- The authors reference models was developed by Sawada et al, which displayed information based on chemical treatment, gene knockdown, or gene overexpression. All three information by Sawada et al. is used as separated features. These steps seem more practical when compared to the authors’ submitted manuscript.

- Figure 4 is unclear. There are 2 sets of ROC and Precision-Recall Curve, but it is hard to determine which graph fits for predicting activity/inhibition by the authors in sections 3.4-3.5.

- Regarding Figure 4B, the AUC for AI-DTI (Mean = 0.61) is quite low. Are the compounds truly distinct from the training set?

- A comparison between mol2vec and other popular chemical representations may be needed and would increase the study’s novelty.

- The authors sought to test their model for COVID-19 DTI. However, insufficient explanation for why the analysis was performed or how it fits with the DTI study was given.

- Authors previously modified their manuscript to address reviewer concerns. However, these areas (in red) as well as the original contents contain grammatical issues or are not written clearly for readers.

Minor:

- A confusion matrix illustrating the COVID-19 DTI results would be helpful.

- Authors randomly selected their negative samples. The reviewer recommends using an additional database for negative samples (i.e. ChEMBL, PubChem, etc). Compounds can be separated based on reported bioactivity (IC50, Ki, or Kd) as a cutoff for negative samples.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 12;18(4):e0282042. doi: 10.1371/journal.pone.0282042.r003

Author response to Decision Letter 0


5 Aug 2022

Reviewer #1:

In this manuscript, the authors presented a method to predict the activatory and inhibitory Drug-target Interactions. The structural compound representations (Mol2vec) and genetically perturbed transcriptomes (978 landmark genes) were used as the features for machine learning. The training datasets were collected from the Therapeutic Target Database 2.0 and DrugBank 5.1.7. The independent test sets were constructed from Drugbank and LIT-PCBA. Four machine learning methods were used for model training, and cascade deep forest has the best predictive performance. The topic is essential, but there are several major concerns that the authors need to address.

* The proposed method has only one prediction model for active DTIs. The predicted result for a DTI is either activatory or inhibitory. However, many relationships between drugs and targets are inactive. The application of the prediction model could be pretty limited to the availability of knowing interaction between drug and target.

-> We appreciate for the comments.

As the reviewers were concerned, the purpose of our model is not to classify the mode of action among known DTIs, but rather to predict whether DTI of interest is active or inactive with its mode of action (activated or inhibited). In addition, we found that AI-DTI could predict DTIs more accurately than previous models, and that the range of predictable compounds and targets was expanded. Therefore, the applicability of our model would not be limited to the availability of known interactions between drug and targets. Thanks to the comments, we found that the previous manuscript did not appropriately describe our aims of the model. To reduce potential confusion, we revised the build process and goals of AI-DTI to be clearer in the manuscript. (line 274-286).

* How many mol2vec-based compound features were used for a compound? Were they 300-dimensional embedding of Morgan substructures? How does the window size of 10 work?

-> As the reviewer mentioned, we used a mol2vec model that receives morgan substructure as input and converts it into a 300-dimensional vector. To describe window size of mol2vec, it would be useful to share the structure of mol2vec. Mol2vec is a NLP-inspired model for calculating compound embeddings. It is trained by calculating vectors that can best reconstruct the substructures contained in the Morgan fingerprint and its surrounding substructures. The window size of mol2vec is the number of surrounding substructures to be considered in the training phase.

* There are 1,278 features if the number of compound features is 300. The number of activatory DTIs in the original and additional are 457 and 887, respectively. It may cause the model to overfit if the number of features exceeds the number of positive/negative samples.

-> Thanks for pointing out this important issue.

As the reviewer points, the overfitting is a major issue in the application of machine learning models. The reviewer noted that this could be due to the ratio between the input function and the number of samples.

To minimize the overfitting issue, we considered various regularization terms such as max_depth, number of trees, and number of telerant rounds for cascade deep forest. Moreover, the trained model showed superior performance on various independent datasets, indicating that there were no significant overfitting issues in the trained model.

* The EFx%, precision, and true and false positive rate equations were wrong without parentheses.

Thank you for pointing our mistakes.

-> We add parentheses to the equation.

* Why is the number of inactive activatory DTIs (318,066) more than the number of compounds (130,412) multiply the number of targets (2) in the LIT-PCBA set?

-> Thank you for noticing our mistakes.

The number of target has been updated with correct information, which is the same as the information on the homepage (https://drugdesign.unistra.fr/LIT-PCBA/).

In lines 212-213, active and inactive DTIs should be described more clearly.

-> To reflect the reviewer's request, we describe in more detail considering active and inactive DTIs. (Line 201-204)

* In Table 4, the authors compared their method with the previous model (ref. 18). How to get the predictive performance of independent sets using the previous prediction model?

-> We appreciate for pointing out a critical issue.

Before revision, the performance of the previous model was derived from the supplementary tables in the previous study. It indicates that the performance of our model and the previous study were measured on different datasets. We realized that this was an unfair condition, so we constructed a common dataset and compared the performance with our model and the previous model. As a result of comprehensive comparative evaluation over hyperparameters of joint learning and classifiers of AI-DTI, we found that AI-DTI showed higher AUROC and AUPR than joint learning, even on the constructed dataset (Supplementary Table S2).

* In line 437, the authors indicated that approximately half of the DTIs (12/25) were successfully rediscovered. It is similar to random if the proposed model was for predicting the DTI is either activatory or inhibitory.

-> As we mentioned in reviewer’s comments 1, the chance ratio of DTI rediscovery is not 0.5 when considering the imbalance between active and inactive DTIs. To quantitatively support our opinion, we calculated a confusion matrix on the COVID-19 dataset and found that the F1-score was more than twice higher than the random chance level. Also, we found that that the prediction results of AI-DTI were significantly associated with known-unknown DTIs (p < 0.05, Fisher's Exact test and Chi-square test for activatory and inhibitory DTIs, respectively),

Reviewer #2: Reviewer Response:

The reviewer appreciates the efforts put forth by the authors. However, there are several concerns regarding the study that hinder its overall effectiveness.

Major:

* A fundamental issue with the submitted manuscript seems to be the combination of two types of features – chemical features and gene expression. The two types have different meanings in drug design. An inhibitor is usually designed to inhibit the activity of a target protein instead of changing the expression level of the target protein. For example, Gefitinib is an EGFR inhibitor, and the drug inhibits EGFR activity. When treated with the drug, the expression level of p-EGFR is inhibited, while the expression level of EGFR is similar. Therefore, combining the fingerprints of Gefitinib with the gene knock-down signature of EGFR may be not reasonable for use as features to predict the drug-target interaction.

-> Thanks for pointing out such an important point.

As the reviewers noted, when focusing on a single gene/protein, we agree with the reviewer’s comment that perturbations to protein functions and gene expression have different meaning in drug design. Specifically, when a drug inhibits protein, the expression level of protein does not change as you mentioned, but the inhibition of the protein can influence other gene expressions level and this response can be inferred by investigating genetic perturbed transcriptome. Using this assumption, several researchers have developed machine learning models to predict DTIs using the KO gene signature as target feature vectos of DTIs [Sawada et al., Xie et al., Lee et al.,]. The results of our study also indicate that DTIs can be accurately classified using the signature measured after gene perturbation, which supports that the genetic perturbation signature can be useful in predicting DTIs.

# Reference

Sawada, Ryusuke, et al. "Predicting inhibitory and activatory drug targets by chemically and genetically perturbed transcriptome signatures." Scientific reports 8.1 (2018): 1-9.

Xie, Lingwei, et al. "Deep learning-based transcriptome data classification for drug-target interaction prediction." BMC genomics 19.7 (2018): 93-102.

Lee, Hanbi, and Wankyu Kim. "Comparison of target features for predicting drug-target interactions by deep neural network based on large-scale drug-induced transcriptome data." Pharmaceutics 11.8 (2019): 377.

* Regarding the feature generation, the use of mol2vec is interesting. However, concatenation of gene expression is a major flaw. We can use HDAC inhibitors as an example. Many HDAC inhibitors contain a hydroxamic moiety. The fingerprint for each compound would be different; however, this information is concatenated with additional information (known gene expression). Adding gene expression would then train a model to “assume” all compounds containing a hydroxamic moiety would induce the same gene expression. The information (mol2vec and the concatenated gene expression) would be better used separately.

-> Thanks for pointing out a potential issue with our model.

As the reviewer worried, the aim of concatenating a target and a compound vector is not to investigate the causality between them, but to provide an input feature for evaluating the reliability of DTIs. In the training process, the impact of the issues mentioned by the reviewer could be reduced through a data-driven manner. Let's consider the DTI dataset focused on HDAC as a same example. In the dataset, DTIs contains combinations of a compound containing a hydroxamic moiety and an HDAC-perturbed gene expression, and they are used to train a classification model. If the presence or absence of a hydroxam moiety does not significantly affect the discrimination of DTIs, then the trained model will not significantly consider feature vectors derived from the hydroxam moiety even when predicting DTI.

Also, the reviewer gives suggestion to use the target feature and the compound feature separately. There are pros and cons to using features separately and using them together. The approach that uses only compound information refers to the ligand-based approach, it shows high predictive performance when information on known ligands for a specific protein is sufficient. On the other hand, an approach with the concatenated use of target and compound feature vectors fall within a chemogenomic approach, and it can efficiently predict a wide range of DTIs. AI-DTI were proposed for predicting targets for natrual products or experimental small molecules for which evidence is lacking. Therefore, AI-DTI concatenated compound features and target features to take advantage of this advantage.

* In Figure 2, the authors show that they aggregated the transcriptome signatures from L1000 profiles. However, the information is aggregated across different cell lines. Following the workflow in Figure 2, it seems the authors combined the expression across different cancers (such as breast cancer and colorectal cancer cell lines). Combining these results does not seem reasonable.

-> As reviewers are concerned, drug-induced signatures in a single cell line are mixed effects of cell line-specific responses and changes in drug effects. Cell line-specific responses risk acting as noise in the prediction of DTI in the general context. Therefore, one way to reduce this risk is to compare the administration of the same drug to multiple cell lines and derive a common response data. To this end, by applying similarity-based weighted aggregation, we were able to obetain representative responses of drug administration while reducing cell-specific responses.

The authors reference models was developed by Sawada et al, which displayed information based on chemical treatment, gene knockdown, or gene overexpression. All three information by Sawada et al. is used as separated features. These steps seem more practical when compared to the authors’ submitted manuscript.

-> We appreciate for pointed out an important issue in terms of the practicality.

To address the issue, we performed a comprehensive comparison with the previous model. We first constructed a common dataset that can be used to compare performance without bias and compared the performance between our method and previous approaches. As a result of comprehensive comparative evaluation over hyperparameters of joint learning and classifiers of AI-DTI, we found that AI-DTI showed higher AUROC and AUPR than joint learning (Supplementary Table S2). In addition, the previous approach requires drug-induced signatures for the prediction. However, our method can predict DTIs for most compound with a 2D structure and a wider target. Taken together, we found that our model was more useful than the previous model in terms of prediction performance and predictable range of drugs and targets.

* Figure 4 is unclear. There are 2 sets of ROC and Precision-Recall Curve, but it is hard to determine which graph fits for predicting activity/inhibition by the authors in sections 3.4-3.5.

-> We appreciate for pointing out the ambiguity of the figure.

We added annotation which graphs are fitted for activatory and inhibitory DTIs.

Regarding Figure 4B, the AUC for AI-DTI (Mean = 0.61) is quite low. Are the compounds truly distinct from the training set?

-> There are three main reasons for the low AUC value evaluated in LIT-PCBA dataset. First, the performance was measured on an independent dataset. Second, the source of the dataset is heterogenous from the training set (expert-curated DTIs for training model vs. high-throughtput screening experiment in independent dataset). Finally, the problem was more difficult due to severe data imbalance. Nevertheless, our method showed superior performance compared to the conventional method, which supports the usefulness of our model.

We find that a small number of compounds contained in a known DTI appear as DTI for different targets in an independent dataset, but no exact matched DTIs for drug and target pairs. Therefore, we considered that the LIT-PCBA consists of an unseen DTIs in the training phase.

* A comparison between mol2vec and other popular chemical representations may be needed and would increase the study’s novelty.

-> Thank you for your suggestions to improve the novelty of the study.

According to the reviewer’s suggestion, we measured the prediction performance after changing the compound representation method to MACCS and morgan fingerprint methods and found that mol2vec still outperforms these methods (Table 3).

* The authors sought to test their model for COVID-19 DTI. However, insufficient explanation for why the analysis was performed or how it fits with the DTI study was given.

-> We appreciate for pointing out the potential ambiguity in our manuscript.

Our motivation for conducting a case study on COVID-19 was to evaluate whether AI-DTI could help the drug discovery process for an unseen disease. To clarify the goals of the analysis on COVID-19, we supplemented the rationale for why we performed a case study in section 3.6 (Line 435-439).

* Authors previously modified their manuscript to address reviewer concerns. However, these areas (in red) as well as the original contents contain grammatical issues or are not written clearly for readers.

-> To reflect the reviewer's comments, we comprehensively reviewed the manuscript and revise the manuscript. In particular, the last section of the introduction (line 79-84), the second paragraph of the overview of the results (line 274-284), and the performance evaluation with the existing model (line 319-349) have been comprehensively rewritten.

Minor:

* A confusion matrix illustrating the COVID-19 DTI results would be helpful.

-> We appreciate for the reviewer’s recommendation.

According to reviewer’s suggestion, we calculated a confusion matrix focused on DTIs between FDA-approved drugs and related targets with COVID-19, and found that AI-DTI achieved an F1-score more than twice the chance level (Supplementary Figure 3).

* Authors randomly selected their negative samples. The reviewer recommends using an additional database for negative samples (i.e. ChEMBL, PubChem, etc). Compounds can be separated based on reported bioactivity (IC50, Ki, or Kd) as a cutoff for negative samples.

-> Thanks for the reviewer's suggestion.

If sufficient number of experimentally validated negative samples are obtained, the reliability of the DTI prediction model can be improved. Unfortunately, as reviewers knows, the number of verified negative DTIs samples is very insufficient. As an example, Souri et al obtained a negative sample of inactive DTIs from Chemble, but the number of them was only 2,057, which would be an insufficient number to train the classifiers.

# Reference

Amiri Souri, E., et al. "Novel drug-target interactions via link prediction and network embedding." BMC bioinformatics 23.1 (2022): 1-16.

Another issue of using experimentally validated results as negative DTIs is the heterogeneity of target and drug types between positive and negative samples. It would induce a ‘shortcut learning’ issue, which the classifier will be trained to distinguish the types of drugs and targets appearing in each dataset, rather than evaluating the reliability of the DTI pair. On the other hands, random selection of negative DTIs from drug and target pairs included in the positive DTI data set can avoid this risk. Therefore, we selected negative samples by random selection of non-interacting pairs from these drugs and targets in each dataset.

Attachment

Submitted filename: 220802_revision_comments.docx

Decision Letter 1

Jinn-Moon Yang

6 Sep 2022

PONE-D-22-10335R1Predicting Activatory and Inhibitory Drug–target Interactions based on Structural Compound Representations and Genetically Perturbed TranscriptomesPLOS ONE

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Oct 21 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jinn-Moon Yang

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Most comments have been addressed.

But the EFx% equation is still wrong without parentheses in numerator.

Reviewer #2: 1. The reviewer appreciates the authors' comments. While we agree that “inhibition of a protein can influence gene other expression levels”, the authors continue to assume that all compounds would result in the same gene expression level as the knock-down/overexpressed query target (i.e. same off-target hits, same protein-ligand binding mechanism, etc). As such, combining the compound fingerprints with knock-down/overexpression of genes as input features for establishing a model appears unreasonable.

The reviewer also appreciates the references given. Unfortunately, the referenced papers used transcriptome signatures as standalone features. Compared to the presented study, the referenced articles appear more reasonable for establishing a useful DTI network.

For example, Gefitinib, Erlotinib, and Lapatinib are EGFR inhibitors and the three compounds would result in different gene expressions. However, in the authors' model, the fingerprints of the three compounds will combine with the knock-down gene signature of EGFR. This assumes that the three compounds would result in the same gene expression level with the knock-down gene signature of EGFR. Therefore, the concatenation of gene expression and chemical information continues to be a major flaw to this study.

2. The reviewer understands the authors’ attempts at reducing prediction noise. Unfortunately, aggregating the information across different cell lines is unreasonable in establishing an effective DTI. Aggregating this information could potentially produce transcriptional signatures that are no longer relevant to a given disease. As the goal of the authors appear to be identifying a drug’s target interactions, this would be problematic. As a result, aggregating information across cell lines would greatly impair the effectiveness of the given model.

3. Regarding the dataset, were the independent dataset randomly selected and removed from the training set? Was there an attempt at balancing the dataset and seeing how that would affect performance?

Reviewer #3: The authors made proper revision, the paper is acceptable in its current form, please proceed with next stage.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 12;18(4):e0282042. doi: 10.1371/journal.pone.0282042.r005

Author response to Decision Letter 1


13 Sep 2022

Reviewer #1: Most comments have been addressed.

But the EFx% equation is still wrong without parentheses in numerator.

-> Thanks for pointing out our mistake.

We added parentheses to the EFx% equation.

===============

Reviewer #2: 1. The reviewer appreciates the authors' comments. While we agree that “inhibition of a protein can influence gene other expression levels”, the authors continue to assume that all compounds would result in the same gene expression level as the knock-down/overexpressed query target (i.e. same off-target hits, same protein-ligand binding mechanism, etc). As such, combining the compound fingerprints with knock-down/overexpression of genes as input features for establishing a model appears unreasonable.

The reviewer also appreciates the references given. Unfortunately, the referenced papers used transcriptome signatures as standalone features. Compared to the presented study, the referenced articles appear more reasonable for establishing a useful DTI network.

For example, Gefitinib, Erlotinib, and Lapatinib are EGFR inhibitors and the three compounds would result in different gene expressions. However, in the authors' model, the fingerprints of the three compounds will combine with the knock-down gene signature of EGFR. This assumes that the three compounds would result in the same gene expression level with the knock-down gene signature of EGFR. Therefore, the concatenation of gene expression and chemical information continues to be a major flaw to this study.

-> We agree with the reviewer’s opinion and believe that an approach without performance evaluations may be unreasonable or majorly flawed. To this end, we previously conducted various tasks to evaluate the efficiency of our model and found that our model outperformed the previous model and showed superior performance in various independent datasets.

Also, the reviewer expressed potential concerns on combining the target feature with the compound feature. There are pros and cons to using features separately and using them together. One of the advantages of using concatenated use of target and compound feature vectors is that it can efficiently predict a wide range of DTIs. In particular, AI-DTI can predict DTIs with mode of action for compounds for which only 2D structures are available because it only requires structure-based information of the compound and genetically perturbed transcriptome as input features. As a representative example, our previous study suggested that AI-DTI is an efficient method to identify candidate flavonoids for NAFLD, a representative liver disease [WY Lee et al., 2022]. Approaches cited by reviewer as being more efficient may be limited by their inability to predict the DTIs for these natural products, where the transcriptome is not measured. Taken together, we showed the reliability of the model through a comprehensive performance evaluation, which can resolve potential concerns for reasonability or flawless

# Reference

- Lee, Won-Yung, et al. "Identifying candidate flavonoids for non-alcoholic fatty liver disease by network-based strategy." Frontiers in Pharmacology (2022): 1718.

2. The reviewer understands the authors’ attempts at reducing prediction noise. Unfortunately, aggregating the information across different cell lines is unreasonable in establishing an effective DTI. Aggregating this information could potentially produce transcriptional signatures that are no longer relevant to a given disease. As the goal of the authors appear to be identifying a drug’s target interactions, this would be problematic. As a result, aggregating information across cell lines would greatly impair the effectiveness of the given model.

-> As the reviewers are concerned, drug-induced transcripts in a single cell line may be biased towards the specific response of the measured cell line. Cell line-specific responses risk acting as noise in DTI predictions in the general context agnostic to specific diseases. Therefore, one way to reduce this risk is to derive common data by aggregating the responses of multiple cell lines administered the same drug. To this end, by applying similarity-based weighted aggregation, we were able to obtain representative responses of drug administration while reducing cell-specific responses.

3. Regarding the dataset, were the independent dataset randomly selected and removed from the training set? Was there an attempt at balancing the dataset and seeing how that would affect performance?

-> We appreciate the reviewer's valuable comments.

As the reviewers are concerned, the random selection process is used in the classifier training phase to obtain negative samples, but not in the generation of independent datasets. In the independent dataset, all unknown drug-target interaction pairs were assigned as negative samples. Therefore, we considered the reviewer's comment as the following three points: 1) whether to balance at the training stage, 2) risk of data leakage on independent datasets, and 3) whether balances on independent datasets are possible

Data balancing is a pivotal factor that significantly affects prediction performance during classifier training. A previous study showed that balancing positive and negative samples is an efficient way to improve the performance of classifiers in predicting DTIs [Sawada et al.,]. Similarly, we found superior performance when the ratio of positive: negative samples was set to 1:1 than when the ratio was 1:10 (Result are not shown).

The reviewer mentioned the risk of data leakage on independent datasets. Confirming that the independent dataset consists of an unseen dataset is an important issue in evaluating the generalized ability of a model. To this end, we compiled DTIs from other sources such as Drugbank, and LIT-PCBA to an independent dataset. We removed the DTIs that overlapped with the training dataset so that the independent dataset consisted of only unseen samples.

Finally, reviewers commented the balancing on independent datasets. Random selection of negative samples is also possible in independent datasets, but the usefulness of this method should be reconsidered given the actual drug development environment. DTIs are imbalanced problems with significantly fewer positive samples than negative samples. If an independent dataset is processed into a balanced dataset, there is a greater risk that the performance in the dataset will not match the actual application situation. Therefore, an evaluation approach on independent data without data balancing may be a more efficient way to evaluate performance in practical applications of drug discovery.

# Reference

- Sawada, Ryusuke, et al. "Predicting inhibitory and activatory drug targets by chemically and genetically perturbed transcriptome signatures." Scientific reports 8.1 (2018): 1-9.

Attachment

Submitted filename: 220912_re_revision_response to reviewer.docx

Decision Letter 2

Jinn-Moon Yang

21 Oct 2022

PONE-D-22-10335R2Predicting Activatory and Inhibitory Drug–target Interactions based on Structural Compound Representations and Genetically Perturbed TranscriptomesPLOS ONE

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Dec 05 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jinn-Moon Yang

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

Reviewer #3: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: (No Response)

Reviewer #2: The reviewer appreciates the efforts put forth by Lee et. al. However, there remain several concerns regarding the submitted manuscript. These concerns were mentioned previously, but were not sufficiently addressed.

1. The authors concatenated gene expression with chemical information, with the assumption that all compounds would produce with the same gene expression level (same off-target hits, protein-ligand binding mechanism, etc). In the example given previously, Gefitinib, Erlotinib, and Lapatinib are EGFR inhibitors, but as reported previously, can produce different gene expression profiles in cell lines (We et al., 2020). As such, combining the compound fingerprints with knock-down/overexpression of genes, with the assumption that they have the same expression profile, as input features for establishing a model appears unreasonable.

Wei, Nan, et al. “transcriptome profiling of acquired gefitinib resistant lung cancer cells reveals dramatically changed transcription programs and new treatment targets.” Frontiers in Oncology (2020):

2. The reviewer appreciates the authors explanation of the reducing prediction noise in the submitted manuscript. Unfortunately, it does not assuage issues with response bias. Again, aggregation of information across cell lines could potentially produce transcriptional signatures that are no longer relevant to a given disease. Because the authors goal is to identify a drug’s target interactions, this continues to be problematic.

The work presented by the authors, aggregating and then giving weight to responses across different cell lines would require additional analysis (such as studying an example drug and its associated pathways) to prove the effectiveness of the model.

Reviewer #3: The revision is satisfactory, reviewer's concern were address, and it is therefor, acceptable in current form.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 12;18(4):e0282042. doi: 10.1371/journal.pone.0282042.r007

Author response to Decision Letter 2


3 Dec 2022

1. The authors concatenated gene expression with chemical information, with the assumption that all compounds would produce with the same gene expression level (same off-target hits, protein-ligand binding mechanism, etc). In the example given previously, Gefitinib, Erlotinib, and Lapatinib are EGFR inhibitors, but as reported previously, can produce different gene expression profiles in cell lines (We et al., 2020). As such, combining the compound fingerprints with knock-down/overexpression of genes, with the assumption that they have the same expression profile, as input features for establishing a model appears unreasonable.

Wei, Nan, et al. “transcriptome profiling of acquired gefitinib resistant lung cancer cells reveals dramatically changed transcription programs and new treatment targets.” Frontiers in Oncology (2020):

-> Thank you for the potential practical issue. To reflect reviewer’s concern, we conducted a case study focusing on EGFR inhibitors suggested by the reviewer.

Although gefitinib, erlotinib, and lapatinib are all EGFR inhibitors, each drug has different target information (Figure A. The figure is attached to the file uploaded to the reviewer's response file). For example, lapatinib has additional targets such as HER2 and eEK-2K, and it can be expected that it would have different properties from signatures measured after EGFR silencing. On the other hand, in the case of gefitinib and erlotinib, which relatively have few or no other targets, it can be expected that they are relatively similar to the signatures measured after EGFR silencing.

To evaluate those expectation, we systematically measured the correlation between signatures of the EGFR inhibitor and EGFR silencing Using CMap data (GEO number: GSE92742), which contains 205,034 drug-induced signatures. We found that the signatures measured after administration of erlotinib and gefitinib were higher than lapatinib. Interestingly, the signatures of gefitinib and erlotinib calculated by applying the aggregation method showed correlation values of 0.538 and 0.68, respectively (Figure B).

To check the significance of the observed result, we calculated the relative ranking of the correlation value for signatures measured after EGFR silencing. Relative ranks were calculated by comparing the drug's correlation value with the drug-perturbed signature (n=205,034) included throughout the GSE dataset. The result showed that the relative ranking of aggregated signature of gefitinib and erlotinib were 0.995 and 1 percentiles, which support the similarity between genetic perturbed signatures and drug-induced signatures (Figure C).

On the other hands, we found that the correlation value and its relative ranking of lapatinib with data aggregation was only -0.13 and 0.21, respectively. This indicates that drugs with multi targets are not suitable for embedding information of a single target. Rather, the genetically perturbed transcriptome, which adjusted for the off-target bias, would be more suitable for embedding information of a specific target. The discrepancies in the expression profiles of EGFR inhibitors in the papers cited by the reviewers highlight the multi-target nature of the drugs and support that drug-induced transcriptomes are not an optimal method for target embedding.

2. The reviewer appreciates the authors explanation of the reducing prediction noise in the submitted manuscript. Unfortunately, it does not assuage issues with response bias. Again, aggregation of information across cell lines could potentially produce transcriptional signatures that are no longer relevant to a given disease. Because the authors goal is to identify a drug’s target interactions, this continues to be problematic.

The work presented by the authors, aggregating and then giving weight to responses across different cell lines would require additional analysis (such as studying an example drug and its associated pathways) to prove the effectiveness of the model.

-> As shown in the case study above, we show that the aggregation method is the efficient way to obtain generalized vector embedding. Also, we found that the aggregation method can broaden the predictable number of targets. We counted the number of targets for which the genetic perturbed transcriptome was measured for each cell line. The result showed that the number of targets that can be additionally predicted through the aggregation method increases by at least 80 and 11 (approximately 15%) for activatory and inhibitory targets. To reflect our finding, we revised the characteristics of the aggregation method in the manuscript line 131-133.

Attachment

Submitted filename: 221128_R3_revision_response.docx

Decision Letter 3

Jinn-Moon Yang

14 Dec 2022

PONE-D-22-10335R3Predicting Activatory and Inhibitory Drug–target Interactions based on Structural Compound Representations and Genetically Perturbed TranscriptomesPLOS ONE

Dear Dr. Kim,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process. Authors should provide new results and evidences to address reviewer' comments.

Please submit your revised manuscript by Jan 28 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jinn-Moon Yang

Academic Editor

PLOS ONE

Additional Editor Comments:

Authors should propose new results abd evidences to address reviewer comments.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: The reviewer appreciates the efforts put forth by Lee et. al. However, major concerns remain regarding their submitted manuscript.

1. Chiefly, the authors concatenated gene expression with chemical information, with the assumption that all compounds would produce the same expression level. Concatenating gene expression with chemical information would present inaccurate results. This is due to compounds producing different gene expression levels. The authors give kinase inhibitors as examples. While Gefitinib and Erlotinib are selective EGFR inhibitors, both inhibitors also exhibit a number of off-targets, with large-scale screening available on the Guide to Pharmacology website.

As the authors mentioned, their results indicate that “drugs with multi targets are not suitable for embedding information of a single target.” The selectivity profile for these inhibitors in their example, Gefitinib and Erlotinib have different off-target inhibitory patterns throughout the human kinome. As a result, it would not seem reasonable to concat knock-down/overexpression of genes with chemical information.

2. We appreciate the effort put forth by the authors. However, there are concerns that are continued to not be adequately addressed. Again, aggregation of information across cell lines could potentially produce transcriptional signatures that are no longer relevant to a given disease. Concatenation of gene expression with chemical information would not assuage these concerns. Because the authors’ goal is to identify drug target interactions, this continues to be problematic.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Apr 12;18(4):e0282042. doi: 10.1371/journal.pone.0282042.r009

Author response to Decision Letter 3


14 Jan 2023

The reviewer appreciates the efforts put forth by Lee et. al. However, major concerns remain regarding their submitted manuscript.

1. Chiefly, the authors concatenated gene expression with chemical information, with the assumption that all compounds would produce the same expression level. Concatenating gene expression with chemical information would present inaccurate results. This is due to compounds producing different gene expression levels. The authors give kinase inhibitors as examples. While Gefitinib and Erlotinib are selective EGFR inhibitors, both inhibitors also exhibit a number of off-targets, with large-scale screening available on the Guide to Pharmacology website.

As the authors mentioned, their results indicate that “drugs with multi targets are not suitable for embedding information of a single target.” The selectivity profile for these inhibitors in their example, Gefitinib and Erlotinib have different off-target inhibitory patterns throughout the human kinome. As a result, it would not seem reasonable to concat knock-down/overexpression of genes with chemical information.

-> Thank you for the valuable feedback provided by the reviewer. We appreciate the effort put forth in reviewing our manuscript.

In regards to the concern about concatenating gene expression with chemical information, we understand the reviewer's perspective that different compounds may produce different gene expression levels. However, our approach utilizes an aggregation method that aims to control for non-biological noise in transcripts, as inspired by the method used by Subramanian et al. Additionally, unlike traditional biological data, our approach utilizes a representative vector for a specific entity, which is an essential step in the machine learning approach. Studies such as Fernández-Torras et al. have shown that this embedding method for drugs can effectively characterize the drug itself and other drugs with similar performance to the drug-treatment transcriptome.

Furthermore, the reviewer's concern that concatenating gene expression with chemical information would present inaccurate results is not supported by current research in the field. Methods for calculating embeddings for targets and compounds using gene expression and chemical fingerprinting are already widely used in the prediction of drug-target interactions, as shown in the studies by Lim et al. and Bagherian et al. Our study has already demonstrated that combining these methods can accurately predict activatory and inhibitory drug-target interactions for most compounds. Therefore, we believe that this point should be understood as a contribution to the novelty of our study, not an issue of accuracy.

In regards to the concern about off-target effects, our approach takes into account these limitations by employing an alternative strategy that uses a genetic perturbed transcriptome as an embedding for the target. This approach aims to ensure that the results are specific to the target, rather than being influenced by off-target effects. Our experiments focusing on EGFR inhibition in previous rounds have shown that this method is more efficient in characterizing specific target information compared to using drug-treated transcriptome. We understand the reviewers concern, but we believe that our approach has been validated by the experiments we performed and that the concerns brought up by the reviewer were already addressed in previous rounds of review.

# References

Subramanian, Aravind, et al. "A next generation connectivity map: L1000 platform and the first 1,000,000 profiles." Cell 171.6 (2017): 1437-1452.

Fernández-Torras, Adrià, et al. "Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque." Nature Communications 13.1 (2022): 1-18.

Lim, Sangsoo, et al. "A review on compound-protein interaction prediction methods: data, format, representation and model." Computational and Structural Biotechnology Journal 19 (2021): 1541-1556.

Bagherian, Maryam, et al. "Machine learning approaches and databases for prediction of drug–target interaction: a survey paper." Briefings in bioinformatics 22.1 (2021): 247-269.

2. We appreciate the effort put forth by the authors. However, there are concerns that are continued to not be adequately addressed. Again, aggregation of information across cell lines could potentially produce transcriptional signatures that are no longer relevant to a given disease. Concatenation of gene expression with chemical information would not assuage these concerns. Because the authors’ goal is to identify drug target interactions, this continues to be problematic.

-> We appreciate the feedback provided by the reviewer and understand their concerns.

However, we would like to clarify that the transcriptional signatures used in this study, specifically the genetically perturbed transcriptome, are not associated with any specific disease. The purpose of our study is to develop a machine-learning model that can predict drug-target interactions, and the aggregation method is used to compute representative vector embeddings for more diverse targets. Although our model is not specialized for any specific disease, we would like to highlight that our model has been successfully applied to various diseases, as demonstrated in case studies of COVID-19 and in our previous study on non-alcoholic fatty liver disease [Lee et al.].

Furthermore, we would like to state that it is important to avoid mentioning potential issues that did not appear in our study and as such, we have taken care to not present any concerns that were not addressed in the paper. We are open to making any further revisions and adjustments that are deemed necessary to ensure that our study meets the standards of academic research.

# Reference

Lee, Won-Yung, et al. "Identifying candidate flavonoids for non-alcoholic fatty liver disease by network-based strategy." Frontiers in Pharmacology (2022): 1718.

Attachment

Submitted filename: 230113_R4_comment response.docx

Decision Letter 4

Jinn-Moon Yang

7 Feb 2023

Predicting Activatory and Inhibitory Drug–target Interactions based on Structural Compound Representations and Genetically Perturbed Transcriptomes

PONE-D-22-10335R4

Dear Dr. Kim,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Jinn-Moon Yang

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Jinn-Moon Yang

3 Apr 2023

PONE-D-22-10335R4

Predicting Activatory and Inhibitory Drug–target Interactions based on Structural Compound Representations and Genetically Perturbed Transcriptomes

Dear Dr. Kim:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Jinn-Moon Yang

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Distribution of targets whose genetically perturbed transcriptome was measured by cell line.

    (XLSX)

    S2 Table. Search range and selected hyperparameter values for the cascade deep forest models.

    (XLSX)

    S3 Table. Performance comparison between joint learning and AI-DTI.

    (XLSX)

    S4 Table. Candidate FDA-approved drugs for COVID-19-related activatory targets.

    (XLSX)

    S5 Table. Candidate FDA-approved drugs for COVID-19-related inhibitory targets.

    (XLSX)

    S1 Fig. The architecture and hyperparameters of the MLP models.

    (TIF)

    S2 Fig. Distribution of spearman correlation coefficients between the inferred data and genetically perturbed transcriptome for the same gene (Within pair) and the other genes (Between pair).

    (TIF)

    S3 Fig. Predictive performance of AI-DTI on DTIs between FDA-approved drugs and COVID-19-related targets.

    (TIF)

    Attachment

    Submitted filename: renamed_e7942.docx

    Attachment

    Submitted filename: 220802_revision_comments.docx

    Attachment

    Submitted filename: 220912_re_revision_response to reviewer.docx

    Attachment

    Submitted filename: 221128_R3_revision_response.docx

    Attachment

    Submitted filename: 230113_R4_comment response.docx

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files or at: https://bitbucket.org/NNSM/ai_dti.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES