Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2025 Jun 3;15:19495. doi: 10.1038/s41598-025-03932-6

Predicting drug-target interactions using machine learning with improved data balancing and feature engineering

Md Alamin Talukder 1,, Mohsin Kazi 2,, Ammar Alazab 3,4,
PMCID: PMC12134243  PMID: 40461636

Abstract

Drug-Target Interaction (DTI) prediction is a vital task in drug discovery, yet it faces significant challenges such as data imbalance and the complexity of biochemical representations. This study makes several contributions to address these issues, introducing a novel hybrid framework that combines advanced machine learning (ML) and deep learning (DL) techniques. The framework leverages comprehensive feature engineering, utilizing MACCS keys to extract structural drug features and amino acid/dipeptide compositions to represent target biomolecular properties. This dual feature extraction method enables a deeper understanding of chemical and biological interactions, enhancing predictive accuracy. To address data imbalance, Generative Adversarial Networks (GANs) are employed to create synthetic data for the minority class, effectively reducing false negatives and improving the sensitivity of the predictive model. The Random Forest Classifier (RFC) is utilized to make precise DTI predictions, optimized for handling high-dimensional data. The proposed framework’s scalability and robustness were validated across diverse datasets, including BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50. For the BindingDB-Kd dataset, the GAN+RFC model achieved remarkable performance metrics: accuracy of 97.46%, precision of 97.49%, sensitivity of 97.46%, specificity of 98.82%, F1-score of 97.46%, and ROC-AUC of 99.42%. Similarly, for the BindingDB-Ki dataset, the model attained an accuracy of 91.69%, precision of 91.74%, sensitivity of 91.69%, specificity of 93.40%, F1-score of 91.69%, and ROC-AUC of 97.32%. On the BindingDB-IC50 dataset, the model achieved an accuracy of 95.40%, precision of 95.41%, sensitivity of 95.40%, specificity of 96.42%, F1-score of 95.39%, and ROC-AUC of 98.97%. These results demonstrate the efficacy of the GAN-based approach in capturing complex patterns, significantly improving DTI prediction outcomes. In conclusion, the proposed GAN-based hybrid framework sets a new benchmark in computational drug discovery by addressing critical challenges in DTI prediction. Its robust performance, scalability, and generalizability contribute substantially to therapeutic development and pharmaceutical research.

Keywords: Drug-Target interaction, Generative adversarial networks, Machine learning, Random forest classifier, Data imbalance, Computational drug discovery

Subject terms: Drug discovery, Computer science

Introduction

The discovery of new drugs is a crucial and time-consuming process in modern medicine, with pharmaceutical companies continuously striving to identify effective treatments for various diseases1,2. According to recent statistics, the global pharmaceutical market is expected to reach a value of $1.5 trillion by 2025, driven by an increasing demand for new and innovative therapies3. However, despite these advancements, the process of drug development is still plagued by high costs and long timelines, with many potential drug candidates failing during the clinical trial phases. A key component of successful drug discovery is understanding the interactions between drugs and their target proteins, which can significantly influence the efficacy and safety of therapeutic agents4.

Drug-Target Interaction (DTI) prediction is a critical aspect of drug discovery, as it helps in identifying potential drug candidates that can interact with target proteins to exert their desired therapeutic effects5. Recent statistics show that approximately 60-70% of drug candidates fail due to poor efficacy or adverse effects, highlighting the importance of accurate DTI prediction6. Traditional experimental methods for predicting DTIs are costly, time-consuming, and labor-intensive. These challenges, along with the sheer complexity of biochemical systems, have led researchers to explore computational approaches that can predict DTIs more efficiently. However, the complexity of these systems, including the vast diversity of drug molecules and target proteins, still poses significant challenges for computational models.

Recent advances in Artificial Intelligence (AI) and Machine Learning (ML)7 models have shown great promise in improving the accuracy and efficiency of DTI predictions8. Traditional DTI prediction models, such as those based on molecular docking or ligand-based methods, often struggle to handle the high-dimensional and noisy data inherent in biological systems. In contrast, ML models have demonstrated superior performance by learning complex patterns from large datasets. These methods can address issues such as data imbalance, which is a common problem in DTI datasets where the number of non-interacting pairs far outweighs the interacting ones. With these advancements, ML-based models offer a scalable and robust solution for accelerating drug discovery, enabling researchers to make more informed decisions in the early stages of drug development.

Wei et al.9 proposed DeepLPI, a deep learning (DL)-based model combining a ResNet-based 1D CNN and a bi-directional LSTM (biLSTM) to predict protein-ligand interactions. Raw drug molecular and target protein sequences were encoded into dense vector representations and processed through two ResNet-based 1D CNN modules to extract features. These features were concatenated and passed through the biLSTM network, followed by an MLP module for final prediction. The model was trained and tested on the BindingDB and Davis datasets. In the BindingDB dataset, DeepLPI achieved an AUC-ROC of 0.893, sensitivity of 0.831, and specificity of 0.792 on the training set, and an AUC-ROC of 0.790, sensitivity of 0.684, and specificity of 0.773 on the test set. When compared to baseline methods like DeepCDA and DeepDTA, DeepLPI outperformed them, demonstrating high accuracy and robust generalization capability. These results indicate that DeepLPI has the potential to identify new drug-target interactions and improve drug discovery. Pei et al.10 proposed a label aggregation method with pair-wise retrieval and a representation aggregation method with point-wise retrieval of nearest neighbors, which efficiently boosted DTA prediction performance during the inference phase without any training cost. Additionally, an extension called Ada-kNN-DTA was introduced, featuring instance-wise and adaptive aggregation with lightweight learning. Results from four benchmark datasets demonstrated that kNN-DTA significantly outperformed previous state-of-the-art (SOTA) methods. On the BindingDB IC50 and Ki testbeds, kNN-DTA achieved new records of RMSE 0.684 and 0.750. The Ada-kNN-DTA extension further improved performance, reaching RMSE values of 0.675 and 0.735. These results confirmed the effectiveness of the proposed methods. Further analyses and results across different settings highlighted the great potential of the kNN-DTA approach, establishing it as a powerful tool for enhancing DTA prediction with minimal computational cost. Zhu et al.11 developed MDCT-DTA, a novel model for drug-target affinity (DTA) prediction, which combined multi-scale diffusion and interactive learning. To overcome the limitations of existing approaches, the model incorporated a multi-scale graph diffusion convolution (MGDC) module to effectively capture intricate interactions among drug molecular graph nodes. A CNN-Transformer Network (CTN) block was also introduced to model the interdependencies between amino acids, enhancing the model’s representation and learning capabilities. Additionally, a local inter-layer information interaction structure was designed to analyze relationships between drug and protein features, improving the robustness and representativeness of structural features. The model’s effectiveness was evaluated on BindingDB benchmark dataset. Experimental results showed that MDCT-DTA accurately predicted drug-target binding affinities, achieving an MSE of 0.475 on the BindingDB dataset. These outcomes highlighted the model’s potential, offering new insights into DTA prediction and advancing the development of robust predictive frameworks. Schuh et al.12 developed BarlowDTI, a novel approach for DTI prediction that utilized the Barlow Twins architecture for feature extraction, focusing on the structural properties of target proteins. The model achieved state-of-the-art performance across the BindingDB-kd benchmark using one-dimensional input data, with a ROC-AUC score of 0.9364. By employing a gradient boosting machine as the predictor, it ensured fast, resource-efficient predictions. Analysis of co-crystal structures demonstrated that BarlowDTI effectively identified catalytically active and stabilizing residues, showcasing its generalization capabilities. This innovation enhanced DTI prediction efficiency and accuracy, contributing to drug discovery advancements and a deeper understanding of molecular interactions. Guichaoua et al.13 introduced a novel approach to address two key challenges in DTI prediction: the need for large, high-quality datasets and scalable prediction methods. He developed LCIdb, a curated, extensive DTI dataset with enhanced molecule and protein space coverage, surpassing traditional benchmarks. Additionally, he proposed Komet (Kronecker Optimized METhod), a scalable prediction pipeline using a three-step framework with efficient computations and the Nyström approximation. Komet’s Kronecker interaction module effectively balances expressiveness and computational complexity. Implemented as open-source software with GPU parallelization, Komet achieved superior scalability and performance, with a ROC-AUC of 0.70 on BindingDB, outperforming existing DL methods. Pei et al.14 proposed a novel framework that incorporated three simple yet effective strategies to enhance Drug-Target Affinity (DTA) prediction. First, a multi-task training approach was employed, which jointly learned DTA prediction and masked language modeling (MLM) on paired drug-target datasets. Second, a semi-supervised learning technique was introduced, utilizing large-scale unpaired molecular and protein data to improve representation learning-unlike traditional pre-training methods that only considered either molecules or proteins in isolation. Third, a cross-attention module was integrated to strengthen the interaction between drug and target representations. Extensive experiments conducted on real-world benchmark datasets, such as BindingDB-IC50, demonstrated that the proposed framework significantly outperformed existing models. It achieved state-of-the-art results, including an RMSE of 0.712 on the BindingDB-IC50 dataset, reflecting over a 5% improvement compared to previous best-performing methods.

Problem statements

Despite significant advancements in DTI prediction, several critical challenges remain:

  1. Integration of Chemical and Biological Information Current methodologies struggle to effectively combine chemical fingerprint representations of drugs and biomolecular features of targets, limiting their capacity to capture complex biochemical and structural relationships necessary for accurate DTI prediction.

  2. Data Imbalance in Experimental Datasets Imbalanced datasets, where the minority class of positive drug-target interactions is underrepresented, lead to biased models that exhibit reduced sensitivity and higher rates of false negatives in prediction tasks.

  3. Limitations of Traditional Drug Discovery Approaches Conventional methods for drug discovery are time-consuming, expensive, and lack scalability, making them unsuitable for addressing the increasing complexity and speed required in modern drug development.

  4. Threshold Optimization for Drug Target Interaction Existing methods lack systematic evaluation for selecting optimal thresholds to classify drug-target interactions, leading to inaccuracies. This research addresses this gap by conducting experimental analysis to reliably balance and improve the drug-target interactions.

These gaps highlight the necessity for innovative computational frameworks that address data integration, imbalance, and scalability to accelerate drug discovery and enhance predictive accuracy.

Objectives

The primary objective of this research is to design a hybrid framework that integrates advanced feature engineering, data balancing, and ML techniques for accurate DTI prediction. The study aims to unify drug fingerprints and target compositions into a single feature representation to enhance predictive performance. It also addresses data imbalance by employing Generative Adversarial Networks (GANs) to improve sensitivity and reduce false negatives. Furthermore, the framework is designed to be scalable and robust, ensuring its applicability across diverse datasets and drug discovery scenarios. By achieving these objectives, the research aims to contribute significantly to the development of robust computational tools for accelerating drug discovery processes.

Contributions

This study makes the following significant contributions to the field of DTI prediction:

  1. Develop a Hybrid Model for DTI Prediction: Design and implement a novel hybrid framework that integrates ML and DL techniques to enhance the accuracy and reliability of DTI predictions.

  2. Expressing Chemical and Biological Information Utilizing advanced techniques to extract structural features of drugs through MACCS keys and biomolecular features of targets via amino acid and dipeptide compositions. This approach captures both chemical and biological intricacies, enabling a comprehensive understanding of drug-target interactions and significantly enhancing predictive accuracy in computational drug discovery.

  3. Address Data Imbalance Challenges: Employ Generative Adversarial Networks (GANs) to generate synthetic data for the minority class, mitigating the adverse effects of data imbalance and improving the sensitivity of the prediction model.

  4. Demonstrate Scalability and Robustness: Validate the scalability of the proposed framework across datasets of different sizes and distributions to ensure its applicability in diverse drug discovery scenarios.

These contributions collectively advance the field of computational drug discovery, providing a scalable, accurate, and robust solution for predicting DTIs and accelerating drug development processes.

Research questions

  • RQ1: How can the proposed ML model improve the prediction accuracy of drug-target interactions compared to traditional ML/DL models?

  • RQ2: What are the benefits of GAN data balancing technique for DTI prediction?

  • RQ3: Can the proposed method enhance predictive performance and scalability in drug discovery?

Hypothesis

The proposed hybrid framework, integrating machine learning (ML) and deep learning (DL) techniques, aims to significantly improve drug-target interaction (DTI) prediction by addressing limitations in data representation, imbalance, and scalability. By unifying drug fingerprints (MACCS keys) and biomolecular compositions (amino acid and dipeptide sequences), the model enhances feature representation, leading to higher predictive accuracy. Additionally, the incorporation of Generative Adversarial Networks (GANs) to generate synthetic data for the minority class effectively mitigates data imbalance, improving sensitivity and reducing false negatives in predictions. The framework is designed for scalability and robustness, ensuring adaptability across datasets of varying sizes and distributions, making it highly applicable to diverse drug discovery scenarios. By systematically validating these aspects, the proposed approach is expected to outperform traditional ML/DL models in predictive performance, contributing to the development of scalable, data-driven drug discovery solutions.

The remaining sections of this paper are structured as follows: Sect. “Materials and Method” provides a detailed explanation of our research materials and methods used in our experiment. The results, including the environment, performance metrics, and evaluation, are outlined in Sect. “Performance Analysis”. Section “Discussion” presents the comparison analysis with existing works and research questions and hypothesis validation. Lastly, Sect. “Conclusion” presents the conclusion and future work.

Materials and method

This study proposes a novel methodology that integrates robust feature extraction, data balancing using GANs, and advanced ML and DL models to improve the accuracy and scalability of DTI prediction. The methodology (Fig. 1) begins with dataset preparation using the BindingDB datasets obtained from the TDC library. Binding affinities are binarized with a threshold of 10, 20 and 30 nM, categorizing interactions into positive (binding) and negative (non-binding). Drug compounds are represented using MACCS fingerprints generated via the RDKit library, while protein targets are encoded through amino acid composition (ACC) and dipeptide composition (DC), normalized by sequence length. The MACCSKeys enhance feature extraction by capturing intricate relationships between drug and target features. Invalid rows resulting from failed fingerprint generation or sequence analysis are excluded to ensure data integrity. The resulting feature matrix is standardized using StandardScaler to facilitate model training. To address class imbalance, a Generative Adversarial Network (GAN) is employed. The GAN architecture includes a generator, which synthesizes realistic feature vectors for the minority class, and a discriminator, which differentiates between real and synthetic data. This iterative process generates high-quality synthetic samples, effectively balancing the dataset. The combined real and synthetic dataset is split into training (80%) and testing (20%) subsets using train_test_split. The study evaluates several ML and DL models. Traditional ML models, including Decision Tree Classifier (DTC), Random Forest Classifier (RFC), and Multilayer Perceptron (MLP), are compared with advanced DL models such as a Fully Connected Neural Network (FCNN), Multi-Head Attention-integrated FCNN (MHA-FCNN), DeepLPI, BarlowDTI and Komet. Model performance is assessed using metrics such as accuracy, precision, recall, F1 score, and ROC-AUC. Results demonstrate that the GAN+RFC outperforms traditional models, achieving superior predictive accuracy and scalability. This comprehensive methodology highlights the importance of robust feature engineering and data balancing in advancing computational drug discovery.

Fig. 1.

Fig. 1

The proposed drug-target interaction prediction architecture.

Dataset collection

In this study, we utilized two datasets from the BindingDB resource: BindingDB_Kd and BindingDB_Ki. These datasets are critical for predicting drug-target interactions (DTIs), specifically focusing on the binding affinities between drug-like small molecules and protein targets. BindingDB15 is a public, web-accessible database that contains experimentally measured binding affinities, primarily for interactions between proteins and drug-like small molecules. It serves as a valuable resource for computational drug discovery, facilitating the analysis and prediction of DTIs. Each record in the database includes detailed information on the target proteins, the interacting compounds, and their respective binding affinities. The prediction task for both datasets is framed as a regression problem that we have converted into binary by using various such as 10, 20 and 30 as a threshold.

We have the amino acid sequence of the target protein and the SMILES (Simplified Molecular Input Line Entry System) string representation of the compound, the goal is to predict the binding affinity between the two entities. Binding affinity is measured in terms of Inline graphic (dissociation constant) or Inline graphic (inhibition constant) or Inline graphic half-maximal inhibitory concentration, which quantifies the strength of the interaction.

Dataset statistics:

The statistics of the datasets (Table 1) used in our experiments are as follows:

  • BindingDB_Kd: Contains 52,284 DTI pairs, 10,665 unique drugs, and 1,413 unique proteins.

  • BindingDB_Ki: Contains 375,032 DTI pairs, 174,662 unique drugs, and 3,070 unique proteins.

  • BindingDB_IC50: Contains 991,486 DTI pairs, 549,205 unique drugs, and 5,078 unique proteins.

Table 1.

Statistics of BindingDB datasets.

Dataset DTI Pairs Unique drugs Unique proteins
BindingDB_Kd 52,284 10,665 1,413
BindingDB_Ki 375,032 174,662 3,070
BindingDB_IC50 991,486 549,205 5,078

Relevance to the study:

These datasets provide a diverse and comprehensive set of interactions, making them ideal for training and evaluating ML models aimed at predicting DTIs. The large number of data points in BindingDB_Ki enables robust training of models, while the smaller BindingDB_Kd dataset offers an opportunity to evaluate model performance on datasets with more limited samples. By using these datasets, our study aims to enhance the predictive accuracy of binding affinity while addressing challenges such as data imbalance and feature representation.

Data preprocessing

To ensure consistency and enhance the quality of the data for model training, several preprocessing steps were applied. These steps included affinity harmonization and binarization, which transformed the regression task into a classification task.

Affinity harmonization:

The binding affinities in the datasets often originate from diverse experimental conditions, leading to variations in their reported values. To address this inconsistency, we harmonized the affinities using the mean mode:

graphic file with name d33e523.gif

The data.harmonize_affinities function works as follows:

  • Input: A dataset containing binding affinity values under various experimental conditions.

  • Process: The function calculates the mean affinity for each set of measurements and adjusts all affinity values to align with this mean. This reduces variance caused by different conditions and produces a standardized dataset.

  • Output: A dataset with harmonized affinity values, suitable for use in predictive modeling.

By harmonizing the affinities, we minimize noise and ensure that the data is consistent across all samples.

Affinity binarization:

The binding affinity values were binarized to convert the continuous affinity data into binary interaction labels. This step is crucial for the drug-target interaction (DTI) task, where the goal is to distinguish between strong and weak interactions. A threshold (Inline graphic) was applied to categorize the affinities:

  • Affinity values below Inline graphic (e.g., Inline graphic): Indicate a strong interaction (label = 1).

  • Affinity values equal to or above Inline graphic (Inline graphic): Indicate no significant interaction (label = 0).

The binarization was performed using the following code:

graphic file with name d33e604.gif

The data.binarize function is implemented as follows:

  • Input: A dataset of continuous binding affinity values and a threshold (Inline graphic).

  • Parameters:
    • threshold: The value(s) used to determine interaction strength.
    • order: Specifies the relationship between values and interaction strength. Here, ’descending’ indicates that lower affinity values correspond to stronger interactions.
  • Process:
    1. The function iterates through the dataset and compares each affinity value to the specified threshold.
    2. If the affinity value is below the threshold, it assigns a binary label of 1 (interaction).
    3. If the value is equal to or above the threshold, it assigns a binary label of 0 (no interaction).
  • Output: A dataset with binary interaction labels for each binding affinity value.

Specifically:

  • Thresholds: 10, 20, 30.

  • Binary Labels:
    • A value of 1 indicates an interaction (Inline graphic).
    • A value of 0 indicates no interaction (Inline graphic).

The threshold value (Th=10) used in the classification task is typically measured in kilocalories per mole (kcal/mol), which is a standard unit for binding affinity. This unit represents the energy of interaction between molecules, such as a drug and its target. With the threshold set to 10 and the order specified as “descending,” binding affinity values below 10 kcal/mol are classified as positive interactions (label = 1), indicating strong interactions between molecules. Conversely, values equal to or above 10 kcal/mol are labeled as negatives (label = 0), signifying no significant interaction. The positive-to-negative ratio in the dataset is determined by counting the number of data points classified as positive (strong interactions) and negative (no interaction) based on this threshold. The positive-to-negative ratio of the dataset for (Th = 10, 20, 30) is shown in Table 2. The table provides the count of data points classified as positive (1) and negative (0) for the BindingDB_Kd, BindingDB_Ki, and BindingDB_IC50 datasets, illustrating the distribution of interactions based on the threshold.

Table 2.

Positive-to-negative interactions of DTI datasets.

Dataset Threshold Interaction
Class 0 / Negative Class 1 / Positive
BindingDB_Kd 10 38910 3326
20 37915 4321
30 37385 4851
BindingDB_Ki 10 227658 69027
20 208998 87687
30 197572 99113
BindingDB_IC50 10 664750 102154
20 621782 145122
30 597037 169867

Significance of preprocessing:

These preprocessing steps are crucial to mitigating data inconsistencies and aligning the data format with the requirements of binary classification tasks. By harmonizing and binarizing the data, the preprocessing pipeline enhances the interpretability and predictive accuracy of the models, especially in tasks involving datasets with varying scales of binding affinity values.

Feature engineering

Feature engineering is a critical step in the development of ML models, transforming raw data into meaningful representations to enhance predictive performance. In this study, feature engineering was applied to extract molecular fingerprints for drugs and amino acid compositions for protein sequences, followed by data scaling and feature integration.

Drug fingerprint extraction:

For drug molecules represented by SMILES strings, we utilized MACCS (Molecular ACCess System) keys as descriptors. These are binary fingerprints capturing the presence or absence of specific substructures in the molecular structure. The RDKit library was employed to compute MACCS keys for each drug molecule:

graphic file with name d33e814.gif

To handle invalid or non-parsable SMILES strings, rows with missing fingerprints were dropped. Each MACCS key was then converted into a feature vector, representing the structural properties of the drug molecules.

Protein sequence representation

Protein sequences are vital for understanding biological interactions, as their structure and composition play a significant role in determining biochemical behavior. To encode this information effectively, computational techniques are employed to extract meaningful representations. Two prominent methods, amino acid composition (AAC) and dipeptide composition (DC), are utilized to capture both individual amino acid frequencies and sequential dependencies within the sequences. These representations ensure that protein data is structured appropriately for machine learning models, aiding in predictive tasks such as drug-target interaction (DTI) modeling.

Amino acid composition:

Amino acid composition (AAC) quantifies the normalized frequency of each of the 20 standard amino acids in a protein sequence. This representation provides a high-level overview of the sequence’s biochemical properties. Mathematically, AAC for a given amino acid is computed using the formula:

graphic file with name d33e831.gif

This method ensures uniform feature scaling and enables seamless integration into predictive models. The Python Counter class is employed to efficiently compute amino acid occurrences, followed by normalization with respect to the total sequence length, encoding the protein’s structural characteristics.

Dipeptide composition:

To enrich protein sequence representation further, dipeptide composition (DC) is calculated. Dipeptides represent pairs of consecutive amino acids, capturing intricate sequential dependencies within the sequence. Using Python’s itertools.product, all possible dipeptides are generated from the 20 standard amino acids, yielding 400 unique combinations. The normalized frequency of each dipeptide is then computed using the Counter class, ensuring the representation accounts for sequence length variations. This detailed representation complements AAC by incorporating sequential patterns, enabling machine learning models to better understand the intricacies of protein structures and improving predictive accuracy for DTIs.

Feature integration:

The drug fingerprint features and amino acid composition features were concatenated to form a unified feature matrix Inline graphic, ensuring each row represents a drug-target pair. The feature matrix Inline graphic was standardized using ‘StandardScaler‘ to ensure that all features were on the same scale:

graphic file with name d33e872.gif

This preprocessing step was crucial to prevent features with larger scales from dominating the learning process during model training.

Output feature matrix:

The resulting feature matrix Inline graphic has the following structure:

  • Drug Fingerprints: Binary features representing the MACCS keys for each drug.

  • Target Compositions: Normalized amino acid compositions for each protein sequence.

  • Concatenated Features: Combined representation of drug-target interactions.

The target variable Inline graphic consists of binary labels indicating the presence or absence of interactions.

Data validation and shape:

The shape of the feature matrix Inline graphic was verified to confirm successful feature extraction and integration, ensuring alignment with the dimensions of the target labels. Standardization further optimized the feature matrix for ML tasks.

SMILES (Simplified Molecular Input Line Entry System) strings provide a compact and widely-used representation of chemical structures. To ensure a consistent and interpretable feature representation, we utilized MACCS (Molecular ACCess System) keys as descriptors for the drug molecules. These are binary fingerprints designed to capture the presence or absence of specific substructures in the molecular structure, making them effective for computational modeling. For target proteins, we computed Amino Acid Composition (AAC), which captures the normalized frequencies of the 20 standard amino acids in the sequence, providing a comprehensive overview of the protein’s composition. The drug fingerprint features and protein AAC features were concatenated to form a unified feature matrix, representing each drug-target pair. Our approach was validated using two benchmark datasets, BindingDB-Kd and BindingDB-Ki, where SMILES strings were used to represent drug sequences. This choice allows for scalability and compatibility with the datasets while leveraging the structural information effectively.

Data balancing using generative adversarial networks (GANs)

Class imbalance in datasets is a significant challenge for ML models, particularly in binary classification tasks. In the context of DTI prediction, the imbalance between interacting (minority) and non-interacting (majority) samples can bias the model toward the majority class, leading to suboptimal performance. To address this, we employ a Generative Adversarial Network (GAN) to synthesize realistic samples for the minority class. The GAN comprises two components: a generator and a discriminator. The generator learns to produce synthetic samples that resemble the minority class distribution, while the discriminator learns to differentiate between real and synthetic samples. Through adversarial training, where the generator aims to “fool” the discriminator, the generator progressively improves its ability to produce high-quality synthetic data. This process enhances the representation of the minority class in the dataset, addressing imbalances and reducing bias in model training.

The GAN-based approach (Algorithm 1) begins by training the discriminator with both real samples from the minority class and synthetic samples generated by the generator. Once the discriminator achieves adequate performance, the generator is trained to produce samples that the discriminator cannot distinguish from real data. After sufficient training epochs, the generator generates synthetic samples in quantities sufficient to balance the dataset. These synthetic samples are then combined with the original dataset, ensuring a balanced distribution of classes. This approach not only mitigates the limitations of under-sampling or over-sampling methods but also maintains the diversity of data, leading to improved model performance and robustness.

The following algorithm (Algorithm 1) outlines the steps for generating synthetic samples using a GAN to balance the dataset.

Algorithm 1.

Algorithm 1

Data Balancing Using GANs

Fig. 2.

Fig. 2

Data distribution of BindingDB-Kd dataset.

Fig. 3.

Fig. 3

Data distribution of BindingDB-Ki dataset.

Fig. 4.

Fig. 4

Data distribution of BindingDB-IC50 dataset.

Machine and deep learning algorithms

The performance of ML and DL algorithms is critical in analyzing complex data. In this study, we employed several algorithms, including Decision Tree Classifier (DTC), Multilayer Perceptron Classifier (MLP), Random Forest Classifier (RFC), Fully Connected Neural Network (FCNN), and Multi-Head Attention Fully Connected Neural Network (MHA-FCNN). Each algorithm was evaluated to assess its suitability for predicting drug-target interactions (DTIs) using protein sequences and SMILES strings.

Decision tree classifier (DTC):

DTC is a simple and interpretable algorithm that partitions data into subsets based on feature values. It creates a tree-like structure where each node represents a decision, and the leaves represent the outcome. While DTC is computationally efficient and easy to implement, it can suffer from overfitting, especially with complex datasets16. In our study, DTC serves as a baseline model to compare with advanced algorithms.

Multilayer perceptron classifier (MLP):

MLP is a feedforward neural network that consists of multiple layers of neurons. Each neuron applies an activation function to a weighted sum of its inputs. MLP can learn non-linear relationships between features and is well-suited for structured datasets17. In this study, MLP helps bridge traditional ML methods with more advanced DL techniques by evaluating its performance on drug-target interaction data.

Random forest classifier (RFC):

RFC is an ensemble learning method that constructs multiple decision trees during training and combines their outputs for final predictions. It is robust to overfitting and effective for handling high-dimensional and complex data18. As the proposed model, RFC leverages its ability to model non-linear interactions between protein sequences and SMILES strings, providing a benchmark for our hybrid approach.

Fully connected neural network (FCNN):

FCNN, also known as a dense neural network, connects every neuron in one layer to every neuron in the next. This architecture allows it to learn complex patterns in data, making it well-suited for raw representations such as protein sequences and SMILES strings. The FCNN used in this study has three hidden layers with dropout for regularization. Its flexibility in learning intricate patterns provides valuable insights into the data.

Multi-head attention fully connected neural network (MHA-FCNN):

MHA-FCNN integrates multi-head attention mechanisms with FCNN layers to enhance the learning process. The attention mechanism focuses on relevant parts of the input, improving the model’s ability to capture meaningful patterns. By combining attention with dense layers, MHA-FCNN achieves better feature representation, particularly for sequences and structured chemical data, making it a powerful tool for DTI prediction. Table 3 shows the Parameter settings for the FCNN and MHA-FCNN models.

Table 3.

Parameter settings for the FCNN and MHA-FCNN models.

Parameter FCNN MHA-FCNN
Number of Dense Layers 3 3
Hidden Neurons [128, 64, 32] [128, 64, 32]
Dropout Rate 0.3 0.3
Attention Mechanism N/A Multi-Head Attention
Optimizer Adam Adam
Learning Rate 0.001 0.001
Loss Function Binary Cross Entropy Binary Cross Entropy
Activation Function (Hidden Layers) ReLU ReLU
Activation Function (Output) Sigmoid Sigmoid

By comparing these algorithms, this study evaluates the balance between simplicity, interpretability, and the ability to extract meaningful features from complex datasets. The complementary strengths of RFC and MHA-FCNN provide robust tools for predicting DTIs, enhancing computational drug discovery methodologies.

Performance analysis

Environment

The experiments were conducted on a system equipped with an HP ProBook 400 GB Notebook PC, featuring an 12th Gen Intel(R) Core(TM) i3-1115G4 processor running at 8.00 GHz. The system is configured with 16.00 GB of RAM. The experiments utilized Python libraries such as TensorFlow, PyTorch, and Scikit-learn for model development and evaluation.

Evaluation metrics

To evaluate the performance of our predictive models for drug-target interaction, we employed a variety of metrics to assess our proposed model for the interaction prediction tasks. These metrics provide a comprehensive understanding of the model’s effectiveness and reliability.

  1. Accuracy: The proportion of correctly classified instances (both positive and negative) out of the total predictions.
    graphic file with name d33e1116.gif
  2. Sensitivity (Recall): The ability of the model to correctly identify positive instances.
    graphic file with name d33e1128.gif
  3. Specificity: The ability of the model to correctly identify negative instances.
    graphic file with name d33e1140.gif
  4. F1-Score: The harmonic mean of precision and recall, balancing the trade-off between false positives and false negatives.
    graphic file with name d33e1152.gif
  5. Receiver Operating Characteristic - Area Under the Curve (ROC-AUC) Score: The ROC-AUC score evaluates a model’s ability to distinguish between classes across various threshold values. It ranges from 0 to 1, where a higher score indicates better performance, with 1 representing perfect classification and 0.5 suggesting random guessing.

  6. Mean Absolute Error (MAE): The average of the absolute differences between predicted and actual values, quantifying the prediction accuracy.
    graphic file with name d33e1171.gif
  7. Mean Squared Error (MSE): The average of the squared differences between predicted and actual values. This metric penalizes larger errors more significantly, making it highly sensitive to outliers.
    graphic file with name d33e1183.gif
  8. Root Mean Squared Error (RMSE): The square root of the MSE, provides an error measurement in the same units as the predicted values. This metric helps interpret the magnitude of errors more intuitively.
    graphic file with name d33e1195.gif
  9. Confusion Matrix: A two-dimensional table that presents the counts of true positive (TP), true negative (TN), false positive (FP), and false negative (FN) predictions. It offers a detailed visualization of the model’s performance across binary classification tasks.
    Predicted Positive Negative
    Actual positive TP FN
    Actual negative FP TN
    Where: - Inline graphic: True Positives (correctly predicted interactions). - Inline graphic: True Negatives (correctly predicted non-interactions). - Inline graphic: False Positives (incorrectly predicted interactions). - Inline graphic: False Negatives (incorrectly predicted non-interactions).

    These metrics collectively evaluate the regression and classification aspects of our models, ensuring robust analysis of their predictive capabilities.

  10. Statistical Significance Testing (Friedman Test)

    To ensure the reliability of model comparisons, we utilized the Friedman test, a non-parametric statistical test designed to detect differences across multiple models. The Friedman test evaluates whether at least one model outperforms the others significantly. It is commonly applied to rank-based results from multiple experiments.
    graphic file with name d33e1276.gif
    Where: - Inline graphic: Friedman test statistic. - Inline graphic: Number of datasets or observations. - Inline graphic: Number of models being compared. - Inline graphic: Sum of ranks assigned to each model across datasets.

    The null hypothesis states that all models perform equally well. A significant Inline graphic-value (< 0.05) indicates rejection of the null hypothesis, suggesting differences in model performance.

    By combining these metrics and statistical tests, we gained comprehensive insights into both the predictive performance and statistical robustness of our approach.

Result analysis

We have presented a comprehensive analysis of the experimental results obtained from evaluating various machine learning (ML) and deep learning (DL) models on benchmark drug-target interaction datasets, including Binding-Kd, Binding-Ki and Binding-IC50. The analysis focuses on key performance metrics such as Accuracy, Precision, Sensitivity, Specificity, F1-score, Kappa, MCC, ROC-AUC, MAE, MSE, and RMSE across different binarized thresholds such as 10,20 and 30. The aim is to assess the predictive capabilities and robustness of traditional classifiers and modern neural architectures in learning complex patterns from biochemical interaction data. By comparing the performance of each model under consistent experimental settings, we highlight the strengths and limitations of the respective approaches, offering insights into their suitability for real-world drug discovery applications.

The BindingDB-Kd dataset was evaluated using a diverse set of traditional machine learning (ML) and deep learning (DL) models across three classification thresholds: 10, 20, and 30 (Table 4). A comprehensive performance comparison was conducted using multiple evaluation metrics including Accuracy, Precision, Sensitivity (Recall), Specificity, F1-score, Cohen’s Kappa, Matthews Correlation Coefficient (MCC), ROC-AUC, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE). At the 10 threshold, the Random Forest Classifier (RFC) demonstrated the most superior performance among all models, achieving the highest accuracy of 97.46%, along with the best results across most metrics such as Precision (97.49%), Specificity (98.82%), ROC-AUC (99.42), and lowest error rates (MAE and MSE of 2.54). The Komet model also showed competitive performance with an accuracy of 97.40%, followed closely by MHA-FCNN and FCNN, which maintained strong F1-scores and ROC-AUC values above 99.2, indicating the robustness of deep learning architectures in capturing nonlinear patterns. When evaluated at the 20 threshold, a slight decline in performance was observed across all models due to the increased classification difficulty. Nonetheless, RFC continued to dominate with the best accuracy of 96.56%, F1-score of 96.56%, and a leading ROC-AUC of 99.20. Komet and MHA-FCNN remained strong contenders, with accuracies of 96.28% and 96.10%, respectively, and high Specificity values, showcasing their generalizability in slightly noisier settings. At the more relaxed 30 threshold, the trend of gradual performance degradation persisted across the board. However, RFC remained the top performer with an accuracy of 96.27%, Precision of 96.33%, and a ROC-AUC of 99.12, reaffirming its reliability across varied thresholds. DeepLPI and Komet followed closely, achieving accuracies of 96.00% and 95.97%, respectively, with strong ROC-AUC and error metrics. Overall, the RFC consistently outperformed all other models at every threshold, reflecting its strong capability in handling structured biochemical data. Deep learning models such as MHA-FCNN, DeepLPI, and BarlowDTI also exhibited impressive and stable results, particularly in Specificity and ROC-AUC, making them suitable for high-confidence classification in drug-target binding prediction tasks.

Table 4.

Performance analysis of ML and DL models on BindingDB-Kd.

Dataset Threshold Model Accuracy Precision Sensitivity Specificity F1score Kappa MCC ROC-AUC MAE MSE RMSE
BindingDB_Kd 10 DTC 96.47 96.47 96.47 96.45 96.47 92.93 92.93 96.66 3.53 3.53 18.80
MLP 97.13 97.14 97.13 97.91 97.13 94.26 94.27 99.15 2.87 2.87 16.95
RFC 97.46 97.49 97.46 98.82 97.46 94.91 94.95 99.42 2.54 2.54 15.95
FCNN 97.14 97.20 97.14 98.84 97.14 94.28 94.34 99.28 2.86 2.86 16.91
MHA-FCNN 97.21 97.25 97.21 98.70 97.21 94.42 94.46 99.29 2.79 2.79 16.70
DeepLPI 97.03 97.04 97.03 97.88 97.02 94.05 94.06 99.34 2.97 2.97 17.25
BarlowDTI 97.16 97.17 97.16 97.96 97.16 94.32 94.33 99.28 2.84 2.84 16.85
Komet 97.40 97.43 97.40 98.59 97.40 94.81 94.83 99.39 2.60 2.60 16.11
20 DTC 95.26 95.26 95.26 95.15 95.26 90.52 90.52 95.50 4.74 4.74 21.77
MLP 95.98 96.00 95.98 97.04 95.98 91.96 91.98 98.79 4.02 4.02 20.06
RFC 96.56 96.62 96.56 98.39 96.56 93.12 93.18 99.20 3.44 3.44 18.55
FCNN 96.08 96.15 96.08 98.00 96.08 92.16 92.22 99.02 3.92 3.92 19.81
MHA-FCNN 96.10 96.21 96.10 98.47 96.10 92.21 92.31 98.92 3.90 3.90 19.74
DeepLPI 96.27 96.31 96.27 97.82 96.27 92.54 92.58 99.12 3.73 3.73 19.32
BarlowDTI 96.33 96.37 96.33 97.82 96.33 92.66 92.70 99.08 3.67 3.67 19.16
Komet 96.28 96.35 96.28 98.14 96.28 92.56 92.63 99.07 3.72 3.72 19.28
30 DTC 94.92 94.92 94.92 95.04 94.92 89.85 89.85 95.18 5.08 5.08 22.53
MLP 95.92 95.95 95.92 97.18 95.92 91.84 91.87 98.77 4.08 4.08 20.20
RFC 96.27 96.33 96.27 98.18 96.27 92.53 92.60 99.12 3.73 3.73 19.32
FCNN 95.55 95.69 95.55 98.34 95.54 91.09 91.23 98.88 4.45 4.45 21.10
MHA-FCNN 95.60 95.66 95.60 97.39 95.60 91.20 91.25 98.79 4.40 4.40 20.98
DeepLPI 96.00 96.08 96.00 98.13 96.00 92.00 92.08 98.95 4.00 4.00 20.00
BarlowDTI 95.93 95.95 95.93 96.99 95.93 91.87 91.89 98.84 4.07 4.07 20.16
Komet 95.97 96.06 95.97 98.22 95.97 91.94 92.04 98.92 4.03 4.03 20.06

On the BindingDB-Ki dataset (Table 5), which focuses on inhibition constant (Ki) data, similar improvements were observed when applying the GAN balancing technique. In the WOB experiment, the RFC model recorded an accuracy of 88.72%, precision of 88.73%, sensitivity of 88.72%, and specificity of 89.67%. In contrast, after applying GAN-based balancing, the RFC model achieved an accuracy of 92.77%, precision of 92.94%, sensitivity of 92.77%, and specificity of 97.89%. The F1-score improved to 92.93%, and the ROC-AUC increased to 93.86%, highlighting the effectiveness of the GAN+RFC hybrid model. These results confirm that the RFC model, when combined with the GAN balancing technique, outperforms other models and provides superior performance in predicting drug-target interactions in the BindingDB-Ki dataset. The p-value of 0.0916 indicated that the differences in performance between the models were not statistically significant, but RFC still emerged as the top performer across the metrics. These findings underscore the reliability and accuracy of RFC in predicting drug-target interactions, both for BindingDB-Kd and BindingDB-Ki datasets, establishing it as the proposed model for DTI prediction in this study. The performance of several machine learning (ML) and deep learning (DL) models was evaluated on the BindingDB-Ki dataset across three different threshold levels: 10, 20, and 30. The analysis encompassed key metrics such as Accuracy, Precision, Sensitivity, Specificity, F1-score, Kappa, MCC, ROC-AUC, MAE, MSE, and RMSE. At a threshold of 10, the Random Forest Classifier (RFC) outperformed all other models, achieving the highest accuracy of 91.69%, precision of 91.74%, and sensitivity of 91.69%. It also yielded the highest Kappa (83.39), MCC (83.44), and ROC-AUC (97.32) scores, while maintaining the lowest MAE, MSE, and RMSE values (8.31, 8.31, and 28.82, respectively). This demonstrates its robustness in both classification performance and error minimization. Deep learning models such as DeepLPI, Komet, and BarlowDTI also performed competitively, with Komet achieving a strong ROC-AUC of 96.52% and low RMSE of 30.66. When the threshold was increased to 20, RFC once again demonstrated superior performance, achieving 89.78% accuracy, 89.81% precision, and 96.28% ROC-AUC-outperforming all other models. Meanwhile, DeepLPI experienced a significant performance drop under this threshold, particularly with a low sensitivity of 8.94% and ROC-AUC of 34.43, indicating its sensitivity to threshold tuning. Komet, however, maintained robust metrics with an accuracy of 88.63% and ROC-AUC of 95.69%, further confirming its consistency. At the 30 threshold level, RFC continued to lead with an accuracy of 88.62%, precision of 88.64%, and ROC-AUC of 95.49%. It again registered the lowest error metrics, with MAE, MSE, and RMSE of 11.38, 11.38, and 33.73, respectively. Among DL models, DeepLPI and BarlowDTI showed improved performance compared to the 20-threshold scenario, with DeepLPI achieving 86.96% accuracy and BarlowDTI slightly ahead at 87.09%. Across all thresholds, RFC consistently demonstrated strong generalizability and resilience, making it the top-performing model for the BindingDB-Ki dataset. Komet and DeepLPI exhibited competitive performance, especially at lower thresholds, while models like DTC and FCNN, although reasonably accurate, lagged in terms of precision, AUC, and error minimization. Notably, the inclusion of attention mechanisms in models such as MHA-FCNN did not significantly enhance performance in this context.

Table 5.

Performance analysis of ML and DL models on BindingDB-Ki.

Dataset Threshold Model Accuracy Precision Sensitivity Specificity F1score Kappa MCC ROC-AUC MAE MSE RMSE
BindingDB_Ki 10 DTC 89.68 89.68 89.68 89.92 89.68 79.36 79.36 90.70 10.32 10.32 32.13
MLP 89.61 89.66 89.61 91.35 89.60 79.22 79.27 96.14 10.39 10.39 32.24
RFC 91.69 91.74 91.69 93.40 91.69 83.39 83.44 97.32 8.31 8.31 28.82
FCNN 89.09 89.74 89.09 95.42 89.05 78.19 78.83 95.63 10.91 10.91 33.03
MHA-FCNN 88.88 89.54 88.88 95.29 88.83 77.77 78.42 95.45 11.12 11.12 33.35
DeepLPI 90.52 90.58 90.52 92.28 90.52 81.05 81.10 96.63 9.48 9.48 30.79
BarlowDTI 90.27 90.27 90.27 90.40 90.27 80.54 80.54 96.69 9.73 9.73 31.20
Komet 90.60 90.96 90.60 95.21 90.58 81.21 81.56 96.52 9.40 9.40 30.66
20 DTC 87.24 87.24 87.24 87.69 87.24 74.48 74.49 88.43 12.76 12.76 35.72
MLP 86.93 86.97 86.93 88.53 86.93 73.86 73.90 94.51 13.07 13.07 36.15
RFC 89.78 89.81 89.78 91.30 89.77 79.55 79.59 96.28 10.22 10.22 31.97
FCNN 86.69 87.26 86.69 92.87 86.63 73.37 73.94 94.32 13.31 13.31 36.49
MHA-FCNN 86.67 86.96 86.67 91.08 86.64 73.34 73.63 94.05 13.33 13.33 36.51
DeepLPI 53.56 67.21 53.56 8.94 42.03 7.06 15.59 34.43 46.44 46.44 68.15
BarlowDTI 88.20 88.20 88.20 88.33 88.20 76.39 76.39 95.38 11.80 11.80 34.36
Komet 88.63 88.67 88.63 90.24 88.63 77.26 77.30 95.69 11.37 11.37 33.72
30 DTC 85.91 85.92 85.91 86.26 85.91 71.83 71.83 87.23 14.09 14.09 37.53
MLP 85.26 85.30 85.26 86.86 85.25 70.52 70.56 93.29 14.74 14.74 38.40
RFC 88.62 88.64 88.62 89.68 88.62 77.24 77.26 95.49 11.38 11.38 33.73
FCNN 84.35 84.88 84.35 90.43 84.30 68.72 69.24 92.57 15.65 15.65 39.56
MHA-FCNN 84.31 84.57 84.31 88.50 84.28 68.63 68.88 92.24 15.69 15.69 39.61
DeepLPI 86.96 86.96 86.96 86.64 86.96 73.92 73.92 94.63 13.04 13.04 36.11
BarlowDTI 87.09 87.09 87.09 86.81 87.09 74.18 74.18 94.54 12.91 12.91 35.93
Komet 87.07 87.09 87.07 87.96 87.07 74.15 74.16 94.61 12.93 12.93 35.95

The performance of various Machine Learning (ML) and Deep Learning (DL) models was evaluated on the BindingDB-IC50 dataset using three different activity thresholds: 10, 20, and 30. Each model was assessed using multiple metrics, including Accuracy, Precision, Sensitivity, Specificity, F1-score, Cohen’s Kappa, Matthews Correlation Coefficient (MCC), ROC-AUC, Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE) (Table 6). Across all thresholds, the Random Forest Classifier (RFC) consistently demonstrated the highest overall performance. At the 10 Inline graphic threshold, RFC achieved the best results, including an accuracy of 95.40%, precision of 95.41%, F1-score of 95.39%, and the highest ROC-AUC of 98.97%, while also recording the lowest error metrics (MAE: 4.60, MSE: 4.60, RMSE: 21.46). These results highlight its robust capability in binary classification tasks related to compound activity. In contrast, other models such as MLP, DeepLPI, and BarlowDTI also demonstrated strong performance, particularly in terms of ROC-AUC and specificity, but generally lagged slightly behind RFC in terms of accuracy and consistency across metrics. For instance, BarlowDTI and Komet achieved accuracies of 94.59% and 94.49% respectively, while maintaining competitive AUCs above 98.00%. At the 20 Inline graphic threshold, a similar trend was observed with RFC again leading in accuracy (93.75%) and other metrics such as F1-score (93.75), MCC (87.52), and ROC-AUC (98.37). Although most models showed a marginal decrease in performance compared to the 10 Inline graphic threshold, RFC retained a noticeable edge. When the threshold was raised to 30 Inline graphic, overall model performances declined slightly, as expected due to reduced class separability. Nevertheless, RFC continued to outperform the others with an accuracy of 92.90%, a F1-score of 92.89%, and a ROC-AUC of 98.01%, while also maintaining lower error rates (MAE: 7.10, RMSE: 26.65). Other models, including DeepLPI and Komet, still performed relatively well, though with a larger performance gap from RFC. Overall, these results establish RFC as the most effective model for IC50 classification on the BindingDB dataset across various thresholds, indicating its strong generalization ability and suitability for compound activity prediction tasks. Moreover, the marginally superior performances of DL-based models like FCNN, MHA-FCNN, and DeepLPI reinforce the potential of deep learning, especially when paired with interpretability enhancements in future work.

Table 6.

Performance analysis of ML and DL models on BindingDB-IC50.

Dataset Threshold Model Accuracy Precision Sensitivity Specificity F1score Kappa MCC ROC-AUC MAE MSE RMSE
BindingDB_IC50 10 DTC 94.12 94.12 94.12 94.04 94.12 88.23 88.23 94.77 5.88 5.88 24.26
MLP 94.27 94.39 94.27 96.92 94.26 88.53 88.66 98.36 5.73 5.73 23.94
RFC 95.40 95.41 95.40 96.42 95.39 90.79 90.81 98.97 4.60 4.60 21.46
FCNN 93.92 94.37 93.92 98.93 93.91 87.85 88.29 98.13 6.08 6.08 24.65
MHA-FCNN 94.16 94.45 94.16 98.20 94.15 88.32 88.61 98.05 5.84 5.84 24.16
DeepLPI 94.02 94.08 94.02 95.95 94.02 88.04 88.10 97.86 5.98 5.98 24.46
BarlowDTI 94.59 94.70 94.59 97.03 94.59 89.18 89.29 98.23 5.41 5.41 23.26
Komet 94.49 94.76 94.49 98.41 94.48 88.97 89.25 98.10 5.51 5.51 23.48
20 DTC 92.04 92.04 92.04 92.11 92.04 84.09 84.09 92.91 7.96 7.96 28.21
MLP 91.97 92.16 91.97 95.33 91.96 83.93 84.12 97.36 8.03 8.03 28.35
RFC 93.75 93.78 93.75 95.02 93.75 87.50 87.52 98.37 6.25 6.25 25.00
FCNN 91.53 92.10 91.53 97.30 91.51 83.07 83.63 97.05 8.47 8.47 29.09
MHA-FCNN 91.70 91.98 91.70 95.79 91.68 83.40 83.68 96.88 8.30 8.30 28.81
DeepLPI 92.46 92.58 92.46 95.14 92.46 84.92 85.04 97.34 7.54 7.54 27.46
BarlowDTI 92.45 92.48 92.45 93.67 92.45 84.91 84.93 97.54 7.55 7.55 27.47
Komet 92.29 92.72 92.29 97.32 92.27 84.58 85.01 97.34 7.71 7.71 27.77
30 DTC 90.83 90.83 90.83 90.83 90.83 81.66 81.66 91.77 9.17 9.17 30.28
MLP 90.42 90.61 90.42 93.87 90.41 80.85 81.04 96.61 9.58 9.58 30.95
RFC 92.90 92.92 92.90 94.13 92.89 85.79 85.82 98.01 7.10 7.10 26.65
FCNN 90.00 90.69 90.00 96.52 89.96 80.00 80.69 96.24 10.00 10.00 31.62
MHA-FCNN 90.17 90.64 90.17 95.58 90.14 80.34 80.81 96.17 9.83 9.83 31.35
DeepLPI 91.13 91.31 91.13 94.48 91.11 82.25 82.43 96.69 8.87 8.87 29.79
BarlowDTI 90.79 90.85 90.79 88.99 90.79 81.59 81.64 97.25 9.21 9.21 30.34
Komet 91.32 91.62 91.32 95.56 91.31 82.64 82.94 96.96 8.68 8.68 29.46

The confusion matrix analysis of the proposed GAN+RFC hybrid model across the BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50 datasets reveals its strong capability in accurately predicting drug-target interactions, particularly under conditions of class imbalance. On the BindingDB-Kd dataset (Fig. 5), the GAN+RFC model exhibits excellent performance, with a high number of true positives and true negatives for both the “Yes” and “No” classes. The model effectively minimizes false positives and false negatives, especially for the “Yes” class, which denotes the presence of drug-target interactions. This outcome underscores the effectiveness of the GAN-based data augmentation in mitigating class imbalance and enhancing the classifier’s sensitivity to minority class patterns. The hybrid framework not only improves the model’s precision and recall but also enhances its ability to generalize across different interaction types. Similarly, on the BindingDB-Ki dataset (Fig. 6), the confusion matrix further validates the robust predictive power of the GAN+RFC model. The classifier achieves a high proportion of correct predictions for both classes, with significantly reduced misclassification rates. The “Yes” class again benefits from the data balancing mechanism provided by the GAN, resulting in an increased number of correctly predicted positive interactions. This highlights the model’s ability to learn meaningful features that distinguish between interacting and non-interacting drug-target pairs, even in imbalanced data scenarios. On the BindingDB-IC50 dataset (Fig. 7), the confusion matrix demonstrates consistent performance, with the GAN+RFC model maintaining high accuracy across both classes. Despite the complexity and variability in this dataset, the model achieves reliable results, suggesting its adaptability and robustness in different drug interaction measurement conditions (Kd, Ki, and IC50). In summary, the confusion matrix analysis across all three datasets affirms the effectiveness of the hybrid GAN+RFC model. Its ability to handle imbalanced data while maintaining high classification accuracy makes it a powerful and generalizable tool for drug-target interaction prediction in computational drug discovery pipelines.

Fig. 5.

Fig. 5

Confusion matrix of ML and DL models on BindingDB-Kd Data Using GAN.

Fig. 6.

Fig. 6

Confusion matrix of ML and DL models on BindingDB-Ki Data Using GAN.

Fig. 7.

Fig. 7

Confusion matrix of ML and DL models on BindingDB-IC50 Data Using GAN.

The classification report analysis across the BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50 datasets underscores the superior performance of the proposed hybrid GAN+RFC model when compared to traditional machine learning and deep learning approaches (Table 7). The application of GAN for data balancing, coupled with the robust classification capabilities of Random Forest, resulted in consistently high precision, recall, and F1-scores-particularly for the “Yes” class, which is often underrepresented in imbalanced datasets. On the BindingDB-Kd dataset, the GAN+RFC model achieved the highest F1-score of 97.42% for the “Yes” class, with a precision of 98.78% and recall of 96.09%. These results indicate the model’s strong ability to accurately identify true drug-target interactions while minimizing false positives and negatives. In contrast, while models like FCNN and MHA-FCNN also performed well, their slightly lower recall values suggest a relatively higher tendency to misclassify positive samples compared to the GAN+RFC model. A similar trend is observed on the BindingDB-Ki dataset. Here, the GAN+RFC model again outperformed its counterparts, achieving a precision of 93.22%, recall of 90.00%, and F1-score of 91.58% for the “Yes” class. This consistent performance highlights the model’s generalizability across datasets with different bioactivity measures. Although deep learning models such as DeepLPI and BarlowDTI showed competitive results, especially in terms of precision, their recall values were comparatively lower, which could lead to missed positive interactions-a critical concern in drug discovery. In the case of the BindingDB-IC50 dataset, the GAN+RFC model maintained its dominance, delivering an F1-score of 95.34% for the “Yes” class. Notably, this performance surpasses models like FCNN and MHA-FCNN, which achieved slightly lower recall values despite high precision. The results reaffirm the effectiveness of combining generative data augmentation with a powerful ensemble classifier, ensuring not only high accuracy but also robustness in imbalanced classification settings. In conclusion, the classification report analysis demonstrates that the hybrid GAN+RFC model consistently achieves superior performance across all three datasets, with particular strength in identifying positive drug-target interactions. By effectively addressing class imbalance and maintaining high predictive accuracy, this hybrid approach provides a reliable and scalable solution for drug-target interaction prediction, positioning itself as a strong candidate for deployment in real-world drug discovery pipelines.

Table 7.

Classification report analysis of ML and DL model on DTI datasets.

BindingDB-Kd BindingDB-Ki BindingDB-IC50
ML/DL Class Precision Recall F1-Score ML/DL Class Precision Recall F1-Score ML/DL Class Precision Recall F1-Score
DTC No 96.49 96.45 96.47 DTC No 89.40 89.92 89.66 DTC No 94.20 94.04 94.12
DTC Yes 96.44 96.49 96.46 DTC Yes 89.95 89.43 89.69 DTC Yes 94.03 94.20 94.11
MLP No 96.41 97.91 97.15 MLP No 88.20 91.35 89.74 MLP No 92.05 96.92 94.42
MLP Yes 97.87 96.35 97.10 MLP Yes 91.11 87.88 89.47 MLP Yes 96.74 91.60 94.10
RFC No 96.20 98.82 97.49 RFC No 90.25 93.40 91.80 RFC No 94.50 96.42 95.45
RFC Yes 98.78 96.09 97.42 RFC Yes 93.22 90.00 91.58 RFC Yes 96.34 94.37 95.34
FCNN No 95.59 98.84 97.19 FCNN No 84.62 95.42 89.70 FCNN No 89.94 98.93 94.22
FCNN Yes 98.80 95.43 97.09 FCNN Yes 94.81 82.81 88.41 FCNN Yes 98.81 88.90 93.59
MHA-FCNN No 95.85 98.70 97.26 MHA-FCNN No 84.38 95.29 89.51 MHA-FCNN No 90.88 98.20 94.40
MHA-FCNN Yes 98.66 95.72 97.17 MHA-FCNN Yes 94.65 82.52 88.17 MHA-FCNN Yes 98.03 90.12 93.91
DeepLPI No 96.24 97.88 97.05 DeepLPI No 89.07 92.28 90.65 DeepLPI No 92.40 95.95 94.14
DeepLPI Yes 97.84 96.17 97.00 DeepLPI Yes 92.07 88.78 90.39 DeepLPI Yes 95.78 92.08 93.89
BarlowDTI No 96.42 97.96 97.19 BarlowDTI No 90.09 90.40 90.24 BarlowDTI No 92.54 97.03 94.73
BarlowDTI Yes 97.92 96.36 97.13 BarlowDTI Yes 90.45 90.14 90.29 BarlowDTI Yes 96.86 92.15 94.45
Komet No 96.31 98.59 97.44 Komet No 87.11 95.21 90.98 Komet No 91.26 98.41 94.70
Komet Yes 98.55 96.22 97.37 Komet Yes 94.77 86.04 90.19 Komet Yes 98.27 90.55 94.25

Statistical analysis

The performance of the models across the updated datasets was further assessed using the Friedman test, a non-parametric statistical test designed to identify differences in performance among multiple models when the assumptions of parametric tests-such as normality-are not met. The test was applied to the BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50 datasets. For the BindingDB-Kd dataset, the Friedman test statistic was found to be 7.0000, with a p-value of 0.4289. Since the p-value is considerably greater than the standard significance threshold of 0.05, we conclude that there are no statistically significant differences in performance among the models. This indicates that all models: RFC, MLP, DTC, FCNN, MHA-FCNN, DeepLPI, BarlowDTI and Komet, exhibited comparable performance despite variations in individual metric scores. A similar trend was observed for both the BindingDB-Ki and BindingDB-IC50 datasets, where the Friedman test also returned a statistic of 7.0000 and a p-value of 0.4289 for each. These consistent results across all three datasets confirm that the observed differences in model metrics are not statistically significant, and the models perform similarly in predicting outcomes on these datasets.

Analysis on varying degrees of imbalanced datasets

The performance analysis on datasets with varying degrees of imbalance demonstrates the effectiveness of balancing techniques, particularly GAN-based augmentation, in improving the classification results across different levels of class imbalance (Table 8). The analysis focuses on three datasets: BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50, evaluating the impact of extreme, moderate, and balanced class distributions. For the BindingDB-Kd dataset, the extreme imbalance case showed a noticeable drop in performance, with an accuracy of 93.69% and a relatively low F1-score of 93.27%. When the dataset became moderately imbalanced, performance improved slightly with an accuracy of 94.64% and an F1-score of 94.12%. However, the most significant improvement occurred when the dataset was balanced using GAN augmentation, resulting in an impressive accuracy of 97.46%, precision of 97.49%, sensitivity of 97.46%, and specificity of 98.82%. The F1-score of 97.46% further underscores the effectiveness of GAN in handling class imbalance, leading to a well-rounded improvement in model performance. Similarly, for the BindingDB-Ki dataset, the extreme imbalance led to lower performance, with accuracy at 84.87% and an F1-score of 84.81%. With moderate balancing, accuracy rose to 87.22%, and the F1-score increased to 86.97%. The balanced case (using 10+GAN) resulted in a notable enhancement, reaching an accuracy of 91.69%, precision of 91.74%, sensitivity of 91.69%, and specificity of 93.40%, with an F1-score of 91.69%. This demonstrates that GAN augmentation can significantly enhance model performance, especially for datasets with initially high-class imbalance. For the BindingDB-IC50 dataset, extreme class imbalance led to an accuracy of 88.76% and F1-score of 88.57%, which are comparatively lower than the other two datasets. After applying moderate balancing, the model performance improved, achieving an accuracy of 91.99% and an F1-score of 91.71%. The balanced case using 10+GAN provided the most significant boost, achieving an accuracy of 95.40%, precision of 95.41%, sensitivity of 95.40%, and specificity of 96.42%, with an F1-score of 95.39%. This improvement across all performance metrics highlights the substantial benefits of GAN augmentation, particularly in addressing the challenges of extreme class imbalance. The analysis reveals the importance of addressing class imbalance for achieving optimal performance. For all three datasets, the GAN-based balancing technique significantly improved the model’s ability to correctly classify both classes, particularly the underrepresented “Yes” class. By balancing the datasets, the 10+GAN approach consistently delivered superior performance across accuracy, precision, sensitivity, specificity, and F1-score, making it a highly effective solution for imbalanced drug-target interaction prediction tasks.

Table 8.

Performance analysis of varying degrees of Imbalanced Datasets.

Dataset Balancing Class 0 / No Class 1 / Yes Accuracy Precision Sensitivity Specificity F1score Threshold
BindingDB_Kd Extreme Imbalanced 37385 4851 93.69 93.24 93.69 98.08 93.27 30
Moderate Imbalanced 38910 3326 94.64 94.1 94.64 98.63 94.12 10
Balanced 38910 38910 97.46 97.49 97.46 98.82 97.46 10+GAN
BindingDB_Ki Extreme Imbalanced 197572 99113 84.87 84.77 84.87 89.45 84.81 30
Moderate Imbalanced 227658 69027 87.22 86.86 87.22 93.2 86.97 10
Balanced 227658 227658 91.69 91.74 91.69 93.40 91.69 10+GAN
BindingDB_IC50 Extreme Imbalanced 597037 169867 88.76 88.47 88.76 94.01 88.57 30
Moderate Imbalanced 664750 102154 91.99 91.57 91.99 96.54 91.71 10
Balanced 664750 664750 95.40 95.41 95.40 96.42 95.39 10+GAN

Effect of data balancing

The inclusion of data balancing, achieved through the GAN module, significantly enhances the performance of the models across all benchmark datasets-BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50 (Table 9). This improvement is evident when comparing models trained with unbalanced datasets to those utilizing the GAN module for data balancing. Specifically, the proposed GAN+RFC framework consistently outperforms other configurations in accuracy, sensitivity, specificity, and AUC-ROC, while also reducing error metrics such as MAE, MSE, and RMSE. On the BindingDB-Kd dataset, for instance, GAN+RFC achieves an accuracy of 97.46%, a sensitivity of 97.46%, and an AUC-ROC of 99.42, as opposed to an accuracy of 94.64% and an AUC-ROC of 93.17 for the unbalanced RFC model. Similarly, across BindingDB-Ki and BindingDB-IC50 datasets, the GAN-enhanced models demonstrate marked improvements in predictive metrics, underscoring the critical role of data balancing in mitigating biases and improving generalizability. These results highlight the transformative impact of the GAN module in effectively capturing and enhancing patterns in imbalanced datasets, leading to robust and reliable predictive performance.

Table 9.

Effect of data balancing.

Dataset Data Balancing Model Accuracy Precision Sensitivity Specificity F1score Kappa MCC ROC-AUC MAE MSE RMSE
BindingDB_Kd No DTC 92.84 92.70 92.84 96.30 92.76 51.91 51.93 76.97 7.16 7.16 26.76
MLP 94.15 93.73 94.15 97.63 93.89 58.10 58.43 88.84 5.85 5.85 24.18
RFC 94.64 94.10 94.64 98.63 94.12 58.17 59.67 93.17 5.36 5.36 23.16
GAN DTC 96.47 96.47 96.47 96.45 96.47 92.93 92.93 96.66 3.53 3.53 18.80
MLP 97.13 97.14 97.13 97.91 97.13 94.26 94.27 99.15 2.87 2.87 16.95
RFC 97.46 97.49 97.46 98.82 97.46 94.91 94.95 99.42 2.54 2.54 15.95
BindingDB_Ki No DTC 84.26 84.24 84.26 89.81 84.25 56.01 56.01 80.51 15.74 15.74 39.67
MLP 84.45 83.73 84.45 92.53 83.90 53.79 54.19 87.68 15.55 15.55 39.44
RFC 87.22 86.86 87.22 93.20 86.97 63.03 63.18 91.49 12.78 12.78 35.75
GAN DTC 89.68 89.68 89.68 89.92 89.68 79.36 79.36 90.70 10.32 10.32 32.13
MLP 89.61 89.66 89.61 91.35 89.60 79.22 79.27 96.14 10.39 10.39 32.24
RFC 91.69 91.74 91.69 93.40 91.69 83.39 83.44 97.32 8.31 8.31 28.82
BindingDB_IC50 No DTC 89.74 89.71 89.74 94.14 89.73 55.52 55.52 80.79 10.26 10.26 32.03
MLP 89.94 89.18 89.94 95.78 89.44 52.33 52.79 89.34 10.06 10.06 31.72
RFC 91.99 91.57 91.99 96.54 91.71 63.03 63.30 93.41 8.01 8.01 28.30
GAN DTC 94.12 94.12 94.12 94.04 94.12 88.23 88.23 94.77 5.88 5.88 24.26
MLP 94.27 94.39 94.27 96.92 94.26 88.53 88.66 98.36 5.73 5.73 23.94
RFC 95.40 95.41 95.40 96.42 95.39 90.79 90.81 98.97 4.60 4.60 21.46

Comparative study of ACC and DC

The performance comparison between Amino Acid Composition (ACC) and Dipeptide Composition (DC) using various machine learning (ML) and deep learning (DL) models is essential for understanding how different feature extraction methods impact the results in Drug-Target Interaction (DTI) datasets (Table 10). This study evaluates models on three datasets: BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50, with key performance metrics such as Accuracy, Precision, Sensitivity, Specificity, F1-Score, Kappa, MCC, ROC-AUC, MAE, MSE, and RMSE. For the BindingDB-Kd dataset, the Amino Acid Composition (ACC) generally outperforms the Dipeptide Composition (DC). In the case of ACC, the Random Forest Classifier (RFC) achieved the highest performance with an accuracy of 97.46%, precision of 97.49%, sensitivity of 97.46%, specificity of 98.82%, and an F1-score of 97.46%. On the other hand, for DC, although the RFC model delivered a slightly lower accuracy of 97.09%, precision, and sensitivity remained strong, but the overall performance lagged behind ACC in terms of key metrics like F1-score and ROC-AUC. The BindingDB-Ki dataset shows a similar trend, with ACC compositions exhibiting superior performance over DC. For ACC, the RFC model achieved an accuracy of 91.69%, which was notably higher than the DC version of the same model, which had an accuracy of 91.61%. The RFC in the ACC composition also outperformed in terms of precision, sensitivity, and F1-score, with an F1-score reaching 91.69%, compared to 91.61% for DC. The BindingDB-IC50 dataset also supports the trend observed in the other datasets, with ACC compositions outperforming DC. The RFC model again showed the best performance for ACC, achieving an accuracy of 95.40%, precision of 95.41%, sensitivity of 95.40%, and specificity of 96.42%. For DC, the RFC model exhibited slightly lower values across the metrics, with an accuracy of 95.37% and comparable reductions in precision and sensitivity. Across all datasets and models tested, the Amino Acid Composition (ACC) consistently led to better performance in terms of accuracy, precision, sensitivity, and F1-score compared to Dipeptide Composition (DC). This comparative study suggests that Amino Acid Composition (ACC) is more effective for Drug-Target Interaction tasks when compared to Dipeptide Composition (DC), as it provides a more comprehensive measure of model performance across multiple metrics.

Table 10.

Comparison analysis of ACC and DC using ML/DL on DTI datasets.

Dataset Composition Model Accuracy Precision Sensitivity Specificity F1score Kappa MCC ROC-AUC MAE MSE RMSE
BindingDB_Kd ACC DTC 96.47 96.47 96.47 96.45 96.47 92.93 92.93 96.66 3.53 3.53 18.80
MLP 97.13 97.14 97.13 97.91 97.13 94.26 94.27 99.15 2.87 2.87 16.95
RFC 97.46 97.49 97.46 98.82 97.46 94.91 94.95 99.42 2.54 2.54 15.95
DC DTC 96.01 96.01 96.01 95.75 96.01 92.02 92.02 96.24 3.99 3.99 19.97
MLP 96.95 96.96 96.95 97.75 96.95 93.90 93.91 99.43 3.05 3.05 17.47
RFC 97.09 97.12 97.09 98.22 97.09 94.18 94.21 99.31 2.91 2.91 17.06
BindingDB_Ki ACC DTC 89.68 89.68 89.68 89.92 89.68 79.36 79.36 90.70 10.32 10.32 32.13
MLP 89.61 89.66 89.61 91.35 89.60 79.22 79.27 96.14 10.39 10.39 32.24
RFC 91.69 91.74 91.69 93.40 91.69 83.39 83.44 97.32 8.31 8.31 28.82
DC DTC 89.75 89.75 89.75 89.82 89.75 79.49 79.49 90.83 10.25 10.25 32.02
MLP 90.71 90.77 90.71 92.62 90.7 81.41 81.47 97.01 9.29 9.29 30.49
RFC 91.61 91.64 91.61 92.77 91.61 83.22 83.25 97.39 8.39 8.39 28.96
BindingDB_IC50 ACC DTC 94.12 94.12 94.12 94.04 94.12 88.23 88.23 94.77 5.88 5.88 24.26
MLP 94.27 94.39 94.27 96.92 94.26 88.53 88.66 98.36 5.73 5.73 23.94
RFC 95.40 95.41 95.40 96.42 95.39 90.79 90.81 98.97 4.60 4.60 21.46
DC DTC 94.11 94.11 94.11 94.1 94.11 88.22 88.22 94.8 5.89 5.89 24.27
MLP 94.69 94.71 94.69 95.77 94.69 89.38 89.41 98.85 5.31 5.31 23.04
RFC 95.37 95.39 95.37 96.32 95.37 90.74 90.75 98.99 4.63 4.63 21.52

Complexity analysis

The computational efficiency of machine learning (ML) and deep learning (DL) models plays a critical role in their practical application, particularly in Drug-Target Interaction (DTI) prediction tasks. In this section, we analyze the training time, prediction time, and total execution time for different models across three DTI datasets: BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50. Table 11 presents the complexity analysis results for the models evaluated in this study. For the BindingDB-Kd dataset, the Random Forest Classifier (RFC) emerges as the most efficient model, requiring 11.697 seconds for training and 0.206 seconds for prediction, resulting in a total execution time of 11.903 seconds. This is significantly faster compared to other models like DeepLPI (82.640 seconds) and BarlowDTI (130.469 seconds), which exhibit considerably longer training and prediction times. The Multi-layer Perceptron (MLP) and Fully Connected Neural Network (FCNN) models also require substantial time for training, with MLP taking 57.502 seconds and FCNN taking 56.124 seconds for training. In the case of the BindingDB-Ki dataset, RFC continues to show its efficiency, requiring 76.017 seconds for training and 2.423 seconds for prediction, yielding a total time of 78.440 seconds. This is again significantly less than other DL models such as BarlowDTI (736.850 seconds) and Komet (470.849 seconds). As with the BindingDB-Kd dataset, MLP exhibits high computational complexity, requiring 553.865 seconds for training. Models like MHA-FCNN and DeepLPI also show considerable training times, reaching around 330.942 seconds and 454.559 seconds, respectively. For the BindingDB-IC50 dataset, the computational complexity increases for all models, particularly the deep learning-based models. RFC remains the most time-efficient model with a training time of 244.949 seconds and a prediction time of 7.019 seconds, resulting in a total time of 251.967 seconds. In comparison, the BarlowDTI model exhibits the longest execution time, with a total time of 2129.139 seconds. MLP also shows high computational demand, taking 741.861 seconds for training and 0.550 seconds for prediction, with a total execution time of 742.411 seconds. From the complexity analysis, it is evident that traditional machine learning models such as DTC and RFC are significantly more efficient in terms of training and prediction time compared to deep learning models like MLP, FCNN, MHA-FCNN, and others. RFC stands out as the most computationally efficient model across all three datasets, making it a strong candidate for scenarios where rapid execution is critical. In contrast, deep learning models, while often more powerful in terms of predictive accuracy, require considerably more computational resources and time.

Table 11.

Complexity analysis of ML and DL model on DTI Datasets.

Dataset Model name Training time (s) Prediction time (s) Total time (s)
BindingDB_Kd DTC 2.808 0.007 2.814
MLP 57.502 0.047 57.548
RFC 11.697 0.206 11.903
FCNN 56.124 1.114 57.239
MHA-FCNN 60.584 1.482 62.066
DeepLPI 81.352 1.288 82.640
BarlowDTI 129.176 1.293 130.469
Komet 83.035 1.266 84.301
BindingDB_Ki DTC 13.721 0.057 13.778
MLP 553.865 0.194 554.059
RFC 76.017 2.423 78.440
FCNN 309.454 5.399 314.854
MHA-FCNN 330.942 5.125 336.067
DeepLPI 454.559 5.170 459.729
BarlowDTI 731.753 5.097 736.850
Komet 465.687 5.162 470.849
BindingDB_IC50 DTC 52.293 0.163 52.456
MLP 741.861 0.550 742.411
RFC 244.949 7.019 251.967
FCNN 872.200 14.611 886.811
MHA-FCNN 913.661 14.015 927.676
DeepLPI 1285.881 13.784 1299.665
BarlowDTI 2115.441 13.698 2129.139
Komet 1336.914 13.869 1350.783

Discussion

We have presented an analytical comparison of the performance of our proposed GAN+RFC model, which combines advanced feature extraction (MACCSKeys and ACC) and data balancing using a GAN module and RFC model, with state-of-the-art (SOTA) approaches for drug-target interaction (DTI) prediction. The evaluation, conducted across multiple benchmark datasets and metrics, highlights the efficacy of our method as summarized in Table 12. Competing methods include DeepLPI9, BarlowDTI12, Komet13, Ada-kNN-DTA10, and MDCT-DTA11. Our GAN+RFC model consistently outperforms SOTA models across three key datasets: BindingDB-Kd, BindingDB-Ki, and BindingDB-IC50. Notably, on the BindingDB-Kd dataset, the model achieves sensitivity of 97.46%, specificity of 98.82%, and an AUC-ROC of 99.42%, surpassing other methods such as DeepLPI (AUC-ROC: 79.00) and Komet (AUC-ROC: 70.00). Despite BarlowDTI showing an improvement with an AUC-ROC of 93.64, it falls short compared to our method’s superior accuracy. Similarly, on the BindingDB-Ki dataset, our approach achieves sensitivity of 91.69%, specificity of 93.40%, and AUC-ROC of 97.32%, outperforming Ada-kNN-DTA and MDCT-DTA, which reported higher RMSE values (73.50 and 47.50, respectively) and lower overall metric alignment. For the BindingDB-IC50 dataset, our GAN+RFC model demonstrates sensitivity of 95.40%, specificity of 95.42%, and AUC-ROC of 98.97, coupled with low MSE (4.60) and RMSE (21.46), significantly exceeding benchmarks set by competing models such as Ada-kNN-DTA. The observed performance gains stem from the synergy of advanced feature extraction and GAN-enabled data balancing, capturing complex molecular and target patterns. Moreover, the Random Forest Classifier contributes robust decision-making capabilities, effectively handling high-dimensional data and forming precise predictive boundaries. These features collectively establish a versatile and scalable framework, enabling accurate identification of drug-target interactions. In conclusion, our GAN+RFC model emerges as a highly effective tool for DTI prediction, outperforming existing methods in both classification and regression tasks across diverse datasets. Its demonstrated generalizability and robustness offer promising implications for computational drug discovery, paving the way for improved therapeutic development and pharmaceutical research.

Table 12.

Comparison of drug-target interaction prediction models with state-of-arts (SOTA) works.

SI. No. Author Dataset Model Sensitivity Specificity AUC-ROC MSE RMSE
1 Wei et al.9 BindingDB-Kd DeepLPI 68.40 77.30 79.00
2 Schuh et al.12 BindingDB-Kd BarlowDTI 93.64
3 Guichaoua et al.13 BindingDB-Kd Komet 70.00
4 Our Proposed BindingDB-Kd GAN+RFC 97.46 98.82 99.42 2.54 15.95
5 Pei et al.10 BindingDB-Ki Ada-kNN-DTA 73.50
6 Zhu et al.11 BindingDB-Ki MDCT-DTA - 47.50
7 Our Proposed BindingDB-Ki GAN+RFC 91.69 93.40 97.32 8.31 28.82
8 Pei et al.10 BindingDB-IC50 Ada-kNN-DTA 67.50
9 Pei et al.14 BindingDB-IC50 Ada-kNN-DTA 73.50
10 Our Proposed BindingDB-IC50 GAN+RFC 95.40 95.42 98.97 4.60 21.46

Problem statements validation

  • VP1: Integration of Chemical and Biological Information Our paper introduces a dual feature extraction method, utilizing MACCS keys for drug structural features and amino acid/dipeptide compositions for target biomolecular properties, addressing this issue.

  • VP2: Data Imbalance in Experimental Datasets Our research employs Generative Adversarial Networks (GANs) to generate synthetic data for the minority class, significantly reducing false negatives and improving model sensitivity.

  • VP3: Limitations of Traditional Drug Discovery Approaches Our study highlights the inefficiencies of traditional methods and proposes a hybrid ML framework, incorporating the Random Forest Classifier (RFC) for scalable, high-dimensional data analysis.

  • VP4: Threshold Optimization for Drug-Target Interaction This research systematically evaluates threshold selection, ensuring reliable classification and improved accuracy. The GAN+RFC model demonstrates high-performance metrics, including ROC-AUC scores exceeding 97%, validating its effectiveness.

These validated aspects of the problem statement emphasize the necessity for innovative computational frameworks that address integration, imbalance, and scalability.

Validation of research questions

  • VRQ1: How can the proposed ML model improve the prediction accuracy of drug-target interactions compared to traditional ML/DL models?

    To address this research question, the proposed ML models, including RFC and the hybrid GAN+RFC model, were evaluated against traditional ML and DL models, such as DTC, RFC, MLP, FCNN, MHA-FCNN, DeepLPI, BarlowDTI and Komet. The comparison revealed that the GAN+RFC hybrid model outperforms traditional models, demonstrating superior prediction accuracy and robustness in DTI classification. Specifically, the integration of GAN for data augmentation effectively addresses the issue of class imbalance, leading to more accurate and reliable predictions. The GAN+RFC model consistently achieved higher accuracy, precision, and recall, surpassing the performance of DTC and MLP models, thereby validating the effectiveness of the proposed models in improving prediction accuracy for DTI (Tables 4, 5 and 6).

  • VRQ2: What are the benefits of GAN data balancing technique for DTI prediction?

    The application of GAN for data balancing offers several significant benefits in DTI prediction. One of the primary challenges in DTI prediction is the class imbalance, where positive interactions (drug-target pairs) are often underrepresented compared to negative interactions. GAN-based data augmentation generates synthetic samples of underrepresented positive interactions, thereby balancing the class distribution and reducing bias in the model. This leads to more accurate predictions for positive interactions, as the model is trained on a more balanced and representative dataset (Table 9). The use of GAN has shown substantial improvements in classification performance, particularly in precision and recall for identifying drug-target interactions, confirming its value as a data balancing technique for DTI prediction.

  • VRQ3: Can the proposed method enhance predictive performance and scalability in drug discovery?

    The proposed method, combining GAN-based data augmentation with the RFC model, demonstrates significant improvements in both predictive performance and scalability in drug discovery. The hybrid approach achieves higher accuracy and robustness, especially when dealing with large and complex datasets (Tables 4, 5 and 6). By leveraging GAN to augment the training data, the method can handle class imbalances and improve generalization, leading to more reliable DTI predictions. Additionally, the use of RFC, a scalable ML algorithm, ensures that the model can be effectively applied to large-scale drug discovery efforts, where high volumes of data need to be processed efficiently. The results validate that the proposed method enhances not only the predictive performance but also the scalability, making it a promising solution for accelerating the drug discovery process.

Hypothesis validation

To validate the hypothesis, extensive experiments were conducted on the BindingDB_Kd and BindingDB_Ki datasets, evaluating the performance of the proposed model against traditional ML and DL models.

  • Accuracy Improvement: The proposed model (GAN+RFC) achieved significantly higher accuracy (e.g., ROC-AUC scores of 99.42% for BindingDB_Kd, 97.32% for BindingDB_Ki and 98.97% for BindingDB_IC50) compared to traditional ML/DL models, demonstrating its superior capability in learning complex relationships between drug-target pairs.

  • Data Balancing via GANs: Using GANs to address class imbalance improved the model’s performance on minority class predictions, as evidenced by increased sensitivity and reduced false-negative rates.

  • Feature Representation: The integration of MACCS fingerprints and amino acid composition captured diverse and complementary information, leading to improved feature representation and classification performance.

  • Scalability: The hybrid model demonstrated consistent performance across datasets of varying sizes, validating its scalability and robustness.

These results confirm the hypothesis, highlighting the effectiveness of integrating advanced data balancing techniques, feature engineering, and hybrid ML approaches in predicting DTIs with high precision and reliability.

Clinical applicability for drug repositioning

The proposed GAN+RFC model demonstrates not only its robustness and accuracy in predicting drug-target interactions (DTIs) but also its potential to contribute significantly to drug repositioning efforts, a crucial area in drug discovery. Drug repositioning, the process of identifying new therapeutic uses for existing drugs, is of particular importance for accelerating drug development timelines and reducing associated costs. The following discussion highlights the clinical applicability of our model in this context, along with practical examples to underscore its relevance to healthcare and pharmacology.

Guiding preclinical studies

The model’s ability to predict DTIs with high sensitivity and specificity provides a powerful tool for identifying potential off-target effects or new therapeutic targets of existing drugs. For instance: - Example 1: Anti-viral drugs such as Remdesivir, initially developed for hepatitis C, were repositioned for treating COVID-19. Using our model, such potential off-target interactions could be identified earlier by analyzing drug-target relationships across different disease pathways. - Example 2: Statins, commonly used to lower cholesterol, have been found to exhibit anti-inflammatory properties. Our model could aid in predicting similar novel interactions, guiding preclinical studies to validate these findings.

By accurately predicting interactions, the model minimizes the experimental burden associated with screening vast chemical libraries, focusing preclinical studies on the most promising candidates.

Enhancing clinical studies

The predictions generated by the GAN+RFC model can also support clinical study design by prioritizing drug candidates with higher probabilities of success. For example: - Example 1: Drugs identified as potential repositioning candidates for rare diseases can be prioritized, addressing the unmet need for orphan disease treatments. - Example 2: Anti-cancer drugs can be repurposed to target specific mutations or pathways, allowing for personalized treatment strategies based on the predicted interactions with mutated proteins in cancer patients.

These insights could expedite the clinical validation phase, leading to faster approval processes for repurposed drugs.

In summary, the proposed GAN+RFC model offers substantial value not only in advancing DTI prediction but also in fostering translational research that bridges computational predictions with practical applications in preclinical and clinical drug discovery processes. This reinforces the significance for the healthcare and pharmacology communities, showcasing its potential to drive innovation in drug repositioning and precision medicine.

Conclusion

In this study, we have proposed a hybrid framework that effectively integrates advanced feature engineering, data balancing, and ML techniques to address the challenges in DTI prediction. The core contribution of our work is the combination of drug fingerprints (MACCS keys) and target compositions (amino acid sequences) into a unified feature representation, allowing for a more comprehensive capture of the biochemical and structural information of both drugs and targets. By employing GANs for data balancing, we successfully mitigated the issues arising from class imbalance, enhancing the model’s sensitivity and reducing false negatives. Our experimental results, evaluated on BindingDB-Kd and BindingDB-Ki datasets, demonstrate that the proposed GAN+RFC model outperforms existing SOTA methods in terms of sensitivity, specificity, and AUC-ROC. Specifically, our model achieves an AUC-ROC of 99.42% on the BindingDB-Kd dataset, 97.32% on the BindingDB-Ki dataset and 98.97% on the BindingDB-IC50, outperforming other prominent DTI prediction models in terms of these critical evaluation metrics. This approach not only enhances the prediction accuracy but also provides a robust solution for DTI prediction that can generalize well across different datasets. Furthermore, our method is computationally efficient, making it suitable for large-scale applications in drug discovery.

While the proposed framework demonstrates significant improvements in DTI prediction, it has certain limitations. It does not incorporate transformer-based DL models, which are known for capturing long-range dependencies effectively. Additionally, advanced feature fusion techniques to combine diverse data representations and few-shot learning methods to address limited data scenarios were not included in this study.

In future research, we plan to integrate transformer-based DL models to enhance the ability to capture intricate relationships within drug and target data. Incorporating advanced feature fusion techniques will enable the model to better leverage complementary information from diverse data sources. Furthermore, the use of few-shot learning methods will allow the framework to perform effectively in scenarios with limited data, making it more robust and generalizable across various drug discovery applications.

Acknowledgements

The authors would like to extend their sincere appreciation to the Ongoing Research Funding Program (ORF-2025-301), King Saud University, Riyadh, Saudi Arabia.

Author contributions

Md. Alamin Talukder: Conceptualization, Data curation, Methodology, Software, Resource, Visualization, Investigation, Formal Analysis, Supervision, Writing-original draft and review & editing. Mohsin Kazi: Methodology, Resource, Formal Analysis, Visualization, Investigation, Validation, Supervision, Writing-review & editing. Ammar Alazab: Methodology, Resource, Formal Analysis, Visualization, Investigation, Validation, Writing-review & editing.

Funding

This research project was supported by the Ongoing Research Funding Program (ORF-2025-301), King Saud University, Riyadh, Saudi Arabia.

Data availability

The selected datasets are sourced from free and open-access sources, such as DTI Data: https://tdcommons.ai/multi_pred_tasks/dti/#bindingdb.

Declarations

Conflict of interest

The authors have no conflicts of interest to declare that they are relevant to the content of this article.

Ethical approval

Not applicable

Consent to participate

Not applicable

Consent to Publish

Not applicable

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Change history

7/19/2025

The original online version of this Article was revised: The original version of this Article contained an error in Affiliation 2, which was incorrectly given as ‘Department of Pharmacognosy, College of Pharmacy, King Saud University, P.O. BOX-2457, Riyadh, 11451, Saudi Arabia’. The correct affiliation is ‘Department of Pharmaceutics, College of Pharmacy, King Saud University, P.O. Box-2457, Riyadh 11451, Saudi Arabia’. In addition, the grant number in the Acknowledgements and Funding section was incorrect. The correct Acknowledgements section now reads: “The authors would like to extend their sincere appreciation to the Ongoing Research Funding Program (ORF-2025-301), King Saud University, Riyadh, Saudi Arabia”. The correct Funding section now reads: “This research project was supported by the Ongoing Research Funding Program (ORF-2025-301), King Saud University, Riyadh, Saudi Arabia”.

Contributor Information

Md. Alamin Talukder, Email: alamin.cse@iubat.edu.

Mohsin Kazi, Email: mkazi@ksu.edu.sa.

Ammar Alazab, Email: ammar.alazab@torrens.edu.au.

References

  • 1.Siddiqui, B., Yadav, C. S., Akil, M., Faiyyaz, M., Khan, A. R., Ahmad, N., Hassan, F., Azad, M. I., Owais, M., Nasibullah, M., et al. Artificial intelligence in computer-aided drug design (cadd) tools for the finding of potent biologically active small molecules: Traditional to modern approach. Combinatorial Chemistry & High Throughput Screening (2025) [DOI] [PubMed]
  • 2.Marques, L. et al. Advancing precision medicine: A review of innovative in silico approaches for drug development, clinical pharmacology and personalized healthcare. Pharmaceutics16(3), 332 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Xing, Q., Cheng, W., Wang, W., Jin, C., & Wang, H.: Drivers of innovation value: simulation for new drug pricing evaluation based on system dynamics modelling. Frontiers in Pharmacology 16, (2025) [DOI] [PMC free article] [PubMed]
  • 4.Qahwaji, R. et al. Pharmacogenomics: a genetic approach to drug development and therapy. Pharmaceuticals17(7), 940 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Uma, E., Mala, T., Geetha, A., & Priyanka, D. A comprehensive survey of drug-target interaction analysis in allopathy and siddha medicine. Artificial Intelligence in Medicine, 102986 (2024) [DOI] [PubMed]
  • 6.Abdul Raheem, A. K. & Dhannoon, B. N. Comprehensive review on drug-target interaction prediction-latest developments and overview. Curr. Drug Discovery Technol.21(2), 56–67 (2024). [DOI] [PubMed] [Google Scholar]
  • 7.Yang, Y., & Cheng, F. Artificial intelligence streamlines scientific discovery of drug–target interactions. British Journal of Pharmacology (2025) [DOI] [PubMed]
  • 8.Kuo, D.-P. et al. Estimating the volume of penumbra in rodents using dti and stack-based ensemble machine learning framework. Eur. Radiol. Experim.8(1), 59 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wei, B., Zhang, Y. & Gong, X. Deeplpi: a novel deep learning-based model for protein-ligand interaction prediction for drug repurposing. Sci. Rep.12(1), 18200 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Pei, Q., Wu, L., He, Z., Zhu, J., Xia, Y., Xie, S., & Yan, R. Exploiting pre-trained models for drug target affinity prediction with nearest neighbors. In: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, pp. 1856–1866 (2024)
  • 11.Zhu, Z. et al. Drug-target binding affinity prediction model based on multi-scale diffusion and interactive learning. Expert Syst. Appl.255, 124647 (2024). [Google Scholar]
  • 12.Schuh, M. G., Boldini, D., Bohne, A. I., & Sieber, S. A. Barlow twins deep neural network for advanced 1d drug-target interaction prediction. arXiv preprint arXiv:2408.00040 (2024) [DOI] [PMC free article] [PubMed]
  • 13.Guichaoua, G., Pinel, P., Hoffmann, B., Azencott, C.-A., & Stoven, V. Advancing drug-target interactions prediction: Leveraging a large-scale dataset with a rapid and robust chemogenomic algorithm. bioRxiv, 2024–02 (2024)
  • 14.Pei, Q., Wu, L., Zhu, J., Xia, Y., Xie, S., Qin, T., Liu, H., & Liu, T.-Y. Smt-dta: Improving drug-target affinity prediction with semi-supervised multi-task training. arXiv preprint arXiv:2206.09818 (2022)
  • 15.Liu, T., Lin, Y., Wen, X., Jorissen, R. N., & Gilson, M. K. Bindingdb: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic acids research 35(suppl_1), 198–201 (2007) [DOI] [PMC free article] [PubMed]
  • 16.Talukder, M. A., Talaat, A. S. & Kazi, M. Hxai-ml: a hybrid explainable artificial intelligence based machine learning model for cardiovascular heart disease detection. Res. Eng.25, 104370 (2025). [Google Scholar]
  • 17.Talukder, M. A., Hossen, R., Uddin, M. A., Uddin, M. N. & Acharjee, U. K. Securing transactions: A hybrid dependable ensemble machine learning model using iht-lr and grid search. Cybersecurity7(1), 32 (2024). [Google Scholar]
  • 18.Talukder, M. A., Khalid, M. & Sultana, N. A hybrid machine learning model for intrusion detection in wireless sensor networks leveraging data balancing and dimensionality reduction. Sci. Rep.15(1), 4617 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The selected datasets are sourced from free and open-access sources, such as DTI Data: https://tdcommons.ai/multi_pred_tasks/dti/#bindingdb.


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES