Skip to main content
Cell Reports Methods logoLink to Cell Reports Methods
. 2025 Sep 24;5(10):101184. doi: 10.1016/j.crmeth.2025.101184

In silico methods for drug-target interaction prediction

Xiaoqing Ru 1,2, Lifeng Xu 1, Wu Han 3,, Quan Zou 2,4,∗∗
PMCID: PMC12570315  PMID: 40997795

Summary

Drug-target interaction (DTI) prediction is a crucial component of drug discovery. In recent years, in silico approaches have attracted attention for DTI prediction, primarily because of their potential to mitigate the high costs, low success rates, and extensive timelines of traditional drug development, while efficiently using the growing amount of available data. This review identifies four major factors that influence DTI predictions, highlights persistent challenges, and proposes insights and strategies from the perspectives of data, features, and experimental setups to address these challenges. Furthermore, it emphasizes the importance of refining established approaches—such as the “guilt-by-association” concept—to manage data sparsity, and integrating emerging technologies, including large language models and AlphaFold, to advance feature engineering. We hope that this work will provide valuable guidance and novel perspectives for advancing future research on DTI predictions.

Keywords: drug-target interaction prediction, in silico approaches, DTI prediction strategies


In this review, Ru et al. present actionable strategies for in silico DTI/DTA prediction, covering dataset curation, true-negative construction, and rigorous cold-start evaluation. By integrating multimodal data, heterogeneous networks, LLMs, and AlphaFold, this work outlines a pathway toward more reliable, interpretable, and translational drug discovery pipelines.

Introduction

Drugs are essential for improving human health, extending life expectancy, and enhancing the quality of life.1,2 The development of new drugs is fundamental to medical progress and has driven the discovery of innovative therapies, facilitating more effective disease management. Many once-untreatable conditions, such as specific cancers and infectious diseases, have become manageable through advances in drug discovery. Likewise, although chronic diseases such as diabetes, hypertension, and cardiovascular disorders remain incurable, drug research has substantially improved patient survival. Thus, drug development remains a cornerstone of medical advancement and a vital contributor to global health outcomes.3

Recent breakthroughs in technologies such as artificial intelligence (AI), gene editing, and high-throughput screening have accelerated the drug development process.4,5,6,7 In early 2020, BenevolentAI leveraged its AI platform to identify baricitinib, a JAK inhibitor, as a potential therapeutic agent capable of inhibiting SARS-CoV-2 entry into host cells. This discovery led to the inclusion of baricitinib in multiple clinical trials for COVID-19 treatment and ultimately resulted in its emergency use authorization approval by the U.S. Food and Drug Administration (FDA).8,9 Insilico Medicine leveraged its generative AI platform in 2019 to design and optimize a novel drug candidate for idiopathic pulmonary fibrosis within just 46 days. The compound entered clinical trials in 2022, marking the world’s first AI-designed drug candidate to reach the clinical stage.10 A recent report by the U.S. FDA noted the approval of 29 new drugs in the first 9 months of 2024, reflecting sustained momentum in U.S. drug development and the agency’s regulatory efficiency. Despite the increased number of approvals, long-standing challenges persist. The development of a new drug—from initial research to the market—typically requires approximately $2.3 billion and spans 10–15 years.11,12,13 These high costs often pose entry barriers for small- and medium-sized pharmaceutical companies, while prolonged timelines compromise the industry’s capacity to rapidly respond to public health emergencies. Moreover, such extensive investments in time and cost do not necessarily correlate with improved success rates in drug development.14 Recent data indicate that the overall success rate fell to 6.3% by 2022, suggesting that over 90% of drug candidates ultimately fail to reach the market.15 These unsuccessful projects further increase research and development costs, underscoring the financial challenges inherent to drug discovery.

Drug development typically comprises five stages: discovery, preclinical research, clinical trials, regulatory approval, and post-marketing surveillance. Each stage has distinct objectives and challenges. Given the high costs, extended timelines, and considerable risks associated with this process, researchers have strived to enhance the efficiency and cost effectiveness across all stages. Drug-target interaction (DTI) prediction is a pivotal component of the discovery phase and is integral to advancing new drug development.16 Accurate target prediction and drug molecule optimization help mitigate the risk of clinical trial failures. Precise target identification minimizes the validation of ineffective drug-target pairs, allowing for more focused experimentation and efficient resource utilization. DTI prediction aids in identifying potential off-target effects, facilitating early detection of safety risks, and thereby improving drug safety.17 It is also valuable for identifying multi-target drugs that are promising for complex disease treatment. Consequently, DTI predictions have attracted considerable attention in drug research.

With the extensive growth in bioactivity data, compound libraries, and protein sequence data, in silico methods have become powerful tools for predicting DTIs.18,19,20,21 These computational approaches enable the preliminary screening of thousands of compounds, notably reducing the reliance on labor-intensive experimental validations and accelerating the drug development pipeline. Recent work by Tanoli et al.,22 a large-scale literature analysis of 3,286 DTI prediction studies, emphasized the importance of combining computational strategies with experimental assays to ensure the biological and translational relevance of DTI models. Their review highlighted the lack of rigorous validation practices across the field. Building on this foundation, this review further summarizes four key factors that influence DTI prediction and systematically outlines the persistent challenges faced by current computational approaches. We propose targeted strategies and insights to address critical issues such as data sparsity and feature representation. We hope this work contributes to narrowing the significant gap between computational prediction and experimental validation and promotes the integration of DTI research into practical drug discovery pipelines.

Application of in silico methods in DTI prediction

Early in silico-based approaches

Early in silico methods for DTI prediction primarily focused on molecular docking and ligand-based virtual screening techniques.23 Molecular docking, the earliest computational approach in this field, was introduced by Kuntz et al. in 1982.24 This technique uses the three-dimensional (3D) structure of target proteins to position candidate drug molecules within the active sites, thereby simulating potential binding interactions.25 Docking algorithms estimate binding free energies to predict the most favorable binding configuration between drugs and their targets.

Ligand-based virtual screening methods,26 such as quantitative structure-activity relationship (QSAR)27 and pharmacophore models,28 predict new drug candidates by leveraging known bioactivity data. QSAR models establish mathematical correlations between the molecular structure and bioactivity. Pharmacophore models identify the spatial arrangements of functional groups that are essential for bioactivity. By capturing the shared characteristics among bioactive compounds, pharmacophore models facilitate the efficient virtual screening of compound libraries for structurally similar candidates.

Early in silico methods of DTI prediction have notable limitations. Molecular docking is highly dependent on the availability of protein 3D structures, which were scarce or difficult to obtain during the early years of its application.29,30 Although homology modeling can approximate unknown structures, its accuracy declines significantly when sequence similarity between the template and target protein is low.31 Ligand-based methods also assume linear relationships between the chemical structure and biological activity, yet real-world molecular interactions are often complex and nonlinear, making such methods insufficient for capturing dynamic binding behaviors.

Moreover, these techniques rely heavily on known active compounds, limiting their potential to explore novel chemical spaces and reducing their general applicability in early-stage drug discovery.32

In summary, the limitations of early in silico methods for DTI prediction, such as their dependency on 3D structural data, inadequacy in capturing complex structure-activity relationships, and difficulties in addressing data scarcity, have catalyzed the adoption and development of machine learning techniques in DTI prediction.

Machine learning-based methods

The advent of machine learning has led to substantial breakthroughs in DTI prediction. Machine learning has become a powerful tool for drug screening and target identification because it enables computational models to autonomously learn patterns and relationships from data. Yamanishi et al.33 pioneered machine learning-based DTI prediction by constructing a dual-layer model that integrates chemical and genomic information.

Various machine learning algorithms and methods have been applied to DTI prediction, each contributing unique strategies and frameworks. Table 1 summarizes the milestones and representative studies in the field, and the following section highlights several influential studies and applications.

  • KronRLS: this approach integrates drug chemical structure similarity with the Smith-Waterman similarity scores of target sequences within a Kronecker regularized least-squares framework. This is the first study to formally define the DTI prediction problem as a regression task, laying the foundation for quantitative DTI prediction.34

  • SimBoost: the first nonlinear approach for continuous DTI prediction, SimBoost introduces prediction intervals as a confidence measure and interpretable features derived from drug similarity matrices, protein similarity matrices, drug-target affinity matrices, and neighboring relationships.35

  • DGraphDTA: DGraphDTA was the first method used to construct protein graphs based on protein contact maps by leveraging the spatial information inherent in protein structures. A protein contact map is a two-dimensional (2D) matrix that captures residual interactions within the 3D structure of a protein, which is essential for accurately predicting binding affinities.36

  • MT-DTI: this model was the first to apply attention mechanisms to drug representation, addressing the limitations of convolutional neural network (CNN)-based methods in capturing associations between distant atoms, thereby improving the interpretability and predictive power of DTI models.37

  • MVGCN: unlike most supervised learning approaches for DTI prediction, it introduces a multiview graph convolutional network (MVGCN) framework for link prediction within biomedical bipartite networks. By integrating similarity networks with bipartite networks, the MVGCN constructs a multiview heterogeneous network and uses self-supervised learning for the initial node embeddings.38

  • DrugVQA: adapting concepts from visual question answering (VQA), DrugVQA frames the DPI task as a VQA problem. The protein’s distance map is used as an “image,” the drug’s SMILES string as a “question,” and the interaction prediction as the “answer,” providing an innovative perspective on DPI tasks.39

  • DeepAffinity: this model captures nonlinear dependencies between protein residues and compound atoms through unsupervised pretraining. These “long-distance” dependencies are crucial for compound-protein interactions, as residues or atoms in proximity within the 3D space may participate jointly in molecular interactions.40

  • BridgeDPI: while learning-based methods focus on individual DPIs, BridgeDPI introduces “guilt-by-association” principles to enhance network-level information, effectively combining network- and learning-based approaches to improve DTI prediction.41

  • DTINet: DTINet integrates data from diverse sources (e.g., drugs, proteins, diseases, and side effects) and learns low-dimensional representations of drugs and proteins to manage noise, incompleteness, and high-dimensional characteristics of large-scale biological data.42

  • DeepICL: this model characterizes four protein-ligand interaction patterns (hydrophobic interactions, hydrogen bonds, salt bridges, and π–π stacking) and proposes a 3D molecular generation framework that incorporates interaction-aware features to advance structure-based drug design.43

  • MMDG-DTI: leveraging pre-trained large language models (LLMs), MMDG-DTI captures generalized text features across biological vocabulary, demonstrating strong capabilities in handling unseen samples and extracting robust, discriminative features.44

  • NerLTR-DTA: framing DTI prediction as a ranking task, NerLTR-DTA uses learning-to-rank (LTR) principles, creating unique queries for applications across multiple scenarios, including the discovery of novel drugs and targets.45

  • BHCNS: this method improves prediction accuracy by identifying reliable negative samples and applying an inverse hypothesis in which proteins dissimilar to a known target are unlikely candidates for a compound’s interaction. This approach enhances the sample reliability and distinguishes BHCNS from conventional CPI prediction models.46

Table 1.

Milestones and representative studies in the DTI and DTA prediction

Drug feature Protein feature Explanation
AGraphDTA47 drug molecule graph features based on GNN amino acid sequence features based on CNN;protein contact map features based on GCN the protein graph features and amino acid sequence features, extracted via the neural network and having the same dimensions, are fused using element-wise matrix addition, thereby providing a more comprehensive representation of the complex information of the protein
BINDTI48 drug molecule graph features based on GCN amino acid sequence features based on self-attention mechanism and convolution the features of the drug and the target are fused through a bidirectional intention network, which combines the intention mechanism and the multi-head attention mechanism
BindingSiteDTI49 GNN-based fixed-scale substructure features of drug molecules GNN-based multiscale substructure features of protein maps explicitly extracts the multiscale substructure of the target and the fixed-scale substructure of the drug to facilitate the identification of structurally similar substructure markers and models hidden relationships at the substructure level to construct interactive features
BiComp-DTA50 extracting features from drug sequences based on separable CNN layers extracting features from encoded protein sequences using fully connected neural networks captures informative features from protein sequences, a unified measure, BiComp, is proposed that builds on alignment-free (i.e., Lempel-Ziv-Markov chain algorithm [LZMA]) and alignment-based (i.e., Smith-Waterman) similarity measures
CPInformer51 functional class fingerprint (FCFP) is fused with structural GCN features as the final representation of the compound multi-scale protein features with different receptive fields extracted from different network layers ProbSparse self-attention is applied to protein features under the guidance of compound features to eliminate information redundancy and improve the accuracy of CPInformer
CPGL52 representing drug molecule information based on graph attention network (GAT) representing protein information based on long short-term memory neural network (LSTM) LSTM is able to learn the relationship between words that are far apart in a sequence. This unique use of LSTM can better extract the spatial features of protein structure
DeepDTA53 learning representations from drug SMILES sequences based on CNN learning representations from raw protein sequences based on CNN first model for predicting DTA based on deep learning
DeepConv-DTI54 extracting drug fingerprint information based on fully connected layers extracting local residue patterns of target protein sequences based on CNN method for capturing local residue patterns using CNN successfully enriched the protein features in the original sequence
DataDTA55 fingerprint characterization of compound structure information based on algebraic graphs, extracting SMILE sequence features based on CNN protein binding pocket descriptor, extracting protein sequence features based on CNN a dual-interaction aggregation neural network strategy is developed to ensure effective learning of multi-scale interaction features
DeepGLSTM56 a graph convolutional network (GCN) module is introduced, using the power graph representation to process drug compounds processing protein sequences using a bidirectional LSTM layer method based on GCN and LSTM, capable of predicting the binding affinity values between FDA-approved drugs and SARS-CoV-2 viral proteins
DrugVQA39 extract SMILES features based on multi-head self-attention BiLSTM feature extraction based on dynamic CNN protein distance map interpretable model inspired by the visual question answering (VQA) paradigm, which can directly predict DPI based on protein distance maps and molecular SMILES
DeepAffinity40 extracting SMILES sequence features based on bidirectional recurrent neural network (RNN) represent protein sequences with a new alphabet of structural and physicochemical properties and then extract features based on a bidirectional RNN the performance on new protein categories with limited labeled data can be further improved through transfer learning. Additionally, we developed independent and joint attention mechanisms and embedded them into our model to enhance its interpretability
DGraphDTA36 extracting drug molecule graph features based on GNN extracting protein contact graph features based on GNN the proposed method is the first attempt to construct a protein graph based on the protein contact graph
DeepCDA57 encoding drug SMILEs based on CNN encoding protein sequences based on LSTM the adversarial domain adaptation method is used to learn the feature encoder network for the test domain to handle the different distributions of training and test domains
DeepEmbedding-DTI58 extracting drug molecule graph features using GNN with attention mechanism the bidirectional encoder representation from Transformer (BERT) model learns embedding vectors from the text composed of these protein words to improve the training efficiency, a Bidirectional Encoder Representation from the Transformer pre-training method is introduced to extract substructure features from protein sequences, and a local breadth-first search is introduced to learn subgraph information from molecular graphs
FusionDTA59 BiLSTM is used as a feature encoder for drug SMILES BiLSTM is used as a feature encoder for protein sequences to address the loss of implicit information, a novel multi-head linear attention mechanism is employed to replace the coarse pooling methods. FusionDTA is able to aggregate global information based on attention weights, rather than selecting the maximum information like max pooling
GraphCPI60 using GNNs to learn graph representations of compounds building blocks using CNNs to learn low-dimensional vector representations of protein sequences a framework that combines advanced graph neural representations of compounds with pre-trained embedding techniques for protein sequences is proposed. To the best of our knowledge, this study is the first to integrate local chemical context and topological structures to learn the interactions between compound-protein pairs
GraphDTA61 drug molecule graph representation learning based on GCN\GAT\GIN\GAT-GCN protein sequence learning based on CNN the first study to use GNN for DTA prediction
HyperAttentionDTI62 using CNN to learn the feature matrix of drugs using CNN to learn the feature matrix of proteins unlike previous attention-based models, our model infers an attention vector for each amino acid-atom pair. These attention vectors not only capture the interactions between amino acids and atoms but also control the representation of features across channels
IIFDTI63 GAT is used to extract independent features of drugs, which can capture the topological information between atoms convolutional structures with different kernel scales are used to extract independent features of proteins fuses the interaction and independent features between drugs and targets to predict DTI
MIDTI64 GCN is used as an encoder to learn drug embeddings from integrated drug similarity networks GCN is also used as an encoder to learn target embeddings from integrated target similarity networks a novel multi-view similarity network fusion strategy is proposed, which leverages a multi-view attention mechanism to integrate different similarity networks in an unsupervised manner, as long as the nodes and sizes of these networks are consistent
MGraphDTA65 extracting drug features based on multi-scale graph neural network multi-scale convolutional neural network to extract target features a multi-scale graph neural network and a novel visual explanation method called gradient-weighted affinity activation mapping are proposed for DTA prediction and interpretation
MCANet66 learning low-dimensional representation features from drug sequences based on CNN learning low-dimensional representation features from protein sequences based on CNN using a cross-attention mechanism to extract interaction features between drugs and proteins, the feature representation capability of drugs and proteins is enhanced. Additionally, the PolyLoss function is employed to mitigate the overfitting and class imbalance issues in drug-target datasets
MATT-DTI67 extracting sequence features based on CNN extracting sequence features based on CNN we propose a relation-aware self-attention module to model drugs from SMILES data while considering the correlations between atoms. The relative self-attention module enhances the relative positional information between atoms in the compound, while accounting for the relationships between all elements
MMDG-DTI44 a graph convolutional network is applied to extract the relationship between atoms the local features of the protein are represented by a 3D-CNN to represent the spatial features of the binding site. The global features of the protein are represented by a one-dimensional convolutional neural network to represent the features of the amino acid sequence we propose a multi-scale convolutional network that utilizes different types of convolutional networks to extract local and global features of proteins as well as topological features of compounds
MFR-DTA68 processing FCFP information based on BioMLP and extracting molecular structure features based on GNN based on BioCNN processing sequence, amino acid embedding and word embedding information BioMLP/CNN is the first module designed to extract individual features of biological sequence elements, simultaneously extracting both individual and relational features of elements in a sequence
Tsubaki et al.69 extracting compound subgraph features using graph neural networks extract sequence features using convolutional neural networks using neural attention mechanisms alleviates the issue of poor interpretability in the black-box nature of deep learning, allowing us to identify which subsequences in the protein are more important when predicting the interactions of drug compounds
TC-DTA70 extracting sequence features based on CNN use the encoder module of Transformer to extract amino acid sequence features the results of this study demonstrate the effectiveness of the Transformer encoder and CNN in extracting meaningful representations from sequences
TEFDTA71 extracting molecular features based on MACCS fingerprints and encoders in Transformer extracting protein sequence features based on CNN most methods have been primarily developed for predicting non-covalent binding affinity, and currently, there are no deep learning methods specifically designed for predicting covalent binding affinity. In this paper, we propose a new model for predicting both covalent (bonded) and non-covalent (non-bonded) binding affinities in drug-protein interactions
TransformerCPI72 solving the molecular representation problem based on GCN the word2vec of the protein is obtained based on its sequence, and then the sequence feature vector of the protein is passed to the encoder to learn a more abstract representation of the protein learns the required interaction features and reduces the risk of hidden ligand bias. By mapping attention weights onto protein sequences and compound atoms, we can explore the interpretability of the model, which helps us determine whether the predictions are reliable and physically meaningful
TransformerCPI2.073 extracting compound molecular graph features based on GCN calculate protein sequence representation using the pre-trained protein language model TAPE-BERT demonstrates that sequence-to-drug models can achieve virtual screening performance close to structure-based approaches (without relying on any prior knowledge of protein 3D structures) and also validates the feasibility of applying this concept to drug discovery

Representative machine learning and deep learning methods are listed together with their drug-side features, protein-side features, and explanatory notes. The “explanation” column highlights the distinctive aspects and meaningful contributions of each study, summarizing what makes the corresponding method noteworthy.

Major factors influencing DTI predictions

Four key factors influencing DTI prediction have been identified: problem formulation (binary classification or regression), data quality and quantity, feature engineering, and the experimental setup.

Problem formulation

Most studies have formulated DTI prediction as a binary classification task to determine whether a drug (typically, a small-molecule compound) interacts with a biological target (such as a protein, enzyme, or receptor).74 The affinity between a drug and its target, which reflects the binding strength of the interaction, is a critical indicator of drug efficacy. Therefore, affinity prediction constitutes a specialized and precise subtask within DTI prediction.75

Affinity prediction is generally formulated as a regression task and requires sophisticated models to accurately quantify the binding strength between drugs and targets. Although DTI and drug-target affinity (DTA) predictions focus on different aspects of drug development, they are complementary and critical. DTI prediction provides a foundation for drug discovery by identifying potential interactions, whereas DTA prediction provides detailed insights into optimizing drug properties. Both DTI and DTA prediction tasks have attracted increasing interest and considerable research investments. Figure 1 outlines the steps involved in DTI and DTA prediction using machine learning approaches (A) and illustrates the number of studies on DTI and DTA conducted in recent years (B).

Figure 1.

Figure 1

Data inputs, modeling workflow, and research trends in in silico DTI/DTA prediction

(A) Drug attributes, target attributes, and drug-target relationships provide the fundamental data basis for in silico prediction. In different studies, these data are represented in forms such as feature vectors, interaction networks, or affinity matrices and then transformed into inputs for model construction. The subsequent machine learning and deep learning workflow typically includes preprocessing, feature encoding, model building, training, and evaluation for two main tasks: drug-target interaction (DTI) classification and drug-target affinity (DTA) regression.

(B) Annual number of published studies on DTI and DTA prediction in recent years, showing the steady growth of research activity in both fields.

Data quality and quantity

Data quality and quantity are critical factors that influence the performance and generalization capabilities of a model. Models trained on unrepresentative data may perform adequately in specific contexts but struggle to generalize to novel situations. Therefore, high-quality datasets are essential for promoting model generalizability, reducing bias and error, enhancing robustness, and mitigating overfitting risks. For DTI prediction, various datasets have been developed to meet specific experimental requirements and align with algorithmic characteristics, such as the enzyme, ion channel, G protein-coupled receptor (GPCR), and nuclear receptor datasets for classification tasks33 and the Davis,76 KIBA,77 and Metz78 datasets for regression tasks.

Table 2 provides a summary of the commonly used datasets for DTI prediction. Based on these datasets, several key characteristics can be identified.

Table 2.

Commonly used datasets in DTI/DTA prediction

Task type Datasets Compound Protein Interaction
Classification enzyme33,64,66,79 445 664 2,926
ion channel33,64,66,79 210 204 1,476
GPCR33,63,64,66,79 223 95 635
nuclear receptor33,64,66,79 54 26 90
DUD-E39,49,58,69,80 22,886 102 22,645
human39,44,48,49,51,52,58,60,63,65,69,72 1,052 852 3,369
C. elegans44,51,52,60,63,65,69,72 1,434 2,504 4,000
BindingDB39,44,48,49,51,52,63,72,81 49,745 812 33,772
DrugBank44,48,62,63,66 6,645 4,254 17,511
BIOSNAP48,81 4,510 2,128 27,464
ChEMBL73 69,616 3,348 117,513
Luo et al.38,42,64 708 1,512 1,923
Regression Davis44,47,53,56,57,65,66,68,70,71,81,82 68 442 30,056
KIBA44,47,53,56,57,65,66,68,70,71,82 2,111 229 118,254
Metz56,65,82 1,423 170 35,259
BindingDB57,71 80,324 5,561 1,254,402
ToxCast56,65 7,657 328 342,869
STITCH56 724,471 15,258 1,244,420
Sc-PDB68 6,326 4,782 16,034
DTC56 5,983 118 67,894

Datasets are grouped by task type. For each, the numbers of unique compounds, proteins, and interaction pairs are provided.

(1)Clinical relevance and functional richness of target proteins: most datasets focus on proteins such as enzymes, ion channels, GPCRs, nuclear receptors, and protein kinases—which are among the most important and well-validated therapeutic targets in clinical pharmacology.79 These target families are extensively annotated in public databases with respect to their structure, function, and mechanism of action. (2) Data integration from multiple trusted sources: many datasets are constructed by integrating records from publicly available databases such as BindingDB, DrugBank, ChEMBL, ToxCast, STITCH, and curated datasets like Human and C. elegans. In contrast, datasets like Davis and Metz are derived from high-throughput screening (HTS) assays conducted under consistent experimental protocols, which provide more accurate and quantitative interaction measures. (3) Species specificity and cross-species extension: the majority of datasets involve human protein targets, reflecting the primary focus of drug development. However, datasets such as C. elegans include protein targets from the model organism Caenorhabditis elegans, providing valuable opportunities to evaluate model generalizability across species. (4) Distinct naming with shared origins: some datasets have unique names but are still built upon publicly available resources. For instance, the Human and C. elegans datasets were curated by Liu et al.46 based on data from DrugBank, Matador, and STITCH. Additionally, the DUD-E dataset includes a broader range of target classes, including GPCRs and ion channels, and is derived from sources such as ChEMBL and ZINC. (5) Regression datasets adapted for classification tasks: although originally designed for regression, several datasets can be converted into binary classification tasks by applying threshold-based labeling strategies. For example, in the BindingDB dataset, drug-target pairs are labeled positive if the reported IC50 value is below 100 nM and as negative if the IC50 value exceeds 10,000 nM.83 Thresholds of 5.0 (pKd) for the Davis dataset and 12.1 (KIBA score) for the KIBA dataset are commonly used to binarize interaction strength.35

Feature engineering

Feature engineering is the process of transforming raw data into a format suitable for machine learning models. This process encompasses feature extraction, optimization, and interaction, which collectively aid models in interpreting data structures, enhancing prediction accuracy, and reducing computational complexity.

A critical step in machine learning-based DTI prediction is to numerically encode compound and protein information.84,85 Numerous studies have sought to boost model accuracy by characterizing compound and protein features from multiple perspectives. For example, compound properties can be represented using molecular fingerprints, general descriptors, molecular structures, and functional groups. Protein information can be described through sequence composition, amino acid physicochemical properties, and secondary structural features. These descriptors form handcrafted features that serve as inputs for classification or regression algorithms. For example, iDTI-ESBoost86 combines structural and evolutionary features with an AdaBoost classifier for DTI prediction. RFDT,87 a predictor based on the rotation forest algorithm, encodes protein sequences into position-specific scoring matrices and represents drugs using fingerprint feature vectors. Several specialized toolkits have been developed to process compound and protein features. Table 3 provides an overview of some widely used options. A concise overview of the tools listed in Table 3 is presented here for reference.

  • BioPython88 is an open-source Python toolkit widely used for the processing and analysis of biological sequence data. It provides modules for reading and writing various sequence file formats, multiple sequence alignment, sequence transcription and translation, sequence comparison, structure parsing, and access to online biological databases.

  • PyMOL89 is a powerful open-source molecular visualization tool supporting multiple structure file formats. It is widely applied in structural biology, drug design, and molecular modeling and is indispensable for protein structural feature extraction and mechanistic studies.

  • DSSP90 is a classic tool for protein secondary structure analysis, automatically identifying and annotating secondary structure types for each amino acid residue based on 3D protein structures (usually from PDB files). It also computes hydrogen bonds, solvent accessibility, backbone torsion angles, and other structural features.

  • iFeature91 is a versatile Python-based tool for biological sequence feature extraction, covering 18 encoding schemes and capable of computing 53 types of feature descriptors. It offers both command-line and graphical interfaces and is commonly used for standardized preprocessing of high-throughput sequence data in machine learning tasks such as sequence classification, function prediction, and interaction analysis.

  • Pfeature92 provides six main modules (composition, binary profiles, evolutionary information, structural features, patterns, and model building), enabling the calculation of over 200,000 features for protein-level and residue-level annotations, as well as for predicting the functions of chemically modified peptides.

  • ProtDCal93 is a Java-based program that calculates general numerical descriptors for protein sequences and 3D structures, covering features such as electronic interactions, van der Waals forces, torsion potentials, and topological indices related to folding rates.

  • modlAMP94 is a Python package designed for antimicrobial peptide data analysis, providing tools for descriptor calculation, sequence retrieval from public or local databases, peptide design, classification, and visualization.

  • ProtParam95 is an online protein sequence analysis tool developed by ExPASy (Swiss Institute of Bioinformatics), capable of computing physical and chemical properties such as molecular weight, theoretical isoelectric point (pI), amino acid composition, extinction coefficients, estimated half-life, instability index, aliphatic index, and hydropathicity.

  • RDKit96 is an open-source cheminformatics toolkit developed in C++ with a complete Python interface, offering efficient and flexible functionalities for molecular modeling, structure parsing, fingerprint generation, and descriptor computation.

  • Open Babel97 is an open-source chemical toolbox supporting multiple chemical data languages, offering functionalities such as file format conversion, conformer search, 2D depiction, filtering, and substructure/similarity searching.

  • OpenChem98 is a deep learning toolkit based on PyTorch for computational chemistry and drug design, providing a flexible, modular framework that integrates molecular representations (e.g., SMILES and molecular graphs) with various neural network models for molecule-level machine learning tasks.

  • ChemPy99 is an open-source chemistry calculation library written in Python, primarily used for handling idealized chemical reaction systems, quantitative chemical calculations, and solving basic chemical equations, including reaction kinetics and concentration modeling.

  • ChemAxon Marvin100 is a professional molecular structure drawing and chemical information processing tool developed by ChemAxon, supporting molecule visualization, standardized input, reaction scheme editing, molecular property evaluation, and initial modeling preparations.

  • PaDEL-Descriptor101 is a Java-based chemical toolkit integrating the Chemistry Development Kit (CDK), capable of calculating 797 molecular descriptors (1D, 2D, and 3D) and 10 fingerprint types, including atom-type E-state descriptors, McGowan volume, molecular linear free energy relationship descriptors, and various binary fingerprints.

  • ChemAxon JChem102 is an enterprise-level cheminformatics suite developed by ChemAxon, designed for processing and analyzing large-scale molecular structure data, offering functionalities such as structure parsing, standardization, searching, fingerprint generation, and property calculations.

  • Pybel103 is the Python wrapper for Open Babel, offering a simplified programming interface for reading, writing, converting, and analyzing molecular structures within the Python environment.

  • ChemDes104 is an integrated platform for molecular descriptor and fingerprint calculation, combining tools such as Pybel, CDK, RDKit, BlueDesc, Chemopy, PaDEL, and jCompoundMapper. It can compute 3,679 molecular descriptors and 59 types of fingerprints and provides utilities for format conversion, MOPAC optimization, and fingerprint similarity calculations.

  • CDK105 is a Java-based open-source cheminformatics library designed for small-molecule modeling and computation, offering functionalities such as molecular structure parsing, descriptor calculation, structure standardization, and cleaning. It serves as the core engine behind platforms like PaDEL-Descriptor, ChemDes, KNIME, and Weka.

  • DeepChem106 is a Python library designed for machine learning and deep learning on molecular and quantum datasets, offering standardized models, datasets, and workflows for tasks in drug discovery, molecular modeling, bioactivity prediction, material science, and computational physics.

Table 3.

Toolkits for feature extraction of proteins and compounds

Commonly used software libraries and platforms for computing descriptors are listed. Official URLs are provided for each toolkit.

Unlike handcrafted features, deep learning-based automatic features do not rely on extensive domain knowledge. They are adept at processing unstructured data and perform exceptionally well in complex tasks. Various deep learning methods, each with unique characteristics and advantages, have been applied to capture information from diverse perspectives. For example, graph attention networks107 were designed to process graph-structured data, allowing models to learn the connection strengths (e.g., chemical bonds) between different nodes (e.g., atoms). Graph convolutional neural networks108 effectively extract information from simplified graphical representations of protein structures, offering interpretability for the learned features. Long short-term memory109 networks, specialized in sequential data, capture temporal information and long-term dependencies within input sequences.

In DTI prediction, many studies have attempted to probe deeper into data by employing feature interactions across multiple perspectives. ProtDec-LTR3.0110 applied feature mapping to integrate information from ACC and top-gram methods, whereas Zhang et al.111 used cross-term feature mapping to handle input features in ligand-based virtual screening studies. PKRank112 utilizes pairwise kernels for feature processing. Recently, attention mechanisms have been incorporated to enhance the feature interactions. DeepCDA57 introduced a dual-attention mechanism that encodes interactions between protein sub-sequences and compound substructures, calculating the attention coefficients between each compound and protein substructure pair to represent the binding strength. BINDTI48 integrates a bidirectional intent network with multihead attention to combine drug and target features, and IIFDTI63 employs a bidirectional encoder-decoder framework to capture interaction features between drug and target substructures. In summary, intrinsic associations between features remain an important direction for further exploration in DTI research.

Experimental setup

In DTI prediction, two primary entities are considered: drugs and proteins. Their interactions extend beyond simple one-to-one relationships, making the distribution of these entities across the training and test sets a critical aspect in DTI research. Based on data distribution patterns and data availability, DTI prediction tasks can be categorized into warm- and cold-start scenarios.34,45 Figure 2 shows the interactions between certain compounds and proteins. The left panel highlights the overlap of protein targets among five selected drugs (CHEMBL10903, CHEMBL17657, CHEMBL16882, CHEMBL17881, and CHEMBL10874). The Venn diagram indicates how many targets are either uniquely associated with or shared among these compounds. It is evident that some targets are commonly targeted by multiple drugs. The right panel provides a global view of the compound-protein interaction network. The network clearly demonstrates that most drugs are linked to multiple protein targets, and some targets are concurrently associated with several drugs. This further supports the complexity of DTI prediction tasks and underscores the necessity of modeling such multi-relational interactions.

Figure 2.

Figure 2

Overlap and global network of drug-target interactions

The left panel shows a Venn diagram of protein targets among five drugs (CHEMBL10903, CHEMBL17657, CHEMBL16882, CHEMBL17881, and CHEMBL10874). It highlights how many targets are unique to each drug and how many are shared (for example, CHEMBL10903 and CHEMBL17657 share several common targets, while CHEMBL17881 also overlaps with CHEMBL16882), illustrating that certain proteins are commonly targeted by multiple drugs. The right panel presents a global compound-protein interaction network, where nodes represent drugs or proteins and edges represent interactions. The network demonstrates that most drugs are linked to multiple targets, and some targets are concurrently associated with several drugs, underscoring the complexity of DTI prediction and the need to model multi-relational interactions.

In the warm-start scenario, both the training and test sets contain the same or highly similar drugs and targets, enabling models to achieve improved predictive accuracy. Investigating warm-start scenarios, commonly referred to as drug repurposing, allows researchers to maximize the utility of existing drugs and identify potential new indications. Consequently, warm-start settings are prevalent in DTI research.40,45,53,61

The cold-start scenarios involve novel drugs, targets, or drug-target pairs in the test set that are absent during training. This lack of prior knowledge significantly increases the difficulty of predictive tasks under cold-start conditions. Research on cold-start scenarios is crucial for new drug development, as it supports the screening and prediction of new drug candidates in data-limited settings, offering innovative treatment possibilities for diseases with currently limited therapeutic options. Cold-start scenarios can be further divided into three categories:

  • Drug cold-start: new drugs appear in the test set, with no prior interaction data available for these drugs during training.

  • Target cold-start: new targets appear in the test set, with no interaction data for these targets present in the training set.

  • Drug-target cold-start: both drugs and targets in the test set are entirely absent from the training set, representing the most stringent cold-start setting.

To investigate the cold-start problem in a more rigorous manner, it is essential to adopt stricter and more refined data partitioning strategies. One effective approach is to construct scenarios where the similarity between drugs or targets in the training and testing sets is minimized, thereby simulating more realistic and challenging settings. This can be achieved by clustering drugs or proteins and ensuring that samples from the same cluster appear exclusively in either the training or testing set.

  • Protein clustering: proteins can be clustered based on sequence similarity using tools such as BLAST113 and MAFFT114 or based on structural similarity using tools like Foldseek.115

  • Drug clustering: drugs can be clustered based on molecular scaffolds,116,117 shape similarity,118 or chemical fingerprint similarity.119

In drug development, addressing both the optimization of existing data through warm-start scenarios and the exploration of new drug potentials in data-scarce conditions via cold-start scenarios are essential for a comprehensive and effective DTI prediction strategy.

Conclusions and future perspectives

This review identified several key challenges in DTI prediction, including issues related to data sourcing, integration, and representation. Several targeted strategies have been proposed to address these challenges.

Challenges

Data sourcing issues

Various databases offering extensive data resources are currently available for DTI prediction. However, these databases present several limitations. The primary issue is data redundancy or duplication across databases, often compounded by inconsistencies in the data formats and standards. Differences in data sources, measurement methods, experimental conditions, and standardization protocols can lead to conflicting affinity values or interaction statuses for the same drug-target pair across databases. Moreover, many databases are not regularly updated, leading to the omission of recent experimental findings or newly identified DTIs. Furthermore, although effective DTI prediction involves a comprehensive range of data types (chemical, biological, genomic, transcriptomic, and clinical), most existing databases focus on a single data type, limiting their utility for multimodal analysis. This fragmented data landscape underscores the need for more integrated and comprehensive datasets to enhance the robustness and generalizability of DTI models.

Data integration challenges

Data sparsity remains a critical challenge in machine learning-based DTI prediction because confirmed DTIs are vastly outnumbered by unknown interactions, particularly for novel targets or rare compounds. Existing databases are often biased toward positive samples representing validated interactions, with a scarcity of clearly annotated negative samples. A common strategy for DTI prediction is to treat unknown interactions as negative samples. However, some unknown pairs may represent unvalidated positive interactions, complicating the ability of a model to distinguish between positive and negative samples accurately. Additionally, the considerable imbalance between known and unknown interactions exacerbates class imbalance in binary classification tasks. Consequently, the incorporation of true negative interactions has emerged as a pivotal area for enhancing model accuracy and robustness in DTI prediction.

Data representation issues

For DTI prediction, traditional machine learning approaches rely on handcrafted features that require deep expertise in drug chemistry, protein structure, and biological networks. Furthermore, handcrafted features often fail to capture the complex, nonlinear relationships intrinsic to DTIs, thereby limiting their applicability to novel molecular structures or less-characterized targets. These features are generally tailored to specific tasks or datasets, which restricts their generalizability and renders their optimization a demanding and time-consuming process.

Despite their proficiency in capturing complex nonlinear patterns and high-dimensional data, automatically learned features are frequently perceived as “black boxes” because of the opacity of their internal decision-making processes, which complicates their interpretation. Moreover, models that use these features typically require extensive hyperparameter tuning, which is an intricate and resource-intensive process requiring substantial experimentation and validation. Neural network architectures, such as CNNs, recurrent neural networks (RNNs), and graph neural networks (GNNs), have shown potential in feature representation for DTI prediction; however, each architecture has inherent limitations. CNNs are constrained in capturing global structural features and are typically unsuitable for molecular graphs that lack a defined Euclidean structure. Although RNNs are effective in processing sequential data, they struggle to retain their long-term dependencies. Despite their effectiveness in representing 2D atomic structures and their use in graph-based encoder-decoder frameworks, graph convolutional networks (GCNs) face challenges in assigning appropriate weights to critical neighboring nodes, exhibit reduced flexibility, and often converge at a slower rate.

Strategies (insights)

Ensuring high-quality data sources

To construct datasets that are comprehensive, representative, and enriched with samples, it is essential to collect data from databases that include diverse sources, measurement techniques, experimental conditions, and standardization protocols.120,121 The following guidelines are recommended to ensure data reliability:

  • (1)

    Prioritize high-quality databases: select databases with well-documented experimental conditions, precise measurement methodologies, and transparent data sources to ensure consistency and dependability. Table 4 provides a list of authoritative databases containing information on proteins, compounds, drug-target pairs, and associated biological entities.

  • A concise overview of the tools listed in Table 4 is presented here for reference.
    • UniProt (Universal Protein Resource)122 is one of the world’s most comprehensive and authoritative protein databases, encompassing over 120 million protein sequences with detailed functional annotations across all branches of life. It provides essential sequence, function, structure, evolutionary relationships, and literature information, serving as a foundational resource in proteomics, bioinformatics, and drug discovery.
    • PDB (Protein Data Bank)123 is the first open-access digital resource in biology and medicine and the most authoritative global repository for 3D structures of proteins and other biomacromolecules.
    • NCBI Protein124 is maintained by the U.S. National Center for Biotechnology Information (NCBI), integrating predicted and experimentally validated protein sequences from multiple sources, along with comprehensive sequence and functional annotations.
    • InterPro125 integrates predictive models from multiple databases (e.g., Gene3D, PANTHER, Pfam, and PROSITE) to annotate protein domains, families, and functional sites, supporting functional annotation and classification research.
    • CDD (Conserved Domain Database)126 maintained by NCBI focuses on the identification and annotation of functional conserved domains within protein sequences.
    • BRENDA127 is the most comprehensive enzyme information system, covering data on enzyme functions, catalytic reactions, physiological roles, substrates, inhibitors, tissue specificity, and disease associations.
    • GO (Gene Ontology)128 is the international standard for gene and protein functional annotation, organized into three core categories: biological processes, molecular functions, and cellular components.
    • PubChem129 is the world’s largest open-access chemical structure and bioactivity database, hosting detailed structure, property, and activity information on small molecules, drugs, intermediates, and natural products.
    • ChEMBL130 is a large open-access drug discovery database integrating bioactivity data of small molecules, approved drugs, and clinical candidates, widely used in pharmaceutical research.
    • ChemDB131 is a small-molecule database constructed from multiple vendors and public resources, containing millions of real and virtual molecules for structure retrieval, property calculation, and target analysis.
    • DrugBank132 is a comprehensive drug and drug-target knowledge base covering basic drug information, mechanisms of action, metabolic pathways, indications, and drug-drug and drug-food interactions, regarded as a “gold standard” resource in drug information.
    • ZINC133 is a freely accessible database for virtual screening, containing tens of billions of enumerated small molecules with atomically precise structures.
    • ChemSpider134 is a chemical database integrating hundreds of independent sources, offering detailed physical and chemical properties, molecular structures, spectra, synthesis routes, and safety information.
    • DrugCentral135 is an integrated platform compiling ingredient information for FDA-approved and other regulatory agency-approved drugs, including structure, bioactivity, pharmacology, indications, and regulatory status.
    • Drugs@FDA136 is the U.S. FDA’s official platform for accessing regulatory information on approved drug and biological products, covering listing information, indications, manufacturers, labeling, and clinical trial data.
    • BindingDB137 is a publicly accessible protein-small molecule interaction database, containing over one million experimentally measured binding data entries and links to resources like ZINC.
    • KEGG138 is a comprehensive database for genome and pathway information, systematically integrating genes, proteins, compounds, diseases, and drugs to assist in understanding biological functions and disease mechanisms.
    • DGIdb139 aggregates drug-gene interaction information from publications, databases, and online resources, standardizing and merging them into unified concept groups.
    • ToxCast140 is a vital database for toxicology research and drug safety evaluation, predicting the potential toxicity of chemicals through high-throughput in vitro screening and computational modeling.
    • STITCH141 is a database of protein-chemical interactions, integrating experimental data, curated information, text-mined results, and predictions to support functional studies and drug discovery.
    • SuperTarget142 is a comprehensive drug database integrating indications, adverse effects, metabolic pathways, target annotations, and GO terms, with high-quality manual curation for part of the content.
    • CovalentInDB143 is a specialized database focusing on covalent inhibitors and their targets, systematically curating known covalent drugs, small-molecule inhibitors, binding mechanisms, and clinical statuses.
    • PGxDB144 is a professional pharmacogenomics database supporting the analysis of drug-gene interaction data and promoting individualized therapy and translational medicine research.
    • DTP (DrugTargetProfiler)145 is an interactive network platform designed to enhance bioactivity modeling for multitarget anticancer compounds, particularly suited for precision oncology applications.
    • GLASS146 is a specialized database focused on GPCR-ligand interactions, annotating interactions involving natural ligands, small-molecule drugs, and endogenous signaling molecules.
    • CTD (Comparative Toxicogenomics Database)147 integrates information on interactions among chemicals, genes, and diseases, aiming to uncover the health impacts of environmental exposures.
    • DTC148 is a community-curated platform integrating and standardizing drug-target interaction data, including clinical development compounds, target-disease associations, and mutant protein targets.
    • TTD (Therapeutic Target Database)149 systematically catalogs descriptions of FDA-approved drugs, clinical drugs, investigational drugs, and their therapeutic targets to facilitate target-based drug research.
    • PharmGKB150 is the world’s leading pharmacogenomics knowledge base, systematically collecting information on gene variants, drug responses, and clinical phenotypes to support precision medicine.
    • Reactome151 is a high-quality, manually curated database of human biological pathways, covering processes such as metabolism, signal transduction, the cell cycle, and immune responses.
    • STRING152 is a high-quality platform for integrated protein-protein interaction data, combining experimental, predicted, text-mined, and curated database sources to build cross-species PPI networks.
    • ConsensusPathDB153 integrates multiple public resources to provide interaction data across human, mouse, and yeast, including protein-protein interactions, metabolic reactions, signaling pathways, and gene regulation networks.
    • HPRD (Human Protein Reference Database)154 compiles high-confidence human proteomic information, including protein phosphorylation modifications, subcellular localization, and functional annotations.
    • BioCyc155 is a portal integrating thousands of microbial genomes and their inferred metabolic pathways, supporting genome browsing, metabolic network modeling, and enrichment analysis.
    • IntAct156 is an open-access molecular interaction database focusing on high-quality experimental data integration, with all entries manually reviewed and standardized.
    • BioGRID157 is a comprehensive repository of molecular interaction data, providing protein and gene interaction information across multiple species, widely used in functional genomics and systems biology research.
    • TDR Targets158 is a target prioritization platform for neglected tropical diseases, systematically integrating pathogen genome data and druggability information to facilitate therapeutic target discovery.
  • (2)

    Data preprocessing: outliers, duplicates, and noise should be removed to maintain data integrity. Unique identifiers (e.g., drug names, target names, and structural descriptors, such as SMILES and InChI) should be used to identify and eliminate redundant drug-target pairs across different databases.

  • (3)

    Harmonizing inconsistent data: in cases where multiple databases report different affinity values for the same drug-target pair, researchers may adopt strategies such as calculating a consistency score, selecting data from the most credible source, or averaging the values to ensure a balanced representation.

  • (4)

    Supplementing insufficient data: when existing data quality or coverage is inadequate, conducting additional experiments on critical drug-target pairs can provide valuable data points, thereby enhancing the overall quality and robustness of the dataset.

Table 4.

Databases of proteins, compounds, drug-target pairs, and various associations

Object Database Address
Protein UniProt122 https://www.uniprot.org/
Protein Data Bank123 https://www.rcsb.org/
NCBI Protein Database124 https://www.ncbi.nlm.nih.gov/protein/
InterPro125 https://www.ebi.ac.uk/interpro/
CDD126 https://www.ncbi.nlm.nih.gov/cdd/
BRENDA127 https://www.brenda-enzymes.org/
GO128 https://geneontology.org/
Compound PubChem129 https://pubchem.ncbi.nlm.nih.gov/
ChEMBL130 https://www.ebi.ac.uk/chembl/
ChemDB131 https://cdb.ics.uci.edu/
DrugBank132 https://go.drugbank.com/
ZINC133 https://zinc.docking.org/
ChemSpider134 https://www.chemspider.com/
DrugCentral135 https://drugcentral.org/
Drugs@FDA136 https://www.accessdata.fda.gov/scripts/cder/daf/index.cfm
Drug-target BindingDB137 https://www.bindingdb.org/
KEGG138 https://www.kegg.jp/
DGIdb139 https://www.dgidb.org/
ToxCast140 https://www.epa.gov/comptox-tools/exploring-toxcast-data
STITCH116 http://stitch.embl.de/
SuperTarget142 http://insilico.charite.de/supertarget
CovalentInDB143 http://cadd.zju.edu.cn/cidb/
PGxDB144 https://pgx-db.org/
DTP145 https://drugtargetprofiler.fimm.fi/
GLASS146 https://zhanggroup.org/GLASS/
CTD147 https://ctdbase.org/
DTC148 https://drugtargetcommons.fimm.fi/
Other associations TTD149 http://db.idrblab.net/ttd/
PharmGKB150 https://www.pharmgkb.org/
Reactome151 https://reactome.org/
STRING152 https://string-db.org/
Consensus
PathDB153
http://cpdb.molgen.mpg.de/
HPRD154 http://www.hprd.org/
BioCyc155 https://biocyc.org/
IntAct156 https://www.ebi.ac.uk/intact/home
BioGRID157 https://thebiogrid.org/
TDR Targets158 https://tdrtargets.org/

By adhering to these guidelines, researchers can ensure that their datasets support robust, reliable, and generalized DTI predictions.

Mitigating data sparsity

Several strategies based on current research, algorithmic innovations, and theoretical frameworks have been proposed to address the challenges posed by data sparsity in DTI prediction.

  • (1)

    Guilt-by-association principle159: this principle posits that structurally similar compounds are likely to interact with the same target and that target proteins with high homology (e.g., sequence or structural similarity) may share interactions with the same compounds. This approach has been integrated into DTI prediction tasks. For instance, NerLTRDTA45 incorporates the properties of closely related neighbors into the profile of the target entity. BHCNS46 generates reliable negative samples by assuming that proteins that differ from known targets of a compound are unlikely to interact with the compound.

  • (2)

    Transfer learning160: transfer learning addresses data scarcity by leveraging knowledge from related domains or tasks and enhancing model performance under data-limited conditions. In DTI prediction, related interaction networks (e.g., drug-disease, target-disease, or protein-protein interactions) offer valuable knowledge that can be transferred to improve prediction accuracy, especially for poorly studied drugs or targets.

  • (3)

    Multitask learning161: when data for a specific target is sparse but the target belongs to a well-studied family, multitask learning can improve prediction by sharing data across related family members. This approach not only enhances DTI prediction but also improves generalizability by leveraging additional information from drug-disease associations.

  • (4)

    Few-shot learning162: few-shot learning enables models to generalize from minimally labeled data, achieving robust performance on new, unseen samples. Meta-learning,163 a prominent few-shot approach, optimizes the learning process to allow rapid adaptation to novel tasks with limited data.

  • (5)

    Data augmentation and active learning164: active learning iteratively refines model training by selecting and labeling the most informative samples. Starting with a small, labeled dataset, the model selects high-value samples from an unlabeled pool using specific sampling strategies. These samples are then labeled and added to the training set, and the model is retrained in successive cycles until performance goals or budget constraints were met. This iterative process effectively mitigates data sparsity by maximizing the utility of available data.

Comprehensive and effective representation of drug and target data

Enhancing the representation of drug and target data is essential to advance DTI prediction accuracy and robustness. Figure 3 illustrates hierarchical relationships between data sources, data representation types, and handcrafted and automatically learned features, providing a framework for streamlining the effective representation of drug and protein information. Based on these insights, the following strategies are proposed.

  • (1)

    Multi-view, multi-modality representation: drugs and targets can be represented using diverse data modalities, such as molecular sequences and graph structures.165,166 Multi-modality data encapsulates diverse aspects of an entity and offers holistic representations. Each modality reflects distinct signal types, formats, and sources that may demonstrate complementary, correlated, or unique characteristics. Consequently, the integration of multi-modality and multi-view data substantially improves the accuracy, robustness, and reliability of the predictions.

  • (2)

    Incorporation of heterogeneous network information: recent studies have emphasized the critical role of inter-entity relationships in elucidating biological functions. For example, analyzing drug-induced effects on microRNA expression, particularly in cancer progression, can inform drug mechanisms and guide novel therapeutic strategies.167 Integrating such heterogeneous network information into DTI models can enhance biological interpretability and predictive performance.

  • (3)

    Application of LLMs168: unlabeled data serve as core resources for training LLMs. By leveraging semantic and structural information from massive unlabeled datasets, LLMs achieve cross-task generalization and extract sophisticated protein representations. Pre-training on large protein sequence datasets enables LLMs to capture complex biological relationships, offering promising avenues for improving DTI predictions.

Figure 3.

Figure 3

Hierarchical relationships between data sources, data representations, and features

Drug and target data, such as molecular structures, protein sequences, and interaction networks, can be represented in different ways and then processed to obtain features. Two complementary strategies are illustrated: handcrafted descriptors (e.g., fingerprints, physicochemical properties, and sequence composition) and automatically learned embeddings (e.g., CNN, RNN, GNN, and Transformer). Importantly, the same type of data may support both handcrafted and learned representations, highlighting the layered relationships between data sources, representation types, and feature extraction approaches.

Collectively, these strategies aim to improve data representation, thereby contributing to more accurate and interpretable DTI predictions.

Enhancing experimental data settings

Given that DTI prediction involves both drug and target entities, the risk of data overlap between training and test sets is substantial, potentially leading to overestimation of model performance. Rigorous experimental protocols are thus necessary.

  • (1)

    Elimination of highly similar samples: prior to analysis, drugs or targets with high structural similarity—assessed using metrics such as Smith-Waterman scores169 or Tanimoto similarity170—should be removed to reduce redundancy and bias, ensuring a more representative dataset and improving experimental validity.

  • (2)

    Cluster-based partitioning for enhanced data independence171: data partitioning based on the clustering of drugs or targets using molecular structures or protein sequences allows the creation of independent training and test sets with no overlap. This approach increases inter-cluster variance and simulates real-world scenarios involving novel drugs or targets more accurately. Consequently, this enhances the robustness and generalizability of the model when applied to unseen samples.

The influence of AlphaFold on DTI prediction

The 3D structure of target proteins is critical for drug design, particularly for evaluating binding sites and interactions. AlphaFold has revolutionized 3D structure prediction by offering reliable and high-precision structural data for DTI research.172 Using AlphaFold-predicted structures, researchers can identify binding sites with greater accuracy, significantly improving virtual screening processes. For instance, Johansson et al.173 applied AlphaFold-based methodologies to refine peptide-protein docking, demonstrating its potential to enhance predictive performance. Integrating AlphaFold-generated structural data into DTI workflows holds great promise for enabling more precise interaction predictions and accelerating drug discovery.

Future perspectives

Contemporary computational approaches that encompass machine learning and deep learning methodologies have substantially enhanced the efficiency and accuracy of DTI predictions. These innovations have not only expedited drug discovery but also reduced development costs and contributed to multiple stages of the drug development pipeline. However, significant challenges remain, particularly in developing models that balance high predictive accuracy with interpretability and broad applicability.

Future research should prioritize the acquisition and integration of high-quality DTI datasets coupled with the development of advanced methods for representing drug and target features. Addressing persistent challenges, such as data scarcity, cold-start scenarios, and limitations in predictive precision, will be critical for further progress.

In summary, contemporary computational methods play a pivotal role in DTI prediction and hold immense potential for advancing drug discovery and development.

Acknowledgments

The work was supported by the National Natural Science Foundation of China (nos. 62450002, 32470693, and 62425107), Zhejiang Provincial Natural Science Foundation of China (no. LD24F020004), and the Municipal Government of Quzhou (no. 2024D001).

Author contributions

X.R. conceived the review framework, conducted the literature survey, synthesized and organized the content, drafted the manuscript, and prepared the figures and tables. Q.Z. provided overall supervision, conceptual guidance, and critical revisions to improve the clarity and rigor of the manuscript. W.H. and L.X. assisted in literature collection and provided helpful comments during manuscript preparation. All authors read and approved the final manuscript.

Declaration of interests

The authors declare no competing interests.

Contributor Information

Wu Han, Email: kevinwh@stanford.edu.

Quan Zou, Email: zouquan@nclab.net.

References

  • 1.O'Neill J. (2014). Antimicrobial resistance: tackling a crisis for the health and wealth of nations. https://amr-review.org/sites/default/files/AMR%20Review%20Paper%20-%20Tackling%20a%20crisis%20for%20the%20health%20and%20wealth%20of%20nations_1.pdf The Review on Antimicrobial Resistance.
  • 2.Wong F., Zheng E.J., Valeri J.A., Donghia N.M., Anahtar M.N., Omori S., Li A., Cubillos-Ruiz A., Krishnan A., Jin W., et al. Discovery of a structural class of antibiotics with explainable deep learning. Nature. 2024;626:177–185. doi: 10.1038/s41586-023-06887-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Mattiuzzi C., Lippi G. Current cancer epidemiology. J. Epidemiol. Glob. Health. 2019;9:217–222. doi: 10.2991/jegh.k.191008.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Dong J., Wu Z., Xu H., Ouyang D. FormulationAI: a novel web-based platform for drug formulation design driven by artificial intelligence. Brief. Bioinform. 2023;25:bbad419. doi: 10.1093/bib/bbad419. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Hartl D., de Luca V., Kostikova A., Laramie J., Kennedy S., Ferrero E., Siegel R., Fink M., Ahmed S., Millholland J., et al. Translational precision medicine: an industry perspective. J. Transl. Med. 2021;19:245. doi: 10.1186/s12967-021-02910-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Petzschner F.H. Practical challenges for precision medicine. Science. 2024;383:149–150. doi: 10.1126/science.adm9218. [DOI] [PubMed] [Google Scholar]
  • 7.Wang X., Duan M., Li J., Ma A., Xin G., Xu D., Li Z., Liu B., Ma Q. MarsGT: Multi-omics analysis for rare population inference using single-cell graph transformer. Nat. Commun. 2024;15:338. doi: 10.1038/s41467-023-44570-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Richardson P., Griffin I., Tucker C., Smith D., Oechsle O., Phelan A., Rawling M., Savory E., Stebbing J. Baricitinib as potential treatment for 2019-nCoV acute respiratory disease. Lancet. 2020;395:e30–e31. doi: 10.1016/S0140-6736(20)30304-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Stebbing J., Krishnan V., de Bono S., Ottaviani S., Casalini G., Richardson P.J., Monteil V., Lauschke V.M., Mirazimi A., Youhanna S., et al. Mechanism of baricitinib supports artificial intelligence-predicted testing in COVID-19 patients. EMBO Mol. Med. 2020;12 doi: 10.15252/emmm.202012697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.From Start to Phase 1 in 30 Months: AI-Discovered and AI-Designed Anti-fibrotic Drug Enters Phase I Clinical Trial. Insilico; 2021. https://insilico.com/phase1?utm_source=chatgpt.com.
  • 11.Chong C.R., Sullivan D.J., Jr. New uses for old drugs. Nature. 2007;448:645–646. doi: 10.1038/448645a. [DOI] [PubMed] [Google Scholar]
  • 12.DiMasi J.A., Grabowski H.G., Hansen R.W. Innovation in the pharmaceutical industry: new estimates of R&D costs. J. Health Econ. 2016;47:20–33. doi: 10.1016/j.jhealeco.2016.01.012. [DOI] [PubMed] [Google Scholar]
  • 13.Lamanna G., Delre P., Marcou G., Saviano M., Varnek A., Horvath D., Mangiatordi G.F. GENERA: a combined genetic/deep-learning algorithm for multiobjective target-oriented de novo design. J. Chem. Inf. Model. 2023;63:5107–5119. doi: 10.1021/acs.jcim.3c00963. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bai F., Li S., Li H. AI enhances drug discovery and development. Natl. Sci. Rev. 2024;11 doi: 10.1093/nsr/nwad303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Hay M., Thomas D.W., Craighead J.L., Economides C., Rosenthal J. Clinical development success rates for investigational drugs. Nat. Biotechnol. 2014;32:40–51. doi: 10.1038/nbt.2786. [DOI] [PubMed] [Google Scholar]
  • 16.Mak K.-K., Pichika M.R. Artificial intelligence in drug development: present status and future prospects. Drug Discov. Today. 2019;24:773–780. doi: 10.1016/j.drudis.2018.11.014. [DOI] [PubMed] [Google Scholar]
  • 17.Smajić A., Grandits M., Ecker G.F. Privacy-preserving techniques for decentralized and secure machine learning in drug discovery. Drug Discov. Today. 2023;28 doi: 10.1016/j.drudis.2023.103820. [DOI] [PubMed] [Google Scholar]
  • 18.Mullowney M.W., Duncan K.R., Elsayed S.S., Garg N., van der Hooft J.J.J., Martin N.I., Meijer D., Terlouw B.R., Biermann F., Blin K., et al. Artificial intelligence for natural product drug discovery. Nat. Rev. Drug Discov. 2023;22:895–916. doi: 10.1038/s41573-023-00774-7. [DOI] [PubMed] [Google Scholar]
  • 19.Savage N. Tapping into the drug discovery potential of AI. Biopharm. Deal. 2021 doi: 10.1038/d43747-021-00045-7. [DOI] [Google Scholar]
  • 20.Tropsha A., Isayev O., Varnek A., Schneider G., Cherkasov A. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nat. Rev. Drug Discov. 2024;23:141–155. doi: 10.1038/s41573-023-00832-0. [DOI] [PubMed] [Google Scholar]
  • 21.Wang Y., Wang C., Liu T., Qi H., Chen S., Cai X., Zhang M., Aliper A., Ren F., Ding X., Zhavoronkov A. Discovery of tetrahydropyrazolopyrazine derivatives as potent and selective MYT1 inhibitors for the treatment of cancer. J. Med. Chem. 2024;67:420–432. doi: 10.1021/acs.jmedchem.3c01476. [DOI] [PubMed] [Google Scholar]
  • 22.Tanoli Z., Schulman A., Aittokallio T. Validation guidelines for drug–target prediction methods. Expert Opin. Drug Discov. 2025;20:31–45. doi: 10.1080/17460441.2024.2430955. [DOI] [PubMed] [Google Scholar]
  • 23.Li S., Hu C., Ke S., Yang C., Chen J., Xiong Y., Liu H., Hong L. LS-MolGen: ligand- and structure dual-driven deep reinforcement learning for target-specific molecular generation improves binding affinity and novelty. J. Chem. Inf. Model. 2023;63:4207–4215. doi: 10.1021/acs.jcim.3c00587. [DOI] [PubMed] [Google Scholar]
  • 24.Kuntz I.D., Blaney J.M., Oatley S.J., Langridge R., Ferrin T.E. A geometric approach to macromolecule–ligand interactions. J. Mol. Biol. 1982;161:269–288. doi: 10.1016/0022-2836(82)90153-X. [DOI] [PubMed] [Google Scholar]
  • 25.Gschwend D.A., Good A.C., Kuntz I.D. Molecular docking towards drug discovery. J. Mol. Recognit. 1996;9:175–186. doi: 10.1002/(SICI)1099-1352(199603)9:2<175::AID-JMR260>3.0.CO;2-D. [DOI] [PubMed] [Google Scholar]
  • 26.Ripphausen P., Nisius B., Bajorath J. State-of-the-art in ligand-based virtual screening. Drug Discov. Today. 2011;16:372–376. doi: 10.1016/j.drudis.2011.02.011. [DOI] [PubMed] [Google Scholar]
  • 27.Nantasenamat C. A practical overview of quantitative structure–activity relationship. EXCLI J. 2009;8:74–88. doi: 10.17877/DE290R-690. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Lu S.-H., Wu J.-W., Liu H.-L., Zhao J.-H., Liu K.-T., Chuang C.-K., Lin H.-Y., Tsai W.-B., Ho Y. The discovery of potential acetylcholinesterase inhibitors: a combination of pharmacophore modeling, virtual screening, and molecular docking studies. J. Biomed. Sci. 2011;18:8. doi: 10.1186/1423-0127-18-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bowie J.U., Lüthy R., Eisenberg D. A method to identify protein sequences that fold into a known three-dimensional structure. Science. 1991;253:164–170. doi: 10.1126/science.1853201. [DOI] [PubMed] [Google Scholar]
  • 30.Kuhlman B., Bradley P. Advances in protein structure prediction and design. Nat. Rev. Mol. Cell Biol. 2019;20:681–697. doi: 10.1038/s41580-019-0163-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Muhammed M.T., Aki-Yalcin E. Homology modeling in drug discovery: overview, current applications, and future perspectives. Chem. Biol. Drug Des. 2019;93:12–20. doi: 10.1111/cbdd.13388. [DOI] [PubMed] [Google Scholar]
  • 32.Batool M., Ahmad B., Choi S. A structure-based drug discovery paradigm. Int. J. Mol. Sci. 2019;20:2783. doi: 10.3390/ijms20112783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yamanishi Y., Kotera M., Kanehisa M., Goto S. Drug–target interaction prediction from chemical, genomic and pharmacological data in an integrated framework. Bioinformatics. 2010;26:i246–i254. doi: 10.1093/bioinformatics/btq176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Pahikkala T., Airola A., Pietilä S., Shakyawar S., Szwajda A., Tang J., Aittokallio T. Toward more realistic drug–target interaction predictions. Brief. Bioinform. 2015;16:325–337. doi: 10.1093/bib/bbu010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.He T., Heidemeyer M., Ban F., Cherkasov A., Ester M. SimBoost: a read-across approach for predicting drug–target binding affinities using gradient boosting machines. J. Cheminform. 2017;9 doi: 10.1186/s13321-017-0209-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jiang M., Li Z., Zhang S., Wang S., Wang X., Yuan Q., Wei Z. Drug–target affinity prediction using graph neural network and contact maps. RSC Adv. 2020;10:20701–20712. doi: 10.1039/D0RA02297G. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Shin B., Park S., Kang K., Ho J.C. Machine Learning for Healthcare Conference. 2019. Self-attention based molecule representation for predicting drug–target interaction. [DOI] [Google Scholar]
  • 38.Fu H., Huang F., Liu X., Qiu Y., Zhang W. MVGCN: data integration through multi-view graph convolutional network for predicting links in biomedical bipartite networks. Bioinformatics. 2022;38:426–434. doi: 10.1093/bioinformatics/btab651. [DOI] [PubMed] [Google Scholar]
  • 39.Zheng S., Li Y., Chen S., Xu J., Yang Y. Predicting drug–protein interaction using quasi-visual question answering system. Nat. Mach. Intell. 2020;2:134–140. doi: 10.1038/s42256-020-0152-y. [DOI] [Google Scholar]
  • 40.Karimi M., Wu D., Wang Z., Shen Y. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics. 2019;35:3329–3338. doi: 10.1093/bioinformatics/btz111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Wu Y., Gao M., Zeng M., Zhang J., Li M. BridgeDPI: a novel graph neural network for predicting drug–protein interactions. Bioinformatics. 2022;38:2571–2578. doi: 10.1093/bioinformatics/btac155. [DOI] [PubMed] [Google Scholar]
  • 42.Luo Y., Zhao X., Zhou J., Yang J., Zhang Y., Kuang W., Peng J., Chen L., Zeng J. A network integration approach for drug–target interaction prediction and computational drug repositioning from heterogeneous information. Nat. Commun. 2017;8:573. doi: 10.1038/s41467-017-00680-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Qu X., Liang Y., Wang Y., Zheng T., Yue T., Ma L., Huang S.-W., Zhang J., Shi Y., Lin C. DEEP-ICL: Definition-Enriched Experts for Language Model In-Context Learning. arXiv. 2024 https://www.arxiv.org/abs/2403.04233v1 Preprint at. [Google Scholar]
  • 44.Hua Y., Feng Z., Song X., Wu X.-J., Kittler J. MMDG-DTI: drug–target interaction prediction via multimodal feature fusion and domain generalization. Pattern Recognit. 2025;157 doi: 10.1016/j.patcog.2024.110887. [DOI] [Google Scholar]
  • 45.Ru X., Ye X., Sakurai T., Zou Q. NerLTR-DTA: drug–target binding affinity prediction based on neighbor relationship and learning to rank. Bioinformatics. 2022;38:1964–1971. doi: 10.1093/bioinformatics/btac048. [DOI] [PubMed] [Google Scholar]
  • 46.Liu H., Sun J., Guan J., Zheng J., Zhou S. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics. 2015;31:i221–i229. doi: 10.1093/bioinformatics/btv695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wang D., Chen X., Bao X., Zhou K. AGraphDTA: an efficient model for drug–target affinity prediction with feature fusion. 2023 International Conference on New Trends in Computational Intelligence (NTCI) 2023 doi: 10.1109/NTCI60157.2023.10403671. [DOI] [Google Scholar]
  • 48.Peng L., Liu X., Yang L., Liu L., Bai Z., Chen M., Lu X., Nie L. BINDTI: a bi-directional intention network for drug–target interaction identification based on attention mechanisms. IEEE J. Biomed. Health Inform. 2025;29:1602–1612. doi: 10.1109/JBHI.2024.3375025. [DOI] [PubMed] [Google Scholar]
  • 49.Pan F., Yin C., Liu S.-Q., Huang T., Bian Z., Yuen P.C. BindingSiteDTI: differential-scale binding site modelling for drug–target interaction prediction. Bioinformatics. 2024;40 doi: 10.1093/bioinformatics/btae308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kalemati M., Zamani Emani M., Koohi S. BiComp-DTA: drug–target binding affinity prediction through complementary biological-related and compression-based featurization approach. PLoS Comput. Biol. 2023;19 doi: 10.1371/journal.pcbi.1011036. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Hua Y., Song X., Feng Z., Wu X.-J., Kittler J., Yu D.-J. CPInformer for efficient and robust compound–protein interaction prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023;20:285–296. doi: 10.1109/TCBB.2022.3144008. [DOI] [PubMed] [Google Scholar]
  • 52.Zhao M., Yuan M., Yang Y., Xu S.X. CPGL: prediction of compound–protein interaction by integrating graph attention network with long short-term memory neural network. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023;20:1935–1942. doi: 10.1109/TCBB.2022.3225296. [DOI] [PubMed] [Google Scholar]
  • 53.Öztürk H., Özgür A., Ozkirimli E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics. 2018;34:i821–i829. doi: 10.1093/bioinformatics/bty593. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Lee I., Keum J., Nam H. DeepConv-DTI: prediction of drug–target interactions via deep learning with convolution on protein sequences. PLoS Comput. Biol. 2019;15 doi: 10.1371/journal.pcbi.1007129. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Zhu Y., Zhao L., Wen N., Wang J., Wang C. DataDTA: a multi-feature and dual-interaction aggregation framework for drug–target binding affinity prediction. Bioinformatics. 2023;39 doi: 10.1093/bioinformatics/btad560. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Mukherjee S., Ghosh M., Basuchowdhuri P. Proceedings of the 2022 SIAM International Conference on Data Mining (SDM) 2022. DeepGLSTM: deep graph convolutional network and LSTM based approach for predicting drug–target binding affinity. [DOI] [Google Scholar]
  • 57.Abbasi K., Razzaghi P., Poso A., Amanlou M., Ghasemi J.B., Masoudi-Nejad A. DeepCDA: deep cross-domain compound–protein affinity prediction through LSTM and convolutional neural networks. Bioinformatics. 2020;36:4633–4642. doi: 10.1093/bioinformatics/btaa544. [DOI] [PubMed] [Google Scholar]
  • 58.Chen W., Chen G., Zhao L., Chen C.-Y.-C. Predicting drug–target interactions with deep-embedding learning of graphs and sequences. J. Phys. Chem. A. 2021;125:5633–5642. doi: 10.1021/acs.jpca.1c02419. [DOI] [PubMed] [Google Scholar]
  • 59.Yuan W., Chen G., Chen C.-Y.-C. FusionDTA: attention-based feature polymerizer and knowledge distillation for drug–target binding affinity prediction. Brief. Bioinform. 2022;23 doi: 10.1093/bib/bbab506. [DOI] [PubMed] [Google Scholar]
  • 60.Quan Z., Guo Y., Lin X., Wang Z.-J., Zeng X. 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2019. GraphCPI: graph neural representation learning for compound–protein interaction; pp. 717–722. [DOI] [Google Scholar]
  • 61.Nguyen T., Le H., Quinn T.P., Nguyen T., Le T.D., Venkatesh S. GraphDTA: predicting drug–target binding affinity with graph neural networks. Bioinformatics. 2021;37:1140–1147. doi: 10.1093/bioinformatics/btaa921. [DOI] [PubMed] [Google Scholar]
  • 62.Zhao Q., Zhao H., Zheng K., Wang J. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics. 2022;38:655–662. doi: 10.1093/bioinformatics/btab715. [DOI] [PubMed] [Google Scholar]
  • 63.Cheng Z., Zhao Q., Li Y., Wang J. IIFDTI: predicting drug–target interactions through interactive and independent features based on attention mechanism. Bioinformatics. 2022;38:4153–4161. doi: 10.1093/bioinformatics/btac485. [DOI] [PubMed] [Google Scholar]
  • 64.Song W., Xu L., Han C., Tian Z., Zou Q. Drug–target interaction predictions with multi-view similarity network fusion strategy and deep interactive attention mechanism. Bioinformatics. 2024;40 doi: 10.1093/bioinformatics/btae346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Yang Z., Zhong W., Zhao L., Yu-Chian Chen C. MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chem. Sci. 2022;13:816–833. doi: 10.1039/D1SC05180F. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Li X., Zhang G., Cui H., Hou S., Wang S., Li X., Chen Y., Li Z., Zhang L. MCANet: a joint semantic segmentation framework of optical and SAR images for land use classification. Int. J. Appl. Earth Obs. Geoinf. 2022;106 doi: 10.1016/j.jag.2022.102638. [DOI] [Google Scholar]
  • 67.Zeng Y., Chen X., Luo Y., Li X., Peng D. Deep drug–target binding affinity prediction with multiple attention blocks. Brief. Bioinform. 2021;22 doi: 10.1093/bib/bbab117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Hua Y., Song X., Feng Z., Wu X. MFR-DTA: a multi-functional and robust model for predicting drug–target binding affinity and region. Bioinformatics. 2023;39 doi: 10.1093/bioinformatics/btad056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Tsubaki M., Tomii K., Sese J. Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics. 2019;35:309–318. doi: 10.1093/bioinformatics/bty535. [DOI] [PubMed] [Google Scholar]
  • 70.Tang X., Zhou Y., Yang M., Li W. TC-DTA: predicting drug–target binding affinity with transformer and convolutional neural networks. IEEE Trans. NanoBioscience. 2024;23:572–578. doi: 10.1109/TNB.2024.3441590. [DOI] [PubMed] [Google Scholar]
  • 71.Li Z., Ren P., Yang H., Zheng J., Bai F. TEFDTA: a transformer encoder and fingerprint representation combined prediction method for bonded and non-bonded drug–target affinities. Bioinformatics. 2024;40 doi: 10.1093/bioinformatics/btad778. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Chen L., Tan X., Wang D., Zhong F., Liu X., Yang T., Luo X., Chen K., Jiang H., Zheng M. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics. 2020;36:4406–4414. doi: 10.1093/bioinformatics/btaa524. [DOI] [PubMed] [Google Scholar]
  • 73.Chen L., Fan Z., Chang J., Yang R., Hou H., Guo H., Zhang Y., Yang T., Zhou C., Sui Q., et al. Sequence-based drug design as a concept in computational drug design. Nat. Commun. 2023;14:4217. doi: 10.1038/s41467-023-39856-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Suruliandi A., Idhaya T., Raja S. Drug target interaction prediction using machine learning techniques – a review. Revista IJIMAI. 2024;8:n6. doi: 10.9781/ijimai.2022.11.002. [DOI] [Google Scholar]
  • 75.Yu J., Li Z., Chen G., Kong X., Hu J., Wang D., Cao D., Li Y., Huo R., Wang G., et al. Computing the relative binding affinity of ligands based on a pairwise binding comparison network. Nat. Comput. Sci. 2023;3:860–872. doi: 10.1038/s43588-023-00529-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Davis M.I., Hunt J.P., Herrgard S., Ciceri P., Wodicka L.M., Pallares G., Hocker M., Treiber D.K., Zarrinkar P.P. Comprehensive analysis of kinase inhibitor selectivity. Nat. Biotechnol. 2011;29:1046–1051. doi: 10.1038/nbt.1990. [DOI] [PubMed] [Google Scholar]
  • 77.Tang J., Szwajda A., Shakyawar S., Xu T., Hintsanen P., Wennerberg K., Aittokallio T. Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J. Chem. Inf. Model. 2014;54:735–743. doi: 10.1021/ci400709d. [DOI] [PubMed] [Google Scholar]
  • 78.Metz J.T., Johnson E.F., Soni N.B., Merta P.J., Kifle L., Hajduk P.J. Navigating the kinome. Nat. Chem. Biol. 2011;7:200–202. doi: 10.1038/nchembio.530. [DOI] [PubMed] [Google Scholar]
  • 79.Schulman A., Rousu J., Aittokallio T., Tanoli Z. Attention-based approach to predict drug–target interactions across seven target superfamilies. Bioinformatics. 2024;40 doi: 10.1093/bioinformatics/btae496. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Wang S., Jiang M., Zhang S., Wang X., Yuan Q., Wei Z., Li Z. MCN-CPI: multiscale convolutional network for compound–protein interaction prediction. Biomolecules. 2021;11:1119. doi: 10.3390/biom11081119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Huang K., Xiao C., Glass L.M., Sun J. MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics. 2021;37:830–836. doi: 10.1093/bioinformatics/btaa880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Yang X., Niu Z., Liu Y., Song B., Lu W., Zeng L., Zeng X. Modality-DTA: multimodality fusion strategy for drug–target affinity prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 2023;20:1200–1210. doi: 10.1109/TCBB.2022.3205282. [DOI] [PubMed] [Google Scholar]
  • 83.Tayebi A., Yousefi N., Yazdani-Jahromi M., Kolanthai E., Neal C.J., Seal S., Garibay O.O. UnbiasedDTI: mitigating real-world bias of drug–target interaction prediction by using deep ensemble-balanced learning. Molecules. 2022;27:2980. doi: 10.3390/molecules27092980. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.McGibbon M., Shave S., Dong J., Gao Y., Houston D.R., Xie J., Yang Y., Schwaller P., Blay V. From intuition to AI: evolution of small molecule representations in drug discovery. Brief. Bioinform. 2023;25:bbad422. doi: 10.1093/bib/bbad422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Wang Y., Zhai Y., Ding Y., Zou Q. SBSM-Pro: support bio-sequence machine for proteins. Sci. China Inf. Sci. 2024;67 doi: 10.1007/s11432-024-4171-9. [DOI] [Google Scholar]
  • 86.Rayhan F., Ahmed S., Shatabda S., Farid D.M., Mousavian Z., Dehzangi A., Rahman M.S. iDTI-ESBoost: identification of drug–target interaction using evolutionary and structural features with boosting. Sci. Rep. 2017;7 doi: 10.1038/s41598-017-18025-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Wang L., You Z.-H., Chen X., Yan X., Liu G., Zhang W. RFDT: a rotation forest-based predictor for predicting drug–target interactions using drug structure and protein sequence information. Curr. Protein Pept. Sci. 2018;19:445–454. doi: 10.2174/1389203718666161114111656. [DOI] [PubMed] [Google Scholar]
  • 88.Cock P.J.A., Antao T., Chang J.T., Chapman B.A., Cox C.J., Dalke A., Friedberg I., Hamelryck T., Kauff F., Wilczynski B., de Hoon M.J.L. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25:1422–1423. doi: 10.1093/bioinformatics/btp163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.DeLano W.L. PyMOL: an open-source molecular graphics tool. CCP4 Newsl Protein Crystallogr. 2002;40:82–92. [Google Scholar]
  • 90.Kabsch W., Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983;22:2577–2637. doi: 10.1002/bip.360221211. [DOI] [PubMed] [Google Scholar]
  • 91.Chen Z., Zhao P., Li F., Leier A., Marquez-Lago T.T., Wang Y., Webb G.I., Smith A.I., Daly R.J., Chou K.-C., Song J. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. 2018;34:2499–2502. doi: 10.1093/bioinformatics/bty140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Pande A., Patiyal S., Lathwal A., Arora C., Kaur D., Dhall A., Mishra G., Kaur H., Sharma N., Jain S., et al. Pfeature: a tool for computing wide range of protein features and building prediction models. J. Comput. Biol. 2023;30:204–222. doi: 10.1089/cmb.2022.0241. [DOI] [PubMed] [Google Scholar]
  • 93.Ruiz-Blanco Y.B., Paz W., Green J., Marrero-Ponce Y. ProtDCal: a program to compute general-purpose numerical descriptors for sequences and 3D-structures of proteins. BMC Bioinf. 2015;16 doi: 10.1186/s12859-015-0586-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Müller A.T., Gabernet G., Hiss J.A., Schneider G. modlAMP: Python for antimicrobial peptides. Bioinformatics. 2017;33:2753–2755. doi: 10.1093/bioinformatics/btx285. [DOI] [PubMed] [Google Scholar]
  • 95.Azimi R., Ozgul M., Kenney M.C., Kuppermann B.D. Bioinformatic analysis of small humanin-like peptides using AlphaFold-2 and Expasy ProtParam. Investig. Ophthalmol. Vis. Sci. 2024;65:1320. [Google Scholar]
  • 96.Bento A.P., Hersey A., Félix E., Landrum G., Gaulton A., Atkinson F., Bellis L.J., De Veij M., Leach A.R. An open source chemical structure curation pipeline using RDKit. J Cheminform. 2020;12:1–16. doi: 10.1186/s13321-020-00456-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.O'Boyle N.M., Banck M., James C.A., Morley C., Vandermeersch T., Hutchison G.R. Open Babel: an open chemical toolbox. J Cheminform. 2011;3:1–14. doi: 10.1186/1758-2946-3-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Korshunova M., Ginsburg B., Tropsha A., Isayev O. OpenChem: a deep learning toolkit for computational chemistry and drug design. J. Chem. Inf. Model. 2021;61:7–13. doi: 10.1021/acs.jcim.0c00971. [DOI] [PubMed] [Google Scholar]
  • 99.Dahlgren B. ChemPy: a package useful for chemistry written in Python. J. Open Source Softw. 2018;3:565. doi: 10.21105/joss.00565. [DOI] [Google Scholar]
  • 100.ten Brink T., Exner T.E. pKa based protonation states and microspecies for protein–ligand docking. J. Comput. Aided Mol. Des. 2010;24:935–942. doi: 10.1007/s10822-010-9385-x. [DOI] [PubMed] [Google Scholar]
  • 101.Yap C.W. PaDEL-Descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011;32:1466–1474. doi: 10.1002/jcc.21707. [DOI] [PubMed] [Google Scholar]
  • 102.Warr W.A. Scientific workflow systems: Pipeline Pilot and KNIME. J. Comput. Aided Mol. Des. 2012;26:801–804. doi: 10.1007/s10822-012-9577-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.O'Boyle N.M., Morley C., Hutchison G.R. Pybel: a Python wrapper for the Open Babel cheminformatics toolkit. Chem. Cent. J. 2008;2:1–7. doi: 10.1186/1752-153X-2-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Dong J., Cao D.-S., Miao H.-Y., Liu S., Deng B.-C., Yun Y.-H., Wang N.-N., Lu A.-P., Zeng W.-B., Chen A.F. ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. J. Cheminform. 2015;7 doi: 10.1186/s13321-015-0109-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Steinbeck C., Han Y., Kuhn S., Horlacher O., Luttmann E., Willighagen E. The Chemistry Development Kit (CDK): an open-source Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci. 2003;43:493–500. doi: 10.1021/ci025584y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 106.Ramsundar B. Stanford University; 2018. Molecular Machine Learning with DeepChem.https://www.proquest.com/dissertations-theses/molecular-machine-learning-with-deepchem/docview/2437211533/se-2?accountid=17077 PhD Thesis. [Google Scholar]
  • 107.Veličković P., Cucurull G., Casanova A., Romero A., Liò P., Bengio Y. Graph Attention Networks. arXiv. 2017 doi: 10.48550/arXiv.1710.10903. Preprint at. [DOI] [Google Scholar]
  • 108.Zhao T., Hu Y., Valsdottir L.R., Zang T., Peng J. Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief. Bioinform. 2021;22:2141–2150. doi: 10.1093/bib/bbaa044. [DOI] [PubMed] [Google Scholar]
  • 109.Egan S., Fedorko W., Lister A., Pearkes J., Gay C. Long Short-Term Memory (LSTM) networks with jet constituents for boosted top tagging at the LHC. arXiv. 2017 doi: 10.48550/arXiv.1711.09059. Preprint at. [DOI] [Google Scholar]
  • 110.Liu B., Zhu Y. ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into learning to rank. IEEE Access. 2019;7:102499–102507. doi: 10.1109/ACCESS.2019.2929363. [DOI] [Google Scholar]
  • 111.Zhang W., Ji L., Chen Y., Tang K., Wang H., Zhu R., Jia W., Cao Z., Liu Q. When drug discovery meets web search: learning to rank for ligand-based virtual screening. J. Cheminform. 2015;7:5–13. doi: 10.1186/s13321-015-0052-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Suzuki S.D., Ohue M., Akiyama Y. PKRank: a novel learning-to-rank method for ligand-based virtual screening using pairwise kernel and RankSVM. Artif. Life Robot. 2018;23:205–212. doi: 10.1007/s10015-017-0416-8. [DOI] [Google Scholar]
  • 113.Ye J., McGinnis S., Madden T.L. BLAST: improvements for better sequence analysis. Nucleic Acids Res. 2006;34:W6–W9. doi: 10.1093/nar/gkl164. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Katoh K., Misawa K., Kuma K.i., Miyata T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 2002;30:3059–3066. doi: 10.1093/nar/30.14.3059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Kim W., Mirdita M., Levy K.E., Gilchrist C.L., Schweke H., Söding J., Levy E.D., Steinegger M. Rapid and sensitive protein complex alignment with Foldseek-Multimer. Nat. Methods. 2025;20:1–4. doi: 10.1038/s41592-025-02593-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Bemis G.W., Murcko M.A. The properties of known drugs. 1. Molecular frameworks. J. Med. Chem. 1996;39:2887–2893. doi: 10.1021/jm9602928. [DOI] [PubMed] [Google Scholar]
  • 117.Ru X., Zhao S., Zou Q., Xu L. Identify potential drug candidates within a high-quality compound search space. Brief. Bioinform. 2024;26:bbaf024. doi: 10.1093/bib/bbaf024. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Schreyer A.M., Blundell T. USRCAT: real-time ultrafast shape recognition with pharmacophoric constraints. J. Cheminform. 2012;4 doi: 10.1186/1758-2946-4-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Cereto-Massagué A., Ojeda M.J., Valls C., Mulero M., Garcia-Vallvé S., Pujadas G. Molecular fingerprint similarity search in virtual screening. Methods. 2015;71:58–63. doi: 10.1016/j.ymeth.2015.01.002. [DOI] [PubMed] [Google Scholar]
  • 120.Aldahdooh J., Tanoli Z., Tang J. Mining drug–target interactions from biomedical literature using chemical and gene descriptions-based ensemble transformer model. Bioinform. Adv. 2024;4 doi: 10.1093/bioadv/vbae106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Aldahdooh J., Vähä-Koskela M., Tang J., Tanoli Z. Using BERT to identify drug–target interactions from whole PubMed. BMC Bioinf. 2022;23:245. doi: 10.1186/s12859-022-04768-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 2019;47:D506–D515. doi: 10.1093/nar/gky1049. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 123.PD Bank Protein Data Bank. Nature New Biol. 1971;233:223. doi: 10.1038/newbio233223b0. [DOI] [Google Scholar]
  • 124.Geer L.Y., Marchler-Bauer A., Geer R.C., Han L., He J., He S., Liu C., Shi W., Bryant S.H. The NCBI BioSystems database. Nucleic Acids Res. 2010;38:D492–D496. doi: 10.1093/nar/gkp858. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Hunter S., Apweiler R., Attwood T.K., Bairoch A., Bateman A., Binns D., Bork P., Das U., Daugherty L., Duquenne L., et al. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009;37:D211–D215. doi: 10.1093/nar/gkn785. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Marchler-Bauer A., Lu S., Anderson J.B., Chitsaz F., Derbyshire M.K., DeWeese-Scott C., Fong J.H., Geer L.Y., Geer R.C., Gonzales N.R., et al. CDD: a Conserved Domain Database for the functional annotation of proteins. Nucleic Acids Res. 2011;39:D225–D229. doi: 10.1093/nar/gkq1189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Placzek S., Schomburg I., Chang A., Jeske L., Ulbrich M., Tillack J., Schomburg D. BRENDA in 2017: new perspectives and new tools in BRENDA. Nucleic Acids Res. 2017;45:D380–D388. doi: 10.1093/nar/gkw952. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 128.Gene Ontology Consortium. Aleksander S.A., Balhoff J., Carbon S., Cherry J.M., Drabkin H.J., Ebert D., Feuermann M., Gaudet P., Harris N.L., et al. The Gene Ontology knowledgebase in 2023. Genetics. 2023;224 doi: 10.1093/genetics/iyad031. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 129.Kim S., Chen J., Cheng T., Gindulyte A., He J., He S., Li Q., Shoemaker B.A., Thiessen P.A., Yu B., et al. PubChem 2023 update. Nucleic Acids Res. 2023;51:D1373–D1380. doi: 10.1093/nar/gkac956. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Mendez D., Gaulton A., Bento A.P., Chambers J., De Veij M., Félix E., Magariños M.P., Mosquera J.F., Mutowo P., Nowotka M., et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47:D930–D940. doi: 10.1093/nar/gky1075. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Chen J., Swamidass S.J., Dou Y., Bruand J., Baldi P. ChemDB: a public database of small molecules and related chemoinformatics resources. Bioinformatics. 2005;21:4133–4139. doi: 10.1093/bioinformatics/bti683. [DOI] [PubMed] [Google Scholar]
  • 132.Knox C., Wilson M., Klinger C.M., Franklin M., Oler E., Wilson A., Pon A., Cox J., Chin N.E.L., Strawbridge S.A., et al. DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res. 2024;52:D1265–D1275. doi: 10.1093/nar/gkad976. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Irwin J.J., Tang K.G., Young J., Dandarchuluun C., Wong B.R., Khurelbaatar M., Moroz Y.S., Mayfield J., Sayle R.A. ZINC20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 2020;60:6065–6073. doi: 10.1021/acs.jcim.0c00675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 134.Pence H.E., Williams A. ChemSpider: an online chemical information resource. J Chem Edu. 2010;87:1123–1124. doi: 10.1021/ed100697w. [DOI] [Google Scholar]
  • 135.Ursu O., Holmes J., Knockel J., Bologa C.G., Yang J.J., Mathias S.L., Nelson S.J., Oprea T.I. DrugCentral: online drug compendium. Nucleic Acids Res. 2017;45:D932–D939. doi: 10.1093/nar/gkw993. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Schwartz L.M., Woloshin S., Zheng E., Tse T., Zarin D.A. ClinicalTrials.gov and Drugs@FDA: a comparison of results reporting for new drug approval trials. Ann. Intern. Med. 2016;165:421–430. doi: 10.7326/M15-2658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Gilson M.K., Liu T., Baitaluk M., Nicola G., Hwang L., Chong J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44:D1045–D1053. doi: 10.1093/nar/gkv1072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 138.Kanehisa M., Furumichi M., Tanabe M., Sato Y., Morishima K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 2017;45:D353–D361. doi: 10.1093/nar/gkw1092. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 139.Freshour S.L., Kiwala S., Cotto K.C., Coffman A.C., McMichael J.F., Song J.J., Griffith M., Griffith O.L., Wagner A.H. Integration of the Drug–Gene Interaction Database (DGIdb 4.0) with open crowdsource efforts. Nucleic Acids Res. 2021;49:D1144–D1151. doi: 10.1093/nar/gkaa1084. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 140.Judson R.S., Magpantay F.M., Chickarmane V., Haskell C., Tania N., Taylor J., Xia M., Huang R., Rotroff D.M., Filer D.L., et al. Integrated model of chemical perturbations of a biological pathway using 18 in vitro high-throughput screening assays for the estrogen receptor. Toxicol. Sci. 2015;148:137–154. doi: 10.1093/toxsci/kfv168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 141.Kuhn M., Szklarczyk D., Pletscher-Frankild S., Blicher T.H., von Mering C., Jensen L.J., Bork P. STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Res. 2014;42:D401–D407. doi: 10.1093/nar/gkt1207. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 142.Günther S., Kuhn M., Dunkel M., Campillos M., Senger C., Petsalaki E., Ahmed J., Urdiales E.G., Gewiess A., Jensen L.J. SuperTarget and Matador: resources for exploring drug–target relationships. Nucleic Acids Res. 2007;36:D919–D922. doi: 10.1093/nar/gkm862. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 143.Du H., Zhang X., Wu Z., Zhang O., Gu S., Wang M., Zhu F., Li D., Hou T., Pan P. CovalentInDB 2.0: an updated comprehensive database for structure-based and ligand-based covalent inhibitor design and screening. Nucleic Acids Res. 2025;53:D1322–D1327. doi: 10.1093/nar/gkae946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 144.Duong Nguyen T.T., Tanoli Z., Hassan S., Özcan U.O., Caroli J., Kooistra A.J., Gloriam D.E., Hauser A.S. PGxDB: an interactive web-platform for pharmacogenomics research. Nucleic Acids Res. 2025;53:D1486–D1497. doi: 10.1093/nar/gkae1127. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 145.Tanoli Z., Alam Z., Ianevski A., Wennerberg K., Vähä-Koskela M., Aittokallio T. Interactive visual analysis of drug–target interaction networks using Drug Target Profiler, with applications to precision medicine and drug repurposing. Brief. Bioinform. 2020;21:211–220. doi: 10.1093/bib/bby119. [DOI] [PubMed] [Google Scholar]
  • 146.Chan W.K.B., Zhang H., Yang J., Brender J.R., Hur J., Özgür A., Zhang Y. GLASS: a comprehensive database for experimentally validated GPCR–ligand associations. Bioinformatics. 2015;31:3035–3042. doi: 10.1093/bioinformatics/btv302. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 147.Davis A.P., Grondin C.J., Johnson R.J., Sciaky D., Wiegers J., Wiegers T.C., Mattingly C.J. Comparative Toxicogenomics Database (CTD): update 2021. Nucleic Acids Res. 2021;49:D1138–D1143. doi: 10.1093/nar/gkaa891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 148.Tanoli Z., Alam Z., Vähä-Koskela M., Ravikumar B., Malyutina A., Jaiswal A., Tang J., Wennerberg K., Aittokallio T. Drug Target Commons 2.0: a community platform for systematic analysis of drug–target interaction profiles. Database. 2018;2018 doi: 10.1093/database/bay083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 149.Zhou Y., Zhang Y., Zhao D., Yu X., Shen X., Zhou Y., Wang S., Qiu Y., Chen Y., Zhu F. TTD: Therapeutic Target Database describing target druggability information. Nucleic Acids Res. 2024;52:D1465–D1477. doi: 10.1093/nar/gkad751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 150.Barbarino J.M., Whirl-Carrillo M., Altman R.B., Klein T.E. PharmGKB: a worldwide resource for pharmacogenomic information. Wiley Interdiscip. Rev. Syst. Biol. Med. 2018;10 doi: 10.1002/wsbm.1417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 151.Gillespie M., Jassal B., Stephan R., Milacic M., Rothfels K., Senff-Ribeiro A., Griss J., Sevilla C., Matthews L., Gong C., et al. The Reactome pathway knowledgebase 2022. Nucleic Acids Res. 2022;50:D687–D692. doi: 10.1093/nar/gkab1028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 152.Szklarczyk D., Gable A.L., Nastou K.C., Lyon D., Kirsch R., Pyysalo S., Doncheva N.T., Legeay M., Fang T., Bork P., et al. The STRING database in 2021: customizable protein–protein networks, and functional characterization of user-uploaded gene/measurement sets. Nucleic Acids Res. 2021;49:D605–D612. doi: 10.1093/nar/gkaa1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 153.Kamburov A., Herwig R. ConsensusPathDB 2022: molecular interactions update as a resource for network biology. Nucleic Acids Res. 2022;50:D587–D595. doi: 10.1093/nar/gkab1128. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 154.Keshava Prasad T.S., Goel R., Kandasamy K., Keerthikumar S., Kumar S., Mathivanan S., Telikicherla D., Raju R., Shafreen B., Venugopal A., et al. Human Protein Reference Database—2009 update. Nucleic Acids Res. 2009;37:D767–D772. doi: 10.1093/nar/gkn892. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 155.Karp P.D., Billington R., Caspi R., Fulcher C.A., Latendresse M., Kothari A., Keseler I.M., Krummenacker M., Midford P.E., Ong Q., et al. The BioCyc collection of microbial genomes and metabolic pathways. Brief. Bioinform. 2019;20:1085–1093. doi: 10.1093/bib/bbx085. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 156.Kerrien S., Aranda B., Breuza L., Bridge A., Broackes-Carter F., Chen C., Duesbury M., Dumousseau M., Feuermann M., Hinz U., et al. The IntAct molecular interaction database in 2012. Nucleic Acids Res. 2012;40:D841–D846. doi: 10.1093/nar/gkr1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 157.Oughtred R., Rust J., Chang C., Breitkreutz B.J., Stark C., Willems A., Boucher L., Leung G., Kolas N., Zhang F., et al. The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci. 2021;30:187–200. doi: 10.1002/pro.3978. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 158.Urán Landaburu L., Berenstein A.J., Videla S., Maru P., Shanmugam D., Chernomoretz A., Agüero F. TDR Targets 6: driving drug discovery for human pathogens through intensive chemogenomic data integration. Nucleic Acids Res. 2020;48:D992–D1005. doi: 10.1093/nar/gkz999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 159.Li Z.-C., Huang M.-H., Zhong W.-Q., Liu Z.-Q., Xie Y., Dai Z., Zou X.-Y. Identification of drug–target interaction from interactome network with ‘guilt-by-association’ principle and topology features. Bioinformatics. 2016;32:1057–1064. doi: 10.1093/bioinformatics/btv695. [DOI] [PubMed] [Google Scholar]
  • 160.Weiss K., Khoshgoftaar T.M., Wang D. A survey of transfer learning. J. Big Data. 2016;3:9. doi: 10.1186/s40537-016-0043-6. [DOI] [Google Scholar]
  • 161.Zhang Y., Yang Q. An overview of multi-task learning. Natl. Sci. Rev. 2018;5:30–43. doi: 10.1093/nsr/nwx105. [DOI] [Google Scholar]
  • 162.Wang Y., Yao Q., Kwok J.T., Ni L.M. Generalizing from a few examples: a survey on few-shot learning. ACM Comput. Surv. 2020;53:1–34. doi: 10.1145/3386252. [DOI] [Google Scholar]
  • 163.Hospedales T., Antoniou A., Micaelli P., Storkey A. Meta-learning in neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 2022;44:5149–5169. doi: 10.1109/TPAMI.2021.3079209. [DOI] [PubMed] [Google Scholar]
  • 164.Fonseca J., Bacao F. Improving active learning performance through the use of data augmentation. Int. J. Intell. Syst. 2023;2023 doi: 10.1155/2023/7941878. [DOI] [Google Scholar]
  • 165.Cheng Y., Gong Y., Liu Y., Song B., Zou Q. Molecular design in drug discovery: a comprehensive review of deep generative models. Brief. Bioinform. 2021;22 doi: 10.1093/bib/bbab344. [DOI] [PubMed] [Google Scholar]
  • 166.Kuang T., Liu P., Ren Z. Impact of domain knowledge and multi-modality on intelligent molecular property prediction: a systematic survey. Big Data Min. Anal. 2024;7:858–888. doi: 10.26599/BDMA.2024.9020028. [DOI] [Google Scholar]
  • 167.Ma J., Dong C., Ji C. MicroRNA and drug resistance. Cancer Gene Ther. 2010;17:523–531. doi: 10.1038/cgt.2010.18. [DOI] [PubMed] [Google Scholar]
  • 168.Raiaan M.A.K., Mukta M.S.H., Fatema K., Fahad N.M., Sakib S., Mim M.M.J., Ahmad J., Ali M.E., Azam S. A review on large language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access. 2024;12:26839–26874. doi: 10.1109/ACCESS.2024.3365742. [DOI] [Google Scholar]
  • 169.Comet J.-P., Aude J.-C., Glémet E., Risler J.-L., Hénaut A., Slonimski P.P., Codani J.-J. Significance of Z-value statistics of Smith–Waterman scores for protein alignments. Comput. Chem. 1999;23:317–331. doi: 10.1016/S0097-8485(99)00008-X. [DOI] [PubMed] [Google Scholar]
  • 170.Bajusz D., Rácz A., Héberger K. Why is Tanimoto index an appropriate choice for fingerprint-based similarity calculations? J. Cheminform. 2015;7:20. doi: 10.1186/s13321-015-0069-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 171.Swarndeep S.J., Pandya S. An overview of partitioning algorithms in clustering techniques. Int J Adv Res Comput Eng Technol. 2016;5:1943–1946. [Google Scholar]
  • 172.Kolluri S., Lin J., Liu R., Zhang Y., Zhang W. Machine learning and artificial intelligence in pharmaceutical research and development: a review. AAPS J. 2022;24:19. doi: 10.1208/s12248-021-00644-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 173.Johansson-Åkhe I., Wallner B. Improving peptide–protein docking with AlphaFold-Multimer using forced sampling. Front. Bioinform. 2022;2 doi: 10.3389/fbinf.2022.959160. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Cell Reports Methods are provided here courtesy of Elsevier

RESOURCES