Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2026 Feb 5;27(1):bbag034. doi: 10.1093/bib/bbag034

TriDTI: tri-modal representation learning with cross-modal alignment for drug–target interaction prediction

Gwang-Hyeon Yun 1, Jong-Hoon Park 2, Young-Rae Cho 3,4,
PMCID: PMC12874921  PMID: 41642192

Abstract

The rapid advancement of artificial intelligence has positioned drug–target interaction (DTI) prediction as a promising approach in drug screening and drug discovery. Recent research has attempted to use pharmacological multimodal information to increase prediction accuracy. However, existing approaches are limited in fully utilizing more than three modalities, primarily due to information loss during the modality integration process. To overcome this challenge, we propose TriDTI, a novel framework that incorporates three modalities for both drugs and proteins. Specifically, TriDTI integrates structural, sequential, and relational modalities from both entities. To mitigate information loss during integration, we employ projection and cross-modal contrastive learning for modality alignment. Furthermore, we design a fusion strategy that combines soft attention and cross-attention to effectively integrate multimodal representations. Extensive experiments on three benchmark datasets demonstrate that TriDTI consistently achieves superior performance to existing state-of-the-art approaches in DTI prediction. Moreover, TriDTI exhibits a robust generalization ability across three challenging cold-start scenarios, effectively predicting interactions involving novel drugs, targets, and bindings. These results highlight the potential of TriDTI as a robust and practical framework for facilitating drug discovery. The source codes and datasets are publicly accessible at https://github.com/knhc1234/TriDTI.

Keywords: drug–target interaction prediction, tri-modal representation learning, modality alignment

Introduction

Predicting drug–target interactions (DTIs) is a fundamental challenge in drug screening and drug discovery [1, 2]. Traditional drug discovery pipelines are often constrained by high costs and long development cycles [3–5]. To overcome these limitations, diverse computational methods have been proposed, enabling both deeper analytical insights, and more efficient prediction in DTI studies [6, 7]. These approaches can be broadly categorized into ligand-based, docking-based, and chemogenomic methods [8].

Ligand-based methods exploit structural similarities between ligands to infer DTIs, while docking-based methods estimate binding affinity by simulating the interactions between drug molecules and the 3D conformations of target proteins [9]. However, both methods are inherently restricted by the scarcity of experimentally verified ligands and reliable 3D structural data [10–12]. In contrast, chemogenomic methods address these limitations by directly leveraging molecular representations of drugs (e.g. SMILES) and protein sequences, thereby eliminating the reliance on 3D structural data or extensive ligand libraries. By enabling predictions for uncharacterized targets, this strategy greatly expands the applicability of computational drug discovery. Building on this foundation, deep learning models have emerged, offering diverse solutions for modeling DTIs. These models are commonly categorized by their treatment of drug embeddings into sequence- and structure-based methods [13].

Sequence-based methods predict DTIs directly from raw sequence data, typically encoding a drug’s SMILES code and a protein’s amino acid sequence into vector representations. For example, TransformerCPI [14] employs a Transformer architecture to jointly encode SMILES and protein sequences, generating predictions through a fully connected layer. HyperAttentionDTI [15] constructs feature matrices from each sequence using a convolutional neural network (CNN) block and captures complex noncovalent interactions between atoms and amino acids through an attention mechanism. More recently, DLM-DTI [16] leverages pretrained language models, specifically ChemBERTa [17] and ProtBERT [18], combined with a lightweight teacher–student learning strategy to enhance prediction efficiency. DrugKANs [19] proposed a novel paradigm that integrates Kolmogorov–Arnold Networks with sequential representations, demonstrating improved expressiveness, and interpretability in modeling complex drug–target relationships.

In contrast, structure-based methods represent drugs as molecular graphs, capturing structural information that sequence-based embeddings may overlook. For instance, MGraphDTA [20] utilizes a multiscale graph neural network (GNN) for molecular graphs alongside a multiscale CNN for protein structural features. Similarly, MGMA-DTI [21] applies a 2-layer graph convolutional network (GCN) to molecular graphs and a multi-order gated convolution to protein sequences, integrating these features through an attention-based fusion module. Furthermore, GPS-DTI [22] uniquely enhances drug representation by employing a GPS layer [23], though it relies on ESM2 sequence embeddings refined by CNNs for protein feature extraction. However, DTI involves complex interactions situated within a wider biological context, leading some studies [24–26] to explore leveraging graph representation learning over heterogeneous biological information networks to capture global dependency patterns. Despite these attempts to utilize relational information, existing sequence-based and structure-based methods primarily rely on single-representation paradigms. Although computationally efficient due to their reliance on a single representation, these approaches are limited in capturing the full spectrum of multimodal information inherent to both drugs and proteins.

To overcome these limitations, recent studies have explored multimodal integration to enhance predictive performance. MCL-DTI [27] extracts features from both drug molecule images and chemical text information that are then combined to form a multimodal drug representation fused with the target sequence for DTI prediction. In addition, MMDG-DTI [28] incorporates two complementary features: textual embeddings from pretrained language models, and structural embeddings derived from molecular graphs and protein sequence encoders. Despite the potential of multimodal integration, effectively optimizing these methods remains challenging, and they do not always outperform single-modality approaches in predictive accuracy.

Motivated by these challenges, we propose TriDTI, a novel framework that simultaneously leverages three distinct modalities for both drugs and proteins. Unlike prior approaches that rely on a single or dual representation, TriDTI incorporates structural, sequential, and relational features within a unified learning paradigm. Furthermore, cross-modal contrastive learning is employed to strengthen semantic alignment, and a dynamic fusion strategy adaptively balances modality contributions, enabling the capture of intricate DTI patterns often overlooked by previous models. Our contributions are summarized as follows:

  • Novel tri-modal framework: TriDTI is a novel DTI prediction model to jointly utilize structural, sequential, and relational modalities for both drugs and proteins, expanding beyond the limitations of single- or dual-modality designs.

  • Enhanced modality alignment: We design a projection layer combined with cross-modal contrastive learning to enforce semantic consistency both across instances and between modalities, addressing the challenges of joint optimization in multimodal learning.

  • Adaptive fusion: We introduce a two-stage fusion mechanism in which soft attention dynamically weights modality-specific contributions and cross-attention models DTIs through interaction-aware representations, yielding more accurate DTI predictions.

Materials and methods

TriDTI consists of four main stages: (i) feature extraction, (ii) modality alignment (iii) feature fusion, and (iv) classification. The overall architecture is shown in Fig. 1, and the details of each component are described in the following sections.

Figure 1.

A schematic diagram illustrating the TriDTI architecture, showing the workflow from tri-modal feature extraction to cross-modal alignment, two-stage adaptive fusion, and the final DTI prediction layer.

The overall architecture of TriDTI, which first extracts modality-specific features—including structural embeddings (graphs and CNNs), sequence-based representations (SMILES and amino acids), and network-derived relational features (subgraphs encoded with GATv2)—and projects them into a unified latent space for cross-modal contrastive alignment; this is followed by a two-stage fusion mechanism where soft attention adaptively balances each modality’s contribution while cross-attention models interaction-aware representations, eventually passing the fused output to a prediction layer to determine interaction probabilities.

Feature extraction

Structural feature

We explicitly encode the structural characteristics of drugs and proteins using graph and convolution architectures. Drug molecules are represented as graphs derived from their SMILES codes using RDKit, where atoms are nodes and bonds are edges. Each atom is encoded into a 79-dimensional feature vector encompassing properties such as atom type, bond degree, hydrogen count, implicit valence, and aromaticity. A 2-layer graph isomorphism network (GIN) is applied to capture the molecular topology:

graphic file with name DmEquation1.gif 1

where Inline graphic denotes the embedding of atom Inline graphic at layer Inline graphic, Inline graphic is the neighbors of node Inline graphic, and Inline graphic is a learnable scalar. The final molecular representations are obtained by averaging the embeddings of all atoms in the last layer, forming the drug-level embedding matrix Inline graphic.

For proteins, we employ a multi-scale CNN to capture motifs of varying lengths from their amino acid sequences. The input sequences are first mapped to a learnable embeddings and passed through three parallel convolutional branches with kernel sizes of 1, 3, and 5, respectively. Each branch consists of three convolutional layers that refine local features. The outputs are then aggregated by AdaptiveMaxPooling to produce the protein embedding matrix Inline graphic, encoding functional motifs and multi-scale dependencies.

Sequential feature

Sequence-based embeddings provide semantic and contextual features that complement explicit structures. Token-level embeddings from pretrained large language models (LLMs) are mean-pooled to obtain sequence-level representations. For drugs, we adopt ChemBERTa, trained on large SMILES corpora, which captures chemical grammar and higher-order molecular patterns. This produces a sequence embedding matrix Inline graphic.

For proteins, we use ESM2-t33-650M-UR50D [29], a transformer model with 650 M parameters trained on protein sequences. Its pooled embeddings form a matrix Inline graphic. These representations encode long-range dependencies relevant to folding and function. By anchoring on large-scale pretraining, these sequence-based representations offer stable and semantically rich priors for downstream modeling.

Relational feature

TriDTI captures relational information beyond individual entities by modeling dependencies within global interaction networks. This is achieved through relational subgraph sampling, a method that extracts relevant neighborhood topologies from drug–drug similarities and protein–protein interaction (PPI) networks to create localized representations.

The process for each entity is as follows. We first obtain node features for drugs from a pretrained LLM, ChemBERTa, denoted as Inline graphic, and for proteins from ESM2, denoted as Inline graphic. For drug entities, we construct a similarity network based on the cosine similarity of these Inline graphic embeddings. We then perform subgraph sampling by reducing the network density to retain only the top-Inline graphic edges and extracting 2-hop subgraphs. Similarly, for protein entities, we leverage the STRING PPI network [30] whose nodes are initialized with the Inline graphic embeddings. We sample subgraphs by applying the top-Inline graphic sparsification based on confidence scores and deriving 2-hop subgraphs.

Next, a 2-layer graph attention network version-2 (GATv2) [31] is applied to these subgraphs to aggregate relational information. The node update rule for the GATv2 is defined as:

graphic file with name DmEquation2.gif 2

where Inline graphic is the embedding of node Inline graphic at layer Inline graphic, and attention weights Inline graphic are computed as:

graphic file with name DmEquation3.gif 3

The final relation embeddings for drugs and proteins are obtained by averaging the node embeddings within their respective subgraphs, which follows the same formulation:

graphic file with name DmEquation4.gif 4

where Inline graphic is the set of nodes in the sampled subgraph. Collecting these subgraph-level representations across all drugs and proteins yields the final relational embedding matrices Inline graphic and Inline graphic. This formulation integrates local interaction patterns with global biological context, thereby complementing both structural and sequence-based features.

Unlike prior graph-based DTI frameworks [32, 33] that construct a unified heterogeneous biological network and perform end-to-end message passing across multiple entity types, TriDTI instead adopts a modular relational representation strategy. Relational information is encoded independently through localized subgraph representations derived from drug–drug similarity and PPI networks, rather than through joint propagation over a single heterogeneous graph. This design enables relational features to complement sequential and structural modalities without entangling heterogeneous propagation paths, facilitating more flexible multimodal fusion while reducing reliance on large, densely connected biological networks.

Modality alignment

Effective integration of heterogeneous features from multiple modalities in TriDTI requires aligning embeddings in a shared latent space. Modality-specific projection networks are employed to map embeddings of varying dimensions into a unified space, ensuring both dimensional consistency and the ability to capture non-linear relationships. Formally, for a set of modality embeddings Inline graphic, each embedding vector is transformed through a 2-layer feed-forward network with GELU activation:

graphic file with name DmEquation5.gif 5

To further ensure that embeddings from different modalities are semantically aligned, a bidirectional cross-modality contrastive learning objective is applied. In this framework, projected embeddings Inline graphic and Inline graphic, form positive pairs for each entity Inline graphic, while embeddings of different entities within the same modality serve as negatives. The directional loss from Inline graphic to Inline graphic is defined as:

graphic file with name DmEquation6.gif 6

where Inline graphic is the number of mini-batches, Inline graphic is the mini-batch size, Inline graphic denotes cosine similarity, and Inline graphic is a temperature hyperparameter. The bidirectional loss

graphic file with name DmEquation7.gif 7

ensures symmetric alignment between modalities. The final contrastive loss is computed over selected modality pairs for both drugs and targets:

graphic file with name DmEquation8.gif 8

focusing on aligning other modalities to the pretrained sequential representations. By encouraging closeness among embeddings of the same entity across modalities while separating embeddings of different entities within each modality, this modality alignment step promotes consistent, discriminative, and semantically coherent representations across the tri-modal feature space, enhancing the predictive capability of TriDTI.

Feature fusion

TriDTI employs a two-stage attention-based fusion strategy to integrate heterogeneous modality embeddings of drugs and proteins. This approach balances modality-specific strengths while mitigating redundancy and noise, yielding interaction-specific representations that capture both entity-level and pair-level dependencies.

First, a soft attention module adaptively weighs the contribution of each modality. Given modality features Inline graphic for an entity, the attention scores Inline graphic are computed using a two-layer multi-layer perceptron (MLP) with Tanh activation, and normalized across modalities via a softmax function. The fused entity representation is then obtained as a weighted sum of modality embeddings:

graphic file with name DmEquation9.gif 9

Second, the fused drug and protein embeddings are refined through a bidirectional cross-attention module. In this design, the query Inline graphic originates from one entity, while the key Inline graphic and value Inline graphic are projected from the other, enabling each entity to selectively attend to features of its counterpart. Formally, the cross-attention from drug to protein is defined as

graphic file with name DmEquation10.gif 10

with a symmetric formulation for Inline graphic. Residual connections are then applied to preserve entity-specific information while incorporating complementary interaction cues, leading to the final embeddings:

graphic file with name DmEquation11.gif 11
graphic file with name DmEquation12.gif 12

Here, Inline graphic and Inline graphic serve as the final drug and protein representations, simultaneously retaining modality integrated features and cross-entity contextual information, which form the basis for downstream interaction prediction.

Classification

The final representations of drugs and proteins, enhanced by the bidirectional cross-attention module, are combined to predict the probability of interaction. Specifically, the two vectors Inline graphic and Inline graphic are concatenated to form a unified representation Inline graphic, which is then fed into an MLP-based classifier. The classifier consists of multiple fully connected layers interleaved with GELU activation functions and dropout regularization, enabling it to capture complex nonlinear dependencies between drugs and proteins. Formally, the prediction is obtained as

graphic file with name DmEquation13.gif 13

where Inline graphic denotes the predicted interaction probability, and Inline graphic is the sigmoid activation function.

Overall loss function

To optimize both prediction accuracy and modality consistency, the model is trained with a composite loss function that combines binary cross-entropy (BCE) loss and cross-modality contrastive loss. The BCE loss directly supervises DTI prediction by minimizing the discrepancy between the predicted probability Inline graphic and the ground-truth label Inline graphic:

graphic file with name DmEquation14.gif 14

The total loss is defined as a weighted sum of BCE loss and the previously defined contrastive loss:

graphic file with name DmEquation15.gif 15

where Inline graphic is a hyperparameter that balances prediction accuracy and modality alignment. In our experiments, we set Inline graphic to provide a small but effective regularization from the contrastive objective. This joint optimization encourages the model not only to maximize predictive performance but also to maintain semantic consistency across heterogeneous modalities, thereby enhancing both generalization and representation quality.

Results

Datasets

We employed three publicly available benchmark datasets for evaluation: DAVIS [34], BioSNAP [35], and DrugBank [36]. The DAVIS dataset consists of 68 drugs and 379 target proteins, providing experimental measurements of drug–target binding affinities. Following prior work, we binarized the affinity values by treating drug–target pairs with dissociation constant (Inline graphic) values below 30 as positive interactions and all others as negative, thus reformulating the task into a binary classification problem. For BioSNAP and DrugBank, we used the preprocessed versions from MolTrans [37] and HyperAttentionDTI [15], respectively. In these versions, drug–target pairs were extracted from the original datasets, and negative sampling was applied to ensure a Inline graphic1:1 ratio of positive to negative interactions. To maintain data integrity, we further removed drug samples with invalid SMILES strings that could not be converted into molecular graphs.

To incorporate relational knowledge, we leveraged the PPI dataset from STRING [30] that provides probabilistic confidence scores for functional associations between proteins. Using STRING PPIs, we constructed separate PPI networks for each benchmark by including only the proteins present in the corresponding DTI dataset. This approach ensures that the relational information is specific to each benchmark while capturing the functional associations relevant to the modeled proteins. These networks were subsequently integrated as an additional modality input to our model. The statistics of the resulting experimental datasets are summarized in Table 1.

Table 1.

Statistics of the benchmark datasets for our experiments.

Dataset Drugs Targets DTIs PPIs
Positive Negative
DAVIS 68 379 1506 9597 15 734
BioSNAP 4502 2181 13 811 13 622 193 212
DrugBank 6645 4254 17 511 17 511 237 405

Experimental settings

For a robust assessment, we adopted five-fold cross-validation. Each dataset was split into training, validation and test sets in a 7:1:2 ratio. Model performance was evaluated using four standard metrics: area under the receiver operating characteristic curve (AUROC), area under the precision–recall curve (AUPRC), F1 score, and accuracy. Training was conducted using the AdamW optimizer with a learning rate of 5e-4, a batch size of 16, and a dropout rate of 0.1 for up to 100 epochs. Analysis of the training dynamics (see Supplementary Section S4) confirmed that the model consistently converged within this epoch limit, demonstrating stable optimization. The model parameters achieving the highest AUROC on the validation set were selected for reporting final test results. Detailed hyperparameter configurations for TriDTI are provided in Table 2, and a sensitivity analysis of the modality alignment hyperparameters (Inline graphic and Inline graphic) is presented in Supplementary Table S1. To ensure a fair and reproducible comparison, all baseline models were rigorously trained, validated, and tested using the identical data splits employed for TriDTI. For model implementations, we adhered to the hyperparameters and configurations explicitly reported in the original work. Where details were unavailable or incompatible with our datasets, hyperparameters were empirically tuned to reflect the scale and characteristics of each dataset.

Table 2.

Hyperparameter configurations for TriDTI across DAVIS, BIOSNAP, and DrugBank datasets.

Hyperparameter DAVIS BIOSNAP DrugBank
Structural feature
 GIN input dim 79 79 79
 GIN output dim (Inline graphic) 128 64 64
 CNN input dim 128 64 64
 CNN output dim (Inline graphic) 128 64 64
Sequential feature
 ChemBERTa input dim 510 510 510
 ChemBERTa output dim (Inline graphic) 768 768 768
 ESM2 input dim 1024 1024 1024
 ESM2 output dim (Inline graphic) 1280 1280 1280
Relational feature
 Drug GATv2 hidden dim (Inline graphic) 128 64 64
 Target GATv2 hidden dim (Inline graphic) 128 64 64
Modality alignment
 Projection dim (128, 128) (128, 64) (256, 64)
 Contrastive temperature (Inline graphic) 0.1 0.1 0.1
 Contrastive weight (Inline graphic) 0.0001 0.0001 0.0001
Modality fusion
 Soft attention hidden dim (128, 3) (64, 3) (64, 3)
 Cross-attention output dim 128 64 64
 Cross-attention num heads 8 8 8

Performance evaluation

TriDTI consistently achieved the best performance among all existing state-of-the-art models across all three benchmark datasets, as summarized in Table 3. On the DAVIS dataset, TriDTI recorded an AUROC of 0.9391 and an AUPRC of 0.7605, corresponding to relative improvements of 0.24% and 0.88% over the previous best-performing model, GPS-DTI. While MGMA-DTI reported a higher F1 score, its performance across the other metrics did not generalize as well. In contrast, TriDTI demonstrated a uniformly strong and balanced predictive capability across all other major evaluation metrics, recording a high accuracy of 0.9234.

Table 3.

DTI prediction performance on DAVIS, BioSNAP, and DrugBank datasets, where values indicate the mean and standard deviation over five-fold cross-validation.

Dataset Methods AUROC AUPRC F1 Accuracy
DAVIS TransformerCPI 0.8399 Inline graphic 0.0125 0.5329 Inline graphic 0.0066 0.5141 Inline graphic 0.0394 0.8723 Inline graphic 0.0073
MGraphDTA 0.9211 Inline graphic 0.0118 0.7064 Inline graphic 0.0163 0.6843 Inline graphic 0.0160 0.9087 Inline graphic 0.0053
HyperAttentionDTI 0.9221 Inline graphic 0.0108 0.7214 Inline graphic 0.0133 0.6911 Inline graphic 0.0168 0.9184 Inline graphic 0.0024
MCL-DTI 0.8967 Inline graphic 0.0114 0.7050 Inline graphic 0.0241 0.6660 Inline graphic 0.0225 0.9180 Inline graphic 0.0057
DLM-DTI 0.9290 Inline graphic 0.0114 0.7436 Inline graphic 0.0249 0.7083 Inline graphic 0.0203 0.9194 Inline graphic 0.0058
MMDG-DTI 0.9166 Inline graphic 0.0058 0.7155 Inline graphic 0.0242 0.6848 Inline graphic 0.0134 0.9094 Inline graphic 0.0068
MGMA-DTI 0.8937 Inline graphic 0.0072 0.6735 Inline graphic 0.0252 0.8311 Inline graphic 0.0102 0.8212 Inline graphic 0.0354
GPS-DTI 0.9368 Inline graphic 0.0069 0.7538 Inline graphic 0.0138 0.7245 Inline graphic 0.0129 0.9244 Inline graphic 0.0048
TriDTI 0.9391 Inline graphic 0.0031 0.7605 Inline graphic 0.0114 0.7186 Inline graphic 0.0100 0.9234 Inline graphic 0.0014
BioSNAP TransformerCPI 0.8714 Inline graphic 0.0040 0.8773 Inline graphic 0.0050 0.7977 Inline graphic 0.0038 0.7877 Inline graphic 0.0097
MGraphDTA 0.9049 Inline graphic 0.0026 0.9117 Inline graphic 0.0030 0.8316 Inline graphic 0.0029 0.8263 Inline graphic 0.0035
HyperAttentionDTI 0.9122 Inline graphic 0.0035 0.9181 Inline graphic 0.0041 0.8410 Inline graphic 0.0053 0.8391 Inline graphic 0.0072
MCL-DTI 0.8773 Inline graphic 0.0025 0.8788 Inline graphic 0.0037 0.8079 Inline graphic 0.0049 0.8060 Inline graphic 0.0043
DLM-DTI 0.9115 Inline graphic 0.0031 0.9158 Inline graphic 0.0025 0.8420 Inline graphic 0.0068 0.8418 Inline graphic 0.0051
MMDG-DTI 0.9093 Inline graphic 0.0022 0.9149 Inline graphic 0.0035 0.8393 Inline graphic 0.0021 0.8345 Inline graphic 0.0023
MGMA-DTI 0.8905 Inline graphic 0.0040 0.8946 Inline graphic 0.0069 0.8180 Inline graphic 0.0052 0.8131 Inline graphic 0.0083
GPS-DTI 0.9256 Inline graphic 0.0039 0.9259 Inline graphic 0.0056 0.8594 Inline graphic 0.0057 0.8555 Inline graphic 0.0068
TriDTI 0.9274 Inline graphic 0.0030 0.9280 Inline graphic 0.0029 0.8605 Inline graphic 0.0039 0.8567 Inline graphic 0.0067
DrugBank TransformerCPI 0.8451 Inline graphic 0.0051 0.8480 Inline graphic 0.0071 0.7729 Inline graphic 0.0035 0.7679 Inline graphic 0.0031
MGraphDTA 0.8780 Inline graphic 0.0042 0.8823 Inline graphic 0.0063 0.8032 Inline graphic 0.0039 0.7948 Inline graphic 0.0073
HyperAttentionDTI 0.8878 Inline graphic 0.0035 0.8922 Inline graphic 0.0046 0.8112 Inline graphic 0.0036 0.8066 Inline graphic 0.0052
MCL-DTI 0.8450 Inline graphic 0.0032 0.8435 Inline graphic 0.0051 0.7762 Inline graphic 0.0038 0.7733 Inline graphic 0.0037
DLM-DTI 0.8990 Inline graphic 0.0051 0.9008 Inline graphic 0.0034 0.8238 Inline graphic 0.0074 0.8181 Inline graphic 0.0132
MMDG-DTI 0.8768 Inline graphic 0.0179 0.8760 Inline graphic 0.0225 0.8064 Inline graphic 0.0133 0.7934 Inline graphic 0.0171
MGMA-DTI 0.8676 Inline graphic 0.0036 0.8693 Inline graphic 0.0107 0.7944 Inline graphic 0.0033 0.7826 Inline graphic 0.0075
GPS-DTI 0.9120 Inline graphic 0.0019 0.9101 Inline graphic 0.0029 0.8431 Inline graphic 0.0039 0.8395 Inline graphic 0.0049
TriDTI 0.9182 Inline graphic 0.0042 0.9180 Inline graphic 0.0068 0.8477 Inline graphic 0.0036 0.8458 Inline graphic 0.0037

Note: The best and second-best results are shown in bold and underline, respectively.

The advantage of TriDTI is further substantiated on the BioSNAP and DrugBank datasets, where its overall superiority is more pronounced. For BioSNAP, TriDTI achieved the highest results across all four metrics: AUROC (0.9274), AUPRC (0.9280), F1 score (0.8605), and accuracy (0.8567). Similarly, TriDTI obtained the best performance on DrugBank recording AUROC 0.9182, AUPRC 0.9180, F1 score 0.8477, and accuracy 0.8458. When compared against the average performance of all other baseline models, these results demonstrate a more substantial margin of improvement. For instance, TriDTI surpasses the average AUROC and AUPRC of all competing models by 2.92% and 2.52% on BioSNAP, and 4.55%, 4.38% on DrugBank, respectively. These results highlight the effectiveness of TriDTI’s modality-integrated representation learning, achieving superior and consistent performance across diverse datasets.

Ablation study

We further analyzed the contribution of individual modalities and the importance of key components in the TriDTI. By systematically removing specific modalities or architectural modules, we evaluated how each element influenced the overall predictive performance. The experimental results are summarized in Fig. 2.

Figure 2.

A comparative performance analysis of TriDTI and its variants across three benchmark datasets. The visualization depicts (a) the impact of individual input modalities on prediction accuracy and (b) the functional necessity of core architectural modules, confirming that the integrated TriDTI framework yields the optimal AUROC metric.

Ablation study results of TriDTI on the DAVIS, BioSNAP, and DrugBank datasets. The figure presents two comparative analyses: (a) Modality contribution analysis assesses the contribution of individual feature sources by comparing the full model against variants where a single or dual input modality is excluded. (b) Module ablation study validates the functional necessity of core architectural units by comparing the full model against variants excluding each modular component. Bars represent the mean and standard deviation over five-fold cross-validation, reported by AUROC.

Modality contribution analysis

The contribution of each modality was analyzed by comparing single-, dual-, and tri-modality configurations. Among single-modality settings, the sequence-only model consistently achieved the best performance across all datasets, whereas relational and structural modalities exhibited relatively lower accuracy. This finding highlights sequence-based semantic information from pretrained language models as the most informative signal for DTI prediction.

Models that included the sequence modality generally maintained strong performance, indicating its robustness across different datasets. However, performance gains were not always guaranteed when two modalities were combined. In several cases, dual-modality models underperformed the sequence-only baseline, suggesting that naive feature fusion does not necessarily lead to improved predictions. Notable exceptions were observed for BioSNAP and DrugBank, where integrating sequence and relational modalities yielded performance improvements, implying complementary contributions from relational information. In contrast, the joint utilization of all three modalities consistently improved performance across all datasets. This outcome demonstrates that full multimodal integration enables TriDTI to capture complementary information beyond what is accessible through single or limited dual-modality configurations. In addition, the soft attention weights offered insight into how the model adaptively emphasizes different modalities based on dataset characteristics (see Supplementary Section S2).

Module ablation study

To validate the necessity of the proposed architecture, we assessed the functional role of TriDTI’s core modules by comparing the full framework against various ablated variants. Across all datasets, the complete model consistently outperformed its ablated variants, confirming the effectiveness of the proposed design. Removing the contrastive learning module resulted in a performance degradation of 2.09% on average. This degradation shows that explicit cross-modal alignment is crucial for learning robust multimodal embeddings, as its absence hinders the model’s ability to fully exploit the complementary nature of heterogeneous features. Furthermore, as shown in Supplementary Section S4, analysis of the training dynamics confirmed that the contrastive objective led to enhanced convergence stability and superior validation AUROC.

The attention-based fusion mechanism was also validated through its components. Excluding the soft attention module reduced performance by 1.30% on average, suggesting that selectively emphasizing informative features within each modality contributes to improving prediction accuracy. A comparable performance drop of 1.31% on average resulted from the removal of the cross-attention module. This result emphasizes the benefit of modeling pairwise interactions at the drug–target level. Overall, the ablation results confirmed that each architectural component meaningfully contributes to the final performance, and that combining contrastive alignment with attention-based fusion is crucial for effective multimodal integration in TriDTI.

Model interpretability

Contrastive learning plays an important role in shaping the quality of the representation space. Figure 3 presents t-SNE visualizations of the joint drug–target embeddings produced by TriDTI on the BioSNAP dataset, comparing models trained with and without the contrastive learning objective. As illustrated in the figure, embeddings generated with contrastive learning for more clearly separated and structured clusters corresponding to interaction and noninteraction labels. In contrast, embeddings obtained without contrastive learning show substantial overlap between classes, indicating reduced discriminative capability. These observations suggest that contrastive learning guides the model to organize the representation space in a way that better captures underlying DTI patterns. Furthermore, detailed analysis and visualization of the bi-directional Cross-Attention mechanism (see Supplementary Section S3) confirmed that the model learns robust and mutual interaction representations by exhibiting complementary attention patterns.

Figure 3.

A comparative t-SNE visualization of joint drug–target embeddings. The visualization demonstrates that the incorporation of contrastive learning results in highly discriminative and structured clusters for interaction pairs, whereas the model without the contrastive objective shows significant overlap between samples.

t-SNE visualization of joint drug–target embeddings. The left panel shows embeddings obtained from the model trained with contrastive learning, while the right panel corresponds to embeddings learned without the contrastive objective. Embeddings learned with contrastive learning exhibit more clearly separated and structured clusters between interaction and non-interaction samples, indicating enhanced discriminative representation learning compared with the non-contrastive counterpart.

Cold-start settings

Table 5.

DTI prediction performance comparison on BioSNAP under three cold-start scenarios: Unseen Drug, Unseen Target, and Unseen Binding, where values represent the mean and standard deviation over five-fold cross-validation.

Model Unseen Drug Unseen Target Unseen Binding
AUROC AUPRC AUROC AUPRC AUROC AUPRC
TransformerCPI 0.8661 Inline graphic 0.0094 0.8768 Inline graphic 0.0071 0.7267 Inline graphic 0.0366 0.7477 Inline graphic 0.0510 0.7040 Inline graphic 0.0618 0.7262 Inline graphic 0.0809
MGraphDTA 0.8571 Inline graphic 0.0089 0.8735 Inline graphic 0.0067 0.7652 Inline graphic 0.0268 0.7907 Inline graphic 0.0381 0.6866 Inline graphic 0.0608 0.7244 Inline graphic 0.0702
HyperAttentionDTI 0.8694 Inline graphic 0.0104 0.8838 Inline graphic 0.0092 0.7868 Inline graphic 0.0219 0.8214 Inline graphic 0.0254 0.7065 Inline graphic 0.0596 0.7473 Inline graphic 0.0721
MCL-DTI 0.8150 Inline graphic 0.0164 0.8321 Inline graphic 0.0097 0.7168 Inline graphic 0.0271 0.7447 Inline graphic 0.0437 0.6399 Inline graphic 0.0417 0.6749 Inline graphic 0.0700
DLM-DTI 0.8266 Inline graphic 0.0538 0.8492 Inline graphic 0.0431 0.8388 Inline graphic 0.0138 0.8552 Inline graphic 0.0209 0.7213 Inline graphic 0.0655 0.7550 Inline graphic 0.0818
MMDG-DTI 0.8691 Inline graphic 0.0105 0.8856 Inline graphic 0.0081 0.8104 Inline graphic 0.0171 0.8339 Inline graphic 0.0125 0.7503 Inline graphic 0.0554 0.7852 Inline graphic 0.0661
MGMA-DTI 0.8660 Inline graphic 0.0079 0.8745 Inline graphic 0.0083 0.6689 Inline graphic 0.0292 0.6904 Inline graphic 0.0435 0.6388 Inline graphic 0.0445 0.6651 Inline graphic 0.0684
GPS-DTI 0.8735 Inline graphic 0.0156 0.8825 Inline graphic 0.0166 0.8684 Inline graphic 0.0122 0.8804 Inline graphic 0.0198 0.7882 Inline graphic 0.0446 0.8110 Inline graphic 0.0581
TriDTI 0.8834 Inline graphic 0.0108 0.8899 Inline graphic 0.0135 0.8670 Inline graphic 0.0073 0.8750 Inline graphic 0.0220 0.7983 Inline graphic 0.0305 0.8080 Inline graphic 0.0395

Note: The best and second-best results are shown in bold and underline, respectively.

A cold-start scenario, where a model encounters previously unseen drugs, targets, or binding pairs, constitutes one of the most challenging settings in DTI prediction. Under these settings, TriDTI demonstrated strong performance across the DAVIS, BioSNAP, and DrugBank datasets, as summarized in Tables 46. In the Unseen Drug setting, TriDTI showed comparatively lower performance on the DAVIS dataset than some baseline methods. However, it achieved the best results on both BioSNAP and DrugBank in terms of AUROC and AUPRC, suggesting effective generalization to previously unseen compounds in larger and more diverse chemical spaces. In the Unseen Target and Unseen Binding settings, TriDTI consistently ranked among the top two methods across all datasets, demonstrating robust generalization under diverse cold-start conditions. In particular, GPS-DTI exhibited notably strong performance in the Unseen Target scenario, which is likely attributable to its reliance on large-scale pretrained protein representations from ESM2. Overall, these results indicate that TriDTI is well suited for real-world DTI prediction scenarios, where new compounds and targets are continuously introduced.

Table 4.

DTI prediction performance comparison on DAVIS under three cold-start scenarios: Unseen Drug, Unseen Target, and Unseen Binding, where values represent the mean and standard deviation over five-fold cross-validation.

Model Unseen Drug Unseen Target Unseen Binding
AUROC AUPRC AUROC AUPRC AUROC AUPRC
TransformerCPI 0.7483 Inline graphic 0.0304 0.3470 Inline graphic 0.0789 0.7972 Inline graphic 0.0342 0.4546 Inline graphic 0.0961 0.7212 Inline graphic 0.0575 0.3086 Inline graphic 0.1324
MGraphDTA 0.7230 Inline graphic 0.0450 0.3554 Inline graphic 0.0966 0.8492 Inline graphic 0.0432 0.5314 Inline graphic 0.1315 0.6314 Inline graphic 0.0775 0.2028 Inline graphic 0.0691
HyperAttentionDTI 0.7400 Inline graphic 0.0297 0.3676 Inline graphic 0.1100 0.8714 Inline graphic 0.0292 0.5955 Inline graphic 0.0986 0.6525 Inline graphic 0.0777 0.2656 Inline graphic 0.1165
MCL-DTI 0.7260 Inline graphic 0.0376 0.3446 Inline graphic 0.0865 0.7871 Inline graphic 0.0391 0.4477 Inline graphic 0.0979 0.6674 Inline graphic 0.0776 0.2530 Inline graphic 0.1316
DLM-DTI 0.7313 Inline graphic 0.0414 0.3861 Inline graphic 0.0918 0.8247 Inline graphic 0.0605 0.5334 Inline graphic 0.1519 0.7016 Inline graphic 0.0808 0.2902 Inline graphic 0.0452
MMDG-DTI 0.7409 Inline graphic 0.0852 0.3748 Inline graphic 0.1159 0.8529 Inline graphic 0.0383 0.5474 Inline graphic 0.1067 0.6490 Inline graphic 0.1232 0.2665 Inline graphic 0.1509
MGMA-DTI 0.7420 Inline graphic 0.0489 0.3746 Inline graphic 0.0738 0.7260 Inline graphic 0.0545 0.3883 Inline graphic 0.0876 0.5729 Inline graphic 0.1287 0.1977 Inline graphic 0.0944
GPS-DTI 0.6904 Inline graphic 0.0503 0.3318 Inline graphic 0.0430 0.8870 Inline graphic 0.0255 0.6280 Inline graphic 0.1019 0.6931 Inline graphic 0.0432 0.2597 Inline graphic 0.0239
TriDTI 0.7302 Inline graphic 0.0357 0.3345 Inline graphic 0.0717 0.8923 Inline graphic 0.0297 0.6328 Inline graphic 0.0896 0.7909 Inline graphic 0.0488 0.4202 Inline graphic 0.0712

Note: The best and second-best results are shown in bold and underline, respectively.

Table 6.

DTI prediction performance comparison on DrugBank under three cold-start scenarios: Unseen Drug, Unseen Target, and Unseen Binding, where values represent the mean and standard deviation over five-fold cross-validation.

Model Unseen Drug Unseen Target Unseen Binding
AUROC AUPRC AUROC AUPRC AUROC AUPRC
TransformerCPI 0.7674 Inline graphic 0.0322 0.3572 Inline graphic 0.0761 0.7240 Inline graphic 0.0159 0.7295 Inline graphic 0.0158 0.6892 Inline graphic 0.0098 0.6860 Inline graphic 0.0251
MGraphDTA 0.8316 Inline graphic 0.0095 0.8407 Inline graphic 0.0108 0.7573 Inline graphic 0.0053 0.7839 Inline graphic 0.0025 0.6911 Inline graphic 0.0062 0.7030 Inline graphic 0.0165
HyperAttentionDTI 0.8335 Inline graphic 0.0052 0.8426 Inline graphic 0.0049 0.7814 Inline graphic 0.0202 0.8091 Inline graphic 0.0164 0.6970 Inline graphic 0.0331 0.6950 Inline graphic 0.0463
MCL-DTI 0.7596 Inline graphic 0.0172 0.7729 Inline graphic 0.0107 0.6619 Inline graphic 0.0164 0.6796 Inline graphic 0.0134 0.5646 Inline graphic 0.0173 0.5585 Inline graphic 0.0314
DLM-DTI 0.8478 Inline graphic 0.0117 0.8514 Inline graphic 0.0117 0.8372 Inline graphic 0.0107 0.8461 Inline graphic 0.0107 0.7579 Inline graphic 0.0056 0.7615 Inline graphic 0.0104
MMDG-DTI 0.8332 Inline graphic 0.0194 0.8397 Inline graphic 0.0196 0.7780 Inline graphic 0.0374 0.7953 Inline graphic 0.0334 0.7071 Inline graphic 0.0154 0.7219 Inline graphic 0.0281
MGMA-DTI 0.8284 Inline graphic 0.0103 0.8346 Inline graphic 0.0124 0.6919 Inline graphic 0.0244 0.7011 Inline graphic 0.0229 0.6318 Inline graphic 0.0174 0.6183 Inline graphic 0.0197
GPS-DTI 0.8487 Inline graphic 0.0074 0.8572 Inline graphic 0.0040 0.8681 Inline graphic 0.0155 0.8776 Inline graphic 0.0155 0.7774 Inline graphic 0.0226 0.7841 Inline graphic 0.0248
TriDTI 0.8688 Inline graphic 0.0086 0.8717 Inline graphic 0.0058 0.8664 Inline graphic 0.0106 0.8725 Inline graphic 0.0102 0.7943 Inline graphic 0.0161 0.7913 Inline graphic 0.0239

Note: The best and second-best results are shown in bold and underline, respectively.

Case study

The cold start analysis demonstrated TriDTI’s strong generalization ability to unseen data. However, this case study aims to highlight the practical utility of the model for real-world drug discovery. To validate our predictions for unknown DTIs, we used the DrugBank dataset. We first filtered all known drug–target pairs and then used the remaining candidate pool as input for our model. This process yielded a list of the 10 most promising novel candidates. After excluding a pair that lacked a 3D PDB structure, we subjected the remaining nine candidates to molecular docking simulations for validation.

To further substantiate our predictions, we used the CB-Dock2 [38] docking server to compute Vina scores for the nine candidates. The detailed docking results, including the Vina score, cavity volume, center coordinates, and docking size for each pair, are presented in Table 7. The results showed that every pair yielded a binding affinity score of < −5 kcal/mol. In docking analysis, a Vina score below −5 kcal/mol is generally considered a strong indicator of potential DTI, with more negative values suggesting a more robust binding ability. The docking outcomes for the top two candidates are further visualized in Fig. 4 that shows their binding poses and key interactions with the target proteins.

Table 7.

Top 9 docking results of drug–protein pairs selected by TriTDI

Drug ID Protein ID Vina score Cavity volume Center (x, y, z) Docking size (x, y, z)
DB11638 P08235 −7.6 458 64, 58, −2 18, 18, 18
DB00753 P08235 −5.3 436 122, 24, 22 16, 16, 16
DB00637 P08913 −10.7 6 −5, −12, 10 26, 26, 26
DB07973 P08913 −9.6 6 −5, −12, 10 23, 23, 23
DB06144 P08913 −9.9 6 −5, −12, 10 25, 25, 25
DB01043 P08235 −7.0 436 122, 24, 22 16, 16, 16
DB05422 P08913 −9.4 6 −5, −12, 10 24, 24, 24
DB08685 P34903 −5.8 3772 142, 102, 133 28, 29, 35
DB05316 P08913 −9.7 6 −5, −12, −10 25, 25, 25

Figure 4.

3D visualization of molecular docking poses for the top-ranked drug–target pairs predicted by TriDTI. The panels illustrate the binding orientations and spatial interactions between (a) DB11638 and P08235, and (b) DB00753 and P08235, within the target protein’s active site.

Molecular docking analysis of top-ranked pairs predicted by TriDTI. (a) Highest-ranked binding prediction: DB11638 interacting with P08235. (b) Second-ranked binding prediction: DB00753 interacting with P08235.

It should be noted that docking scores alone do not constitute experimental validation of DTIs. Rather, these results provide supportive, structure-based evidence that the model-predicted pairs are physically plausible and merit further investigation. Taken together, this case study demonstrates that TriDTI can effectively prioritize candidate drug–target pairs that are favorable for downstream structure-based analysis, thereby serving as a useful computational screening tool in the early stages of drug discovery.

Conclusion

In this study, we present TriDTI, a novel deep learning framework designed to address the limitations of traditional DTI prediction models. The model simultaneously integrates three complementary modalities for both drugs and proteins: sequential representations from LLMs, structural features from molecular graphs and amino acid sequences, and relational information from biological networks. To balance the contributions of these heterogeneous modalities, we adopt a cross-modal contrastive learning strategy that enhances semantic alignment across feature spaces. In addition, a dynamic attention-based fusion mechanism is introduced to maximize predictive accuracy by adaptively weighting modality-specific contributions and modeling DTI patterns. Extensive experiments demonstrate that TriDTI consistently achieves the best performance across three benchmark datasets. Moreover, validation under cold-start scenarios and molecular docking case studies highlights its strong generalization capacity and practical utility in discovering novel drug–target pairs.

Although TriDTI is a useful tool for DTI prediction, several avenues remain for future exploration. First, while our current design incorporates pretrained LLM-based features, pretraining the molecular graph modality on large-scale datasets [39, 40] could further alleviate the imbalance among heterogeneous modalities and enhance structural representations. Second, although TriDTI effectively utilizes relational features through drug–drug similarities and PPIs, it currently does not rely on a comprehensive heterogeneous biological information network containing multiple entity types (e.g. diseases or side-effects). A promising direction involves augmenting the relational modality by incorporating such comprehensive networks and leveraging advanced heterogeneous graph representation learning methods [41, 42]. Third, TriDTI does not yet incorporate explicit 3D structural data, despite employing CNNs to model 1D protein sequences [43, 44]. Therefore, integrating 3D conformational information, potentially through geometric deep learning or structure-informed representations, would allow us to capture spatial interaction patterns more effectively. Fourth, the framework can be extended to integrate additional complementary modalities for both drugs and proteins, such as molecular images or textual descriptions, to achieve an even richer multimodal representation. Such extensions will further strengthen TriDTI’s capability and establish it as an even more versatile tool for advancing computational drug discovery.

Key Points.

  • We propose TriDTI, a novel tri-modal framework that integrates structural, sequential, and relational modalities to learn comprehensive representations by capturing diverse features of both drugs and proteins.

  • The model employs a cross-modal contrastive learning strategy to enforce semantic alignment across disparate embedding spaces, effectively minimizing information loss during the integration of heterogeneous features.

  • A two-stage adaptive fusion mechanism, combining soft attention and cross-attention, is designed to dynamically balance modality contributions and precisely model interaction-aware representations.

Supplementary Material

bbag034_Supplemental_File

Contributor Information

Gwang-Hyeon Yun, Department of Software, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea.

Jong-Hoon Park, Department of Software, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea.

Young-Rae Cho, Department of Software, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea; Department of Digital Healthcare, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea.

Author contributions

Gwang-Hyeon Yun (Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing—original draft, Writing—review & editing), Jong-Hoon Park (Methodology, Software, Formal analysis, Investigation, Writing-review & editing, Visualization), Young-Rae Cho (Conceptualization, Writing—original draft, Writing—review & editing, Resources, Supervision, Project administration, Funding acquisition)

Conflict of interest

None declared.

Funding

This research was supported by National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (grant no. RS-2025-16067916), Basic Science Research Program through the NRF funded by the Ministry of Education (grant no. RS-2025-25432868), and the Regional Innovation System & Education (RISE) program through the Gangwon RISE Center funded by the Ministry of Education and the Gangwon State, Republic of Korea (grant no. 2025-RISE-10-006).

Data availability

The codes and datasets are available online at https://github.com/knhc1234/TriDTI.

References

  • 1. Zhangli  L, Song  G, Zhu  H. et al.  DTIAM: a unified framework for predicting drug-target interactions, binding affinities and drug mechanisms. Nat Commun  2025; 16:2548. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Hua  Y, Song  X, Feng  Z. et al.  CPInformer for efficient and robust compound-protein interaction prediction. IEEE/ACM Trans Comput Biol Bioinform  2022; 20:285–96. 10.1109/TCBB.2022.3144008 [DOI] [PubMed] [Google Scholar]
  • 3. Talukder  MA, Kazi  M, Alazab  A. Predicting drug-target interactions using machine learning with improved data balancing and feature engineering. Sci Rep  2025; 15:19495. 10.1038/s41598-025-03932-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Yun  G-H, Park  J-H, Cho  Y-R. FACT: feature aggregation and convolution with transformers for predicting drug classification code. Bioinformatics  2025; 41:i77–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Noor  F, Junaid  M, Almalki  AH. et al.  Deep learning pipeline for accelerating virtual screening in drug discovery. Sci Rep  2024; 14:28321. 10.1038/s41598-024-79799-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Qian Liao  Y, Zhang  YC, Ding  Y. et al.  Application of artificial intelligence in drug-target interactions prediction: a review. npj Biomed Innov  2025; 2:1. [Google Scholar]
  • 7. Wei  J, Zhu  Y, Zhuo  L. et al.  Efficient deep model ensemble framework for drug-target interaction prediction. J Phys Chem Lett  2024; 15:7681–93. 10.1021/acs.jpclett.4c01509 [DOI] [PubMed] [Google Scholar]
  • 8. Donghua  Y, Liu  H, Yao  S. Drug–target interaction prediction based on improved heterogeneous graph representation learning and feature projection classification. Expert Syst Appl  2024; 252:124289. 10.1016/j.eswa.2024.124289 [DOI] [Google Scholar]
  • 9. Dong  W, Yang  Q, Wang  J. et al.  Multi-modality attribute learning-based method for drug–protein interaction prediction based on deep neural network. Brief Bioinform  2023; 24:bbad161. 10.1093/bib/bbad161 [DOI] [PubMed] [Google Scholar]
  • 10. Shan  J, Sun  J, Zheng  H. MIF–DTI: a multimodal information fusion method for drug–target interaction prediction. Brief Bioinform  2025; 26:bbaf474. 10.1093/bib/bbaf474 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Zitnik  M, Nguyen  F, Wang  B. et al.  Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Information Fusion  2019; 50:71–91. 10.1016/j.inffus.2018.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Li  Y, Huang  Y-A, You  Z-H. et al.  Drug-target interaction prediction based on drug fingerprint information and protein sequence. Molecules  2019; 24:2999. 10.3390/molecules24162999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Shi  W, Yang  H, Xie  L. et al.  A review of machine learning-based methods for predicting drug–target interactions. Health Inf Sci Syst  2024; 12:30. 10.1007/s13755-024-00287-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Chen  L, Tan  X, Wang  D. et al.  TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics  2020; 36:4406–14. 10.1093/bioinformatics/btaa524 [DOI] [PubMed] [Google Scholar]
  • 15. Zhao  Q, Zhao  H, Zheng  K. et al.  HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics  2021; 38:655–62. 10.1093/bioinformatics/btab715 [DOI] [PubMed] [Google Scholar]
  • 16. Lee  J, Jun  DW, Song  I. et al.  DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning. J Cheminform  2024; 16:14. 10.1186/s13321-024-00808-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Chithrananda  S, Grand  G, Ramsundar  B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885 2020. 10.48550/arXiv.2010.09885 [DOI]
  • 18. Elnaggar  A, Heinzinger  M, Dallago  C. et al.  ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Trans Pattern Anal Mach Intell  2021; 44:7112–27. [DOI] [PubMed] [Google Scholar]
  • 19. Xiangzheng  F, Zhenya  D, Chen  Y. et al.  DrugKANs: a paradigm to enhance drug-target interaction prediction with kans. IEEE J Biomed Health Inform  2025;PP:1–12. 10.1109/JBHI.2025.3566931 [DOI] [PubMed] [Google Scholar]
  • 20. Yang  Z, Zhong  W, Zhao  L. et al. MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chem Sci  2022; 13:816–33. 10.1039/d1sc05180f [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li  C, Mi  J, Wang  H. et al.  MGMA-DTI: drug target interaction prediction using multi-order gated convolution and multi-attention fusion. Comput Biol Chem  2025; 118:108449. [DOI] [PubMed] [Google Scholar]
  • 22. Xiong  A, Luo  Z, Xia  Y. et al.  An interpretable geometric graph neural network for enhancing the generalizability of drug–target interaction prediction. BMC Biol  2025; 23:350. 10.1186/s12915-025-02456-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Rampášek  L, Galkin  M, Dwivedi  VP. et al.  Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems  2022; 35:14501–15. [Google Scholar]
  • 24. Zhao  T, Yang  H, Valsdottir  LR. et al.  Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief Bioinform  2020; 22:2141–50. 10.1093/bib/bbaa044 [DOI] [PubMed] [Google Scholar]
  • 25. Peng  J, Wang  Y, Guan  J. et al.  An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction. Brief Bioinform  2021; 22:bbaa430. 10.1093/bib/bbaa430 [DOI] [PubMed] [Google Scholar]
  • 26. Xiaorui  S, Pengwei  H, Yi  H. et al.  Predicting drug-target interactions over heterogeneous information network. IEEE J Biomed Health Inform  2023; 27:562–72. 10.1109/JBHI.2022.3219213 [DOI] [PubMed] [Google Scholar]
  • 27. Qian  Y, Li  X, Jian  W. et al.  MCL-DTI: using drug multimodal information and bi-directional cross-attention learning method for predicting drug–target interaction. BMC bioinformatics  2023; 24:323. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Hua  Y, Feng  Z, Song  X. et al.  MMDG-DTI: drug–target interaction prediction via multimodal feature fusion and domain generalization. Pattern Recogn  2025; 157:110887. [Google Scholar]
  • 29. Lin  Z, Akin  H, Rao  R. et al.  Evolutionary-scale prediction of atomic-level protein structure with a language model. Science  2023; 379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
  • 30. Szklarczyk  D, Nastou  K, Koutrouli  M. et al.  The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res  2025; 53:D730–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Brody  S, Alon  U, Yahav  E. How attentive are graph attention networks?  arXiv preprint arXiv:2105.14491. 2021. https://arxiv.org/abs/2105.14491 (accessed 29 January 2026).
  • 32. Zhao  B-W, Xiao-Rui  S, Peng-Wei  H. et al.  iGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics  2023; 39:btad451. 10.1093/bioinformatics/btad451 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Zhao  B-W, Xiao-Rui  S, Yang  Y. et al.  Regulation-aware graph learning for drug repositioning over heterogeneous biological network. Inform Sci  2025; 686:121360. 10.1016/j.ins.2024.121360 [DOI] [Google Scholar]
  • 34. Davis  MI, Hunt  JP, Herrgard  S. et al.  Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol  2011; 29:1046–51. 10.1038/nbt.1990 [DOI] [PubMed] [Google Scholar]
  • 35. Zitnik  M, Sosič  R, Maheshwari  S. et al.  BioSNAP datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata.
  • 36. Knox  C, Wilson  M, Klinger  CM. et al.  DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res  2024; 52:D1265–75. 10.1093/nar/gkad976 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Huang  K, Xiao  C, Glass  LM. et al.  MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics  2021; 37:830–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Liu  Y, Yang  X, Gan  J. et al.  CB-Dock2: improved protein–ligand blind docking by integrating cavity detection, docking and homologous template fitting. Nucleic Acids Res  2022; 50:W159–64. 10.1093/nar/gkac394 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Wang  Y, Wang  J, Cao  Z. et al.  Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell  2022; 4:279–87. 10.1038/s42256-022-00447-x [DOI] [Google Scholar]
  • 40. Rong  Y, Bian  Y, Tingyang  X. et al.  Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems  2020; 33:12559–71. [Google Scholar]
  • 41. Xiaorui  S, Pengwei  H, Li  D. et al.  Interpretable identification of cancer genes across biological networks via transformer-powered graph representation learning. Nat Biomed Eng  2025; 9:371–89. 10.1038/s41551-024-01312-5 [DOI] [PubMed] [Google Scholar]
  • 42. Xiangzheng  F, Peng  L, Chen  H. et al.  GRAPE: graph-regularized protein language modeling unlocks TCR-epitope binding specificity. Brief Bioinform  2025; 26:10. 10.1093/bib/bbaf522 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Zhao  L, Wang  H, Shi  S. PocketDTA: an advanced multimodal architecture for enhanced prediction of drug-target affinity from 3D structural data of target binding pockets. Bioinformatics  2024; 40:btae594. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Stärk  H, Beaini  D, Corso  G. et al.  3D infomax improves gnns for molecular property prediction. In: International Conference on Machine Learning (Baltimore, MD, USA), PMLR 2022; 162:20479–502. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

bbag034_Supplemental_File

Data Availability Statement

The codes and datasets are available online at https://github.com/knhc1234/TriDTI.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES