TriDTI: tri-modal representation learning with cross-modal alignment for drug–target interaction prediction

Gwang-Hyeon Yun; Jong-Hoon Park; Young-Rae Cho

doi:10.1093/bib/bbag034

. 2026 Feb 5;27(1):bbag034. doi: 10.1093/bib/bbag034

TriDTI: tri-modal representation learning with cross-modal alignment for drug–target interaction prediction

Gwang-Hyeon Yun ¹, Jong-Hoon Park ², Young-Rae Cho ^3,^4,^✉

PMCID: PMC12874921 PMID: 41642192

Abstract

The rapid advancement of artificial intelligence has positioned drug–target interaction (DTI) prediction as a promising approach in drug screening and drug discovery. Recent research has attempted to use pharmacological multimodal information to increase prediction accuracy. However, existing approaches are limited in fully utilizing more than three modalities, primarily due to information loss during the modality integration process. To overcome this challenge, we propose TriDTI, a novel framework that incorporates three modalities for both drugs and proteins. Specifically, TriDTI integrates structural, sequential, and relational modalities from both entities. To mitigate information loss during integration, we employ projection and cross-modal contrastive learning for modality alignment. Furthermore, we design a fusion strategy that combines soft attention and cross-attention to effectively integrate multimodal representations. Extensive experiments on three benchmark datasets demonstrate that TriDTI consistently achieves superior performance to existing state-of-the-art approaches in DTI prediction. Moreover, TriDTI exhibits a robust generalization ability across three challenging cold-start scenarios, effectively predicting interactions involving novel drugs, targets, and bindings. These results highlight the potential of TriDTI as a robust and practical framework for facilitating drug discovery. The source codes and datasets are publicly accessible at https://github.com/knhc1234/TriDTI.

Keywords: drug–target interaction prediction, tri-modal representation learning, modality alignment

Introduction

Predicting drug–target interactions (DTIs) is a fundamental challenge in drug screening and drug discovery [1, 2]. Traditional drug discovery pipelines are often constrained by high costs and long development cycles [3–5]. To overcome these limitations, diverse computational methods have been proposed, enabling both deeper analytical insights, and more efficient prediction in DTI studies [6, 7]. These approaches can be broadly categorized into ligand-based, docking-based, and chemogenomic methods [8].

Ligand-based methods exploit structural similarities between ligands to infer DTIs, while docking-based methods estimate binding affinity by simulating the interactions between drug molecules and the 3D conformations of target proteins [9]. However, both methods are inherently restricted by the scarcity of experimentally verified ligands and reliable 3D structural data [10–12]. In contrast, chemogenomic methods address these limitations by directly leveraging molecular representations of drugs (e.g. SMILES) and protein sequences, thereby eliminating the reliance on 3D structural data or extensive ligand libraries. By enabling predictions for uncharacterized targets, this strategy greatly expands the applicability of computational drug discovery. Building on this foundation, deep learning models have emerged, offering diverse solutions for modeling DTIs. These models are commonly categorized by their treatment of drug embeddings into sequence- and structure-based methods [13].

Sequence-based methods predict DTIs directly from raw sequence data, typically encoding a drug’s SMILES code and a protein’s amino acid sequence into vector representations. For example, TransformerCPI [14] employs a Transformer architecture to jointly encode SMILES and protein sequences, generating predictions through a fully connected layer. HyperAttentionDTI [15] constructs feature matrices from each sequence using a convolutional neural network (CNN) block and captures complex noncovalent interactions between atoms and amino acids through an attention mechanism. More recently, DLM-DTI [16] leverages pretrained language models, specifically ChemBERTa [17] and ProtBERT [18], combined with a lightweight teacher–student learning strategy to enhance prediction efficiency. DrugKANs [19] proposed a novel paradigm that integrates Kolmogorov–Arnold Networks with sequential representations, demonstrating improved expressiveness, and interpretability in modeling complex drug–target relationships.

In contrast, structure-based methods represent drugs as molecular graphs, capturing structural information that sequence-based embeddings may overlook. For instance, MGraphDTA [20] utilizes a multiscale graph neural network (GNN) for molecular graphs alongside a multiscale CNN for protein structural features. Similarly, MGMA-DTI [21] applies a 2-layer graph convolutional network (GCN) to molecular graphs and a multi-order gated convolution to protein sequences, integrating these features through an attention-based fusion module. Furthermore, GPS-DTI [22] uniquely enhances drug representation by employing a GPS layer [23], though it relies on ESM2 sequence embeddings refined by CNNs for protein feature extraction. However, DTI involves complex interactions situated within a wider biological context, leading some studies [24–26] to explore leveraging graph representation learning over heterogeneous biological information networks to capture global dependency patterns. Despite these attempts to utilize relational information, existing sequence-based and structure-based methods primarily rely on single-representation paradigms. Although computationally efficient due to their reliance on a single representation, these approaches are limited in capturing the full spectrum of multimodal information inherent to both drugs and proteins.

To overcome these limitations, recent studies have explored multimodal integration to enhance predictive performance. MCL-DTI [27] extracts features from both drug molecule images and chemical text information that are then combined to form a multimodal drug representation fused with the target sequence for DTI prediction. In addition, MMDG-DTI [28] incorporates two complementary features: textual embeddings from pretrained language models, and structural embeddings derived from molecular graphs and protein sequence encoders. Despite the potential of multimodal integration, effectively optimizing these methods remains challenging, and they do not always outperform single-modality approaches in predictive accuracy.

Motivated by these challenges, we propose TriDTI, a novel framework that simultaneously leverages three distinct modalities for both drugs and proteins. Unlike prior approaches that rely on a single or dual representation, TriDTI incorporates structural, sequential, and relational features within a unified learning paradigm. Furthermore, cross-modal contrastive learning is employed to strengthen semantic alignment, and a dynamic fusion strategy adaptively balances modality contributions, enabling the capture of intricate DTI patterns often overlooked by previous models. Our contributions are summarized as follows:

Novel tri-modal framework: TriDTI is a novel DTI prediction model to jointly utilize structural, sequential, and relational modalities for both drugs and proteins, expanding beyond the limitations of single- or dual-modality designs.
Enhanced modality alignment: We design a projection layer combined with cross-modal contrastive learning to enforce semantic consistency both across instances and between modalities, addressing the challenges of joint optimization in multimodal learning.
Adaptive fusion: We introduce a two-stage fusion mechanism in which soft attention dynamically weights modality-specific contributions and cross-attention models DTIs through interaction-aware representations, yielding more accurate DTI predictions.

Materials and methods

TriDTI consists of four main stages: (i) feature extraction, (ii) modality alignment (iii) feature fusion, and (iv) classification. The overall architecture is shown in Fig. 1, and the details of each component are described in the following sections.

Feature extraction

Structural feature

We explicitly encode the structural characteristics of drugs and proteins using graph and convolution architectures. Drug molecules are represented as graphs derived from their SMILES codes using RDKit, where atoms are nodes and bonds are edges. Each atom is encoded into a 79-dimensional feature vector encompassing properties such as atom type, bond degree, hydrogen count, implicit valence, and aromaticity. A 2-layer graph isomorphism network (GIN) is applied to capture the molecular topology:

where Inline graphic denotes the embedding of atom at layer , is the neighbors of node , and is a learnable scalar. The final molecular representations are obtained by averaging the embeddings of all atoms in the last layer, forming the drug-level embedding matrix .

For proteins, we employ a multi-scale CNN to capture motifs of varying lengths from their amino acid sequences. The input sequences are first mapped to a learnable embeddings and passed through three parallel convolutional branches with kernel sizes of 1, 3, and 5, respectively. Each branch consists of three convolutional layers that refine local features. The outputs are then aggregated by AdaptiveMaxPooling to produce the protein embedding matrix Inline graphic , encoding functional motifs and multi-scale dependencies.

Sequential feature

Sequence-based embeddings provide semantic and contextual features that complement explicit structures. Token-level embeddings from pretrained large language models (LLMs) are mean-pooled to obtain sequence-level representations. For drugs, we adopt ChemBERTa, trained on large SMILES corpora, which captures chemical grammar and higher-order molecular patterns. This produces a sequence embedding matrix Inline graphic .

For proteins, we use ESM2-t33-650M-UR50D [29], a transformer model with 650 M parameters trained on protein sequences. Its pooled embeddings form a matrix Inline graphic . These representations encode long-range dependencies relevant to folding and function. By anchoring on large-scale pretraining, these sequence-based representations offer stable and semantically rich priors for downstream modeling.

Relational feature

TriDTI captures relational information beyond individual entities by modeling dependencies within global interaction networks. This is achieved through relational subgraph sampling, a method that extracts relevant neighborhood topologies from drug–drug similarities and protein–protein interaction (PPI) networks to create localized representations.

The process for each entity is as follows. We first obtain node features for drugs from a pretrained LLM, ChemBERTa, denoted as Inline graphic , and for proteins from ESM2, denoted as . For drug entities, we construct a similarity network based on the cosine similarity of these embeddings. We then perform subgraph sampling by reducing the network density to retain only the top- edges and extracting 2-hop subgraphs. Similarly, for protein entities, we leverage the STRING PPI network [30] whose nodes are initialized with the Inline graphic embeddings. We sample subgraphs by applying the top- sparsification based on confidence scores and deriving 2-hop subgraphs.

Next, a 2-layer graph attention network version-2 (GATv2) [31] is applied to these subgraphs to aggregate relational information. The node update rule for the GATv2 is defined as:

where Inline graphic is the embedding of node at layer , and attention weights are computed as:

The final relation embeddings for drugs and proteins are obtained by averaging the node embeddings within their respective subgraphs, which follows the same formulation:

where Inline graphic is the set of nodes in the sampled subgraph. Collecting these subgraph-level representations across all drugs and proteins yields the final relational embedding matrices and . This formulation integrates local interaction patterns with global biological context, thereby complementing both structural and sequence-based features.

Unlike prior graph-based DTI frameworks [32, 33] that construct a unified heterogeneous biological network and perform end-to-end message passing across multiple entity types, TriDTI instead adopts a modular relational representation strategy. Relational information is encoded independently through localized subgraph representations derived from drug–drug similarity and PPI networks, rather than through joint propagation over a single heterogeneous graph. This design enables relational features to complement sequential and structural modalities without entangling heterogeneous propagation paths, facilitating more flexible multimodal fusion while reducing reliance on large, densely connected biological networks.

Modality alignment

Effective integration of heterogeneous features from multiple modalities in TriDTI requires aligning embeddings in a shared latent space. Modality-specific projection networks are employed to map embeddings of varying dimensions into a unified space, ensuring both dimensional consistency and the ability to capture non-linear relationships. Formally, for a set of modality embeddings Inline graphic , each embedding vector is transformed through a 2-layer feed-forward network with GELU activation:

To further ensure that embeddings from different modalities are semantically aligned, a bidirectional cross-modality contrastive learning objective is applied. In this framework, projected embeddings Inline graphic and , form positive pairs for each entity , while embeddings of different entities within the same modality serve as negatives. The directional loss from to is defined as:

where Inline graphic is the number of mini-batches, is the mini-batch size, denotes cosine similarity, and is a temperature hyperparameter. The bidirectional loss

ensures symmetric alignment between modalities. The final contrastive loss is computed over selected modality pairs for both drugs and targets:

focusing on aligning other modalities to the pretrained sequential representations. By encouraging closeness among embeddings of the same entity across modalities while separating embeddings of different entities within each modality, this modality alignment step promotes consistent, discriminative, and semantically coherent representations across the tri-modal feature space, enhancing the predictive capability of TriDTI.

Feature fusion

TriDTI employs a two-stage attention-based fusion strategy to integrate heterogeneous modality embeddings of drugs and proteins. This approach balances modality-specific strengths while mitigating redundancy and noise, yielding interaction-specific representations that capture both entity-level and pair-level dependencies.

First, a soft attention module adaptively weighs the contribution of each modality. Given modality features Inline graphic for an entity, the attention scores are computed using a two-layer multi-layer perceptron (MLP) with Tanh activation, and normalized across modalities via a softmax function. The fused entity representation is then obtained as a weighted sum of modality embeddings:

Second, the fused drug and protein embeddings are refined through a bidirectional cross-attention module. In this design, the query Inline graphic originates from one entity, while the key and value are projected from the other, enabling each entity to selectively attend to features of its counterpart. Formally, the cross-attention from drug to protein is defined as

with a symmetric formulation for Inline graphic . Residual connections are then applied to preserve entity-specific information while incorporating complementary interaction cues, leading to the final embeddings:

Here, Inline graphic and serve as the final drug and protein representations, simultaneously retaining modality integrated features and cross-entity contextual information, which form the basis for downstream interaction prediction.

Classification

The final representations of drugs and proteins, enhanced by the bidirectional cross-attention module, are combined to predict the probability of interaction. Specifically, the two vectors Inline graphic and are concatenated to form a unified representation , which is then fed into an MLP-based classifier. The classifier consists of multiple fully connected layers interleaved with GELU activation functions and dropout regularization, enabling it to capture complex nonlinear dependencies between drugs and proteins. Formally, the prediction is obtained as

where Inline graphic denotes the predicted interaction probability, and is the sigmoid activation function.

Overall loss function

To optimize both prediction accuracy and modality consistency, the model is trained with a composite loss function that combines binary cross-entropy (BCE) loss and cross-modality contrastive loss. The BCE loss directly supervises DTI prediction by minimizing the discrepancy between the predicted probability Inline graphic and the ground-truth label :

The total loss is defined as a weighted sum of BCE loss and the previously defined contrastive loss:

where Inline graphic is a hyperparameter that balances prediction accuracy and modality alignment. In our experiments, we set to provide a small but effective regularization from the contrastive objective. This joint optimization encourages the model not only to maximize predictive performance but also to maintain semantic consistency across heterogeneous modalities, thereby enhancing both generalization and representation quality.

Results

Datasets

We employed three publicly available benchmark datasets for evaluation: DAVIS [34], BioSNAP [35], and DrugBank [36]. The DAVIS dataset consists of 68 drugs and 379 target proteins, providing experimental measurements of drug–target binding affinities. Following prior work, we binarized the affinity values by treating drug–target pairs with dissociation constant ( Inline graphic ) values below 30 as positive interactions and all others as negative, thus reformulating the task into a binary classification problem. For BioSNAP and DrugBank, we used the preprocessed versions from MolTrans [37] and HyperAttentionDTI [15], respectively. In these versions, drug–target pairs were extracted from the original datasets, and negative sampling was applied to ensure a Inline graphic 1:1 ratio of positive to negative interactions. To maintain data integrity, we further removed drug samples with invalid SMILES strings that could not be converted into molecular graphs.

To incorporate relational knowledge, we leveraged the PPI dataset from STRING [30] that provides probabilistic confidence scores for functional associations between proteins. Using STRING PPIs, we constructed separate PPI networks for each benchmark by including only the proteins present in the corresponding DTI dataset. This approach ensures that the relational information is specific to each benchmark while capturing the functional associations relevant to the modeled proteins. These networks were subsequently integrated as an additional modality input to our model. The statistics of the resulting experimental datasets are summarized in Table 1.

Table 1.

Statistics of the benchmark datasets for our experiments.

Dataset	Drugs	Targets	DTIs		PPIs
Dataset	Drugs	Targets	Positive	Negative	PPIs
DAVIS	68	379	1506	9597	15 734
BioSNAP	4502	2181	13 811	13 622	193 212
DrugBank	6645	4254	17 511	17 511	237 405

Open in a new tab

Experimental settings

For a robust assessment, we adopted five-fold cross-validation. Each dataset was split into training, validation and test sets in a 7:1:2 ratio. Model performance was evaluated using four standard metrics: area under the receiver operating characteristic curve (AUROC), area under the precision–recall curve (AUPRC), F1 score, and accuracy. Training was conducted using the AdamW optimizer with a learning rate of 5e-4, a batch size of 16, and a dropout rate of 0.1 for up to 100 epochs. Analysis of the training dynamics (see Supplementary Section S4) confirmed that the model consistently converged within this epoch limit, demonstrating stable optimization. The model parameters achieving the highest AUROC on the validation set were selected for reporting final test results. Detailed hyperparameter configurations for TriDTI are provided in Table 2, and a sensitivity analysis of the modality alignment hyperparameters ( Inline graphic and ) is presented in Supplementary Table S1. To ensure a fair and reproducible comparison, all baseline models were rigorously trained, validated, and tested using the identical data splits employed for TriDTI. For model implementations, we adhered to the hyperparameters and configurations explicitly reported in the original work. Where details were unavailable or incompatible with our datasets, hyperparameters were empirically tuned to reflect the scale and characteristics of each dataset.

Table 2.

Hyperparameter configurations for TriDTI across DAVIS, BIOSNAP, and DrugBank datasets.

Hyperparameter	DAVIS	BIOSNAP	DrugBank
Structural feature
GIN input dim	79	79	79
GIN output dim ()	128	64	64
CNN input dim	128	64	64
CNN output dim ()	128	64	64
Sequential feature
ChemBERTa input dim	510	510	510
ChemBERTa output dim ()	768	768	768
ESM2 input dim	1024	1024	1024
ESM2 output dim ()	1280	1280	1280
Relational feature
Drug GATv2 hidden dim ()	128	64	64
Target GATv2 hidden dim ()	128	64	64
Modality alignment
Projection dim	(128, 128)	(128, 64)	(256, 64)
Contrastive temperature ()	0.1	0.1	0.1
Contrastive weight ()	0.0001	0.0001	0.0001
Modality fusion
Soft attention hidden dim	(128, 3)	(64, 3)	(64, 3)
Cross-attention output dim	128	64	64
Cross-attention num heads	8	8	8

Open in a new tab

Performance evaluation

TriDTI consistently achieved the best performance among all existing state-of-the-art models across all three benchmark datasets, as summarized in Table 3. On the DAVIS dataset, TriDTI recorded an AUROC of 0.9391 and an AUPRC of 0.7605, corresponding to relative improvements of 0.24% and 0.88% over the previous best-performing model, GPS-DTI. While MGMA-DTI reported a higher F1 score, its performance across the other metrics did not generalize as well. In contrast, TriDTI demonstrated a uniformly strong and balanced predictive capability across all other major evaluation metrics, recording a high accuracy of 0.9234.

Table 3.

DTI prediction performance on DAVIS, BioSNAP, and DrugBank datasets, where values indicate the mean and standard deviation over five-fold cross-validation.

Dataset	Methods	AUROC	AUPRC	F1	Accuracy
DAVIS	TransformerCPI	0.8399 0.0125	0.5329 0.0066	0.5141 0.0394	0.8723 0.0073
	MGraphDTA	0.9211 0.0118	0.7064 0.0163	0.6843 0.0160	0.9087 0.0053
	HyperAttentionDTI	0.9221 0.0108	0.7214 0.0133	0.6911 0.0168	0.9184 0.0024
	MCL-DTI	0.8967 0.0114	0.7050 0.0241	0.6660 0.0225	0.9180 0.0057
	DLM-DTI	0.9290 0.0114	0.7436 0.0249	0.7083 0.0203	0.9194 0.0058
	MMDG-DTI	0.9166 0.0058	0.7155 0.0242	0.6848 0.0134	0.9094 0.0068
	MGMA-DTI	0.8937 0.0072	0.6735 0.0252	0.8311 0.0102	0.8212 0.0354
	GPS-DTI	0.9368 0.0069	0.7538 0.0138	0.7245 0.0129	0.9244 0.0048
	TriDTI	0.9391 0.0031	0.7605 0.0114	0.7186 0.0100	0.9234 0.0014
BioSNAP	TransformerCPI	0.8714 0.0040	0.8773 0.0050	0.7977 0.0038	0.7877 0.0097
	MGraphDTA	0.9049 0.0026	0.9117 0.0030	0.8316 0.0029	0.8263 0.0035
	HyperAttentionDTI	0.9122 0.0035	0.9181 0.0041	0.8410 0.0053	0.8391 0.0072
	MCL-DTI	0.8773 0.0025	0.8788 0.0037	0.8079 0.0049	0.8060 0.0043
	DLM-DTI	0.9115 0.0031	0.9158 0.0025	0.8420 0.0068	0.8418 0.0051
	MMDG-DTI	0.9093 0.0022	0.9149 0.0035	0.8393 0.0021	0.8345 0.0023
	MGMA-DTI	0.8905 0.0040	0.8946 0.0069	0.8180 0.0052	0.8131 0.0083
	GPS-DTI	0.9256 0.0039	0.9259 0.0056	0.8594 0.0057	0.8555 0.0068
	TriDTI	0.9274 0.0030	0.9280 0.0029	0.8605 0.0039	0.8567 0.0067
DrugBank	TransformerCPI	0.8451 0.0051	0.8480 0.0071	0.7729 0.0035	0.7679 0.0031
	MGraphDTA	0.8780 0.0042	0.8823 0.0063	0.8032 0.0039	0.7948 0.0073
	HyperAttentionDTI	0.8878 0.0035	0.8922 0.0046	0.8112 0.0036	0.8066 0.0052
	MCL-DTI	0.8450 0.0032	0.8435 0.0051	0.7762 0.0038	0.7733 0.0037
	DLM-DTI	0.8990 0.0051	0.9008 0.0034	0.8238 0.0074	0.8181 0.0132
	MMDG-DTI	0.8768 0.0179	0.8760 0.0225	0.8064 0.0133	0.7934 0.0171
	MGMA-DTI	0.8676 0.0036	0.8693 0.0107	0.7944 0.0033	0.7826 0.0075
	GPS-DTI	0.9120 0.0019	0.9101 0.0029	0.8431 0.0039	0.8395 0.0049
	TriDTI	0.9182 0.0042	0.9180 0.0068	0.8477 0.0036	0.8458 0.0037

Open in a new tab

Note: The best and second-best results are shown in bold and underline, respectively.

The advantage of TriDTI is further substantiated on the BioSNAP and DrugBank datasets, where its overall superiority is more pronounced. For BioSNAP, TriDTI achieved the highest results across all four metrics: AUROC (0.9274), AUPRC (0.9280), F1 score (0.8605), and accuracy (0.8567). Similarly, TriDTI obtained the best performance on DrugBank recording AUROC 0.9182, AUPRC 0.9180, F1 score 0.8477, and accuracy 0.8458. When compared against the average performance of all other baseline models, these results demonstrate a more substantial margin of improvement. For instance, TriDTI surpasses the average AUROC and AUPRC of all competing models by 2.92% and 2.52% on BioSNAP, and 4.55%, 4.38% on DrugBank, respectively. These results highlight the effectiveness of TriDTI’s modality-integrated representation learning, achieving superior and consistent performance across diverse datasets.

Ablation study

We further analyzed the contribution of individual modalities and the importance of key components in the TriDTI. By systematically removing specific modalities or architectural modules, we evaluated how each element influenced the overall predictive performance. The experimental results are summarized in Fig. 2.

A comparative performance analysis of TriDTI and its variants across three benchmark datasets. The visualization depicts (a) the impact of individual input modalities on prediction accuracy and (b) the functional necessity of core architectural modules, confirming that the integrated TriDTI framework yields the optimal AUROC metric. — Ablation study results of TriDTI on the DAVIS, BioSNAP, and DrugBank datasets. The figure presents two comparative analyses: (a) Modality contribution analysis assesses the contribution of individual feature sources by comparing the full model against variants where a single or dual input modality is excluded. (b) Module ablation study validates the functional necessity of core architectural units by comparing the full model against variants excluding each modular component. Bars represent the mean and standard deviation over five-fold cross-validation, reported by AUROC.

Modality contribution analysis

The contribution of each modality was analyzed by comparing single-, dual-, and tri-modality configurations. Among single-modality settings, the sequence-only model consistently achieved the best performance across all datasets, whereas relational and structural modalities exhibited relatively lower accuracy. This finding highlights sequence-based semantic information from pretrained language models as the most informative signal for DTI prediction.

Models that included the sequence modality generally maintained strong performance, indicating its robustness across different datasets. However, performance gains were not always guaranteed when two modalities were combined. In several cases, dual-modality models underperformed the sequence-only baseline, suggesting that naive feature fusion does not necessarily lead to improved predictions. Notable exceptions were observed for BioSNAP and DrugBank, where integrating sequence and relational modalities yielded performance improvements, implying complementary contributions from relational information. In contrast, the joint utilization of all three modalities consistently improved performance across all datasets. This outcome demonstrates that full multimodal integration enables TriDTI to capture complementary information beyond what is accessible through single or limited dual-modality configurations. In addition, the soft attention weights offered insight into how the model adaptively emphasizes different modalities based on dataset characteristics (see Supplementary Section S2).

Module ablation study

To validate the necessity of the proposed architecture, we assessed the functional role of TriDTI’s core modules by comparing the full framework against various ablated variants. Across all datasets, the complete model consistently outperformed its ablated variants, confirming the effectiveness of the proposed design. Removing the contrastive learning module resulted in a performance degradation of 2.09% on average. This degradation shows that explicit cross-modal alignment is crucial for learning robust multimodal embeddings, as its absence hinders the model’s ability to fully exploit the complementary nature of heterogeneous features. Furthermore, as shown in Supplementary Section S4, analysis of the training dynamics confirmed that the contrastive objective led to enhanced convergence stability and superior validation AUROC.

The attention-based fusion mechanism was also validated through its components. Excluding the soft attention module reduced performance by 1.30% on average, suggesting that selectively emphasizing informative features within each modality contributes to improving prediction accuracy. A comparable performance drop of 1.31% on average resulted from the removal of the cross-attention module. This result emphasizes the benefit of modeling pairwise interactions at the drug–target level. Overall, the ablation results confirmed that each architectural component meaningfully contributes to the final performance, and that combining contrastive alignment with attention-based fusion is crucial for effective multimodal integration in TriDTI.

Model interpretability

Contrastive learning plays an important role in shaping the quality of the representation space. Figure 3 presents t-SNE visualizations of the joint drug–target embeddings produced by TriDTI on the BioSNAP dataset, comparing models trained with and without the contrastive learning objective. As illustrated in the figure, embeddings generated with contrastive learning for more clearly separated and structured clusters corresponding to interaction and noninteraction labels. In contrast, embeddings obtained without contrastive learning show substantial overlap between classes, indicating reduced discriminative capability. These observations suggest that contrastive learning guides the model to organize the representation space in a way that better captures underlying DTI patterns. Furthermore, detailed analysis and visualization of the bi-directional Cross-Attention mechanism (see Supplementary Section S3) confirmed that the model learns robust and mutual interaction representations by exhibiting complementary attention patterns.

A comparative t-SNE visualization of joint drug–target embeddings. The visualization demonstrates that the incorporation of contrastive learning results in highly discriminative and structured clusters for interaction pairs, whereas the model without the contrastive objective shows significant overlap between samples. — t-SNE visualization of joint drug–target embeddings. The left panel shows embeddings obtained from the model trained with contrastive learning, while the right panel corresponds to embeddings learned without the contrastive objective. Embeddings learned with contrastive learning exhibit more clearly separated and structured clusters between interaction and non-interaction samples, indicating enhanced discriminative representation learning compared with the non-contrastive counterpart.

Cold-start settings

Table 5.

DTI prediction performance comparison on BioSNAP under three cold-start scenarios: Unseen Drug, Unseen Target, and Unseen Binding, where values represent the mean and standard deviation over five-fold cross-validation.

Model	Unseen Drug		Unseen Target		Unseen Binding
	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC
TransformerCPI	0.8661 0.0094	0.8768 0.0071	0.7267 0.0366	0.7477 0.0510	0.7040 0.0618	0.7262 0.0809
MGraphDTA	0.8571 0.0089	0.8735 0.0067	0.7652 0.0268	0.7907 0.0381	0.6866 0.0608	0.7244 0.0702
HyperAttentionDTI	0.8694 0.0104	0.8838 0.0092	0.7868 0.0219	0.8214 0.0254	0.7065 0.0596	0.7473 0.0721
MCL-DTI	0.8150 0.0164	0.8321 0.0097	0.7168 0.0271	0.7447 0.0437	0.6399 0.0417	0.6749 0.0700
DLM-DTI	0.8266 0.0538	0.8492 0.0431	0.8388 0.0138	0.8552 0.0209	0.7213 0.0655	0.7550 0.0818
MMDG-DTI	0.8691 0.0105	0.8856 0.0081	0.8104 0.0171	0.8339 0.0125	0.7503 0.0554	0.7852 0.0661
MGMA-DTI	0.8660 0.0079	0.8745 0.0083	0.6689 0.0292	0.6904 0.0435	0.6388 0.0445	0.6651 0.0684
GPS-DTI	0.8735 0.0156	0.8825 0.0166	0.8684 0.0122	0.8804 0.0198	0.7882 0.0446	0.8110 0.0581
TriDTI	0.8834 0.0108	0.8899 0.0135	0.8670 0.0073	0.8750 0.0220	0.7983 0.0305	0.8080 0.0395

Open in a new tab

Note: The best and second-best results are shown in bold and underline, respectively.

A cold-start scenario, where a model encounters previously unseen drugs, targets, or binding pairs, constitutes one of the most challenging settings in DTI prediction. Under these settings, TriDTI demonstrated strong performance across the DAVIS, BioSNAP, and DrugBank datasets, as summarized in Tables 4–6. In the Unseen Drug setting, TriDTI showed comparatively lower performance on the DAVIS dataset than some baseline methods. However, it achieved the best results on both BioSNAP and DrugBank in terms of AUROC and AUPRC, suggesting effective generalization to previously unseen compounds in larger and more diverse chemical spaces. In the Unseen Target and Unseen Binding settings, TriDTI consistently ranked among the top two methods across all datasets, demonstrating robust generalization under diverse cold-start conditions. In particular, GPS-DTI exhibited notably strong performance in the Unseen Target scenario, which is likely attributable to its reliance on large-scale pretrained protein representations from ESM2. Overall, these results indicate that TriDTI is well suited for real-world DTI prediction scenarios, where new compounds and targets are continuously introduced.

Table 4.

DTI prediction performance comparison on DAVIS under three cold-start scenarios: Unseen Drug, Unseen Target, and Unseen Binding, where values represent the mean and standard deviation over five-fold cross-validation.

Model	Unseen Drug		Unseen Target		Unseen Binding
	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC
TransformerCPI	0.7483 0.0304	0.3470 0.0789	0.7972 0.0342	0.4546 0.0961	0.7212 0.0575	0.3086 0.1324
MGraphDTA	0.7230 0.0450	0.3554 0.0966	0.8492 0.0432	0.5314 0.1315	0.6314 0.0775	0.2028 0.0691
HyperAttentionDTI	0.7400 0.0297	0.3676 0.1100	0.8714 0.0292	0.5955 0.0986	0.6525 0.0777	0.2656 0.1165
MCL-DTI	0.7260 0.0376	0.3446 0.0865	0.7871 0.0391	0.4477 0.0979	0.6674 0.0776	0.2530 0.1316
DLM-DTI	0.7313 0.0414	0.3861 0.0918	0.8247 0.0605	0.5334 0.1519	0.7016 0.0808	0.2902 0.0452
MMDG-DTI	0.7409 0.0852	0.3748 0.1159	0.8529 0.0383	0.5474 0.1067	0.6490 0.1232	0.2665 0.1509
MGMA-DTI	0.7420 0.0489	0.3746 0.0738	0.7260 0.0545	0.3883 0.0876	0.5729 0.1287	0.1977 0.0944
GPS-DTI	0.6904 0.0503	0.3318 0.0430	0.8870 0.0255	0.6280 0.1019	0.6931 0.0432	0.2597 0.0239
TriDTI	0.7302 0.0357	0.3345 0.0717	0.8923 0.0297	0.6328 0.0896	0.7909 0.0488	0.4202 0.0712

Open in a new tab

Note: The best and second-best results are shown in bold and underline, respectively.

Table 6.

DTI prediction performance comparison on DrugBank under three cold-start scenarios: Unseen Drug, Unseen Target, and Unseen Binding, where values represent the mean and standard deviation over five-fold cross-validation.

Model	Unseen Drug		Unseen Target		Unseen Binding
	AUROC	AUPRC	AUROC	AUPRC	AUROC	AUPRC
TransformerCPI	0.7674 0.0322	0.3572 0.0761	0.7240 0.0159	0.7295 0.0158	0.6892 0.0098	0.6860 0.0251
MGraphDTA	0.8316 0.0095	0.8407 0.0108	0.7573 0.0053	0.7839 0.0025	0.6911 0.0062	0.7030 0.0165
HyperAttentionDTI	0.8335 0.0052	0.8426 0.0049	0.7814 0.0202	0.8091 0.0164	0.6970 0.0331	0.6950 0.0463
MCL-DTI	0.7596 0.0172	0.7729 0.0107	0.6619 0.0164	0.6796 0.0134	0.5646 0.0173	0.5585 0.0314
DLM-DTI	0.8478 0.0117	0.8514 0.0117	0.8372 0.0107	0.8461 0.0107	0.7579 0.0056	0.7615 0.0104
MMDG-DTI	0.8332 0.0194	0.8397 0.0196	0.7780 0.0374	0.7953 0.0334	0.7071 0.0154	0.7219 0.0281
MGMA-DTI	0.8284 0.0103	0.8346 0.0124	0.6919 0.0244	0.7011 0.0229	0.6318 0.0174	0.6183 0.0197
GPS-DTI	0.8487 0.0074	0.8572 0.0040	0.8681 0.0155	0.8776 0.0155	0.7774 0.0226	0.7841 0.0248
TriDTI	0.8688 0.0086	0.8717 0.0058	0.8664 0.0106	0.8725 0.0102	0.7943 0.0161	0.7913 0.0239

Open in a new tab

Note: The best and second-best results are shown in bold and underline, respectively.

Case study

The cold start analysis demonstrated TriDTI’s strong generalization ability to unseen data. However, this case study aims to highlight the practical utility of the model for real-world drug discovery. To validate our predictions for unknown DTIs, we used the DrugBank dataset. We first filtered all known drug–target pairs and then used the remaining candidate pool as input for our model. This process yielded a list of the 10 most promising novel candidates. After excluding a pair that lacked a 3D PDB structure, we subjected the remaining nine candidates to molecular docking simulations for validation.

To further substantiate our predictions, we used the CB-Dock2 [38] docking server to compute Vina scores for the nine candidates. The detailed docking results, including the Vina score, cavity volume, center coordinates, and docking size for each pair, are presented in Table 7. The results showed that every pair yielded a binding affinity score of < −5 kcal/mol. In docking analysis, a Vina score below −5 kcal/mol is generally considered a strong indicator of potential DTI, with more negative values suggesting a more robust binding ability. The docking outcomes for the top two candidates are further visualized in Fig. 4 that shows their binding poses and key interactions with the target proteins.

Table 7.

Top 9 docking results of drug–protein pairs selected by TriTDI

Drug ID	Protein ID	Vina score	Cavity volume	Center (x, y, z)	Docking size (x, y, z)
DB11638	P08235	−7.6	458	64, 58, −2	18, 18, 18
DB00753	P08235	−5.3	436	122, 24, 22	16, 16, 16
DB00637	P08913	−10.7	6	−5, −12, 10	26, 26, 26
DB07973	P08913	−9.6	6	−5, −12, 10	23, 23, 23
DB06144	P08913	−9.9	6	−5, −12, 10	25, 25, 25
DB01043	P08235	−7.0	436	122, 24, 22	16, 16, 16
DB05422	P08913	−9.4	6	−5, −12, 10	24, 24, 24
DB08685	P34903	−5.8	3772	142, 102, 133	28, 29, 35
DB05316	P08913	−9.7	6	−5, −12, −10	25, 25, 25

Open in a new tab

3D visualization of molecular docking poses for the top-ranked drug–target pairs predicted by TriDTI. The panels illustrate the binding orientations and spatial interactions between (a) DB11638 and P08235, and (b) DB00753 and P08235, within the target protein’s active site. — Molecular docking analysis of top-ranked pairs predicted by TriDTI. (a) Highest-ranked binding prediction: DB11638 interacting with P08235. (b) Second-ranked binding prediction: DB00753 interacting with P08235.

It should be noted that docking scores alone do not constitute experimental validation of DTIs. Rather, these results provide supportive, structure-based evidence that the model-predicted pairs are physically plausible and merit further investigation. Taken together, this case study demonstrates that TriDTI can effectively prioritize candidate drug–target pairs that are favorable for downstream structure-based analysis, thereby serving as a useful computational screening tool in the early stages of drug discovery.

Conclusion

In this study, we present TriDTI, a novel deep learning framework designed to address the limitations of traditional DTI prediction models. The model simultaneously integrates three complementary modalities for both drugs and proteins: sequential representations from LLMs, structural features from molecular graphs and amino acid sequences, and relational information from biological networks. To balance the contributions of these heterogeneous modalities, we adopt a cross-modal contrastive learning strategy that enhances semantic alignment across feature spaces. In addition, a dynamic attention-based fusion mechanism is introduced to maximize predictive accuracy by adaptively weighting modality-specific contributions and modeling DTI patterns. Extensive experiments demonstrate that TriDTI consistently achieves the best performance across three benchmark datasets. Moreover, validation under cold-start scenarios and molecular docking case studies highlights its strong generalization capacity and practical utility in discovering novel drug–target pairs.

Although TriDTI is a useful tool for DTI prediction, several avenues remain for future exploration. First, while our current design incorporates pretrained LLM-based features, pretraining the molecular graph modality on large-scale datasets [39, 40] could further alleviate the imbalance among heterogeneous modalities and enhance structural representations. Second, although TriDTI effectively utilizes relational features through drug–drug similarities and PPIs, it currently does not rely on a comprehensive heterogeneous biological information network containing multiple entity types (e.g. diseases or side-effects). A promising direction involves augmenting the relational modality by incorporating such comprehensive networks and leveraging advanced heterogeneous graph representation learning methods [41, 42]. Third, TriDTI does not yet incorporate explicit 3D structural data, despite employing CNNs to model 1D protein sequences [43, 44]. Therefore, integrating 3D conformational information, potentially through geometric deep learning or structure-informed representations, would allow us to capture spatial interaction patterns more effectively. Fourth, the framework can be extended to integrate additional complementary modalities for both drugs and proteins, such as molecular images or textual descriptions, to achieve an even richer multimodal representation. Such extensions will further strengthen TriDTI’s capability and establish it as an even more versatile tool for advancing computational drug discovery.

Key Points.

We propose TriDTI, a novel tri-modal framework that integrates structural, sequential, and relational modalities to learn comprehensive representations by capturing diverse features of both drugs and proteins.
The model employs a cross-modal contrastive learning strategy to enforce semantic alignment across disparate embedding spaces, effectively minimizing information loss during the integration of heterogeneous features.
A two-stage adaptive fusion mechanism, combining soft attention and cross-attention, is designed to dynamically balance modality contributions and precisely model interaction-aware representations.

Supplementary Material

bbag034_Supplemental_File

bbag034_supplemental_file.docx^{(1.1MB, docx)}

Contributor Information

Gwang-Hyeon Yun, Department of Software, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea.

Jong-Hoon Park, Department of Software, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea.

Young-Rae Cho, Department of Software, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea; Department of Digital Healthcare, Yonsei University Mirae Campus, 1 Yeonsedae-gil, Wonju-si, Gangwon-do, 26493, Republic of Korea.

Author contributions

Gwang-Hyeon Yun (Conceptualization, Methodology, Software, Formal analysis, Investigation, Data curation, Writing—original draft, Writing—review & editing), Jong-Hoon Park (Methodology, Software, Formal analysis, Investigation, Writing-review & editing, Visualization), Young-Rae Cho (Conceptualization, Writing—original draft, Writing—review & editing, Resources, Supervision, Project administration, Funding acquisition)

Conflict of interest

None declared.

Funding

This research was supported by National Research Foundation of Korea (NRF) grant funded by the Ministry of Science and ICT (grant no. RS-2025-16067916), Basic Science Research Program through the NRF funded by the Ministry of Education (grant no. RS-2025-25432868), and the Regional Innovation System & Education (RISE) program through the Gangwon RISE Center funded by the Ministry of Education and the Gangwon State, Republic of Korea (grant no. 2025-RISE-10-006).

Data availability

The codes and datasets are available online at https://github.com/knhc1234/TriDTI.

References

1. Zhangli L, Song G, Zhu H. et al. DTIAM: a unified framework for predicting drug-target interactions, binding affinities and drug mechanisms. Nat Commun 2025; 16:2548. [DOI] [PMC free article] [PubMed] [Google Scholar]
2. Hua Y, Song X, Feng Z. et al. CPInformer for efficient and robust compound-protein interaction prediction. IEEE/ACM Trans Comput Biol Bioinform 2022; 20:285–96. 10.1109/TCBB.2022.3144008 [DOI] [PubMed] [Google Scholar]
3. Talukder MA, Kazi M, Alazab A. Predicting drug-target interactions using machine learning with improved data balancing and feature engineering. Sci Rep 2025; 15:19495. 10.1038/s41598-025-03932-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Yun G-H, Park J-H, Cho Y-R. FACT: feature aggregation and convolution with transformers for predicting drug classification code. Bioinformatics 2025; 41:i77–85. [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Noor F, Junaid M, Almalki AH. et al. Deep learning pipeline for accelerating virtual screening in drug discovery. Sci Rep 2024; 14:28321. 10.1038/s41598-024-79799-w [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Qian Liao Y, Zhang YC, Ding Y. et al. Application of artificial intelligence in drug-target interactions prediction: a review. npj Biomed Innov 2025; 2:1. [Google Scholar]
7. Wei J, Zhu Y, Zhuo L. et al. Efficient deep model ensemble framework for drug-target interaction prediction. J Phys Chem Lett 2024; 15:7681–93. 10.1021/acs.jpclett.4c01509 [DOI] [PubMed] [Google Scholar]
8. Donghua Y, Liu H, Yao S. Drug–target interaction prediction based on improved heterogeneous graph representation learning and feature projection classification. Expert Syst Appl 2024; 252:124289. 10.1016/j.eswa.2024.124289 [DOI] [Google Scholar]
9. Dong W, Yang Q, Wang J. et al. Multi-modality attribute learning-based method for drug–protein interaction prediction based on deep neural network. Brief Bioinform 2023; 24:bbad161. 10.1093/bib/bbad161 [DOI] [PubMed] [Google Scholar]
10. Shan J, Sun J, Zheng H. MIF–DTI: a multimodal information fusion method for drug–target interaction prediction. Brief Bioinform 2025; 26:bbaf474. 10.1093/bib/bbaf474 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Zitnik M, Nguyen F, Wang B. et al. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Information Fusion 2019; 50:71–91. 10.1016/j.inffus.2018.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Li Y, Huang Y-A, You Z-H. et al. Drug-target interaction prediction based on drug fingerprint information and protein sequence. Molecules 2019; 24:2999. 10.3390/molecules24162999 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Shi W, Yang H, Xie L. et al. A review of machine learning-based methods for predicting drug–target interactions. Health Inf Sci Syst 2024; 12:30. 10.1007/s13755-024-00287-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Chen L, Tan X, Wang D. et al. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 2020; 36:4406–14. 10.1093/bioinformatics/btaa524 [DOI] [PubMed] [Google Scholar]
15. Zhao Q, Zhao H, Zheng K. et al. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics 2021; 38:655–62. 10.1093/bioinformatics/btab715 [DOI] [PubMed] [Google Scholar]
16. Lee J, Jun DW, Song I. et al. DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning. J Cheminform 2024; 16:14. 10.1186/s13321-024-00808-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Chithrananda S, Grand G, Ramsundar B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885 2020. 10.48550/arXiv.2010.09885 [DOI]
18. Elnaggar A, Heinzinger M, Dallago C. et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2021; 44:7112–27. [DOI] [PubMed] [Google Scholar]
19. Xiangzheng F, Zhenya D, Chen Y. et al. DrugKANs: a paradigm to enhance drug-target interaction prediction with kans. IEEE J Biomed Health Inform 2025;PP:1–12. 10.1109/JBHI.2025.3566931 [DOI] [PubMed] [Google Scholar]
20. Yang Z, Zhong W, Zhao L. et al. MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chem Sci 2022; 13:816–33. 10.1039/d1sc05180f [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Li C, Mi J, Wang H. et al. MGMA-DTI: drug target interaction prediction using multi-order gated convolution and multi-attention fusion. Comput Biol Chem 2025; 118:108449. [DOI] [PubMed] [Google Scholar]
22. Xiong A, Luo Z, Xia Y. et al. An interpretable geometric graph neural network for enhancing the generalizability of drug–target interaction prediction. BMC Biol 2025; 23:350. 10.1186/s12915-025-02456-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Rampášek L, Galkin M, Dwivedi VP. et al. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems 2022; 35:14501–15. [Google Scholar]
24. Zhao T, Yang H, Valsdottir LR. et al. Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief Bioinform 2020; 22:2141–50. 10.1093/bib/bbaa044 [DOI] [PubMed] [Google Scholar]
25. Peng J, Wang Y, Guan J. et al. An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction. Brief Bioinform 2021; 22:bbaa430. 10.1093/bib/bbaa430 [DOI] [PubMed] [Google Scholar]
26. Xiaorui S, Pengwei H, Yi H. et al. Predicting drug-target interactions over heterogeneous information network. IEEE J Biomed Health Inform 2023; 27:562–72. 10.1109/JBHI.2022.3219213 [DOI] [PubMed] [Google Scholar]
27. Qian Y, Li X, Jian W. et al. MCL-DTI: using drug multimodal information and bi-directional cross-attention learning method for predicting drug–target interaction. BMC bioinformatics 2023; 24:323. [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Hua Y, Feng Z, Song X. et al. MMDG-DTI: drug–target interaction prediction via multimodal feature fusion and domain generalization. Pattern Recogn 2025; 157:110887. [Google Scholar]
29. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023; 379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
30. Szklarczyk D, Nastou K, Koutrouli M. et al. The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res 2025; 53:D730–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Brody S, Alon U, Yahav E. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491. 2021. https://arxiv.org/abs/2105.14491 (accessed 29 January 2026).
32. Zhao B-W, Xiao-Rui S, Peng-Wei H. et al. iGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics 2023; 39:btad451. 10.1093/bioinformatics/btad451 [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Zhao B-W, Xiao-Rui S, Yang Y. et al. Regulation-aware graph learning for drug repositioning over heterogeneous biological network. Inform Sci 2025; 686:121360. 10.1016/j.ins.2024.121360 [DOI] [Google Scholar]
34. Davis MI, Hunt JP, Herrgard S. et al. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 2011; 29:1046–51. 10.1038/nbt.1990 [DOI] [PubMed] [Google Scholar]
35. Zitnik M, Sosič R, Maheshwari S. et al. BioSNAP datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata.
36. Knox C, Wilson M, Klinger CM. et al. DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res 2024; 52:D1265–75. 10.1093/nar/gkad976 [DOI] [PMC free article] [PubMed] [Google Scholar]
37. Huang K, Xiao C, Glass LM. et al. MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 2021; 37:830–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Liu Y, Yang X, Gan J. et al. CB-Dock2: improved protein–ligand blind docking by integrating cavity detection, docking and homologous template fitting. Nucleic Acids Res 2022; 50:W159–64. 10.1093/nar/gkac394 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Wang Y, Wang J, Cao Z. et al. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 2022; 4:279–87. 10.1038/s42256-022-00447-x [DOI] [Google Scholar]
40. Rong Y, Bian Y, Tingyang X. et al. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems 2020; 33:12559–71. [Google Scholar]
41. Xiaorui S, Pengwei H, Li D. et al. Interpretable identification of cancer genes across biological networks via transformer-powered graph representation learning. Nat Biomed Eng 2025; 9:371–89. 10.1038/s41551-024-01312-5 [DOI] [PubMed] [Google Scholar]
42. Xiangzheng F, Peng L, Chen H. et al. GRAPE: graph-regularized protein language modeling unlocks TCR-epitope binding specificity. Brief Bioinform 2025; 26:10. 10.1093/bib/bbaf522 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Zhao L, Wang H, Shi S. PocketDTA: an advanced multimodal architecture for enhanced prediction of drug-target affinity from 3D structural data of target binding pockets. Bioinformatics 2024; 40:btae594. [DOI] [PMC free article] [PubMed] [Google Scholar]
44. Stärk H, Beaini D, Corso G. et al. 3D infomax improves gnns for molecular property prediction. In: International Conference on Machine Learning (Baltimore, MD, USA), PMLR 2022; 162:20479–502. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

bbag034_Supplemental_File

bbag034_supplemental_file.docx^{(1.1MB, docx)}

Data Availability Statement

The codes and datasets are available online at https://github.com/knhc1234/TriDTI.

[ref1] 1. Zhangli L, Song G, Zhu H. et al. DTIAM: a unified framework for predicting drug-target interactions, binding affinities and drug mechanisms. Nat Commun 2025; 16:2548. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2. Hua Y, Song X, Feng Z. et al. CPInformer for efficient and robust compound-protein interaction prediction. IEEE/ACM Trans Comput Biol Bioinform 2022; 20:285–96. 10.1109/TCBB.2022.3144008 [DOI] [PubMed] [Google Scholar]

[ref3] 3. Talukder MA, Kazi M, Alazab A. Predicting drug-target interactions using machine learning with improved data balancing and feature engineering. Sci Rep 2025; 15:19495. 10.1038/s41598-025-03932-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref4] 4. Yun G-H, Park J-H, Cho Y-R. FACT: feature aggregation and convolution with transformers for predicting drug classification code. Bioinformatics 2025; 41:i77–85. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Noor F, Junaid M, Almalki AH. et al. Deep learning pipeline for accelerating virtual screening in drug discovery. Sci Rep 2024; 14:28321. 10.1038/s41598-024-79799-w [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref6] 6. Qian Liao Y, Zhang YC, Ding Y. et al. Application of artificial intelligence in drug-target interactions prediction: a review. npj Biomed Innov 2025; 2:1. [Google Scholar]

[ref7] 7. Wei J, Zhu Y, Zhuo L. et al. Efficient deep model ensemble framework for drug-target interaction prediction. J Phys Chem Lett 2024; 15:7681–93. 10.1021/acs.jpclett.4c01509 [DOI] [PubMed] [Google Scholar]

[ref8] 8. Donghua Y, Liu H, Yao S. Drug–target interaction prediction based on improved heterogeneous graph representation learning and feature projection classification. Expert Syst Appl 2024; 252:124289. 10.1016/j.eswa.2024.124289 [DOI] [Google Scholar]

[ref9] 9. Dong W, Yang Q, Wang J. et al. Multi-modality attribute learning-based method for drug–protein interaction prediction based on deep neural network. Brief Bioinform 2023; 24:bbad161. 10.1093/bib/bbad161 [DOI] [PubMed] [Google Scholar]

[ref10] 10. Shan J, Sun J, Zheng H. MIF–DTI: a multimodal information fusion method for drug–target interaction prediction. Brief Bioinform 2025; 26:bbaf474. 10.1093/bib/bbaf474 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Zitnik M, Nguyen F, Wang B. et al. Machine learning for integrating data in biology and medicine: principles, practice, and opportunities. Information Fusion 2019; 50:71–91. 10.1016/j.inffus.2018.09.012 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Li Y, Huang Y-A, You Z-H. et al. Drug-target interaction prediction based on drug fingerprint information and protein sequence. Molecules 2019; 24:2999. 10.3390/molecules24162999 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Shi W, Yang H, Xie L. et al. A review of machine learning-based methods for predicting drug–target interactions. Health Inf Sci Syst 2024; 12:30. 10.1007/s13755-024-00287-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Chen L, Tan X, Wang D. et al. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics 2020; 36:4406–14. 10.1093/bioinformatics/btaa524 [DOI] [PubMed] [Google Scholar]

[ref15] 15. Zhao Q, Zhao H, Zheng K. et al. HyperAttentionDTI: improving drug–protein interaction prediction by sequence-based deep learning with attention mechanism. Bioinformatics 2021; 38:655–62. 10.1093/bioinformatics/btab715 [DOI] [PubMed] [Google Scholar]

[ref16] 16. Lee J, Jun DW, Song I. et al. DLM-DTI: a dual language model for the prediction of drug-target interaction with hint-based learning. J Cheminform 2024; 16:14. 10.1186/s13321-024-00808-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref17] 17. Chithrananda S, Grand G, Ramsundar B. ChemBERTa: large-scale self-supervised pretraining for molecular property prediction. arXiv preprint arXiv:2010.09885 2020. 10.48550/arXiv.2010.09885 [DOI]

[ref18] 18. Elnaggar A, Heinzinger M, Dallago C. et al. ProtTrans: towards cracking the language of life’s code through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2021; 44:7112–27. [DOI] [PubMed] [Google Scholar]

[ref19] 19. Xiangzheng F, Zhenya D, Chen Y. et al. DrugKANs: a paradigm to enhance drug-target interaction prediction with kans. IEEE J Biomed Health Inform 2025;PP:1–12. 10.1109/JBHI.2025.3566931 [DOI] [PubMed] [Google Scholar]

[ref20] 20. Yang Z, Zhong W, Zhao L. et al. MGraphDTA: deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chem Sci 2022; 13:816–33. 10.1039/d1sc05180f [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Li C, Mi J, Wang H. et al. MGMA-DTI: drug target interaction prediction using multi-order gated convolution and multi-attention fusion. Comput Biol Chem 2025; 118:108449. [DOI] [PubMed] [Google Scholar]

[ref22] 22. Xiong A, Luo Z, Xia Y. et al. An interpretable geometric graph neural network for enhancing the generalizability of drug–target interaction prediction. BMC Biol 2025; 23:350. 10.1186/s12915-025-02456-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Rampášek L, Galkin M, Dwivedi VP. et al. Recipe for a general, powerful, scalable graph transformer. Advances in Neural Information Processing Systems 2022; 35:14501–15. [Google Scholar]

[ref24] 24. Zhao T, Yang H, Valsdottir LR. et al. Identifying drug–target interactions based on graph convolutional network and deep neural network. Brief Bioinform 2020; 22:2141–50. 10.1093/bib/bbaa044 [DOI] [PubMed] [Google Scholar]

[ref25] 25. Peng J, Wang Y, Guan J. et al. An end-to-end heterogeneous graph representation learning-based framework for drug–target interaction prediction. Brief Bioinform 2021; 22:bbaa430. 10.1093/bib/bbaa430 [DOI] [PubMed] [Google Scholar]

[ref26] 26. Xiaorui S, Pengwei H, Yi H. et al. Predicting drug-target interactions over heterogeneous information network. IEEE J Biomed Health Inform 2023; 27:562–72. 10.1109/JBHI.2022.3219213 [DOI] [PubMed] [Google Scholar]

[ref27] 27. Qian Y, Li X, Jian W. et al. MCL-DTI: using drug multimodal information and bi-directional cross-attention learning method for predicting drug–target interaction. BMC bioinformatics 2023; 24:323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref28] 28. Hua Y, Feng Z, Song X. et al. MMDG-DTI: drug–target interaction prediction via multimodal feature fusion and domain generalization. Pattern Recogn 2025; 157:110887. [Google Scholar]

[ref29] 29. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023; 379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]

[ref30] 30. Szklarczyk D, Nastou K, Koutrouli M. et al. The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res 2025; 53:D730–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Brody S, Alon U, Yahav E. How attentive are graph attention networks? arXiv preprint arXiv:2105.14491. 2021. https://arxiv.org/abs/2105.14491 (accessed 29 January 2026).

[ref32] 32. Zhao B-W, Xiao-Rui S, Peng-Wei H. et al. iGRLDTI: an improved graph representation learning method for predicting drug–target interactions over heterogeneous biological information network. Bioinformatics 2023; 39:btad451. 10.1093/bioinformatics/btad451 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Zhao B-W, Xiao-Rui S, Yang Y. et al. Regulation-aware graph learning for drug repositioning over heterogeneous biological network. Inform Sci 2025; 686:121360. 10.1016/j.ins.2024.121360 [DOI] [Google Scholar]

[ref34] 34. Davis MI, Hunt JP, Herrgard S. et al. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 2011; 29:1046–51. 10.1038/nbt.1990 [DOI] [PubMed] [Google Scholar]

[ref35] 35. Zitnik M, Sosič R, Maheshwari S. et al. BioSNAP datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata.

[ref36] 36. Knox C, Wilson M, Klinger CM. et al. DrugBank 6.0: the DrugBank knowledgebase for 2024. Nucleic Acids Res 2024; 52:D1265–75. 10.1093/nar/gkad976 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. Huang K, Xiao C, Glass LM. et al. MolTrans: molecular interaction transformer for drug–target interaction prediction. Bioinformatics 2021; 37:830–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. Liu Y, Yang X, Gan J. et al. CB-Dock2: improved protein–ligand blind docking by integrating cavity detection, docking and homologous template fitting. Nucleic Acids Res 2022; 50:W159–64. 10.1093/nar/gkac394 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref39] 39. Wang Y, Wang J, Cao Z. et al. Molecular contrastive learning of representations via graph neural networks. Nat Mach Intell 2022; 4:279–87. 10.1038/s42256-022-00447-x [DOI] [Google Scholar]

[ref40] 40. Rong Y, Bian Y, Tingyang X. et al. Self-supervised graph transformer on large-scale molecular data. Advances in neural information processing systems 2020; 33:12559–71. [Google Scholar]

[ref41] 41. Xiaorui S, Pengwei H, Li D. et al. Interpretable identification of cancer genes across biological networks via transformer-powered graph representation learning. Nat Biomed Eng 2025; 9:371–89. 10.1038/s41551-024-01312-5 [DOI] [PubMed] [Google Scholar]

[ref42] 42. Xiangzheng F, Peng L, Chen H. et al. GRAPE: graph-regularized protein language modeling unlocks TCR-epitope binding specificity. Brief Bioinform 2025; 26:10. 10.1093/bib/bbaf522 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] 43. Zhao L, Wang H, Shi S. PocketDTA: an advanced multimodal architecture for enhanced prediction of drug-target affinity from 3D structural data of target binding pockets. Bioinformatics 2024; 40:btae594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref44] 44. Stärk H, Beaini D, Corso G. et al. 3D infomax improves gnns for molecular property prediction. In: International Conference on Machine Learning (Baltimore, MD, USA), PMLR 2022; 162:20479–502. [Google Scholar]

PERMALINK

TriDTI: tri-modal representation learning with cross-modal alignment for drug–target interaction prediction

Gwang-Hyeon Yun

Jong-Hoon Park

Young-Rae Cho

Abstract

Introduction

Materials and methods

Figure 1.

Feature extraction

Structural feature

Sequential feature

Relational feature

Modality alignment

Feature fusion

Classification

Overall loss function

Results

Datasets

Table 1.

Experimental settings

Table 2.

Performance evaluation

Table 3.

Ablation study

Figure 2.

Modality contribution analysis

Module ablation study

Model interpretability

Figure 3.

Cold-start settings

Table 5.

Table 4.

Table 6.

Case study

Table 7.

Figure 4.

Conclusion

Key Points.

Supplementary Material

Contributor Information

Author contributions

Conflict of interest

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases