Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information

Xiang Li; Wei Peng; Xiaolei Zhu

doi:10.1093/bib/bbaf457

. 2025 Sep 8;26(5):bbaf457. doi: 10.1093/bib/bbaf457

Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information

Xiang Li ¹, Wei Peng ², Xiaolei Zhu ^3,^✉

PMCID: PMC12415854 PMID: 40919912

Abstract

Protein–nucleic acid binding sites play a crucial role in biological processes such as gene expression, signal transduction, replication, and transcription. In recent years, with the development of artificial intelligence, protein language models, graph neural networks, and transformer architectures have been adopted to develop both structure-based and sequence-based predictive models. Structure-based methods benefit from the spatial relationship between residues and have shown promising performance. However, structure-based information requires 3D protein structures, which is a challenge for large-scale protein sequence spaces. To address this limitation, researchers have attempted to use predicted protein structure information to guide binding site prediction. While this strategy has improved accuracy, it still depends on the quality of structure predictions. Thus, some studies have returned to prediction methods based solely on protein sequences, particularly those using protein language models, which have greatly enhanced the prediction accuracy. This paper proposes a novel protein–nucleic acid binding site prediction framework, ATtention Maps and Graph convolutional neural networks to predict nucleic acid–protein Binding sites (ATMGBs), which first fuses protein language embeddings with physicochemical properties to obtain multiview information, then leverages the attention map of a protein language model to simulate the relationship between residues, and then utilizes graph convolutional networks for enhancing the feature representations for final prediction. ATMGBs was evaluated on several different independent test sets. The results indicate that the proposed approach significantly improves sequence-based prediction performance, even achieving prediction accuracy comparable to structure-based frameworks. The dataset and code used in this study are available at https://github.com/lixiangli01/ATMGBs.

Keywords: nucleic acid binding sites, GCN, attention map, protein language model, physicochemical properties

Introduction

Protein–nucleic acid interactions play key roles in biological processes such as gene expression, signal transduction, replication, repair, transcription, and translation [1]. The binding sites of protein–nucleic acid complexes, which are the stable contact regions between protein and nucleic acid molecule surfaces, determine the specificity and functionality of their interactions [2, 3]. Numerous methods have been developed to identify protein–nucleic acid binding sites, such as X-ray crystallography and nuclear magnetic resonance spectroscopy [4, 5]. However, determining binding sites through wet-lab experiments is costly and time-consuming, which undoubtedly slows down related drug development and disease diagnosis. With the advancement of big data analytics, particularly deep learning, and the establishment of related binding site databases, a bunch of universally applicable predictive tools have been proposed to detect binding sites based on existing nucleic acid binding site datasets to accelerate or guide biological experiments.

Effective extraction of protein intrinsic information is the critical step for constructing protein–nucleic acid binding site prediction models. Sequential information has been extensively used in previous studies. For instance, Wang et al. used PSI-BLAST to obtain evolutionary information and built the BindN model based on support vector machines, achieving high predictive accuracy [6]. Su et al. combined a support vector machine-based ab initio method, SVMnuc, and a template-based method, COACH-D, to build an ensemble prediction model, NucBind, which enhanced the complementarity of the two models [7]. Yan et al. constructed a two-layer logistic regression model using physicochemical properties and predicted secondary structure information of residues, attaining high prediction efficiency [8]. Patiyal et al. harnessed residue physicochemical properties and position specific scoring matrix (PSSM) evolutionary information to construct the DBPred model based on a one-dimensional convolutional neural network to improve the prediction performance [9]. Wang et al. integrated functional features into residue representations using inductive and transfer frameworks to construct a predictive model called iDRNA-ITF, which performed well on multiple test datasets [10]. Zhang et al. introduced HybridDBRpred [11], which used deep transformer networks to combine multiple prediction results, providing new insights for nucleic acid binding site prediction tasks.

Besides, structural information of proteins has also been extracted to predict binding sites on protein–nucleic acid interfaces. For example, 3D structures of proteins can be compared with known structures in structure databases to infer possible binding sites [12]. Moreover, protein surface features, such as surface charge, hydrophobicity, and polarity, can be analyzed and combined with classification algorithms for prediction. Lam et al. proposed NucleicNet [13], which utilizes local protein surface topology and residue physicochemical properties with the ResNet architecture to enhance the model’s generalization performance. Xia et al. employed hierarchical graph neural networks (HGNNs) to embed structural and biophysical features for binding residue recognition, significantly improving the prediction accuracy [14]. Additionally, methods such as EGPDI [15] and CrossBind [16] have fully applied the geometric information of protein structures or employed different graph neural network aggregation strategies to innovate prediction methods and achieve excellent prediction performance.

However, structure-based prediction methods require real 3D protein structure information. For the vast protein sequence space, the majority of sequences lack corresponding real 3D structure data, which hinders the application of structure-based methods. With the advancement of AI-based protein structure prediction, such as AlphaFold [17] and ESMfold [18], sequence-based protein structure prediction have made great progress. Correspondingly, some researchers have utilized predicted structural information to predict protein–nucleic acid binding sites, achieving performance comparable to predictions based on real protein structure data. For example, Yuan et al. used AlphaFold-derived structural information and combined it with a graph transformer model, achieving prediction accuracy equivalent to that of real protein structure-based binding site prediction on multiple test datasets [19]. This progress has greatly expanded the applicability boundary of protein–nucleic acid binding site prediction. Additionally, methods such as GLMSite [20], GPSite [21], and DeepProSite [22] based on predicted structures have further optimized and improved the prediction accuracy. Nevertheless, the strategy of using predicted structures has some drawbacks. First, obtaining predicted structures is not so easy; deploying and inferring large deep learning models like AlphaFold incurs high costs. Second, the accuracy of nucleic acid binding protein binding site predictions strongly depends on the quality of the predicted protein structures. Any errors in structure prediction will directly affect the accuracy of binding site predictions. As a result, some researchers have refocused on sequence-based prediction methods, with protein language models playing a key role. By using pretrained protein language models, high-dimensional embedding representations of protein sequences learned from the entire protein sequence space can be obtained. For example, Morcillo significantly improved sequence-based protein–nucleic acid binding site predictions by utilizing pretrained ELMo embeddings [23], demonstrating the feasibility of using protein language models for nucleic acid–protein binding site prediction. Zhu et al. employed three transformer-based language models combined with an Long Short-Term Memory (LSTM) attention framework to effectively improve the DNA binding site prediction accuracy [24]. Zhang et al. proposed a prediction framework called MucLiPred [25], which improved nucleic acid binding residue identification by utilizing pretrained BERT models with a dual contrastive learning mechanism. Liu et al. also used pretrained protein language models and contrastive learning methods to predict DNA binding residues and evaluated predictions for multiple ligand-binding tasks [26]. Wu et al. applied protein language models to generate robust sequence embeddings and combined them with multiscale learning and scale-based self-attention mechanisms for the recognition of specific nucleotide-binding residues [27].

Leveraging protein language embeddings for model training has effectively improved the performance of sequence-based nucleic acid–protein binding site prediction, but it lacks explicit representation of interresidue correlations within the protein sequence. In structure-based binding site prediction methods, the distance between residues is often calculated to provide this information. However, for sequence-only prediction methods, such information cannot be directly obtained, which somewhat affects prediction performance. Therefore, incorporating interresidue relationships into sequence models becomes a challenging task. Most protein language models are designed based on the transformer architecture [28], and the most notable feature of this architecture is the self-attention mechanism, which dynamically focuses on key parts of the sequence, capturing long-range dependencies and contextual information. This interdependent representation may help establish relationships between residues. In this study, we extracted the attention map from the Prot-T5 [29] protein language model for residue relationship modeling and combined embeddings of the T5 and ESM model [30] for model construction. In addition, physicochemical properties of amino acids were also employed to improve the prediction performance and enhance model stability. With all these three kinds of features, we build a model, ATMGBs, using ATtention Maps and Graph convolutional neural networks to predict nucleic acid-protein Binding sites. The results on several test sets indicate that the proposed model significantly improves the sequence-based prediction performance, achieving results comparable to structure-based prediction frameworks.

Materials and methods

Datasets

To evaluate the performance of ATMGBs and fairly compare it with other sequence- or structure-based models, we used widely adopted protein–nucleic acid binding site datasets from previous studies. Specifically, for the DNA–protein binding site prediction task, we used the DNA_Train_573 and DNA_Test_129 datasets obtained from GraphBind [14] for model training and testing. Additionally, another test set, DNA_Test_181 [19], was employed to further evaluate the generalization. For the RNA binding site prediction task, we similarly used RNA_Train_495 from GraphBind [14], which was processed in the same way as DNA_Train_573. We trained ATMGBs on RNA_Train_495 for RNA binding site prediction and evaluated its generalization performance on RNA_Test_117. In both the DNA and RNA binding site training and testing datasets, a target residue was defined as a DNA/RNA-binding residue if the minimum atomic distance between the residue and the DNA/RNA molecule is <0.5 Å plus the sum of the van der Waals radii of the two closest atoms. Detailed information about the collection of each dataset can be found in the online supplementary material, and the number of protein complexes and the ratio of binding to nonbinding sites are shown in Table S1.

Overall architecture of ATMGBs

In this study, we propose a novel framework, ATMGBs, for nucleic acid binding site prediction. As shown in Fig. 1, the framework mainly consists of four parts. In the sequence information extraction part, we use pretrained protein language models to represent protein sequences. To maximize the utilization of sequence information, we employ two pretrained models, ESM1b and ProtT5, which were widely used in the protein prediction tasks. Furthermore, considering the critical role of the physicochemical properties of amino acids in protein-binding prediction tasks [31], we retrieve 531 physicochemical properties of amino acids, including electrostatic potential, hydrophobicity, and so on, from the AAindex database. Additionally, to capture the interresidue relationships within nucleic acid binding protein sequences, we fuse the attention maps of the 24th layer of ProtT5 and use them to construct a relational graph. Secondly, in the physicochemical property processing part, we use the transformer architecture to capture and integrate the physicochemical information across the entire protein sequence. Thirdly, the three kinds of features are concatenated to obtain a more comprehensive representation of each residue. At last, we employ a multilayer graph convolutional network (GCN) to perform residue-level feature aggregation using a static attention relation graph obtained before. And a multi-layer perceptron (MLP) is employed for binary classification to determine whether a residue is a binding site or not.

Flow diagram of ATMGBs model, first, the embeddings, physicochemical properties and attention map of proteins were extracted based on two protein language models and AAindex. Second, the physicochemical properties were further processed based on transformer encoders. Third, the embeddings of the two protein language models and the processed physicochemical information were concatenated. Finally, the concatenated features and the attention map were inputted into a GCN network for further processing and then a full connected layer was used to do the final classification. — Framework of ATMGBs.

Protein language embedding extraction

Two protein language models, ProtT5 and ESM1b, were adopted to obtain the embeddings of proteins in this work. The ProtT5-XL-U50 model is a transformer-based self-supervised protein language model that was pretrained on UniRef50 [32]. It predicts masked amino acids based on the contextual information of protein sequences, enabling it to capture complex context. In this study, we obtained L × 1024-dimensional embedding representations for each protein sequence based on ProtT5. Additionally, we also employed the ESM1b model, which holds advantages of computational efficiency and model parameters. ESM1b was pretrained on a dataset of 250 million sequences. Through its deep transformer architecture, it effectively learns residue physicochemical properties as well as long-range homology information across sequences. We obtained L × 1280-dimensional embedding representations for each protein sequence based on ESM1b.

Amino acid physicochemical properties

AAindex (Amino Acid Index Database) [33] is a database used to describe the physicochemical properties of amino acids and their related characteristics. It provides quantifiable numerical indicators for bioinformatics, structural biology, and protein engineering research, and is commonly used in protein structure prediction, function analysis, and sequence feature extraction, among other fields. In this section, we used iFeature [34] to retrieve 531 physicochemical property indices for each amino acid in the sequence (provided in the code repository), including hydrophobicity, polarity, volume, charge, rigidity, solubility, and other information.

Attention map extraction

According to the interface provided in the ProtTrans code repository [29], we extracted the attention map matrix from the final layer of the ProtT5-XL-U50 language model. To comprehensively utilize the information from each attention head in the final layer, and considering the sparsity of different attention maps and their impact on computational efficiency, we averaged the attention matrices from different heads after stacking them. This averaged map was then used to construct the interresidue relationships of the protein sequence.

Transformer architecture

After obtaining the physicochemical information for the entire protein sequence, we input the acquired AAindex information as raw embeddings into the encoder module of the transformer architecture in order to further perceive and compress the AAindex information. In the transformer encoder, the multihead self-attention mechanism is the key for this dynamic capturing. The specific principle is as follows: let the input sequence be Inline graphic , where is the sequence length and is the feature dimension of each residue. First, the input is mapped to three sets of feature matrices through linear transformations, as shown in the following formula:

(1)

(2)

where Inline graphic , , and are the trainable weight matrices, and is the feature dimension after the transformation.

To enhance the model’s representability and efficiency, a multihead self-attention mechanism is used. The input is split into multiple different heads, where each head independently performs the attention computation as described above. The outputs of all heads are concatenated ( Inline graphic ) and then merged through a linear transformation. The specific formulas are as follows:

(3)

where Inline graphic is a trainable weight matrix. Finally, we obtain , which is based on the AAindex embedding information, to be fused with the protein language embeddings.

Multiple information fusion with GCN architecture

To integrate the high-dimensional embeddings extracted from ProtT5 and ESM1b, along with the physicochemical information of amino acid residues, a GCN [35] was employed based on the residues’ relationship derived from the attention map extracted from the protein language model. Specifically, each residue in the protein sequence is treated as a node in the graph, and the attention weights between different residues in the attention map are abstracted as edges in the graph. Through a message-passing mechanism, information is transmitted between the nodes of the graph, and the three types of features for each residue are aggregated layer by layer in the graph convolution process. This allows the model to learn high-dimensional information from different amino acid nodes, which is then used for final classification. The detailed process is as follows:

(4)

(5)

where Inline graphic , ,, and . In standard GCNs, the adjacency matrix is typically symmetrically normalized, i.e. , to ensure balanced and stable information flow. However, in ATMGBs, we use the attention map extracted from the protein language model as the relational graph for modeling. Since is a weighted directed graph, we do not apply symmetric normalization but instead directly use Inline graphic for information aggregation. The specific formula is as follows:

(6)

Additionally, as mentioned in Yuan et al. [36], using a GCN with residual connections can effectively mitigate the issue of oversmoothing caused by an increase in the number of network layers. We also adopted this strategy in ATMGBs. The formula is as follows:

(7)

where Inline graphic and are hyperparameters is an identity matrix.

Finally, we used an MLP to classify whether a residue is a binding site or not based on the output from the last layer of the GCN. Specifically:

(8)

where Inline graphic is the weight matrix, is the bias item, and refers to the prediction results for the residues in a protein sequence. Additionally, the hyperparameter settings and training environments for ATMGBs are provided in Tables S2 and S3.

Evaluation metrics

To evaluate the model’s prediction performance, we use precision (PRE), recall (REC), specificity (SPE), accuracy (ACC)，F₁ score (F₁), and Matthews correlation coefficient (MCC) as the evaluation metrics during both the training and testing phases. Their definitions are as follows:

(9)

(10)

(11)

(12)

(13)

(14)

where TP represents true positives, FP represents false positives, TN represents true negatives, and FN represents false negatives. Additionally, we also used area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) to assess the classification performance of the model in the nucleic acid–protein binding site prediction task, where AUPRC is widely used for performance evaluation on imbalanced datasets.

Results and discussion

Ablation of ATMGBs modules and features

We used five-fold cross-validation on the training datasets to select our model architecture and validate the effectiveness of the model design. Note that both feature ablation and module ablation were conducted together since some network modules are deeply tied to the corresponding feature information. In total, eight model architectures were compared, including (i) ATMGBs, (ii) ATMGBs without ProtT5 embeddings, (iii) ATMGBs without ESM1b embeddings, (iv) ATMGBs without AAindex embeddings and its related transformer, (v) ATMGBs with AAindex embeddings which were not processed with the transformer, (vi) ATMGBs without ESM1b and AAindex embeddings, (vii) ATMGBs without ProtT5 and AAindex embeddings, and (viii) ATMGBs without ProtT5 and ESM1b embeddings.

As shown in Table 1, for the DNA–protein binding site prediction task, the first four ablation experiments indicated that removing each part of ATMGBs affected the final results, causing a decline in performance metrics. Although the removal of AAindex features only led to a slight decrease in AUROC and MCC, they showed significant effects on the stability of model convergence (see Fig. S1). According to this phenomenon, AAindex features are deemed essential to the model. Additionally, we found that removing the transformer module that was used to enhance the representation of the AAindex features led to a performance drop (AUROC drops from 0.917 to 0.910), illustrating the necessity of the corresponding module. Moreover, the results of the three single-feature models, corresponding to ESM, T5, and physicochemical features, indicated that the T5-based model exhibited the best performance (AUROC: 0.912), outperforming the models using ESM (AUROC: 0.894) and AAindex (AUROC: 0.707).

Table 1.

Ablation and characteristic ablation results of the ATMGBs model

Task type	Model type^a	ACC	PRE	REC	F ₁	MCC	AUROC	AUPRC
Protein–DNA binding sites	ATMGBs	0.927	0.653	0.442	0.519	0.496	0.918	0.600
	ATMGBs (w/o ProtT5)	0.924	0.662	0.355	0.453	0.444	0.899	0.549
	ATMGBs (w/o ESM)	0.927	0.686	0.388	0.490	0.479	0.914	0.589
	ATMGBs (w/o AAindex)	0.929	0.689	0.406	0.509	0.494	0.917	0.600
	ATMGBs (w/o transformer)	0.922	0.617	0.440	0.499	0.474	0.910	0.571
	ProtT5 + GCN	0.927	0.660	0.421	0.513	0.491	0.912	0.588
	ESM + GCN	0.922	0.637	0.365	0.454	0.440	0.894	0.541
	AAindex + transformer	0.910	0.075	0.0002	0.0004	0.003	0.707	0.210
Protein–RNA binding sites	ATMGBs	0.908	0.674	0.315	0.408	0.405	0.877	0.533
	ATMGBs (w/o ProtT5)	0.906	0.678	0.255	0.367	0.375	0.857	0.499
	ATMGBs (w/o ESM)	0.902	0.590	0.403	0.455	0.424	0.873	0.524
	ATMGBs (w/o AAindex)	0.907	0.661	0.312	0.420	0.411	0.874	0.531
	ATMGBs (w/o transformer^a)	0.905	0.664	0.288	0.385	0.385	0.869	0.513
	ProtT5 + GCN	0.907	0.613	0.376	0.465	0.433	0.871	0.526
	ESM + GCN	0.907	0.642	0.311	0.415	0.402	0.856	0.502
	AAindex + transformer	0.894	0.651	0.031	0.058	0.121	0.744	0.315

Open in a new tab

^a“w/o transformer” means that the feature of AAindex is not further processed with transformer and is concatenated with the other two embeddings directly. Then the concatenated features were input to GCN networks.

Similarly, in the protein–RNA binding site prediction task, we conducted corresponding ablation experiments. The ATMGBs model achieved the best result in five-fold cross-validation on the training dataset, with an AUROC of 0.877 and AUPRC of 0.533, both outperforming the other seven ablation models.

Based on the ablation experiment results of the two tasks, we demonstrated the contribution of each module to the proposed architecture. In addition, to validate the effectiveness of the attention map used in our model, we conducted two additional experiments. First, we constructed three model variants by using different kinds of attention maps. Specifically, the three variants were built by only replacing the attention maps generated by ProtT5 with randomly generated matrices of the same size and similar value distributions, the attention maps generated by the ESM2 model, and the average of the attention maps of ProtT5 and ESM2, respectively. The cross-validation results of the three variants on the two training datasets are shown in Table S4, which indicate that the AUROC and AUPRC values decrease in five of the six cases compared to ATMGBs. Second, we replaced the GCN modules with attention maps by four other modules in step 4 of our model, respectively, to obtain four model variants. The four modules include transformer, CNN + transformer, BiGRU, and CNN + BiGRU. The corresponding results are shown in Table S5, which illustrate that ATMGBs outperformed all four model variants according to AUROC and AUPRC values. These two experiments confirm the critical role of attention maps in modeling interresidue relationships in our framework. Moreover, to verify the efficacy of the embeddings of protein language models, we used one-hot encodings to replace them while keeping the other settings the same. As shown in Table S6, the models based on one-hot encodings achieved AUROC and AUPRC of 0.698 and 0.195, and 0.740 and 0.305 on DNA_Train_573 and RNA_Train_495, respectively, which are significantly worse than the corresponding values of ATMGBs.

Hyperparameter sensitivity analysis

The performance of our model could be affected by different hyperparameters, so we performed a grid search to select the optimal hyperparameters. The ranges of the three key hyperparameters, dropout rate, learning rate, and number of GCN layers are [0, 0.1, 0.3, 0.5], [1e-2, 1e-3, 1e-4, 1e-5], and [4, 6, 8, 10], respectively. Tables S7–S9 show the cross-validation results based on different hyperparameters on DNA_Train_573 and RNA_Train_495, which indicate that the best dropout rate, the best learning rate, and the optimal number of GCN layers are 0.1, 1e-3, and 8, respectively.

Comparison with other sequence-based models on the protein–DNA independent test sets

As shown in Table 2, when compared with seven sequence-based models, DRNAPred [8], DNAPred [37], SVMnuc [7], NCBRPed [38], ULDNA [24], PDNAPred [39], and CLAPE-DB [26] on DNA_Test_129, ATMGBs outperformed all the other models across all evaluation metrics except recall. Specifically, compared to CLAPE, ATMGBs showed significant improvements in all metrics, especially with AUROC and AUPRC increased by 5% and 11%, respectively. Compared to the PDNAPred [39] model, which uses similar ESM and ProtT5 features to ATMGBs, ATMGBs still showed substantial improvements, with AUROC and AUPRC increased by 0.7% and 1.6%, respectively.

Table 2.

Comparison with other sequence methods on two independent test sets, DNA_Test_129 and DNA_Test_181

Test sets	Model type	REC	PRE	F ₁	MCC	AUROC	AUPRC^a
	DRNAPred	0.233	0.190	0.210	0.155	0.693	–
	DNAPred	0.396	0.353	0.373	0.332	0.845	0.367
	SVMnuc	0.316	0.371	0.341	0.304	0.812	0.302
DNA_Test_129	NCBRPred	0.312	0.392	0.347	0.313	0.823	0.310
	ULDNA	0.725	0.340	0.463	0.452	0.893	–
	PDNAPred	0.595	0.466	0.523	0.494	0.923	0.509
	CLAPE-DB	0.464	0.396	0.427	0.389	0.881	0.411
	ATMGBs	0.566	0.490	0.525	0.494	0.930	0.525
	DNAPred	0.334	0.223	0.267	0.233	0.802	0.230
	SVMnuc	0.289	0.242	0.263	0.229	0.803	0.193
	NCBRPred	0.259	0.241	0.250	0.215	0.771	0.183
DNA_Test_181	ULDNA	0.585	0.238	0.339	0.331	0.851	–
	PDNAPred	0.512	0.309	0.386	0.364	0.896	0.350
	CLAPE-DB	0.413	0.212	0.280	0.252	0.824	–
	ATMGBs	0.455	0.327	0.380	0.353	0.899	0.354

Open in a new tab

Note that the values of evaluation metrics were collected from the literature; some values were not reported in the corresponding literature, so we used ‘-’ to represent not available. To facilitate understanding, the highest value in each column is shown in bold.

Similarly, as shown in Table 2, when compared with six sequence-based models on DNA_Test_181, ATMGBs outperformed all the other models in terms of AUROC and AUPRC. Compared to CLAPE, ATMGBs led by a wide margin in all metrics, with AUROC and MCC increased by 7% and 10%, respectively. Compared to PDNAPred, ATMGBs still showed its superiority, with AUROC and AUPRC increased by 0.3% and 0.4%, respectively.

All the results in Table 2 demonstrate the excellent performance of ATMGBs in the protein–DNA binding site prediction task, and the comparison with PDNAPred fully validates the effectiveness of using an attention map-based relational graph neural network and amino acid physicochemical information.

Comparison with other sequence-based models on the protein––RNA independent test sets

Similar to the protein–DNA binding site prediction task, we evaluated the generalization performance of the ATMGBs model for the RNA–protein binding site prediction task on the independent test sets RNA_Test_117 and RNA_Test_161. In Table 3, we present the results of the state-of-the-art sequence-based methods on these two independent test sets for comparison. On RNA_Test_117, when compared with RNABindPlus [40], SVMnuc, CLAPE-RB, and PDNAPred, ATMGBs outperformed all the other models, achieving optimal predictive performance. Specifically, compared to PDNAPred, ATMGBs achieved improvements of 7%, 2%, 4%, 4%, and 2.5% for REC, PRE, F₁, MCC, and AUROC, respectively.

Table 3.

Comparison with other sequence-based methods on RNA_Test_117

Test sets	Model type	REC	PRE	F ₁	MCC	AUROC	AUPRC^a
	RNABindPlus	0.273	0.227	0.248	0.202	0.717	–
	SVMnuc	0.231	0.240	0.235	0.192	0.729	–
RNA_Test_117	CLAPE-RB	0.467	0.201	0.281	0.240	0.800	–
	PDNAPred	0.335	0.298	0.315	0.274	0.829	–
	ATMGBs	0.409	0.316	0.356	0.317	0.854	0.284

Open in a new tab

Note that the values of evaluation metrics were collected from the literature; some values were not reported in the corresponding literature.

To further explore the generalization ability of ATMGBs, we conducted a test on another larger independent test set, RNA_Test_161 (see Table S1). This test set was collected in MucLiPred [25] for generalization performance comparison. For fair comparison, we trained the ATMGBs model using the RNA_Train_545 dataset, as provided by MucLiPred. The results, shown in Table 4, demonstrate that ATMGBs achieved the best results when compared with six other methods. Specifically, it outperformed the second-best method, PDNAPred, by 2% and 6% for AUROC and MCC, respectively. This fully demonstrates the wide applicability and robustness of ATMGBs in the protein–RNA binding site prediction task.

Table 4.

Comparison with other sequence-based methods on RNA_Test_161

Test sets	Model type	ACC	REC	MCC	AUROC	AUPRC^a
	DRNApred	0.516	0.454	0.010	0.520	–
	NCBRPred	0.840	0.379	0.140	0.690	–
	RNABindR-Plus	0.683	0.672	0.150	0.730	–
RNA_Test_161	iDRNA-ITF	0.718	0.716	0.190	0.770	–
	MucLiPred	0.851	0.669	0.430	0.840	–
	PDNAPred	0.886	–	0.410	0.840	–
	ATMGBs	0.867	0.59	0.470	0.860	0.570

Open in a new tab

Note that the values of evaluation metrics were collected from the literature; some values were not reported in the corresponding literature.

Comparison with other structure-based methods

In the nucleic acid–protein binding site prediction task, many excellent prediction methods currently utilize either the true or predicted protein structures for model construction. It is necessary to explore the performance differences between structure-based prediction methods and sequence-based methods. To this end, we specifically compared ATMGBs with several widely used structure-based methods, particularly focusing on their prediction performance across independent test sets for protein–DNA binding sites and protein–RNA binding sites.

As shown in Tables 5 and 6, we compared ATMGBs with structure-based methods on three independent datasets: DNA_Test_129, DNA_Test_181, and RNA_Test_117. On DNA_Test_129, ATMGBs outperformed five models: COACH-D [12], NucBind [7], DNABind [41], GLMsite [20], and GraphBind [14]. On DNA_Test_181, it outperformed COACH-D, NucBind, DNABind, and GLMsite. On RNA_Test_117, it outperformed COACH-D, NucBind, aaRNA [42], and NucleicNet. Among the models used for comparison, GraphBind is a well-recognized protein–nucleic acid binding site prediction model based on protein structure and graph neural networks, and it was one of the first models tested on the DNA_Test_129 and RNA_Test_117 datasets. On DNA_Test_129, ATMGBs outperformed GraphBind, with improvements of 0.3% and 0.6% for AUROC and AUPRC, respectively. On RNA_Test_117, ATMGBs achieved the same AUROC value (0.854) as GraphBind. These results demonstrate that our sequence-based model, ATMGBs, can achieve comparable prediction performance to those structure-based methods.

Table 5.

Comparison with other structure-based methods on DNA_Test_129 and DNA_Test_181.

Dataset	Models	REC	PRE	F ₁	MCC	AUROC	AUPRC^a
	COACH-D	0.367	0.357	0.362	0.321	0.710	0.269
	NucBind	0.330	0.381	0.354	0.317	0.811	0.284
	DNABind	0.601	0.346	0.440	0.411	0.858	0.402
	GraphBind	0.676	0.425	0.522	0.499	0.927	0.519
DNA_Test_129	Graphsite	0.665	0.460	0.543	0.519	0.934	0.544
	GLMsite	0.848	0.287	0.405	0.412	0.918	–
	EGPDI	0.612	0.503	0.549	0.522	0.941	–
	ATMGBs	0.566	0.490	0.525	0.494	0.930	0.525
	COACH-D	0.254	0.280	0.266	0.235	0.655	0.172
	NucBind	0.293	0.248	0.269	0.234	0.796	0.191
	DNABind	0.535	0.199	0.290	0.279	0.825	0.219
	GraphBind	0.624	0.293	0.399	0.392	0.904	0.339
DNA_Test_181	Graphsite	0.517	0.354	0.420	0.397	0.917	0.369
	GLMsite	0.829	0.209	0.311	0.334	0.899	–
	EGPDI	0.558	0.346	0.424	0.407	0.914	–
	ATMGBs	0.455	0.327	0.380	0.353	0.899	0.354

Open in a new tab

Note that the values of evaluation metrics were collected from the literature; some values were not reported in the corresponding literature.

Table 6.

Comparison with other structure-based methods on RNA_Test_117

Dataset	Models	REC	PRE	F ₁	MCC	AUROC
	COACH-D	0.221	0.252	0.235	0.195	0.663
	NucBind	0.231	0.235	0.233	0.189	0.715
	aaRNA	0.484	0.166	0.247	0.214	0.771
RNA_Test_117	NucleicNet	0.371	0.201	0.261	0.216	0.788
	GraphBind	0.463	0.294	0.358	0.322	0.854
	ATMGBs	0.409	0.316	0.356	0.317	0.854

Open in a new tab

Moreover, we also compared ATMGBs with some of the latest structure-based methods, such as Graphsite and EGPDI, on the aforementioned independent test sets. For example, EGPDI employs a multiview fusion strategy based on equivariant graph neural networks and graph convolutional networks, effectively leveraging structural information to significantly enhance its prediction accuracy. Compared to these sophisticated methods, the proposed sequence-based ATMGBs shows slightly worse performance. However, as we mentioned earlier, obtaining detailed protein structures is difficult in practical application scenarios. Therefore, from a practical standpoint, our method is more accessible and user-friendly for researchers.

Case study

To visually demonstrate the results of ATMGBs in protein–nucleic acid binding site prediction, we selected a protein–DNA complex (PDB ID: 4G92) [43] for a case study. This complex is not included in the training or test datasets used previously. PyMOL was used to visually display the prediction results. Figure 2A shows the distribution of the actual binding sites of 4G92A, and Fig. 2B–2D shows the distribution of binding sites predicted by ATMGBs and the other two methods.

Comparison of the predictive results of the three models on the 4G92 complex. Subplot A is the real labels for the residues of the complex. Subplots B, C, and D show the predictive results of ATMGBs, DBPred, and GPsite; clearly, the performance of ATMGBs is better than that of the other two models. — The actual labels (A) and the predicted labels (B–D) of three different prediction methods on the 4G92 complex. The prediction results for the different residues are presented in four colors: TN (true negative): blue, TP (true positive): red, FN (false negative): purple, FP (false positive): yellow.

As shown in Fig. 2B, ATMGBs successfully predicted 10 of the 12 binding sites on the interface of 4G92A. In contrast, DBPred [9] (Fig. 2C) only predicted two binding sites. Figure 2D presents the prediction results from GPSite [21], an advanced method based on the structure predicted by ESMFold, which has achieved excellent prediction performance on multiple protein–ligand binding site datasets. GPSite predicted 6 out of the 12 known binding sites

in 4G92A. It is important to note that GPSite utilizes predicted structural information, whereas our model relies solely on protein sequence modeling and achieves superior prediction results. This highlights the exceptional performance of ATMGBs. In Figs S2 and S3, we present the ROC and PRC curves for the predictions of 4G92A by ATMGBs and GPSite. Because DBPred and GPSite were trained and tested on datasets different from those used in this study, it is not fair to compare our model with these two models on DNA_Test_129, DNA_Test_181, and RNA_Test_117. To conduct a comprehensive and fair comparison with these two models, we retrained and evaluated our models based on the datasets of GPSite and DBPred, respectively. The results are shown in Tables S10 and S11, which illustrate that our models achieve the second-best performance on the independent test sets of GPSite and the best performance on the independent test set of DBPred.

Interpretability analysis

Through the comprehensive comparison of ATMGBs with other state-of-the-art models, we conclude that ATMGBs is an outstanding tool for nucleic acid binding site recognition. To further explore the potential reasons for its superior predictive capability, we conducted model interpretability analysis from two perspectives: feature representation and attention map-derived residue relationships.

First, we utilized t-SNE (t-distributed stochastic neighbor embedding) to visualize feature representations of the three input features and the features processed with the ATMGBs framework. Specifically, in the protein–DNA binding site prediction task, we first merged the two independent test sets used in this study, DNA_Test_129 and DNA_Test_181, to create a new dataset with 310 DNA-binding proteins. Then, two kinds of embeddings of the residues from the 310 proteins were extracted from ProtT5 and ESM1b, respectively. Moreover, the physicochemical properties for each residue were extracted from AAindex. In addition, using the ATMGBs model, we obtained the hidden layer information from the last layer of the GCN, which corresponds to the high-level feature representations of the residues. These features were visualized in Fig. 3 by t-SNE.

The t-SNE figures for the features processed by ATMGBs (D, H) that show a more clear boundary between positive and negative samples than the t-SNE figures for three kinds of raw features extracted from AAindex (A, E), ProtT5 (B, F), and ESM (C, G), which demonstrate the effectiveness of the designed framework of ATMGBs. — Visualization of four kinds of feature representations for positive and negative examples: Physicochemical properties (A, E), ProtT5 (B, F), ESM (C, G), and high-level representation processed by ATMGBs (D, H). The upper panel is for protein–DNA binding site prediction and the lower panel is for protein–RNA binding site prediction.

As shown in Fig. 3A–C, the positive and negative samples represented by the original physicochemical properties, T5 embeddings, and ESM embeddings are generally mixed. However, Fig. 3D shows that there is a clear boundary between positive and negative samples represented by the feature processing through ATMGBs. These results demonstrate that ATMGBs has learned the difference between binding and nonbinding residues, significantly enhancing the prediction performance.

For the protein–RNA binding site prediction task, we observed a similar outcome. By similar procedures, we merged RNA_Test_117 and RNA_Test_161 into a new dataset with 278 RNA-binding proteins and then t-SNE was employed to visualize the four kinds of feature representations. As shown in Fig. 3E–H, it is difficult to distinguish the regions where positive and negative samples are located based on the three input feature representations. However, after processing through ATMGBs, the distribution of positive and negative samples exhibits a certain regularity, which is consistent with the protein–DNA binding protein case. This further proves that ATMGBs is capable of effectively distinguishing binding and nonbinding residues.

Furthermore, in the ATMGBs framework, the attention map is used to simulate the relationships between protein residues, whose effectiveness was also proved according to the superior predictive performance. Thus, it is significant to analyze how it functions in our model. We selected a protein–DNA complex, PDB ID 5GZB [44], from the TEST129 dataset as an example for analysis. And our model achieved good prediction performance on this complex with an AUROC of 0.930 and an AUPRC of 0.766. The attention map for this protein is visualized as a heatmap, which is compared with the heatmap for the real contact matrix. In Fig. 4A–C, we present the real structure of 5GZB_A, the adjacency matrix heatmap, and the attention map heatmap. It can be observed that there is a certain similarity between the two heatmaps. Thus, the attention map can enhance the information representation of the proteins. On the other hand, the heatmap of attention weights is not as perfectly symmetric as the contact map. This is because the attention map is generated through the multihead attention mechanism of the protein language model, and the attention scores between two residues are not identical and can even vary significantly. This provides additional information for the protein–nucleic acid binding site prediction by enabling message passing between residues.

The attention maps of 5GZB_A obtained from ProtT5 which can partially simulate the real contact map between residues. — Interpretability analysis of the complex 5GZB_A. (A) The real structure of 5GZB_A. (B) The true distance matrix between the residues of the complex 5GZB_A. (C) The attention weight heatmap between the residues of the complex 5GZB_A. (D) The attention weight heatmap between the binding residues of the complex 5GZB_A.

On the other hand, the attention map used can, to some extent, sense other binding residues. For example, in the 5GZB_A complex, we present the attention weight heatmap for the 18 binding sites, as shown in Fig. 4D. Some binding sites exhibit relatively large attention weights between them, which allows the residues to receive more useful information during the graph convolution process using the attention map. This, in turn, improves the performance of the ATMGBs model in predictions.

Performance on other ligand binding site prediction tasks

To validate the transferability of the ATMGBs framework in a broader range of protein–ligand binding site prediction tasks, we applied it to five protein–ligand binding site prediction tasks involving Adenosine Triphosphate (ATP), Protoporphyrin IX containing FE (HEM), Ca²⁺, Mg²⁺, and MN²⁺. The performance of our model was compared with GraphBind [14], DELIA [45], IonCom [46], COACH [47], and S-SITE [47], where GraphBind and DELIA are based on protein structural information and COACH uses both protein sequence and structural information. Note that we did not compare it with another state-of-the-art ligand binding site prediction model, LMetalSite [48], because the model used different training and testing datasets. For the five tasks, we trained the model on the corresponding training datasets and evaluated generalization performance on independent test sets. Figure 5 shows the AUROC values of different models on the independent test sets for the ATP，Ca²⁺, and Mg²⁺ binding site prediction tasks, with additional evaluation metrics and the results for the other two tasks provided in Tables S12–S18.

Bar plots of AUROC values achieved by ATMGBs and the other methods on the test sets for ATP, Ca2+, and Mg2+ binding site prediction, which indicate that ATMGBs’ performance is comparable or better than those of the other methods, demonstrating the generalization of the proposed framework. — Comparison of AUROC between ATMGBs and other methods on ATP (A), Ca²⁺ (B), and Mg²⁺ (C) binding site prediction tasks.

The results indicate that ATMGBs performs notably well, achieving AUROC values that are comparable to or exceed those of other methods for the five tasks. For example, in the protein–Mg²⁺ binding site prediction task, ATMGBs achieved an AUROC of 0.848, outperforming GraphBind (0.827), DELIA (0.780), IonCom (0.685), and the other methods. These results demonstrate the broad applicability of the ATMGBs model and its potential as a general tool for protein–ligand binding site prediction based solely on protein sequence information.

Conclusion

In this study, we developed a sequence-based framework, ATMGBs, for predicting protein–nucleic acid binding sites. The relationship between residues has been proven to be important information for effective prediction. Based only on protein sequences, the attention map obtained from ProtT5 was tailored to simulate the residue relationships. An ablation experiment and interpretation analysis demonstrated the effectiveness of this strategy. Further results on several independent test sets demonstrated that our model is superior to the existing state-of-the-art sequence-based models and comparable to structure-based models. Additionally, ATMGBs also performs well in other protein–ligand binding site prediction tasks, achieving the performance of models tailored to specific tasks, showing its potential as a universal binding site prediction model.

Although the results are superior compared with sequence-based methods, ATMGBs has not fully surpassed structure-based methods. In future work, we will further optimize ATMGBs, such as fine-tuning the residue relationship graph based on attention maps using real protein residue contact map information or employing contrastive learning strategies to align protein language embeddings with physicochemical information of residues. The improved ATMGBs model will then be applied to a broader range of protein prediction tasks, contributing to related biological and medical research.

Key Points

Three kinds of features were extracted to comprehensively represent the sequence information of proteins, including: embeddings of ProtT5 and ESM1b protein language models and physicochemical properties from AAindex.
The attention map obtained from ProtT5 was used to simulate the residue relationship of the proteins that were fed to the graph convolution network (GCN) to enhance the representation.
The proposed ATMGBs model, which is based only on protein sequences, achieves superior performance across multiple independent test sets compared with other state-of-the-art models.

Supplementary Material

ATMGBs_Supplementary_0813_final_bbaf457

atmgbs_supplementary_0813_final_bbaf457.docx^{(1.9MB, docx)}

Contributor Information

Xiang Li, School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.

Wei Peng, School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.

Xiaolei Zhu, School of Information and Artificial Intelligence, Anhui Agricultural University, 130 Changjiang Road, Shushan District, Hefei, Anhui 230036, China.

Author contributions

X.Z.: Study conception. X.L.: Study design. X.L., W.P.: Data analysis. X.L., W.P., and X.Z.: Writing the paper. All the authors read and approved the manuscript.

Funding

This work was supported by the University Natural Science Research Project of Anhui Province (grant no. 2023AH050998).

Conflict of interest: The authors declare that they have no competing interests.

Data availability

The code and datasets used in this study are available at: https://github.com/lixiangli01/ATMGBs. A web server is also deployed for our model at: http://zhulab.org.cn/ATMGBs.

References

1. Dillon SC, Dorman CJ. Bacterial nucleoid-associated proteins, nucleoid structure and gene expression. Nat Rev Microbiol 2010;8:185–95. 10.1038/nrmicro2261 [DOI] [PubMed] [Google Scholar]
2. Said N, Finazzo M, Hilal T. et al. Sm-like protein Rof inhibits transcription termination factor ρ by binding site obstruction and conformational insulation. Nat Commun 2024;15:3186. 10.1038/s41467-024-47439-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
3. Steitz TA. Structural studies of protein–nucleic acid interaction: the sources of sequence-specific binding. Q Rev Biophys 1990;23:205–80. 10.1017/S0033583500005552 [DOI] [PubMed] [Google Scholar]
4. Ge P, Zhou ZH. Hydrogen-bonding networks and RNA bases revealed by cryo electron microscopy suggest a triggering mechanism for calcium switches. Proc Natl Acad Sci 2011;108:9637–42. 10.1073/pnas.1018104108 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Hubbard SR, Bishop WR, Kirschmeier P. et al. Identification and characterization of zinc binding sites in protein kinase C. Science 1991;254:1776–9. 10.1126/science.1763327 [DOI] [PubMed] [Google Scholar]
6. Wang L, Brown SJ. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res 2006;34:W243–8. 10.1093/nar/gkl298 [DOI] [PMC free article] [PubMed] [Google Scholar]
7. Su H, Liu M, Sun S. et al. Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics 2019;35:930–6. 10.1093/bioinformatics/bty756 [DOI] [PubMed] [Google Scholar]
8. Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA-and RNA-binding residues. Nucleic Acids Res 2017;45:e84–4. 10.1093/nar/gkx059 [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Patiyal S, Dhall A, Raghava GP. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform 2022;23:bbac322. 10.1093/bib/bbac322 [DOI] [PubMed] [Google Scholar]
10. Wang N, Yan K, Zhang J. et al. iDRNA-ITF: identifying DNA-and RNA-binding residues in proteins based on induction and transfer framework. Brief Bioinform 2022;23:bbac236. [DOI] [PubMed] [Google Scholar]
11. Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024;52:e10–0. 10.1093/nar/gkad1131 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Wu Q, Peng Z, Zhang Y. et al. COACH-D: improved protein–ligand binding sites prediction with refined ligand-binding poses through molecular docking. Nucleic Acids Res 2018;46:W438–42. 10.1093/nar/gky439 [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Lam JH, Li Y, Zhu L. et al. A deep learning framework to predict binding preference of RNA constituents on protein surface. Nat Commun 2019;10:4941. 10.1038/s41467-019-12920-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Xia Y, Xia C-Q, Pan X. et al. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021;49:e51–1. 10.1093/nar/gkab044 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Zheng M, Sun G, Li X. et al. EGPDI: identifying protein–DNA binding sites based on multi-view graph embedding fusion. Brief Bioinform 2024;25:bbae330. 10.1093/bib/bbae330 [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Jing L, Xu S, Wang Y. et al. CrossBind: collaborative cross-modal identification of protein nucleic-acid-binding residues. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2024, AAAI Press, Washington, DC, USA, pp. 2661–2669.
17. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
19. Yuan Q, Chen S, Rao J. et al. AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform 2022;23:bbab564. 10.1093/bib/bbab564 [DOI] [PubMed] [Google Scholar]
20. Song Y, Yuan Q, Zhao H. et al. Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Brief Bioinform 2023;24:bbad360. 10.1093/bib/bbad360 [DOI] [PubMed] [Google Scholar]
21. Yuan Q, Tian C, Yang Y. Genome-scale annotation of protein binding sites via language model and geometric deep learning. Elife 2024;13:RP93695. 10.7554/eLife.93695 [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Fang Y, Jiang Y, Wei L. et al. DeepProSite: structure-aware protein binding site prediction using ESMFold and pretrained language model. Bioinformatics 2023;39:btad718. 10.1093/bioinformatics/btad718 [DOI] [PMC free article] [PubMed] [Google Scholar]
23. Villegas-Morcillo A, Makrodimitris S, van Ham RC. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2021;37:162–70. 10.1093/bioinformatics/btaa701 [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Zhu Y-H, Liu Z, Liu Y. et al. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction. Brief Bioinform 2024;25:bbae040. 10.1093/bib/bbae040 [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Zhang J, Wang R, Wei L. MucLiPred: multi-level contrastive learning for predicting nucleic acid binding residues of proteins. J Chem Inf Model 2024;64:1050–65. 10.1021/acs.jcim.3c01471 [DOI] [PubMed] [Google Scholar]
26. Liu Y, Tian B. Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform 2024;25:bbad488. [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Wu J, Liu Y, Zhang Y. et al. Identifying protein-nucleotide binding residues via grouped multi-task learning and pre-trained protein language models. J Chem Inf Model 2025;65:1040–1052. 10.1021/acs.jcim.5c00837 [DOI] [PubMed] [Google Scholar]
28. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30:5998–6008. [Google Scholar]
29. Elnaggar A, Heinzinger M, Dallago C. et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2021;44:7112–27. 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]
30. Rives A, Meier J, Sercu T. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci 2021;118:e2016239118. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
31. Stebliankin V, Shirali A, Baral P. et al. Evaluating protein binding interfaces with transformer networks. Nat Mach Intell 2023;5:1042–53. 10.1038/s42256-023-00715-4 [DOI] [Google Scholar]
32. Suzek BE, Huang H, McGarvey P. et al. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 2007;23:1282–8. 10.1093/bioinformatics/btm098 [DOI] [PubMed] [Google Scholar]
33. Kawashima S, Pokarowski P, Pokarowska M. et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2007;36:D202–5. 10.1093/nar/gkm998 [DOI] [PMC free article] [PubMed] [Google Scholar]
34. Chen Z, Zhao P, Li F. et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018;34:2499–502. 10.1093/bioinformatics/bty140 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 2016. 10.48550/arXiv.1609.02907 [DOI]
36. Yuan Q, Chen J, Zhao H. et al. Structure-aware protein–protein interaction site prediction using deep graph convolutional network. Bioinformatics 2022;38:125–32. 10.1093/bioinformatics/btab643 [DOI] [PubMed] [Google Scholar]
37. Zhu Y-H, Hu J, Song X-N. et al. DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines. J Chem Inf Model 2019;59:3057–71. 10.1021/acs.jcim.8b00749 [DOI] [PubMed] [Google Scholar]
38. Zhang J, Chen Q, Liu B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief Bioinform 2021;22:bbaa397. 10.1093/bib/bbaa397 [DOI] [PubMed] [Google Scholar]
39. Zhang L, Liu T. PDNAPred: interpretable prediction of protein-DNA binding sites based on pre-trained protein language models. Int J Biol Macromol 2024;281:136147. 10.1016/j.ijbiomac.2024.136432 [DOI] [PubMed] [Google Scholar]
40. Walia RR, Xue LC, Wilkins K. et al. RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins. PLoS One 2014;9:e97725. 10.1371/journal.pone.0097725 [DOI] [PMC free article] [PubMed] [Google Scholar]
41. Liu R, Hu J. DNABind: a hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning-and template-based approaches. Proteins 2013;81:1885–99. 10.1002/prot.24330 [DOI] [PubMed] [Google Scholar]
42. Li S, Yamashita K, Amada KM. et al. Quantifying sequence and structural features of protein–RNA interactions. Nucleic Acids Res 2014;42:10086–98. 10.1093/nar/gku681 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Huber EM, Scharf DH, Hortschansky P. et al. DNA minor groove sensing and widening by the CCAAT-binding complex. Structure 2012;20:1757–68. 10.1016/j.str.2012.07.012 [DOI] [PubMed] [Google Scholar]
44. Shi Z, He F, Chen M. et al. DNA-binding mechanism of the Hippo pathway transcription factor TEAD4. Oncogene 2017;36:4362–9. 10.1038/onc.2017.24 [DOI] [PubMed] [Google Scholar]
45. Xia C-Q, Pan X, Shen H-B. Protein–ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 2020;36:3018–27. 10.1093/bioinformatics/btaa110 [DOI] [PubMed] [Google Scholar]
46. Hu X, Dong Q, Yang J. et al. Recognizing metal and acid radical ion-binding sites by integrating ab initio modeling with template-based transferals. Bioinformatics 2016;32:3260–9. 10.1093/bioinformatics/btw396 [DOI] [PMC free article] [PubMed] [Google Scholar]
47. Yang J, Roy A, Zhang Y. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013;29:2588–95. 10.1093/bioinformatics/btt447 [DOI] [PMC free article] [PubMed] [Google Scholar]
48. Yuan Q, Chen S, Wang Y. et al. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning. Brief Bioinform 2022;23:23. 10.1093/bib/bbac444 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

ATMGBs_Supplementary_0813_final_bbaf457

atmgbs_supplementary_0813_final_bbaf457.docx^{(1.9MB, docx)}

Data Availability Statement

The code and datasets used in this study are available at: https://github.com/lixiangli01/ATMGBs. A web server is also deployed for our model at: http://zhulab.org.cn/ATMGBs.

[ref1] 1. Dillon SC, Dorman CJ. Bacterial nucleoid-associated proteins, nucleoid structure and gene expression. Nat Rev Microbiol 2010;8:185–95. 10.1038/nrmicro2261 [DOI] [PubMed] [Google Scholar]

[ref2] 2. Said N, Finazzo M, Hilal T. et al. Sm-like protein Rof inhibits transcription termination factor ρ by binding site obstruction and conformational insulation. Nat Commun 2024;15:3186. 10.1038/s41467-024-47439-6 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3. Steitz TA. Structural studies of protein–nucleic acid interaction: the sources of sequence-specific binding. Q Rev Biophys 1990;23:205–80. 10.1017/S0033583500005552 [DOI] [PubMed] [Google Scholar]

[ref4] 4. Ge P, Zhou ZH. Hydrogen-bonding networks and RNA bases revealed by cryo electron microscopy suggest a triggering mechanism for calcium switches. Proc Natl Acad Sci 2011;108:9637–42. 10.1073/pnas.1018104108 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref5] 5. Hubbard SR, Bishop WR, Kirschmeier P. et al. Identification and characterization of zinc binding sites in protein kinase C. Science 1991;254:1776–9. 10.1126/science.1763327 [DOI] [PubMed] [Google Scholar]

[ref6] 6. Wang L, Brown SJ. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res 2006;34:W243–8. 10.1093/nar/gkl298 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref7] 7. Su H, Liu M, Sun S. et al. Improving the prediction of protein–nucleic acids binding residues via multiple sequence profiles and the consensus of complementary methods. Bioinformatics 2019;35:930–6. 10.1093/bioinformatics/bty756 [DOI] [PubMed] [Google Scholar]

[ref8] 8. Yan J, Kurgan L. DRNApred, fast sequence-based method that accurately predicts and discriminates DNA-and RNA-binding residues. Nucleic Acids Res 2017;45:e84–4. 10.1093/nar/gkx059 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Patiyal S, Dhall A, Raghava GP. A deep learning-based method for the prediction of DNA interacting residues in a protein. Brief Bioinform 2022;23:bbac322. 10.1093/bib/bbac322 [DOI] [PubMed] [Google Scholar]

[ref10] 10. Wang N, Yan K, Zhang J. et al. iDRNA-ITF: identifying DNA-and RNA-binding residues in proteins based on induction and transfer framework. Brief Bioinform 2022;23:bbac236. [DOI] [PubMed] [Google Scholar]

[ref11] 11. Zhang J, Basu S, Kurgan L. HybridDBRpred: improved sequence-based prediction of DNA-binding amino acids using annotations from structured complexes and disordered proteins. Nucleic Acids Res 2024;52:e10–0. 10.1093/nar/gkad1131 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref12] 12. Wu Q, Peng Z, Zhang Y. et al. COACH-D: improved protein–ligand binding sites prediction with refined ligand-binding poses through molecular docking. Nucleic Acids Res 2018;46:W438–42. 10.1093/nar/gky439 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Lam JH, Li Y, Zhu L. et al. A deep learning framework to predict binding preference of RNA constituents on protein surface. Nat Commun 2019;10:4941. 10.1038/s41467-019-12920-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref14] 14. Xia Y, Xia C-Q, Pan X. et al. GraphBind: protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 2021;49:e51–1. 10.1093/nar/gkab044 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Zheng M, Sun G, Li X. et al. EGPDI: identifying protein–DNA binding sites based on multi-view graph embedding fusion. Brief Bioinform 2024;25:bbae330. 10.1093/bib/bbae330 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref16] 16. Jing L, Xu S, Wang Y. et al. CrossBind: collaborative cross-modal identification of protein nucleic-acid-binding residues. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2024, AAAI Press, Washington, DC, USA, pp. 2661–2669.

[ref17] 17. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref18] 18. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]

[ref19] 19. Yuan Q, Chen S, Rao J. et al. AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform 2022;23:bbab564. 10.1093/bib/bbab564 [DOI] [PubMed] [Google Scholar]

[ref20] 20. Song Y, Yuan Q, Zhao H. et al. Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures. Brief Bioinform 2023;24:bbad360. 10.1093/bib/bbad360 [DOI] [PubMed] [Google Scholar]

[ref21] 21. Yuan Q, Tian C, Yang Y. Genome-scale annotation of protein binding sites via language model and geometric deep learning. Elife 2024;13:RP93695. 10.7554/eLife.93695 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Fang Y, Jiang Y, Wei L. et al. DeepProSite: structure-aware protein binding site prediction using ESMFold and pretrained language model. Bioinformatics 2023;39:btad718. 10.1093/bioinformatics/btad718 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref23] 23. Villegas-Morcillo A, Makrodimitris S, van Ham RC. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 2021;37:162–70. 10.1093/bioinformatics/btaa701 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref24] 24. Zhu Y-H, Liu Z, Liu Y. et al. ULDNA: integrating unsupervised multi-source language models with LSTM-attention network for high-accuracy protein–DNA binding site prediction. Brief Bioinform 2024;25:bbae040. 10.1093/bib/bbae040 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Zhang J, Wang R, Wei L. MucLiPred: multi-level contrastive learning for predicting nucleic acid binding residues of proteins. J Chem Inf Model 2024;64:1050–65. 10.1021/acs.jcim.3c01471 [DOI] [PubMed] [Google Scholar]

[ref26] 26. Liu Y, Tian B. Protein–DNA binding sites prediction based on pre-trained protein language model and contrastive learning. Brief Bioinform 2024;25:bbad488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref27] 27. Wu J, Liu Y, Zhang Y. et al. Identifying protein-nucleotide binding residues via grouped multi-task learning and pre-trained protein language models. J Chem Inf Model 2025;65:1040–1052. 10.1021/acs.jcim.5c00837 [DOI] [PubMed] [Google Scholar]

[ref28] 28. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30:5998–6008. [Google Scholar]

[ref29] 29. Elnaggar A, Heinzinger M, Dallago C. et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2021;44:7112–27. 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]

[ref30] 30. Rives A, Meier J, Sercu T. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci 2021;118:e2016239118. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref31] 31. Stebliankin V, Shirali A, Baral P. et al. Evaluating protein binding interfaces with transformer networks. Nat Mach Intell 2023;5:1042–53. 10.1038/s42256-023-00715-4 [DOI] [Google Scholar]

[ref32] 32. Suzek BE, Huang H, McGarvey P. et al. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 2007;23:1282–8. 10.1093/bioinformatics/btm098 [DOI] [PubMed] [Google Scholar]

[ref33] 33. Kawashima S, Pokarowski P, Pokarowska M. et al. AAindex: amino acid index database, progress report 2008. Nucleic Acids Res 2007;36:D202–5. 10.1093/nar/gkm998 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref34] 34. Chen Z, Zhao P, Li F. et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018;34:2499–502. 10.1093/bioinformatics/bty140 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 2016. 10.48550/arXiv.1609.02907 [DOI]

[ref36] 36. Yuan Q, Chen J, Zhao H. et al. Structure-aware protein–protein interaction site prediction using deep graph convolutional network. Bioinformatics 2022;38:125–32. 10.1093/bioinformatics/btab643 [DOI] [PubMed] [Google Scholar]

[ref37] 37. Zhu Y-H, Hu J, Song X-N. et al. DNAPred: accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines. J Chem Inf Model 2019;59:3057–71. 10.1021/acs.jcim.8b00749 [DOI] [PubMed] [Google Scholar]

[ref38] 38. Zhang J, Chen Q, Liu B. NCBRPred: predicting nucleic acid binding residues in proteins based on multilabel learning. Brief Bioinform 2021;22:bbaa397. 10.1093/bib/bbaa397 [DOI] [PubMed] [Google Scholar]

[ref39] 39. Zhang L, Liu T. PDNAPred: interpretable prediction of protein-DNA binding sites based on pre-trained protein language models. Int J Biol Macromol 2024;281:136147. 10.1016/j.ijbiomac.2024.136432 [DOI] [PubMed] [Google Scholar]

[ref40] 40. Walia RR, Xue LC, Wilkins K. et al. RNABindRPlus: a predictor that combines machine learning and sequence homology-based methods to improve the reliability of predicted RNA-binding residues in proteins. PLoS One 2014;9:e97725. 10.1371/journal.pone.0097725 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref41] 41. Liu R, Hu J. DNABind: a hybrid algorithm for structure-based prediction of DNA-binding residues by combining machine learning-and template-based approaches. Proteins 2013;81:1885–99. 10.1002/prot.24330 [DOI] [PubMed] [Google Scholar]

[ref42] 42. Li S, Yamashita K, Amada KM. et al. Quantifying sequence and structural features of protein–RNA interactions. Nucleic Acids Res 2014;42:10086–98. 10.1093/nar/gku681 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref43] 43. Huber EM, Scharf DH, Hortschansky P. et al. DNA minor groove sensing and widening by the CCAAT-binding complex. Structure 2012;20:1757–68. 10.1016/j.str.2012.07.012 [DOI] [PubMed] [Google Scholar]

[ref44] 44. Shi Z, He F, Chen M. et al. DNA-binding mechanism of the Hippo pathway transcription factor TEAD4. Oncogene 2017;36:4362–9. 10.1038/onc.2017.24 [DOI] [PubMed] [Google Scholar]

[ref45] 45. Xia C-Q, Pan X, Shen H-B. Protein–ligand binding residue prediction enhancement through hybrid deep heterogeneous learning of sequence and structure data. Bioinformatics 2020;36:3018–27. 10.1093/bioinformatics/btaa110 [DOI] [PubMed] [Google Scholar]

[ref46] 46. Hu X, Dong Q, Yang J. et al. Recognizing metal and acid radical ion-binding sites by integrating ab initio modeling with template-based transferals. Bioinformatics 2016;32:3260–9. 10.1093/bioinformatics/btw396 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref47] 47. Yang J, Roy A, Zhang Y. Protein–ligand binding site recognition using complementary binding-specific substructure comparison and sequence profile alignment. Bioinformatics 2013;29:2588–95. 10.1093/bioinformatics/btt447 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref48] 48. Yuan Q, Chen S, Wang Y. et al. Alignment-free metal ion-binding site prediction from protein sequence through pretrained language model and multi-task learning. Brief Bioinform 2022;23:23. 10.1093/bib/bbac444 [DOI] [PubMed] [Google Scholar]

PERMALINK

Predicting nucleic acid binding sites by attention map-guided graph convolutional network with protein language embeddings and physicochemical information

Xiang Li

Wei Peng

Xiaolei Zhu

Abstract

Introduction

Materials and methods

Datasets

Overall architecture of ATMGBs

Figure 1.

Protein language embedding extraction

Amino acid physicochemical properties

Attention map extraction

Transformer architecture

Multiple information fusion with GCN architecture

Evaluation metrics

Results and discussion

Ablation of ATMGBs modules and features

Table 1.

Hyperparameter sensitivity analysis

Comparison with other sequence-based models on the protein–DNA independent test sets

Table 2.

Comparison with other sequence-based models on the protein––RNA independent test sets

Table 3.

Table 4.

Comparison with other structure-based methods

Table 5.

Table 6.

Case study

Figure 2.

Interpretability analysis

Figure 3.

Figure 4.

Performance on other ligand binding site prediction tasks

Figure 5.

Conclusion

Key Points

Supplementary Material

Contributor Information

Author contributions

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases