Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 May 29;26(3):bbaf245. doi: 10.1093/bib/bbaf245

PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models

Suresh Pokharel 1, Kepha Barasa 2, Pawel Pratyush 3, Dukka B KC 4,
PMCID: PMC12121366  PMID: 40439671

Abstract

DNA-binding proteins (DBPs) play a crucial role in gene regulation, development, and environmental responses across plants, animals, and microorganisms. Existing DBP prediction methods are largely limited to sequence information, whether through handcrafted features or sequence-based protein language models (PLMs), overlooking structural cues critical to protein function. In addition, most existing tools are trained for general DBP predictions, which are often not accurate for plant-specific DBPs due to the unique structural and functional properties of plant proteins. Our work introduces PLM-DBPs, a deep learning framework that integrates both sequence-based and structure-aware representations to enhance DBP prediction in plants. We evaluated several state-of-the-art PLMs to extract high-dimensional protein representations and experimented with various fusion strategies to validate the complementary information between the various representations. Our final model, a fusion of sequence-based and structure-aware ANN models, achieves a notable improvement in predicting DBPs in plants outperforming previous state-of-the-art models. Although sequence-based PLMs already demonstrate strong performance in DBP prediction, our findings show that the integration of structural information further enhances predictive accuracy. This underscores the complementary nature of structural representations and establishes PLM-DBPs as a robust tool for advancing plant research and agricultural innovation. The proposed model and other resources are publicly available at https://github.com/suresh-pokharel/PLM-DBPs

Keywords: plant DNA-binding protein, structure-aware PLMs, protein language models, protein classification, DBP prediction

Introduction

DNA-binding refers to the interaction between proteins and DNA, where specific proteins recognize and attach to particular sequences or structures within the DNA molecule in response to cellular signals such as gene expression changes, environmental cues, or molecular interactions. These proteins bind through mechanisms like hydrogen bonding, electrostatic attraction, hydrophobic interactions, and van der Waals forces. Proteins that can bind to DNA are called DNA-binding proteins (DBPs) [1–3]. In humans and other species, this interaction is crucial for various biological processes that include regulating gene expression, DNA replication, repair, and recombination. Understanding the nature of these proteins is crucial for understanding the basis of many diseases and the development of gene-based therapies [1, 4].

Similarly, in plants, DBPs are essential for the regulation of nearly all aspects of plant life, from growth and development to environmental adaptation and defense responses [5]. Since experimental methods are often time-consuming and resource-intensive, numerous computational tools have been developed, with ongoing research focused on advancing these tools to enhance the efficiency and accuracy of their predictions.

Although several computational tools have been developed to predict DBPs in general DBP datasets, there are only a few tools specifically trained to predict DBPs in plants. Plant DBP prediction is crucial not only for basic biological understanding but also has practical applications in agriculture, environmental sustainability, and biodiversity conservation. DBPs in plants differ significantly compared to DBPs in other species and general DNA-binding predictor often struggle to accurately generalize the plant-specific DBPs [5, 6].

Recent advances in computational methods, particularly machine learning and deep learning, have greatly improved the accuracy of DBP (DBP) prediction from protein sequences. However, many of the existing approaches rely on multiple handcrafted features, which require substantial time and effort in the data processing, feature engineering, and model development process. For example, IDRBP-PPCT [7] uses a random forest (RF) model trained on a combined combination of features of position-specific scoring matrix (PSSM), position-specific frequency matrix (PSFM), and cross-transformation (PPCT) features. Similarly, StackDPPred [8] employs a combination of PSSM features, evolutionary distance transformation (EDT) features, residue probing transformation (RPT) features, and a stacked ensemble of support vector machines (SVM), logistic regression, and K-Nearest Neighbor (KNN) models to develop a DBP predictor. These methods incorporate a variety of machine learning techniques, such as SVM, RF, adaptive boosting (ADB), and light gradient boosting (LGB).

Additionally, iDRBP_MMC [9] is a deep learning model that predicts both DBP and RNA-binding protein (RBP) using a multi-label classification approach. It employs PSSM features to train a convolutional neural network (CNN) for predicting four classes: DBPs, RBPs, non-DBPs, and dual RNA–DNA binding proteins (DRBPs). Similarly, DeepDRBP-2L [10] combines a CNN with a bidirectional long short-term memory (biLSTM) network, also trained on PSSM features, to predict the same classes. PDBP-Fusion [11] takes a similar approach by integrating CNN and biLSTM networks, but instead uses one-hot encoded features for training.

Recently, Wu et al. [4] proposed two models: a hierarchical CNN-LSTM and a multi-class CNN-LSTM, both trained on PSSM features, achieving balanced prediction accuracy for DBPs and RBPs within a single model. More recently, Qi et al. [12] introduced PreDBP-PLMs, a CNN model that combines PLM-based representations with evolutionary (PSSM-based) features to identify general DBPs. Similarly, Zeng et al. [13] developed ESM-DBP, which fine-tunes the ESM-2 protein language model (PLM) to create a domain-adaptive language model for DBP prediction.

Despite the development of numerous tools for DBP prediction, only a limited number of these tools are specifically tailored to plant DBPs [5, 6]. To our knowledge, PlDBPred [5] is the most recent tool developed specifically for plant DBP prediction, which is a SVM model trained on the integrated PSSM features. Since existing methods predominantly rely on sequence-based representations and handcrafted features, including PSSM and physicochemical properties, we propose an effective application of both sequence-based and structure-aware PLM-based representations for plant DNA-binding prediction. Sequence-based PLMs are trained on a large corpus of protein sequences that have demonstrated an extraordinary ability to capture meaningful representations of protein sequences. Notable examples include Ankh [14], ESM versions-1/2/3 (evolutionary scale modeling) [15], and ProtT5 [14], which have shown exceptional performance across various protein-related tasks, such as predicting post-translational modifications [16, 17], binding residues in disordered regions [18], and protein structure [19].

On the other hand, with the AlphaFold breakthrough making protein structures more accessible, structure-aware PLMs [20–22] are trained using both sequences as well as 3D structural information. Integrating 3D structural information as a complementary feature enriches protein representations by capturing spatial and functional details not accessible from sequence alone, thereby enhancing predictive performance across a range of tasks.

Building on the success of these sequence-based and structure-aware PLMs, this study explores their potential for predicting DBPs in plants. We investigated several sequence-based PLMs and integrated the optimal model with a structure-aware PLM to develop PLM-DBPs, a robust deep learning framework for plant DBP prediction. Our results demonstrate that combining sequence and structural representations significantly enhances predictive performance, suggesting that such integration captures latent features within proteins that are not apparent from sequence modality alone.

Materials and methods

Dataset

The main training and test sets were adopted from PlDBPred [5], which was curated from the UniProt database (accessed on 14 June 2021) [23]. The dataset includes DBPs (positive set, GO: 0003677) and non-DBPs (negative set) derived from plants across 35 species. Sequences containing non-standard amino acids (AA; B, J, O, U, X, Z) and those shorter than 50 AAs were removed. The remaining sequences underwent homology reduction using CD-HIT with a 0.4 sequence identity cutoff. After pre-processing, the final PlDBPred dataset includes 849 DBPs and 1848 non-DBPs. The negative set was balanced by randomly selecting an equal number of non-DBPs. An independent test set with 997 sequences (500 non-DBP and 497 DBPs) is also adopted from PlDBPred. The distribution of DBP and non-DBPs in the training and test sets are presented in Table 1.

Table 1.

Summary of datasets used in this study

Dataset name Split/Class Type Count
PlDBPred’s Plant Dataset [5] Train DBPs 849
NDBPs 849
Test-1 DBPs 500
NDBPs 497
Our Independent Plant Test Set Test-2 DBPs 20
NDBPs 20
Multiclass General DBPs Dataset [4] Train/Val/Test (81:9:10) SSBs 561
DSBs 4520
RBPs 5836
non-NABPs 12 899

To further assess the generalization capability of the proposed model, an independent test set was curated using a set based on this rule: we selected proteins annotated with the Gene Ontology term GO:0003677 (DNA binding), belonging to the taxonomy group taxonomy ID:33090 (plant species), with reviewed annotations to ensure data quality. We found 20 DBPs which were created between 14 June 2021 and 22 March 2025, to provide an unbiased validation on recent, high-confidence plant protein sequences. The details of the data set provided for this data set are described in Supplementary information (Supplementary Table T2).

Furthermore, to validate our methodology on a more general dataset, we also trained and tested our model using the multiclass DNA-binding dataset developed by Wu and Guo [4]. Details of the dataset and the corresponding results are provided in Supplementary information (Section S2).

Model workflow architecture

An overview of the proposed PLM-DBPs framework is presented in Fig. 1, illustrating the integration of a sequence-based PLM (ProtT5) and a structure-aware PLM (SaProt) for DBP prediction in plants. The AA sequence of a protein is first input into ProtT5, from which an average-pooled, fixed-length protein-level representation of dimension Inline graphic, where Inline graphic, is extracted. This representation is then passed through a multi-layer perceptron (MLP) classification head to produce a prediction score.

Figure 1.

Figure 1

Overview of the proposed workflow for plant DBP prediction, integrating ProtT5 (sequence-based) and SaProt (structure-aware) PLMs. Inline graphic and Inline graphic represent the probabilities predicted by the two models, which are combined through score-level fusion for more reliable predictions. In this study, equal weights (Inline graphic) are applied, with a cutoff threshold probability of Inline graphic.

Concurrently, the corresponding 3Di structural sequence is provided as input to SaProt, which generates residue-level embeddings that are similarly aggregated to form a protein-level representation of dimension Inline graphic, where Inline graphic. This structural embedding is also passed through an independent MLP classification head to generate a second prediction score.

The outputs from both MLP heads are fused at the score level using a late fusion strategy—specifically, by computing the weighted average of the individual prediction probabilities. In this study, equal weights are applied to both modalities (Inline graphic), and the final classification is based on a threshold probability of Inline graphic. This fusion strategy leverages the complementary strengths of sequence- and structure-based embeddings, resulting in more robust and accurate DBP predictions. Detailed descriptions of each component depicted in Fig. 1 are provided in the subsequent subsections.

Protein language models

In the rapidly evolving field of computational biology and bioinformatics, the advent of large PLMs has significantly influenced research and applications across a wide range of bioinformatics tasks. Built on the seminal transformer architecture, these models leverage the attention mechanism to capture long-range dependencies and contextual relationships between AAs within protein sequences. Importantly, these PLMs are trained on extensive corpora of unlabeled protein sequences, enabling them to generate rich representations that encode structural, functional, and evolutionary information. In this study, we utilize several prominent sequence-based PLMs, including Ankh, three versions of ESM-2, and ProtT5. With the widespread availability of high-quality 3D protein structures from AlphaFold, there is a growing interest in bilingual PLMs that are pretrained on both AA sequence and structure data. These models benefit from the complementary information encoded in structural space. To leverage this advantage, we incorporate a leading structure-aware PLM, SaProt, alongside the aforementioned sequence-based models. A comprehensive summary of all the PLMs used in this study is provided in Supplementary information (Section S1). Key details such as model architecture, number of parameters, and training data are also highlighted in Table 2.

Table 2.

Overview of PLMs evaluated in this study summarizing their architectures, number of layers, number of parameters, training datasets, embedding dimensions, and other details

PLM Dataset Architecture # Layers # Parameters (Billions) Embedding Dimension
Ankh-Base UniRef50 T5 (Hyperparameter Tuning on ProtT5 Framework) 24 Inline graphic B 768
Ankh-Large 24 Inline graphic B 1536
ProtT5 UniRef50 T5-3B 24 Inline graphic B 1024
ESM2-650M UniRef50 RoBERTa 33 0.65B 1280
ESM2-3B 36 3B 2560
ESM2-15B 48 15B 5120
SaProt Sequence, AlphaFold DB, PDB RoBERTa 33 650M 1280

Ankh, ProtT5 models are taken from Hugging Face (https://huggingface.co) ESM models are downloaded from their GitHub repository (https://github.com/evolutionaryscale/esm) SaProt model is downloaded from their GitHub repository (https://github.com/westlake-repl/SaProt).

Extracting protein representation using sequence-based protein language models

The last hidden state of the encoder component from each sequence-based PLM is utilized as the protein representation for the sequence modality. As illustrated in Fig. 2, a protein sequence of length Inline graphic is fed to the PLM, resulting in vector embeddings of dimension Inline graphic for each AA. Subsequently, these residue-level vector embeddings are consolidated into a protein-level representation using average pooling. For instance, in the case of ProtT5, the model generates a sequence of residue-level embeddings, each of dimension Inline graphic, for a protein sequence of length Inline graphic. These Inline graphic embeddings are then averaged across the sequence length to produce a single protein-level feature vector of dimension Inline graphic.

Figure 2.

Figure 2

Process of extracting protein representation using sequence-based and structure-aware PLMs. Ankh/ESM-2/ProtT5 takes an AA sequence as input, whereas SaProt takes a combination of the AA sequence and 3Di sequence.

Extracting protein representation using structure-aware protein language model: SaProt

The 3Di sequence is a structural representation of a protein derived from its 3D conformation. It encodes the local geometric interactions of AAs, allowing the 3D structure of a protein to be expressed in a sequence-like format. As illustrated in Fig. 2, SaProt generates structure-aware embeddings by combining both AA sequences and their corresponding 3D interaction (3Di) sequences.

SaProt takes as input a combined sequence, where each token is formed by pairing a character from the AA sequence with its corresponding 3Di character. It outputs a 1280-dimensional embedding for each token. These token-level embeddings are then average pooled to obtain a fixed-length, protein-level structure-aware representation.

Getting 3Di sequence from protein sequence and structure

The concept of 3Di sequence representation is developed by FoldSeek [24] to encode protein structures into sequence-like representations of local conformations, enabling fast structure-based homology searching against large databases.

As presented in Fig. 3, while Foldseek (Method-II) is the recommended method for deriving 3Di sequences, some protein structures in our dataset were missing from AlphaFold DB or PDB. Due to limited computational resources, we were unable to run AlphaFold locally to obtain the missing 3D structures. Instead, we used ProstT5’s CNN based 3Di [20] prediction model (Method-I) which was trained on millions of AlphaFold-predicted 3D structures and it is helpful for resource-constrained scenarios.

Figure 3.

Figure 3

Conversion of AA sequences to 3Di sequences using Method-I: Predictive approach (no 3D structure needed) using the ProstT5’s pre-trained CNN model for 3Di prediction and Method-II: fold-seek (3D structure needed).

In our small-scale experiments, we compared the performance of SaProt representations generated using Foldseek-derived 3Di sequences versus those generated using ProstT5-predicted 3Di sequences. The difference in the performance of the downstream task was found to be negligible. Given the high-computational cost of generating 3D structures and running Foldseek, we recommend using ProstT5’s 3Di prediction method to use our model.

Model selection

Model evaluation performance metrics

In this study, we used a range of evaluation metrics to assess model performance, consistent with established methods in the field. Metrics such as area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve (AUPR), sensitivity (Sn), specificity (Sp), Matthew’s correlation coefficient (MCC), and accuracy (ACC) were used. Using standard classification terminology, TP (true positive) refers to correctly predicted DNA-binding sequences, FP (false positive) refers to incorrectly predicted DNA-binding sequences, TN (true negative) refers to correctly identified non-DNA-binding sequences, and FN (false negative) refers to misclassified non-DNA-binding sequences.

Sensitivity: Inline graphic

Specificity: Inline graphic

Accuracy: Inline graphic

Precision: Inline graphic

F1-score: Inline graphic

MCC: Inline graphic

Cross-validation and hyperparameter-tuning

In this study, a five-fold cross-validation is performed to assess the performance of various machine learning models trained on our datasets. The dataset was randomly split into five equal folds, with one-fold serving as the validation set in each iteration, while the remaining four-folds form the training set. The performance on each validation set is presented in Table 4, with final results representing the averages across all folds. To select the best model and hyperparameters, we performed a grid search training over a wide range of configurations for SVM, RF, feed-forward artificial neural network (ANN), and CNN-1D (1D-CNN) models. We used KerasTuner [25] to optimize the hyperparameters of the neural network and CNN-1D models. The search space for the grid search and KerasTuner, as well as the best-performing hyperparameters, are provided in Supplementary information (Section S2). The best-performing model was selected based on its balanced performance across multiple metrics, including MCC, AUROC, and AUPR, obtained through cross-validation.

Table 4.

The comparison of five-fold cross-validation results based on embeddings from ProtT5, Ankh, ESM2-650M, ESM2-3B, ESM2-15B, and SaProt

PLM Model Sn Sp ACC Pr F1-score AUC–ROC AUC–PR MCC
ProtT5 SVM 94.2Inline graphic2.51 91.3Inline graphic1.34 92.8Inline graphic1.01 91.5Inline graphic0.82 92.9Inline graphic1.55 92.8Inline graphic2.32 89.1Inline graphic1.42 85.6Inline graphic1.12
RF 87.9Inline graphic2.84 91.1Inline graphic1.29 89.5Inline graphic1.16 90.8Inline graphic1.82 89.3Inline graphic2.74 89.5Inline graphic0.95 85.8Inline graphic1.22 79.0Inline graphic1.08
ANN 93.3Inline graphic0.02 91.5Inline graphic0.04 92.4Inline graphic0.02 91.8Inline graphic0.03 92.5Inline graphic0.01 97.3Inline graphic0.01 97.2Inline graphic0.01 84.9Inline graphic0.03
CNN1D 92.6Inline graphic0.06 91.4Inline graphic0.03 92.1Inline graphic0.02 91.7Inline graphic0.02 92.1Inline graphic0.01 96.1Inline graphic0.01 96.6Inline graphic0.01 84.2Inline graphic0.03
ESM2-650M SVM 92.6Inline graphic1.68 92.1Inline graphic3.97 92.3Inline graphic2.74 92.2Inline graphic3.51 92.4Inline graphic2.54 92.3Inline graphic2.74 89.1Inline graphic3.81 84.7Inline graphic5.46
RF 86.1Inline graphic2.41 88.1Inline graphic5.45 87.2Inline graphic3.65 88.2Inline graphic4.76 87.1Inline graphic3.27 87.1Inline graphic3.60 82.9Inline graphic4.51 74.3Inline graphic7.34
ANN 92.9Inline graphic0.02 90.2Inline graphic0.02 91.6Inline graphic0.01 90.5Inline graphic0.02 91.7Inline graphic0.01 96.9Inline graphic0.00 97.0Inline graphic0.01 83.2Inline graphic0.02
CNN-1D 92.6Inline graphic0.02 90.4Inline graphic0.04 91.5Inline graphic0.02 90.8Inline graphic0.02 91.7Inline graphic0.02 96.4Inline graphic0.01 96.1Inline graphic0.02 83.1Inline graphic0.04
ESM2-3B SVM 91.1Inline graphic1.85 92.1Inline graphic2.25 91.6Inline graphic1.23 92.0Inline graphic2.50 91.6Inline graphic1.51 91.6Inline graphic1.22 88.3Inline graphic2.54 83.3Inline graphic2.47
RF 83.1Inline graphic1.49 91.9Inline graphic1.13 87.5Inline graphic0.66 91.1Inline graphic1.35 86.9Inline graphic1.16 87.5Inline graphic0.80 84.1Inline graphic1.82 75.3Inline graphic1.42
ANN 91.7Inline graphic0.02 92.4Inline graphic0.02 92.0Inline graphic0.01 92.3Inline graphic0.02 92.0Inline graphic0.02 96.9Inline graphic0.01 97.1Inline graphic0.01 84.0Inline graphic0.01
CNN-1D 92.8Inline graphic0.03 91.9Inline graphic0.02 92.1Inline graphic0.01 91.8Inline graphic0.02 92.1Inline graphic0.02 97.2Inline graphic0.01 97.3Inline graphic0.02 83.2Inline graphic0.02
ESM2-15B SVM 92.2Inline graphic1.42 93.4Inline graphic2.62 92.8Inline graphic1.62 93.4Inline graphic2.21 92.8Inline graphic1.39 92.8Inline graphic1.62 90.0Inline graphic2.11 85.6Inline graphic3.25
RF 84.3Inline graphic3.37 90.8Inline graphic4.10 87.6Inline graphic2.84 90.5Inline graphic3.64 87.2Inline graphic2.53 87.5Inline graphic2.68 84.1Inline graphic3.14 75.4Inline graphic5.66
ANN 90.3Inline graphic0.02 92.5Inline graphic0.02 91.5Inline graphic0.01 92.4Inline graphic0.01 91.4Inline graphic0.01 96.9Inline graphic0.01 96.9Inline graphic0.02 82.9Inline graphic0.02
CNN-1D 92.4Inline graphic0.03 89.0Inline graphic0.03 90.6Inline graphic0.01 89.3Inline graphic0.02 90.8Inline graphic0.02 96.7Inline graphic0.01 96.6Inline graphic0.02 81.4Inline graphic0.02
Ankh SVM 82.9Inline graphic1.59 80.7Inline graphic2.01 81.8Inline graphic1.06 81.1Inline graphic2.44 82.0Inline graphic1.29 81.8Inline graphic1.06 75.8Inline graphic2.24 63.7Inline graphic2.12
RF 77.3Inline graphic2.83 77.4Inline graphic1.31 77.4Inline graphic1.34 77.3Inline graphic2.05 77.3Inline graphic2.12 77.4Inline graphic1.35 71.1Inline graphic2.47 54.7Inline graphic2.68
ANN 83.2Inline graphic0.02 81.3Inline graphic0.01 79.7Inline graphic0.03 82.8Inline graphic0.01 83.0Inline graphic0.02 80.1Inline graphic0.02 76.1Inline graphic0.02 64.9Inline graphic0.03
CNN-1D 77.6Inline graphic0.06 79.9Inline graphic0.03 78.8Inline graphic0.03 79.5Inline graphic0.14 78.5Inline graphic0.04 86.8Inline graphic0.02 86.4Inline graphic0.03 57.7Inline graphic0.06
SaProt SVM 92.5Inline graphic2.28 89.6Inline graphic1.73 91.6Inline graphic1.23 90.9Inline graphic1.78 91.7Inline graphic1.24 96.6Inline graphic0.58 96.9Inline graphic0.68 83.2Inline graphic2.50
RF 86.5Inline graphic2.19 86.8Inline graphic3.11 87.0Inline graphic1.57 87.4Inline graphic2.19 86.9Inline graphic1.58 93.7Inline graphic0.89 93.7Inline graphic1.24 74.0Inline graphic3.14
ANN 92.8Inline graphic3.11 91.6Inline graphic1.91 92.2Inline graphic1.57 91.6Inline graphic1.98 92.1Inline graphic1.85 96.4Inline graphic1.05 96.5Inline graphic0.96 84.4Inline graphic3.09
CNN-1D 90.9Inline graphic2.77 90.9Inline graphic3.81 90.9Inline graphic2.27 90.8Inline graphic4.06 90.9Inline graphic2.13 96.5Inline graphic0.86 96.7Inline graphic1.01 81.9Inline graphic4.75

Bold values indicate the maximum values for that column. All the metrics are converted to percentages for better comparison. The SaProt results are highlighted to distinguish the outcomes of the structure-aware PLM from those of the sequence-based PLMs.

Cross-modality fusion strategies

Based on evaluating individual models using different representations, we explored various fusion strategies to integrate the knowledge of sequence-based and structure-aware representations. We performed five-fold cross-validation of early fusion (feature-level fusion), cross-attention fusion, and late-fusion (score-level fusion) between top-performing sequence-based representation with structure-aware representations. This integrated model leverages the complementary strengths captured from both sequence and structural modalities, allowing them to enhance prediction accuracy mutually.

  • Feature-level fusion (Early fusion): Combines sequence-based and structure-aware embeddings by concatenating them into a single feature vector before feeding them into the model.

  • Cross-attention fusion: Applies cross-attention to capture dependencies between sequence-based and structure-aware representations by first projecting both embeddings to a similar dimension latent space. The sequence-based embeddings act as the query Inline graphic, while the structure-aware embeddings serve as the key Inline graphic and value Inline graphic. Multihead cross-attention is performed, followed by concatenation of the attended representations, which are then passed through flattened and dense layers for final prediction.

  • Score-level fusion (Late fusion): Integrates the predicted probabilities from independently trained sequence-based and structure-aware models using a weighted average approach.

Final model training

Protein-level representations extracted from both sequence-based and structure-aware PLMs are individually trained using custom prediction heads. Based on the cross-validation results from experiments on various individual and integrated training using representations from sequence-based and structure-aware PLMs, we developed the final model integrating ProtT5 and SaProt-based representations. The final model, dubbed PLM-DBPs (PLMs for DBP prediction), predicts DBPs by averaging the outputs from two independently trained neural networks: one leveraging sequence-based embeddings from ProtT5, and the other utilizing structure-aware embeddings from SaProt. Detailed architectures of these two base models are presented in Table 3. The training statistics of models are provided in Supplementary information (Section S3)

Table 3.

Detailed model summary of ProtT5-ANN and SaProt-ANN used in the plant DBP prediction workflow

Description ProtT5-ANN SaProt-ANN
Input shape (None, 1024) (None, 1280)
Hidden layers 4 3
Neurons per layer 512, 512, 128, 16 512, 256, 16
Dropout 0.2–0.4 0.2–0.4
Activation function ReLU ReLU
Output layer 1 neuron (Sigmoid) 1 neuron (Sigmoid)
Total parameters 855 201 791 329

Fusion strategy: As illustrated in Fig. 1, the score level fusion (also known as late fusion) with a cut-off threshold of Inline graphic is used to integrate two models described in Table 3. The integrated probability Inline graphic is calculated as a weighted average of the predicted probabilities from two models, where Inline graphic. Here, Inline graphic and Inline graphic represent the predicted probabilities from models that were independently trained using sequence-based (ProtT5) and structure-aware (SaProt) PLMs. In this case, the weights are set as Inline graphic, indicating an average score level fusion approach.

Results and discussion

Ablation study on individual PLM representation

The five-fold cross-validation results for hyperparameter optimization and model selection for different prediction heads, including SVM, RF, feed-forward artificial neural networks (ANN), and (1D CNNs) CNN-1D, were trained with a wide range of hyperparameters to evaluate the effectiveness of different feature types.

Table 4 shows that multiple models yield comparable results, indicating high-quality protein representations from both sequence-based and structure-aware PLMs. While other sequence-based PLM’s SVM and CNN-1D models perform similarly to the ANN model, we selected the ProtT5-based ANN model for further experiments due to its superior AUROC and competitive performance across other metrics. In addition, our results are consistent with findings from previous studies [18, 26, 27], which emphasize that the rich information content in PLM-based embeddings allows the downstream model to utilize a relatively simple architecture. We observed that most classification algorithms exhibited consistently strong performance with minimal variation. This enabled us to concentrate on simpler yet effective models, eliminating the need for additional complexity. Consequently, our findings highlight that straightforward ANN models or even shallow architectures can achieve results comparable to deeper neural networks, such as the CNN-1D utilized in our experiments, for downstream tasks.

Similarly, it is noteworthy that models trained with structure-aware protein sequence representations achieve performance nearly equivalent to that of the top-performing models utilizing sequence-based representations.

Ablation study on sequence-structure representation fusion strategies

Based on the cross-validation results of individually trained models, we explore various model fusion strategies, including feature-level fusion, cross-attention fusion, and score-level fusion, by integrating the best-performing sequence-based and structure-aware PLMs. The results of the five-fold cross-validation, presented in Table 5 and Fig. 4, indicate that score-level fusion gives on-par results or outperforms other strategies across most of the evaluation metrics. In contrast, feature-level fusion and cross-attention fusion did not yield competitive performance, which we attribute to the increased feature dimensionality and model complexity, potentially leading to suboptimal learning and generalization. The

Table 5.

The comparison of various fusion strategies of ProtT5 (sequence-based) and SaProt (structure-aware) representation

Fusion technique Sn Sp ACC Pr F1-score AUC–ROC AUC–PR MCC
Feature-level fusion 92.6Inline graphic0.02 86.1Inline graphic0.04 89.2Inline graphic0.02 86.8Inline graphic0.04 89.3Inline graphic0.02 92.1Inline graphic0.03 91.1Inline graphic0.02 79.4Inline graphic0.03
Cross-attention fusion 89.5Inline graphic0.02 93.4Inline graphic0.05 91.5Inline graphic0.02 93.60Inline graphic0.04 91.5Inline graphic0.02 97.1Inline graphic0.01 96.9Inline graphic0.01 83.3Inline graphic0.04
Score-level fusion (PLM-DBPs) 92.4Inline graphic0.01 94.2Inline graphic0.01 93.4Inline graphic0.01 94.2Inline graphic0.01 93.4Inline graphic0.01 98.0Inline graphic0.01 97.9Inline graphic0.01 86.8Inline graphic0.02

Bold values indicate the maximum values for that column.

Figure 4.

Figure 4

Receiver operating characteristic–area under the curve (ROCAUC) and AUPR comparing the performance of individual PLMs and fusion-based models tested on the first-fold of training set.

Analysis of features

We used UMAP (uniform manifold approximation and projection) [28] to visualize high-dimensional representations in a lower-dimensional space. Specifically, we generated 2D UMAP projections for both the original ProtT5 and SaProt embeddings. We also visualized their refined representations learned by the respective PLM-DBPs base models, illustrating the distinction between DBPs and non-DBPs.

As illustrated in Fig. 5a and c, the raw ProtT5 and SaProt embeddings exhibit noticeable separation between DBP and NDBP proteins in our experiment. However, after training the PLM-DBPs models using these features, the resulting embeddings (Fig. 5b and d) demonstrate even clearer distinctions. While these visualizations are lower-dimensional approximations, the observed separations highlight the effectiveness of our proposed approach in capturing the hidden patterns in DBP and non-DBPNDBP.

Figure 5.

Figure 5

Visualization of 2D UMAP projections of learned feature representations. (a) Raw ProtT5 embeddings, (b) ProtT5-based ANN model embeddings from the penultimate layer, (c) raw SaProt embeddings, and (d) SaProt-based ANN model embeddings from the penultimate layer.

Comparison with existing DNA-binding prediction tools

As presented in Table 6, we compared the results of our proposed model with existing tools DNAbinder [29], DPP_PseAAC [30], IDRBP_MMC [9], IDRBP-PPCT [7], PDBP-fusion [11], DeepDRBP [10], StackDPPred [8], PlDBPred[5], and ESM-DBP [13] on the independent test set described in Table 6. The published results from PlDBPred [5] are included for comparison, as it employs the same training and test datasets as our study.

Table 6.

Comparison of performance between PLM-DBPs and existing tools on independent test set

Method Sn Sp ACC Pre F1-score MCC
DNAbinder 50.9 51.5 51.0 67.4 57.9 2.1
DPP_PseAAC 56.1 54.6 55.4 49.8 52.8 10.9
IDRBP_MMC 72.7 67.7 69.9 63.8 67.9 40.1
IDRBP-PPCT 63.9 84.9 69.9 91.4 75.2 44.1
PDBP-fusion 86.4 70.0 75.8 61.2 71.7 54.0
DeepDRBP 61.4 70.6 64.7 79.0 69.1 30.7
StackDPPred 65.8 61.1 63.0 54.2 59.4 26.4
PlDBPred 82.6 93.4 88.0 92.6 87.3 76.4
ESM-DBPa 95.5 87.1 90.69 84.5 89.71 81.7
PLM-DBPsa 88.2 95.4 91.9 94.6 91.3 84.0
PLM-DBPs 89.8 95.1 92.3 95.4 92.5 84.7

aFor a fair comparison, we removed 41 positive sequences from our test dataset when comparing with ESM-DBP, as they were part of its training set (UniDB40). Bold values indicate the maximum values for that column.

Our comparative analysis evaluated the proposed model against existing approaches using several key metrics: sensitivity, specificity, accuracy, precision, F1-Score, and MCC. Although AUROC and AUPR are not included in Table 6 as some of the existing tools that we are comparing against have not reported these metrics. However, PLM-DBPs achieved impressive AUROC and AUPR values of 97.5 and 97.3, respectively on the independent test set.

Additionally, to further evaluate the effectiveness of our proposed method, we conducted experiments on a multiclass general DBP dataset recently analyzed by Wu and Guo [4]. This dataset comprises four distinct classes: RBP, single-stranded DBPs (SSB), double-stranded DBPs (DSB), and non-nucleic acid-binding proteins (NABP). The results of this evaluation are presented in Supplementary information (Section S2), which demonstrates that our method performs well in general DBP prediction, effectively distinguishing between related protein classes.

Comparison with the plant specific DNA-binding prediction tool (PlDBPred)

For further validation of our proposed model, we curated a new test set with the following criteria: proteins associated with the go_manual:0003677 term (DNA binding), from the taxonomy group taxonomy_id:33090 (representing plant species), with reviewed annotations (ensuring high-quality data), and created between 14 June 2021 and 22 March 2025. For the negative class (non-DBPs), we followed the same criteria but excluded proteins associated with go_manual:0003677. This resulted in 951 entries, from which we selected the first 20 sequences for use in our validation.

This ensures that all proteins in this test set are created after our training and main test dataset, preventing any overlap. This independent dataset was specifically designed to assess the generalization and robustness of our model on recent, high-quality protein data, ensuring an unbiased evaluation. Our model achieved competitive performance on this dataset, with an MCC of 85.9%. Detailed results of this experiment are provided in Supplementary information, Supplementary Table III.

We compared the performance of PLM-DBPs against PlDBPred (state of the art model for plant-specific DNA-binding) and found that our model identifies two additional DBPs with high confidence in nearly all cases, outperforming PlDBPred’s results. Detailed results are presented in Supplementary information (Supplementary Table T4).

Computational efficiency and runtime analysis of model inference

For our experiments, we utilized a single A100 80GB GPU for model training. However, inference and evaluation were conducted on an IntelInline graphic XeonInline graphic W-2145 processor (8 cores, 16 threads) @ 3.70GHz with 32GB DDR4 ECC RAM (2 Inline graphic 16 GB, 2666 MHz, Samsung). We observed that prediction time increases with sequence length. To systematically estimate computation time, we conducted experiments on 1000 randomly generated protein sequences of varying lengths (200–1000, in intervals of 200). Figure 6 illustrates the time required for generating representations and the overall prediction time using our proposed tool. On average, the time taken to predict 10 sequences of lengths 200, 400, 600, 800, and 1000 was approximately 89 Inline graphic18, 116 Inline graphic9, 145 Inline graphic25, 173 Inline graphic12, and 218Inline graphic13 s, respectively. Further details of computational time analysis are provided in Supplementary information (Supplementary Table T1). The prediction time can be significantly reduced by using a GPU to generate representations from PLMs.

Figure 6.

Figure 6

PLM-DBPs model inference time analysis for predicting protein sequences of varying lengths using IntelInline graphic XeonInline graphic (8 cores, 16 threads) @ 3.70 GHz with 32 GB DDR4 RAM.

Conclusion and future work

In this study, we introduced PLM-DBPs, a robust prediction model specifically designed for DBPs in plants, integrating sequence-based and structure-aware PLM representations. After exploring various state-of-the-art PLMs for protein representation, we developed an integrated model leveraging a score-level fusion of models trained using ProtT5 (sequence-based PLM) and SaProt (structure-aware PLM) representations.

Our final model demonstrated a notable improvement in most of the measured metrics that highlight the complementary benefits of sequence-based and structure-aware PLMs. By moving beyond reliance on purely sequence-based handcrafted features or embeddings, our proposed methodology integrates structural information, leading to more accurate identification of DBPs in plants. Further validation was conducted through the curation of an additional independent test set comprising plant DBPs annotated after our original dataset’s compilation. The results in all experimented datasets demonstrated that PLM-DBPs have better generalization and robustness as compared to existing approaches.

Future work will involve benchmarking advanced sequence-based PLMs such as ESM3 [31], xTrimoPGLM [32] and explore PLMs more to further enhance prediction accuracy. Additionally, we plan to explore other more efficient fusion techniques and weighting strategies to optimally integrate sequence-based and structure-aware PLMs, e.g. dynamically assigning their weights based on their contributions in making decisions (currently, as illustrated in Fig. 1, we assign equal weights Inline graphic to each prediction). Furthermore, we aim to explore other structure-aware models and incorporate diverse structural databases to improve the model’s generalizability and robustness. A key direction will be to explore the techniques to incorporate structural information more effectively, as the current approach leverages only limited structural knowledge through 3Di sequences.

Key Points

  • PLM-DBPs is a novel method that predicts plant DNA-binding proteins (DBPs) by integrating representations from both sequence-based and structure-aware protein language models (PLMs).

  • We systematically evaluated five different sequence-based PLMs and combined the best-performing one (ProtT5) with a structure-aware model (SaProt), exploring multiple integration strategies, including score-level, feature-level, and cross-attention fusion.

  • PLM-DBPs is validated using two independent plant-based test sets, including newly curated data from UniProt, and further assessed its generalizability using a multiclass general DBP dataset.

  • PLM-DBPs that use cross-modality fusion consistently outperformed existing plant-specific and general DBP predictors in predicting plant DBPs, achieving superior performance across key evaluation metrics.

Supplementary Material

suppl_materials_bbaf245

Acknowledgments

We thank Dr Stefan Schulze for the helpful discussion during the development of this project. We acknowledge the use of the DeepBlizzard HPC Cluster at Michigan Tech and RC Cluster at RIT [33].

Contributor Information

Suresh Pokharel, Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States.

Kepha Barasa, College of Computing, Michigan Technological University, Houghton 49931, MI, United States.

Pawel Pratyush, Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States.

Dukka B KC, Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States.

Author contributions

Dukka KC and Suresh Pokharel conceived the experiment(s). Suresh Pokharel and Kepha Barasa conducted the experiment(s). Kepha Barasa and Pawel Pratyush performed the feature extraction. Suresh Pokharel and Kepha Barasa wrote the manuscript. Suresh Pokharel, Pawel Pratyush, and Dukka KC revised the manuscript. Dukka KC oversaw the project.

 

Conflict of interest: None declared.

Funding

This work is supported by funds from the National Science Foundation (NSF: #2210356, #260359B, and MI-SAPPHIRE-GRANT). Computational resources used were made available by NSF grant #2215734. Additionally, some aspects of the work are supported by a start-up grant from Rochester Institute of Technology (RIT) to D.K.

Data availability

The stand-alone version of the model is publicly available at https://github.com/suresh-pokharel/PLM-DBPs.

References

  • 1. Kaiyang  Q, Wei  L, Zou  Q. A review of DNA-binding proteins prediction methods. Curr Bioinform  2019;14:246–54. 10.2174/1574893614666181212102030 [DOI] [Google Scholar]
  • 2. Hudson  WH, Ortlund  EA. The structure, function and evolution of proteins that bind DNA and RNA. Nat Rev Mol Cell Biol  2014;15:749–60. 10.1038/nrm3884 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Schleif  R. DNA binding by proteins. Science  1988;241:1182–7. 10.1126/science.2842864 [DOI] [PubMed] [Google Scholar]
  • 4. Siwen  W, Guo  J-t. Improved prediction of DNA and RNA binding proteins with deep learning models. Brief Bioinform  2024;25:bbae285. 10.1093/bib/bbae285 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Pradhan  UK, Meher  PK, Naha  S. et al.  Pldbpred: a novel computational model for discovery of DNA binding proteins in plants. Brief Bioinform  2023;24:bbac483. 10.1093/bib/bbac483 [DOI] [PubMed] [Google Scholar]
  • 6. Motion  GB, Howden  AJM, Huitema  E. et al.  DNA-binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic Acids Res  2015;43:e158–8. 10.1093/nar/gkv805 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Wang  N, Zhang  J, Liu  B. IDRBP-PPCT: identifying nucleic acid-binding proteins based on position-specific score matrix and position-specific frequency matrix cross transformation. IEEE/ACM Trans Comput Biol Bioinform  2021;19:2284–93. [DOI] [PubMed] [Google Scholar]
  • 8. Mishra  A, Pokhrel  P, Hoque  MT. Stackdppred: A stacking based prediction of DNA-binding protein from sequence. Bioinformatics  2019;35:433–41. 10.1093/bioinformatics/bty653 [DOI] [PubMed] [Google Scholar]
  • 9. Zhang  J, Chen  Q, Liu  B. IDRBP_MMC: identifying DNA-binding proteins and RNA-binding proteins based on multi-label learning model and motif-based convolutional neural network. J Mol Biol  2020;432:5860–75. 10.1016/j.jmb.2020.09.008 [DOI] [PubMed] [Google Scholar]
  • 10. Zhang  J, Chen  Q, Liu  B. Deepdrbp-2l: a new genome annotation predictor for identifying DNA-binding proteins and RNA-binding proteins using convolutional neural network and long short-term memory. IEEE/ACM Trans Comput Biol Bioinform  2019;18:1451–63. [DOI] [PubMed] [Google Scholar]
  • 11. Li  G, Xiuquan  D, Li  X. et al.  Prediction of DNA binding proteins using local features and long-term dependencies with primary sequences based on deep learning. PeerJ  2021;9:e11262. 10.7717/peerj.11262 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Qi  D, Song  C, Liu  T. PREDBP-PLMS: prediction of DNA-binding proteins based on pre-trained protein language models and convolutional neural networks. Anal Biochem  2024;694:115603. 10.1016/j.ab.2024.115603 [DOI] [PubMed] [Google Scholar]
  • 13. Zeng  W, Dou  Y, Pan  L. et al.  Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein. Nat Commun  2024;15:7838. 10.1038/s41467-024-52293-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Elnaggar  A, Essam  H, Salah-Eldin  W. et al.  Ankh: optimized protein language model unlocks general-purpose modelling.  arXiv preprint, arXiv:2301.06568. 2023. Available from: https://arxiv.org/abs/2301.06568
  • 15. Lin  Z, Akin  H, Rao  R. et al.  Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv  2022;2022:500902. [Google Scholar]
  • 16. Pokharel  S, Pratyush  P, Heinzinger  M. et al.  Improving protein succinylation sites prediction using embeddings from protein language model. Sci Rep  2022;12:16933. 10.1038/s41598-022-21366-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Pratyush  P, Bahmani  S, Pokharel  S. et al.  LMCROT: an enhanced protein crotonylation site predictor by leveraging an interpretable window-level embedding from a transformer-based protein language model. Bioinformatics  2024;40:btae290. 10.1093/bioinformatics/btae290 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Jahn  LR, Marquet  C, Heinzinger  M. et al.  Protein embeddings predict binding residues in disordered regions. Sci Rep  2024;14:13566. 10.1038/s41598-024-64211-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Chowdhury  R, Bouatta  N, Biswas  S. et al.  Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol  2022;40:1617–23. 10.1038/s41587-022-01432-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Heinzinger  M, Weissenow  K, Sanchez  JG. et al.  Bilingual language model for protein sequence and structure. NAR Genom Bioinform  2024;6:4. 10.1093/nargab/lqae150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Wang  D, Pourmirzaei  M, Abbas  UL. et al.  S-PLM: structure-aware protein language model via contrastive learning between sequence and structure. Adv  Sci  2025;12:2404212. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Jin  S, Han  C, Zhou  Y. et al.  Saprot: protein language modeling with structure-aware vocabulary. bioRxiv  2023, 2023;–10. 10.1101/2023.10.01.560349 [DOI] [Google Scholar]
  • 23. The UniProt Consortium . Uniprot: the universal protein knowledgebase in 2021. Nucleic Acids Res  2021;49:D480–9. 10.1093/nar/gkaa1100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Van Kempen  M, Kim  SS, Tumescheit  C. et al.  Fast and accurate protein structure search with foldseek. Nat Biotechnol  2024;42:243–6. 10.1038/s41587-023-01773-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. O’Malley  T, Bursztein  E, Long  J. et al.  Kerastuner.  https://github.com/keras-team/keras-tuner 2019. Available from: https://github.com/keras-team/keras-tuner
  • 26. Villegas-Morcillo  A, Makrodimitris  S, van Ham  RCHJ. et al.  Unsupervised protein embeddings outperform[hand-crafted sequence and structure features at predicting molecular function. Bioinformatics  2021;37:162–70. 10.1093/bioinformatics/btaa701 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Weissenow  K, Heinzinger  M, Rost  B. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure  2022;30:1169–1177.e4. 10.1016/j.str.2022.05.001 [DOI] [PubMed] [Google Scholar]
  • 28. McInnes  L, Healy  J, Melville  J. Umap: uniform manifold approximation and projection for dimension reduction.  arXiv preprint, arXiv:1802.03426. 2018. Available from: https://arxiv.org/abs/1802.03426
  • 29. Kumar  M, Gromiha  MM, Raghava  GPS. Identification of DNA-binding proteins using support vector machines and evolutionary profiles. BMC Bioinform  2007;8:463. 10.1186/1471-2105-8-463 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Saifur Rahman  M, Shatabda  S, Saha  S. et al.  DPP-PSEAAC: a DNA-binding protein prediction model using Chou’s general PSEAAC. J Theor Biol  2018;452:22–34. 10.1016/j.jtbi.2018.05.006 [DOI] [PubMed] [Google Scholar]
  • 31. Hayes  T, Rao  R, Akin  H. et al.  Simulating 500 million years of evolution with a language model. Science  2025;387:eads0018. [DOI] [PubMed] [Google Scholar]
  • 32. Chen  B, Cheng  X, Li  P. et al.  Xtrimopglm: unified 100-billion-parameter pretrained transformer for deciphering the language of proteins. Nat Methods  2025;22:1–12. 10.1038/s41592-025-02636-z [DOI] [PubMed] [Google Scholar]
  • 33. Rochester Institute of Technology. Research Computing Services. Rochester, NY: Rochester Institute of Technology; 2019. https://www.rit.edu/researchcomputing/ [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

suppl_materials_bbaf245

Data Availability Statement

The stand-alone version of the model is publicly available at https://github.com/suresh-pokharel/PLM-DBPs.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES