DRBpred: A Sequence-based Machine Learning Method to effectively predict DNA- and RNA-Binding Residues

Md Wasi Ul Kabir; Duaa Mohammad Alawad; Pujan Pokhrel; Md Tamjidul Hoque

doi:10.1016/j.compbiomed.2024.108081

. Author manuscript; available in PMC: 2025 Mar 1.

Published in final edited form as: Comput Biol Med. 2024 Jan 29;170:108081. doi: 10.1016/j.compbiomed.2024.108081

DRBpred: A Sequence-based Machine Learning Method to effectively predict DNA- and RNA-Binding Residues

Md Wasi Ul Kabir ¹, Duaa Mohammad Alawad ¹, Pujan Pokhrel ¹, Md Tamjidul Hoque ^1,^*

PMCID: PMC10922697 NIHMSID: NIHMS1966654 PMID: 38295475

Abstract

DNA-binding and RNA-binding proteins are essential to an organism’s normal life cycle. These proteins have diverse functions in various biological processes. DNA-binding proteins are crucial for DNA replication, transcription, repair, packaging, and gene expression. Likewise, RNA-binding proteins are essential for the post-transcriptional control of RNAs and RNA metabolism. Identifying DNA- and RNA-binding residue is essential for biological research and understanding the pathogenesis of many diseases. However, most DNA-binding and RNA-binding proteins still need to be discovered. This research explored various properties of the protein sequences, such as amino acid composition type, Position-Specific Scoring Matrix (PSSM) values of amino acids, Hidden Markov model (HMM) profiles, physiochemical properties, structural properties, torsion angles, and disorder regions. We utilized a sliding window technique to extract more information from a target residue’s neighbors. We proposed an optimized Light Gradient Boosting Machine (LightGBM) method, named DRBpred, to predict DNA-binding and RNA-binding residues from the protein sequence. DRBpred shows an improvement of 112.00%, 33.33%, and 6.49% for the DNA-binding test set compared to the state-of-the-art method. It shows an improvement of 112.50%, 16.67%, and 7.46% for the RNA-binding test set regarding Sensitivity, Mathews Correlation Coefficient (MCC), and AUC metric.

Keywords: DNA-binding proteins, RNA-binding proteins, machine learning

Graphical Abstract

graphic file with name nihms-1966654-f0001.jpg

1. Introduction

Protein-DNA and protein-RNA interactions are important in various biological processes. This includes DNA replication and repair, gene regulation, transcription, post-transcriptional control of RNAs and RNA metabolism, and other DNA-related and RNA-related biological activities [1–4]. Understanding how and why proteins interact with DNA and RNA requires the identification of DNA-binding and RNA-binding proteins. Many experimental techniques, such as nuclear magnetic resonance, X-ray crystallography, and chromatin immunoprecipitation on microarrays, can identify DNA-binding and RNA-binding proteins [5]. However, the experimental techniques to determine DNA-binding and RNA-binding proteins are time-consuming and labor-intensive [6]. Given the limitations of wet experiments for determining DNA-binding and RNA-binding proteins, computational methods for identifying putative DNA-binding and RNA-binding proteins have become increasingly important in recent years. Recent breakthroughs in genomic and proteomic techniques have recently generated numerous DNA-binding and RNA-binding protein sequences [7]. For example, in 2014, there were more than ten times as many DNA-binding proteins in the UniProt database as in 2000 [8]. These massive amounts of data lay the groundwork for research into computational approaches for identifying DNA-binding and RNA-binding proteins.

In the existing literature, many recent methods [9, 10] rely not just on the protein sequence but also on the protein’s experimentally derived or predicted 3D structure. However, the number of experimentally derived structures (DNA and RNA complexes) is limited. Addressing this gap, our work introduces DRBpred, a novel approach for predicting DNA-binding and RNA-binding residues using only protein sequences. We studied several properties of protein sequences, including amino acid composition, evolutionary profiles (PSSM and HMM values of amino acids), physicochemical properties, structural properties, torsion angles, and disorder values. We ranked the features to determine which features contribute most to our trained model. We employed a recursive feature elimination (RFE) technique combined with SHAP (Shapley Additive exPlanations) values to select a subset of important features. The sliding window technique was used to obtain as much information as possible about the target and context residues. The features were concatenated to achieve superior predictive performance. In addition, an optimized LightGBM (Light Gradient Boosting Machine) classifier-based predictor was trained as the machine-learning method for the classification task. We found that the proposed method outperformed the existing state-of-the-art methods. Finally, we used Local Interpretable Model-Agnostic Explanations (LIME) to explain the trained model and analyze the effects of features.

2. Related Work

Several methods have been proposed in the literature to identify DNA-binding and RNA-binding sites in proteins. The three types of features used in these prediction methods are sequence, structure, and evolutionary. Using evolutionary features was hard to compute due to the lack of computing power. The structural and sequence-based features were mostly used for prediction. Ahmad et al. utilized only sequence features to predict protein-DNA-binding [9]. Cai and Lin employed the SVM algorithm to predict DNA-binding proteins, utilizing a protein’s amino acid composition, hydrophobicity, and solvent-accessible surface area correlations as input features. [11]. In more recent work, Zou et al. introduced a sequence-based protocol that integrates informative features from different scales to train an SVM model for the prediction of DNA-binding proteins [12]. The random forest (RF) algorithm, which is a useful machine learning classifier, was also used to predict DNA-binding proteins. Lou et al. applied the RF algorithm for the prediction of DNA-binding proteins, with predicted relative solvent accessibility, predicted secondary structure, and position-specific scoring matrix serving as the primary sequence features. [13]. Zhang et al. proposed DNA-Prot, a predictor for DNA-binding proteins. It employs an SVM classifier and a comprehensive set of features categorized into six groups: primary sequence-based, evolutionary profile-based, predicted relative solvent accessibility-based, predicted secondary structure-based, physicochemical property-based, and biological function-based features [14]. In addition, Yan et al. presented the DRNApred tool [15] that can distinguish between DNA-binding and RNA-binding residues and proteins. It employs a collection of features extracted from a diverse set of sources of sequence-derived information extracted from a dataset with both DNA-binding and RNA-binding proteins. This information contains amino acid types, amino acid physicochemical properties, evolutionary profiles, potential intrinsic disorder, secondary structure, and solvent accessibility. DRNApred lowers cross predictions and predicts potentially higher-quality false positives near-native binding residues. Moreover, Seungwoo et al. introduced DP-Bind, a method for predicting DNA-binding sites in a DNA-binding protein based on the protein’s amino acid sequence. DP-Bind implements three machine learning methods: support vector machine (DP-Bind(SVN)), kernel logistic regression (DP-Bind (klr)), and penalized logistic regression(DP-Bind (plr)) [16]. In DP-Bind, predictions can be made using either the input sequence alone or an autonomously created profile of the input sequence’s evolutionary conservation in the form of a PSI-BLAST position-specific scoring matrix (PSSM). Wang et al. proposed the BindN+ method, which employs two SVM models to predict RNA-binding and DNA-binding sites; each model performs better on its respective type of proteins [17].

Several studies have indicated the significance of evolutionary features in the detection of DNA-binding proteins [18–20]. Methods lacking these evolutionary features tend to exhibit lower accuracy, often resulting in classifier bias due to imbalanced sample numbers. Thus, the inclusion of evolutionary information to predict DNA-binding residues can improve accuracy. Computing power has increased dramatically in the last decade, which makes it much easier to compute evolutionary features, which are often time-consuming. Position-Specific Scoring Matrix (PSSM) is used to represent evolutionary features. They are usually calculated in one of two ways: (a) Concatenation methods that encode the residues by concatenating PSSM scores in a sliding window (b) Combination methods, which encode residues by combining PSSM scores with other physiochemical properties such as hydrophobicity, molecular mass, torsion angles, and other frequency profiles in a sliding window. Zhou et al. [21] introduce a residue-encoding technique called Position Specific Score Matrix Relation Transformation (PSSM-RT), which encodes residues by considering their evolutionary relationships. Deng et al. proposed the PDRLGB method that predicts binding residues in protein-DNA complexes using a light gradient-boosting machine (LightGBM) [9]. The author used an incremental feature selection with the random forest algorithm to find the best subset of features and trained a light gradient boosting machine. However, their method is dependent not only on the protein sequence but also on the experimentally derived 3D structure of the protein. They extracted structural features from the three-dimensional protein structure using the DSSP [22]. Zhang et al. [23] proposed the StackPDB method for predicting DNA-binding Proteins. The StackPDB method extracts pseudo amino acid composition (PseAAC), pseudo-position-specific scoring matrix (PsePSSM), position-specific scoring matrix-transition probability composition (PSSM-TPC), evolutionary distance transformation (EDT), and residue probing transformation (RPT) features from protein sequences. The authors selected a subset of the features using extreme gradient boosting-recursive feature elimination (XGB-RFE) and employed a stacked ensemble classifier consisting of XGBoost, LightGBM, and SVM for DNA-binding protein prediction. Ali et al. introduce DP-BINDER, a computational method for identifying DBPs based on physicochemical and evolutionary information. It involves extracting key features from protein sequences using normalized Moreau-Broto autocorrelation (NMBAC), position-specific scoring matrix-transition probability composition (PSSM-TPC), and pseudo position-specific scoring matrix (PsePSSM). These features are refined using support vector machine recursive feature elimination and correlation bias reduction (SVM-RFE + CBR) and analyzed using random forest (RF) and support vector machine (SVM). DP-BINDER demonstrated an accuracy of 92.46% with the jackknife method.

Many research works also apply deep learning methods to predict the DNA-binding and RNA-binding Residues. Hendrix et al. [24] constructed and evaluated a deep-learning model to estimate the likelihood that a voxel on the protein surface is a DNA-binding site. Based on three distinct evaluation datasets, the results indicate that the model beats a number of earlier methods on two widely used datasets. In [25]the authors presented EL LSTM, an approach for DNA-binding residue prediction that consists of two main components: Long Short-Term Memory (LSTM) and an ensemble learning-based classifier. LSTM uses a bi-gram model to learn pairwise relationships between residues before learning feature vectors for all residues. Then, an ensemble learning-based classifier is developed to address the data imbalance problem in binding residue predictions. To achieve balanced samples, they used a variant of the bagging strategy in ensemble learning. Despite the existence of numerous methods, the classification score remains low, indicating room for improvement. Additionally, some of these methods rely on the three-dimensional structure of proteins. Certain methods only offer protein-level predictions rather than residue-level predictions. This motivates us to explore this problem further and develop a machine-learning method capable of accurately predicting DNA-binding and RNA-binding residues.

3. Proposed Method

This section formally discusses the data collection methods, feature extraction, machine learning methods, feature selection, and performance evaluation metrics for predicting DNA-binding and RNA-binding residues.

3.1. Dataset

Throughout this study, we used the processed dataset that was used in [15]. In Figure 1, we summarized the steps the authors in [15] had used to prepare the dataset. The original dataset was collected from 564 protein–DNA, 72 protein–RNA, and 16 protein–DNA–RNA high-resolution (better than 2.5) complexes PDB. Then, an extra 892 DNA-binding and 145 RNA-binding chains with the previous dataset were added, yielding 2827 DNA-binding and 1125 RNA-binding chains.

Fig 1. — Illustrates the steps of creating the Training and Test datasets. The dataset was created with both benchmark datasets and data collected from the Protein Data Bank (PDB).

Next, the dataset is clustered to select proteins that share ≥ 80% sequence similarity and ≥0.5 TM scores. Annotations were moved between proteins in the same cluster [15]. All chains’ DNA-binding and RNA-binding residues are transferred in the same cluster into a representative chain with the largest number of binding residues. To reduce the sequence similarity between training and test datasets, the test proteins are filtered by removing every sequence that shares >30% sequence similarity with any training sequence based on pairwise sequence similarity [15]. Finally, the long proteins from the training and test datasets were removed because of the existing predictors of DNA- and RNA-binding residues that could not complete predictions for proteins that are over 1000 residues long [15]. A version of the test dataset was also created without transferring annotations of binding residues. Table 1 summarizes the number of proteins, RNA-binding, and DNA-binding residue annotations.

Table 1.

The number of DNA- and RNA- binding residues in the Training and Test Dataset.

Dataset	No. of proteins	No. of Non-binding residues	No. of DNA-binding residues	No. of RNA-binding residues
Training	488	95161	7823 (7.6%)	4699 (4.6%)
Test	82	17925	968 (5.1%	808 .2%)

Open in a new tab

3.2. Feature Extraction

We extracted a variety of features to represent proteins. We encompassed important properties such as sequence information, predicted structural details, and evolutionary information. These features offer relevant insights into the characteristics of the residues. Previous studies in the literature have indicated that information concerning the correct folding of a protein is embedded within its amino acid sequence and the disorder contents [26]. Furthermore, details pertaining to the binding affinity of proteins are encoded within the evolutionary information, along with other structural and physicochemical properties [27–30]. Consequently, these features were integrated with the evolutionary attributes to enhance prediction accuracy.

We collected a total of 119 features using various feature encoding techniques, as depicted in Figure 2. Utilizing different feature-encoding techniques, the subsequent section briefly describes the collected features.

Fig 2. — Illustration of encoding the protein residues into a feature vector of 119 features utilizing various feature encoding techniques. The feature vector includes amino acid composition type, evolutionary features, physicochemical, structural properties, torsion angles, and disorder probabilities.

Physiochemical properties:

The physiochemical characteristics of a protein are the inherent properties of its constituent amino acids. Previous research studies [31, 32] have shown the influence amino acid physiochemical properties have on the activity of transcription factors and how they regulate their interactions with other proteins. In this study, we have extracted seven concise numerical patterns from the work of [31] to represent key aspects of amino acid properties. The include polarity, secondary structure propensity, molecular volume, codon diversity, and electrostatic charge. These patterns serve as features to capture the distinctive attributes of each amino acid.

Residue properties:

We represented each of the 20 standard amino acid (AA) types with a unique feature to effectively capture the amino acid composition of individual residues within a protein sequence. Previous research studies [33–35] have highlighted the significance of this feature in addressing bioinformatics problems. We encoded terminal residues, specifically those five residues located from the N and C termini. The values ranged from −1.0 to −0.2 and +0.2 to +1.0, creating a distinctive feature for each residue [33].

Evolutionary properties:

As demonstrated in previous research, the evolutionary profile is a crucial factor in post-translational modifications (PTM). This includes DNA-binding and RNA-binding activity [27–30]. In our study, we acquired the evolutionary profile of the protein sequence through a normalized position-specific scoring matrix (PSSM) obtained from BLAST (PSI-BLAST) [36]. This PSSM is represented with a 20-dimensional matrix, capturing evolutionary patterns in multiple alignments and storing scores for each position in the alignment. High scores indicate highly conserved positions, while scores near zero or negative values indicate weakly conserved positions. We extended the PSSM scores to calculate monogram (MG) and bi-gram (BG) features. MG and BG features can be used to describe a segment of a protein sequence that exhibits conservation in terms of transition probabilities from one amino acid to another [37]. We extracted 1-dimensional MG features and 20-dimensional BG features from the DisPredict2 program and incorporated them into our analysis. We calculated the close neighbor correlation coefficient based on the PSSM scores. We obtained 30 Hidden Markov Model (HMM) profile-based evolutionary features for the protein sequence. To identify distantly related sequences, profile Hidden Markov Models (HMMs) transform a multiple sequence alignment into a specialized scoring system tailored for searching databases [38]. Numerous studies have emphasized the importance of evolutionary features in characterizing protein properties [27–30, 39, 40]. Methods that do not consider evolutionary features typically exhibit lower accuracy. The classifier can be biased due to imbalanced sample numbers. Therefore, the inclusion of evolutionary information in predicting DNA-binding and RNA-binding residues can enhance accuracy.

Structural properties:

Local structural characteristics, such as the predicted secondary structure (SS) and accessible surface area (ASA) of amino acids, have been widely employed in addressing various biological challenges, including DNA- and RNA-binding residue prediction. In our study, we utilized the Dispredict2 [35] and SPOT-Disorder2 [41] programs to acquire predicted ASA values and SS probabilities for helix (H), coil (C), and beta-sheet (E) at the individual residue level. Additionally, we obtained a separate set of SS probabilities for E, C, and E at the residue level from the Dispredict2 and SPOT-Disorder2 programs.

Flexibility properties:

Protein molecules exhibit varying levels of flexibility within their 3D structures, often expressed as fluctuations in the Cartesian coordinates of the protein backbone and defined by two torsion angles Φ and Ψ. The fluctuations in backbone torsion angles have proven valuable in developing several computational methods [40, 41]. We acquired two features related to backbone angle fluctuations, specifically dphi (ΔΦ) and dpsi (ΔΨ), using the Dispredict2 and SPOT-Disorder2 programs [35, 41]. Previous research has established that intrinsically disordered regions (IDRs) contain post-translational modification (PTM) site-sorting signals and play a crucial role in regulating protein structures and functions, i.e., DNA- and RNA-binding proteins [42–44]. In our study, we represented each amino acid in a protein with a disorder probability obtained from a disorder predictor. We also included Molecular Recognition Features (MoRFs). These are short, interaction-prone segments of protein disorder that transition from disorder to order upon specific binding, representing a specific class of intrinsically disordered regions with molecular recognition and binding functions.

Energy profile:

A method for estimating the position-specific estimated energy (PSEE) of amino acid residues solely based on sequence information was developed by Iqbal et al. [35]. The authors incorporate the contact energy and predict relative solvent accessibility (RSA) to determine the PSEE. Their work showcased how PSEE can effectively distinguish between structured and unstructured regions within a protein, including intrinsically disordered regions. Additionally, PSEE can be employed to identify functional binding regions within a protein. We incorporated the PSEE score per amino acid as a feature in our study.

3.3. Machine Learning Algorithms

In this study, we have explored the following seven Machine Learning Methods.

K-nearest Neighbors Classifier (KNN): KNN learns from the K number of training samples in the feature space that are the closest distance to the target point. The classification decision is based on the neighbors’ majority votes. K was set to 5 as a default value, and all neighbors were equally weighted [45].
Random Forest Classifier (RF): Random forest [46] is a supervised learning algorithm that employs ensemble learning techniques for classification tasks. It is a meta-estimator that aggregates many decision trees (bagging). The random forest creates trees in parallel, and these trees have no interaction. At the training time, the algorithm creates a large number of decision trees and outputs the average prediction of the individual trees.
Logistic Regression (LG): Logistic regression [47] is a statistical method employed in binary classification tasks to model the probability of a particular outcome based on the relationships with independent variables. It calculates the estimated probability of a categorical dependent variable’s relationship with one or more independent variables.
Extra Tree Classifier (ET): Extra Tree (ET), or extremely randomized tree, is an ensemble machine learning method [48]. The Extra Tree Classifier method improves predictive accuracy and controls over-fitting by averaging by fitting several randomized decision trees from the original learning sample.
Support Vector Machine (SVM): The Support Vector Machine classifier determines how much error in the model is acceptable and selects a line or hyperplane that best fits the data [49]. We optimized the epsilon and cost parameter C using a Bayesian optimization algorithm.
Light Gradient Boosting Machine (LGBM): Light GBM is a learning algorithm that uses a tree-based approach [50]. The algorithm grows the tree vertically and selects a leaf based on the loss. The gradient boosting framework is used in this project. LGBM is a quick algorithm with a small memory footprint that can handle large datasets.
Categorical Gradient Boosting Classifier (CAT): CatBoost handles categorical features and outperforms existing publicly available gradient boosting implementations in terms of quality [51]. On ensembles of similar size, the library has a GPU implementation of the learning algorithm and a CPU implementation of the scoring algorithm, making it significantly faster than other gradient-boosting libraries.

3.4. Feature Selection

The feature selection process can be considered a method of selecting a subset of variables from a large feature set and assessing their accuracy. It is used for various reasons, including simplifying models to make them easier for researchers to interpret, reducing training times, avoiding the dimensionality curse, and improving data compatibility with a learning model class.

To identify which features are important, we have used SHAP (Shapley Additive exPlanations) importance scores [52]. SHAP is a state-of-the-art method used in machine learning to interpret the output of complex machine learning models. These scores are based on game theory, specifically the concept of Shapley values, which were developed to allocate the payout of a cooperative game fairly to its players based on their individual contributions [52]. SHAP importance scores provide a detailed and fair explanation of how each feature in a dataset influences the prediction of a machine learning model, enhancing transparency and interpretability in complex models. Figures 3 and 4 show the SHAP importance scores for the DNA and RNA datasets. We found that the most important feature in both datasets is the Accessible Surface Area. A larger Accessible Surface Area likely provides more binding space with DNA and RNA. AA index, Monogram and Bigram, and HMM profile are some of the other features that contribute to the prediction of the proposed method.

Fig 3. — Importance scores from SHAP (Shapley Additive exPlanations) for DNA-binding proteins. The Accessible Surface Area feature holds the highest feature importance score, followed by AA index. The evolutionary-based features, Monogram and Bigram, calculated from PSSM scores, have the third highest importance scores.

Fig 4. — SHAP (Shapley Additive exPlanations) Importance scores for RNA-binding proteins. Similar to DNA-binding proteins, the Accessible Surface Area feature possesses the highest feature importance score, succeeded by the AA index. Following these, the evolutionary-based features Monogram and Bigram, derived from PSSM scores, rank as the third most significant in terms of importance scores.

When applying a feature selection technique, the fundamental assumption is that the dataset includes features that might be redundant or irrelevantand can be safely eliminated without substantial loss of information. Features that are not relevant or only partially relevant have the potential to affect the performance of a model; hence, feature selection becomes a crucial step in the model creation process. We have used a Recursive Feature Elimination technique (RFE) that allows you to reduce the number of features in the dataset while maintaining the model’s predictive power. It removes the features with the lowest importance based on the SHAP importance scores. Recursive Feature Elimination offers several benefits, including the utilization of tree-based or linear models to detect the complex relations between features and the target. RFE can be implemented with SHAP importance scores, one of the most reliable ways to estimate the importance of features. Unlike many other techniques, it works with missing values and categorical variables. It also provides a list of features that should not be eliminated, e.g., in the case of prior knowledge. After using the Recursive Feature Elimination method, a total of 96 features out of 119 features are selected for both the DNA and RNA datasets. Figures 5 and 6 show the selected features for each (DNA and RNA) dataset.

Fig 5. — Illustration of the number of selected features for the DNA dataset. The orange bar represents the selected features, and the blue bar represents the total number of features. The lower number of features are selected from PSSM and HMM profiles as they both represent the evolutionary features.

Fig 6. — Illustration of the number of selected features for the RNA dataset. The orange bar represents the selected features, and the blue bar represents the total number of features. The lower number of features are selected from PSSM and physical properties.

3.5. Performance Evaluation Metrics

In our study, the dataset is highly imbalanced, so we need to choose the evaluation metrics carefully. We have selected AUC, Recall, and MCC to evaluate our method. Area Under the “Receiver Characteristic Operator Curve” (AUC) is a widely used metric to find the performance of the machine learning method. AUC is not threshold-dependent, making it a robust metric to evaluate the model performance. We chose the Recall and MCC metrics because they are also used for imbalanced datasets, and existing methods use them for comparison. The following section shows the formula to calculate the recall and MCC.

R e c a l l / S e n s i t i v i t y = \frac{T P}{T P + F N}

M C C = \frac{T P \times T N - F N \times F P}{\sqrt{(T P + F N) \times (T P + F P) \times (T N + F P) \times (T N + F N)}}

Where $T P$ is the number of correctly predicted binding residues (true positives)

$T N$ is the number of correctly predicted non-binding residues (true negatives)

$F P$ is the number of incorrectly predicted non-binding residues (false positives)

$F N$ is the number of incorrectly predicted binding residues (false negatives)

4. Results

In this section, we first discuss the performance of Machine learning methods and then optimizing window size and hyperparameters. Finally, we compare the performance of DRBpred with the existing state-of-the-art methods.

4.1. Performance of Machine learning methods on the training dataset

As discussed before, we have selected seven machine learning methods to find the best method suitable for this problem. Figures 7 and 8 show the 10-fold cross-validation results for the DNA and RNA datasets, respectively. The Light Gradient Boosting Machine performs better for each dataset than the other methods in terms of AUC, Recall, and MCC, so we selected this method for the rest of the experiments.

Fig 7. — 10-fold cross-validation results on the DNA Training dataset on different Machine learning methods. The Light Gradient Boosted Machine outperforms all the other methods in terms of AUC, MCC, and Sensitivity metrics. (CAT: Categorical Gradient Boosting Classifier, ET: Extra Tree Classifier, KNN: K-nearest Neighbors Classifier, LG: Logistic Regression, LGBM: Light Gradient Boosted Machine, RF: Random Forest Classifier, SVM: Support Vector Machine).

Fig 8. — 10-fold cross-validation results on the RNA Training dataset on different Machine learning methods. The Light Gradient Boosted Machine outperforms all the other methods in terms of AUC, MCC, and Sensitivity metrics. (CAT: Categorical Gradient Boosting Classifier, ET: Extra Tree Classifier, KNN: K-nearest Neighbors Classifier, LG: Logistic Regression, LGBM: Light Gradient Boosted Machine, RF: Random Forest Classifier, SVM: Support Vector Machine).

Additionally, to assess the robustness and consistency of our model, we conducted a 5-fold cross-validation on the training dataset in terms of AUCROC. This approach is crucial to understand the model’s performance variability between the training and testing phases. In cross-validation, the dataset is divided into five equal parts. In each fold, a different part of the dataset is held out for testing while the remaining four parts are used for training. This process is repeated five times, each time with a different part being used as the test set, ensuring a comprehensive evaluation. Figures 9 and 10 display the Receiver Operating Characteristic-Area Under Curve (ROC-AUC) for the DNA and RNA training and test sets. The ROC-AUC metric is a reliable indicator of the model’s ability to distinguish between classes, with a value closer to 1 indicating higher accuracy. For our model, the training ROC-AUC score is notably high, approximately 0.91, which is a strong indication of the model’s effectiveness in the training phase. Furthermore, this high score is consistent across all five folds, as indicated by the low variation in performance. This consistency is important as it implies that the model is not overly fitted to a specific part of the training data and can generalize well across the entire dataset. On the other hand, while the performance on the test set shows a decrease compared to the training set, it still yields good results. This decrease is a common observation, as models tend to perform slightly worse on unseen data. However, the fact that the model still shows good results on the test sets suggests that while there is a drop in performance, the model maintains a significant degree of its predictive power when applied to new, unseen data, which is a critical aspect of model reliability and usefulness in practical applications.

Fig 9. — The ROC-AUC curve for the DNA training and test dataset. The training ROC-AUC score is approximately 0.90 for five folds and shows low variation in performance. For unseen test set the ROC-AUC score is 0.82.

Fig 10. — The performance on the RNA training and test datasets is depicted by the ROC-AUC curve. The training phase achieves a consistent ROC-AUC score of around 0.91 across all five folds, indicating stable performance. On the unseen test set, the ROC-AUC score reaches 0.72.

4.2. Optimizing Hyperparameters

Machine learning method performance highly depends on the selected hyperparameter. To improve the results of our method, we optimized the hyperparameter of the Light Gradient Boosting Machine. The parameters (n_estimators, learning_rate, num_leaves, max_depth, min_child_samples, max_bin, subsample, subsample_freq, and colsample_bytree) of LightGBM are optimized using a hyperparameter optimization framework (Optuna) [53]. The framework used a Tree-structured Parzen Estimator algorithm to optimize the hyperparameters. Table 2 shows the selected hyperparameters for both DNA and RNA datasets.

Table 2.

Selected best parameters LightGBM for DNA and RNA datasets. The parameters are selected with a Tree-structured Parzen Estimator algorithm.

Parameter Name	DNA	RNA
n_estimators	1000	1000
learning_rate	0.151	0.159
num_leaves	7	10
max_depth	3	4
min_child_samples	95	89
max_bin	102	118
subsample	0.72	0.83
subsample_freq	1	1
colsample_bytree	0.93	0.95

Open in a new tab

4.3. Selection of best window size

The residues/amino acids are interconnected within proteins. This means each residue’s characteristics are influenced by its adjacent residues.That motivates us to represent residues not only with their own features but also the neighboring residue’s features. We collected 96 features to represent each residue/amino acid. Figure 11 shows that the residues glycine(G) can be represented by concatenating the features from two of its neighbor’s residues, lysine(K) and leucine(L). For window size 3, the length of the glycine(G) residue feature vector is 96×3= 288 features. As the feature dimensions increase with the window size increase, we investigate the optimal window size for each model.

Fig 11. — Illustration of sliding window technique to incorporate neighbor residues information. After feature selection, each residue is represented by 96 features. For sliding window size 3, the residues glycine(G) can be represented by concatenating the features from two of its neighbor’s residues, lysine(K) and leucine(L), and the feature vector length is 96×3= 288 features.

We investigated window sizes from 1 to 19 to find the optimal size for both DNA and RNA models. For the DNA dataset, Figures 12 and 13 show that the optimal window size is 11 for the DNA and RNA models.

Fig 12. — Selection of sliding window size for DNA dataset. To maximize the objective function (Sensitivy+MCC), the model performance for window sizes 1 to 19 has been evaluated. The model performs best for window size 11.

Fig 13. — Selection of sliding window size for RNA dataset. To maximize the objective function (Sensitivy+MCC), the model performance for window sizes 1 to 19 has been evaluated. The model performs best for window size 11.

4.4. Performance on the test dataset

The model’s performance is assessed by conducting an evaluation on the test dataset. The performance of DRBpred is presented in Table 3, where various metrics such as Sensitivity, Specificity, Balanced Accuracy (BACC), Matthews Correlation Coefficient (MCC), Accuracy (ACC), False Positive Rate (FPR), False Negative Rate (FNR), Precision, F1-score, and Receiver Operating Characteristic Area Under the Curve (ROCAUC) are reported. The results indicate that DRBPred exhibits strong performance, particularly excelling in terms of BACC, ACC, and ROCAUC.

Table 3.

Classification scores for DNA and RNA model.

Datasets	Sensitivity	Specificity	BACC	MCC	ACC	FPR	FNR	Precision	F1-score	ROCAUC
DNA	52.58	89.53	71.06	0.28	87.64	0.11	0.47	21.33	0.30	82.00
RNA	34.03	88.82	61.43	0.14	86.47	0.11	0.66	11.98	0.18	72.36

Open in a new tab

4.5. Comparison with existing methods

We performed a comparative evaluation of our method against recent state-of-the-art methods, namely DRNApred, Pprint, RNABindR, and BindN+. The results of these state-of-the-art methods were gathered from the DRNApred paper. The performance of the RNA model is detailed in Table 4 and Figure 14. Our proposed method has an improvement of 112.50%, 16.67%, and 7.46% in terms of Sensitivity, MCC, and AUC compared with the best method DRNApred.

Table 4.

Performance comparison of DRBpred with existing methods in the RNA Test dataset. DRBpred method shows promising results compared to the existing methods.

Methods	Sensitivity	MCC	AUC
DRNApred	0.16	0.12	0.67
Pprint	0.15	0.11	0.66
RNABindR	0.14	0.10	0.73
BindN+	0.12	0.08	0.67
DRBpred	0.34	0.14	0.72
(Imp%)	112.50%	16.67%	7.46%

Open in a new tab

The best score values are bold-faced. (Imp%) shows improvement compared to the best method (DRNApred).

Fig 14. — Performance comparison of DRBpred with existing methods in the RNA Test dataset. The red bar shows the performance of the DRBpred method in terms of AUC, MCC, and sensitivity metrics.

Similarly, we tested our method for DNA-binding prediction. Table 5 and Figure 15 show the performance of the DNA model. Our proposed method has improved by 112.00%, 33.33%, and 6.49% in Sensitivity, MCC, and AUC compared with the best method, DRNApred.

Table 5.

Performance comparison of DRBpred with existing methods in the DNA Test dataset. DRBpred performs better compared to the existing methods.

Methods	Sensitivity	MCC	AUC
DRNApred	0.25	0.21	0.77
BindN+	0.22	0.18	0.79
DP-Bind(svm)	0.24	0.20	0.75
DP-Bind(klr)	0.24	0.20	0.76
DP-Bind(plr)	0.22	0.18	0.74
DBS-PSSM	0.21	0.17	0.77
DRBpred	0.53	0.28	0.82
(Imp%)	112.00%	33.33%	6.49%

Open in a new tab

The best score values are bold-faced. (Imp%) shows improvement compared to the best method (DRNApred).

Fig 15. — Performance comparison of DRBpred with existing methods in the DNA Test dataset. The red bar shows the performance of the DRBpred method in terms of AUC, MCC, and sensitivity metrics.

We plotted the Receiver Operating Characteristic Area Under the Curve (ROC-AUC) analysis in Figures 16 and 17. The ROC-AUC curve is a graphical representation of the model’s ability to discriminate between positive and negative samples, where a larger area under the curve indicates better performance. These curves were constructed using data obtained from the findings presented in the paper [15], as some existing methods were not publicly available. Figures 16 and 17 provide clear evidence that the DRBpred method surpasses the performance of currently established state-of-the-art techniques.

Fig 16. — The ROC-AUC curve for the DNA test dataset. DRBpred achieves an AUC of 0.72, is represented by the blue color. Among the evaluated methods, the second-best performance is demonstrated by DRNApred, with an AUC score of 0.77. DRBpred surpasses the performance of the currently established state-of-the-art methods, indicating its superior accuracy in classifying DNA-binding proteins.

Fig 17. — The ROC-AUC curve for the RNA test dataset. DRBpred achieves an AUC of 0.82, which is shown by the blue color. DRBpred outperforms the performance of the state-of-the-art methods, demonstrating effectiveness in the classification of RNA-binding proteins.

5. Case Study

We conducted LIME analysis [54] on the independent test samples. LIME, an acronym for Local Interpretable Model-Agnostic Explanations, is used to approximate local, interpretable models that can explain individual predictions for black-box machine learning models [54]. For machine learning models, it is crucial for models to be explainable to gain the trust of users. LIME allows users to understand what happens within these black-box machine-learning models and aids in the identification of possible concerns, including issues related to information leakage, model bias, robustness, and causality [52, 54]. LIME introduces perturbations to the original data points, inputs them into the black-box model, and observes the resulting outputs [54]. The method then assigns weights to these new data points based on their proximity to the original point. Subsequently, it creates a surrogate model on the dataset, incorporating these weighted variations [54]. This surrogate model is then used to explain each original data point individually.

We randomly selected two amino acids for DNA-binding and RNA-binding predictive models from the test dataset for LIME analysis. Figure 18 provides insights into the top five features influencing the prediction of Valine (V) as a DNA-binding residue for protein 3POV0. The predicted probability for the DNA-binding class is 0.91, whereas the non DNA-binding class has a probability of 0.09. The model correctly predicts the label for this test sample. Notably, the feature importance scores for this particular sample reveal that the HMM profile (L) has a 5% importance score, followed by Bigram (R) with 4%, Accessible Surface Area with 3%, and HMM profile (E) also has a 3% importance score toward the DNA-binding class. On the other hand, the Secondary Structure (P(8-T)) has a 3% importance score toward the non DNA-binding class.

Fig 18. — The figure illustrates the features influencing the prediction of the amino acid valine (V) as a DNA-binding residue in protein ID 3POV0. (a) Displays the prediction probabilities of the model for DNA-binding (orange) and non-DNA-binding (blue) classes. (b) Highlights the top five significant features. (c) the top five features and their corresponding values. (d) same as figure (b) and shows each feature’s contribution to the prediction of the selected amino acid, with their relative importance denoted by floating-point numbers on the x-axis. Features contributing to DNA-binding are shown in green, and those contributing to non DNA-binding in red.

Figures 18(b) and 18(d) visualize the range of local interpretability predictions for the Valine (V) sample. They indicate that the HMM profile (L) for this specific instance exceeds 5658, the Accessible Surface Area is greater than 0.5, Bigram (R) falls within the range of 0.52 to 1.54, and the HMM profile (L) falls within the range of 4805 to 6758, contributing to the DNA-binding class. It is extremely difficult to relate these features to the prediction of DNA-binding model. However, one feature importance score aligns with our hypothesis. Residues tend to exhibit DNA-binding tendencies when they possess a higher accessible surface area. In this instance, the Accessible Surface Area is greater than 0.5, contributing to this sample being identified as a DNA-binding residue. Furthermore, during our analysis of feature importance scores, we observed that the Accessible Surface Area is the most crucial feature in our trained model for both DNA and RNA datasets.

We further investigated the contribution of features for RNA-binding protein prediction. Figure 19 provides insights into the top five features influencing the Serine (S) prediction at position 261 as an RNA-binding residue for protein 3ZH22. The predicted label for this instance is non RNA-binding with a probability of 0.99. The true label is non RNA-binding. Notably, the feature importance scores for this particular sample reveal that HMM profile (E) has a less than 1% importance score, followed by Bigram (V), Bigram (F), and Bigram (P), all with less than 1%. In contrast, the Accessible Surface Area is less than 0.11 for this particular sample and contributed to the non RNA-binding prediction. These feature importance scores align with our hypothesis that residues exhibit non RNA-binding tendencies when they possess a lower accessible surface area.

Fig 19. — The key features influencing the identification of Serine (S) in protein 3ZH22 as an RNA-binding residue. (a) Showcases the model’s predictive probabilities: RNA-binding in orange and non-RNA-binding in blue. (b) Displays the five most crucial features. (c) Shows the top five features with their values. (d) Illustrates each feature’s role in predicting the specific amino acid, with their significance quantified by floating-point values on the x-axis. Features contributing to RNA-binding are shown in green and those contributing to non-RNA binding in red.

6. Conclusions

In this study, we developed a new method, DRBpred, to predict DNA-binding and RNA-binding residues from protein sequences. This method involves gathering relevant features and employing a recursive feature elimination (RFE) technique along with SHAP values to select a subset of features. Additionally, a sliding window technique was utilized to extract additional information from neighboring residues, and an optimized LightGBM classifier was trained to predict the binding residues. DRBpred demonstrated significant improvements across various evaluation metrics compared to the state-of-the-art method. Specifically, for the DNA-binding test dataset, DRBpred exhibited enhancements of 112.00% in sensitivity, 33.33% in Matthews’s correlation coefficient (MCC), and 6.49% in the area under the curve (AUC). Similarly, improvements of 112.50% in sensitivity, 16.67% in MCC, and 7.46% in AUC were observed for the RNA-binding test dataset. These results clearly indicate that the optimized LightGBM method surpasses the performance of the existing state-of-the-art approach. The limitation of the proposed approach is that feature extraction is computationally expensive. DRBpred depends on other existing methods for feature extraction, some of which are time intensive. To mitigate this, we plan to employ parallel processing strategies involving multiple CPUs in future developments. Additionally, we aim to incorporate three-dimensional predicted structural information in the future. Moreover, large language models (LLMs) could be employed to extract important features, potentially improving prediction accuracy. We believe that the DRBpred method will assist researchers in predicting DNA- and RNA-binding residues, enabling a better understanding of the roles played by DNA- and RNA-binding proteins in the life cycle of organisms.

Highly accurate prediction of DNA and RNA binding residues from protein sequences.
The study includes feature extraction and selecting the most suitable features for prediction.
Enhances our understanding of features that determine DNA and RNA-binding proteins.
An optimized LightGBM classifier to predict binding residues.
Outperforms the current state-of-the-art approach.

Acknowledgments:

The authors would like to thank Christopher David Moore for thorough review of the manuscript.

Footnotes

Conflict of Interest

The authors declare no conflict of interest.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Data Availability

The DRBpred webserver is available at https://bmll.cs.uno.edu. The data, including the code related to the development of DRBpred can be found here https://github.com/wasicse/DRBpred

References

1.Zhou J, et al. , EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation. 2017. 18. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Dai X, Zhang S, and Zaleta-Rivera K, RNA: interactions drive functionalities. Mol Biol Rep, 2020. 47(2): p. 1413–1434. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Licatalosi DD, Roles of RNA-binding Proteins and Post-transcriptional Regulation in Driving Male Germ Cell Development in the Mouse. Adv Exp Med Biol, 2016. 907: p. 123–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Cozzolino F, et al. , Protein-DNA/RNA Interactions: An Overview of Investigation Methods in the - Omics Era. J Proteome Res, 2021. 20(6): p. 3018–3030. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Kobras CM, Fenton AK, and Sheppard SK, Next-generation microbiology: from comparative genomics to gene function. Genome Biol, 2021. 22(1): p. 123. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Li K, et al. , Prediction of hot spots in protein-DNA binding interfaces based on supervised isometric feature mapping and extreme gradient boosting. BMC Bioinformatics, 2020. 21(Suppl 13): p. 381. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Mesri M, Advances in Proteomic Technologies and Its Contribution to the Field of Cancer. Advances in medicine, 2014. 2014: p. 238045–238045. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Faezov B and Dunbrack RL Jr, PDBrenum: A webserver and program providing Protein Data Bank files renumbered according to their UniProt sequences. Plos one, 2021. 16(7): p. e0253411. [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Deng L, et al. , PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine. BMC Bioinformatics, 2018. 19(Suppl 19): p. 522. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Yuan Q, et al. , AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Briefings in Bioinformatics, 2022. 23(2): p. bbab564. [DOI] [PubMed] [Google Scholar]
11.Cai YD and Lin SL, Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta, 2003. 1648(1–2): p. 127–33. [DOI] [PubMed] [Google Scholar]
12.Zou C, Gong J, and Li H, An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics, 2013. 14(1): p. 90. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Lou W, et al. , Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes. PLoS One, 2014. 9(1): p. e86703. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Zhang Y, et al. , newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation. Comput Biol Chem, 2014. 52: p. 51–9. [DOI] [PubMed] [Google Scholar]
15.Yan J and Kurgan L, DRNApred, fast sequence-based method that accurately predicts and discriminates DNA-and RNA-binding residues. Nucleic acids research, 2017. 45(10): p. e84–e84. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hwang S, Gou Z, and Kuznetsov IB, DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics, 2007. 23(5): p. 634–6. [DOI] [PubMed] [Google Scholar]
17.Wang L, et al. , BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Systems Biology, 2010. 4: p. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Wang L and Brown SJ, BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res, 2006. 34(Web Server issue): p. W243–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Wang L, et al. , BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst Biol, 2010. 4 Suppl 1(Suppl 1): p. S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Ahmad S and Sarai A, PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 2005. 6(1): p. 33. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zhou J, et al. , EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation. BMC Bioinformatics, 2017. 18(1): p. 379. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Kabsch W and Sander C, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577–637. [DOI] [PubMed] [Google Scholar]
23.Zhang Q, et al. , StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Applied Soft Computing, 2021. 99: p. 106921. [Google Scholar]
24.Hendrix SG, et al. , DeepDISE: DNA Binding Site Prediction Using a Deep Learning Method. Int J Mol Sci, 2021. 22(11): p. 5510. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zhou J, et al. , EL_LSTM: Prediction of DNA-Binding Residue from Protein Sequence by Combining Long Short-Term Memory and Ensemble Learning. IEEE/ACM Trans Comput Biol Bioinform, 2020. 17(1): p. 124–135. [DOI] [PubMed] [Google Scholar]
26.Jones DT and Ward JJ, Prediction of disordered regions in proteins from position specific score matrices. Proteins, 2003. 53 Suppl 6(6): p. 573–8. [DOI] [PubMed] [Google Scholar]
27.Ma X, et al. , A SVM-based approach for predicting DNA-binding residues in proteins from amino acid sequences. IEEE Xplore, 2009. [Google Scholar]
28.Liu B, et al. , Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One, 2012. 7(9): p. e46633. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Huang HL, et al. , Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. BMC Bioinformatics, 2011: p. S47. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Liu B, et al. , Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. Plos One, 2012. 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Zhu L, et al. , Improving the accuracy of predicting disulfide connectivity by feature selection. J Comput Chem, 2010. 31(7): p. 1478–85. [DOI] [PubMed] [Google Scholar]
32.Niu S, et al. , Prediction of tyrosine sulfation with mRMR feature selection and analysis. J Proteome Res, 2010. 9(12): p. 6490–7. [DOI] [PubMed] [Google Scholar]
33.Iqbal S, Mishra A, and Hoque MT, Improved prediction of accessible surface area results in efficient energy function application. J Theor Biol, 2015. 380: p. 380–91. [DOI] [PubMed] [Google Scholar]
34.Iqbal S and Hoque MT, PBRpredict-Suite: a suite of models to predict peptide-recognition domain residues from protein sequence. Bioinformatics, 2018. 34(19): p. 3289–3299. [DOI] [PubMed] [Google Scholar]
35.Iqbal S and Hoque MT, Estimation of position specific energy as a feature of protein residues from sequence alone for structural classification. PLOS ONE, 2016. 11(9): p. e0161452. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Altschul SF, et al. , Basic Local Alignment Search Tool. J. Mol. Biol, 1990. 215: p. 403–410. [DOI] [PubMed] [Google Scholar]
37.Sharma A, et al. , Evaluation of sequence features from intrinsically disordered regions for the estimation of protein function. PLoS One, 2014. 9(2): p. e89890. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Eddy SR, Profile hidden Markov models. Bioinformatics, 1998. 14(9): p. 755–63. [DOI] [PubMed] [Google Scholar]
39.Mishra A, Kabir MWU, and Hoque MT, diSBPred: A machine learning based approach for disulfide bond prediction. Comput Biol Chem, 2021. 91: p. 107436. [DOI] [PubMed] [Google Scholar]
40.Kabir MW, et al. TAFPred: Torsion Angle Fluctuations Prediction from Protein Sequences. Biology, 2023. 12, DOI: 10.3390/biology12071020. [DOI] [PMC free article] [PubMed] [Google Scholar]
41.Hanson J, et al. , SPOT-Disorder2: Improved Protein Intrinsic Disorder Prediction by Ensembled Deep Learning. Genomics Proteomics Bioinformatics, 2019. 17(6): p. 645–656. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Wright PE and Dyson HJ, Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol, 1999. 293(2): p. 321–31. [DOI] [PubMed] [Google Scholar]
43.Liu J, Tan H, and Rost B, Loopy proteins appear conserved in evolution. J Mol Biol, 2002. 322(1): p. 53–64. [DOI] [PubMed] [Google Scholar]
44.Tompa P, Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527–33. [DOI] [PubMed] [Google Scholar]
45.Gattani S, Mishra A, and Hoque MT, StackCBPred: A stacking based prediction of protein-carbohydrate binding sites from sequence. Carbohydr Res, 2019. 486: p. 107857. [DOI] [PubMed] [Google Scholar]
46.Vigneau E, et al. , Random forests: A machine learning methodology to highlight the volatile organic compounds involved in olfactory perception. Food Quality, 2018. 68: p. 135–145. [Google Scholar]
47.Ranganathan P, Pramesh CS, and Aggarwal R, Common pitfalls in statistical analysis: Logistic regression. Perspect Clin Res, 2017. 8(3): p. 148–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Geurts P, Ernst D, and Wehenkel L, Extremely randomized trees. Machine learning, 2006. 63(1): p. 3–42. [Google Scholar]
49.Alawad DM, Mishra A, and Hoque MT, AIBH: accurate identification of brain hemorrhage using genetic algorithm based feature selection and stacking. Machine Learning Knowledge Extraction, 2020. 2(2): p. 56–77. [Google Scholar]
50.Ke G, et al. , Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 2017. 30. [Google Scholar]
51.Dorogush AV, Ershov V, and Gulin A, CatBoost: gradient boosting with categorical features support. arXiv preprint; 2018. [Google Scholar]
52.Lundberg S and Lee S-I, A Unified Approach to Interpreting Model Predictions, in Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 4768–4777. [Google Scholar]
53.Akiba T, et al. , Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv [cs.LG], 2019. [Google Scholar]
54.Ribeiro MT, Singh S, and Guestrin C, “Why Should I Trust You?”, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, Association for Computing Machinery: San Francisco, California, USA. p. 1135–1144. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The DRBpred webserver is available at https://bmll.cs.uno.edu. The data, including the code related to the development of DRBpred can be found here https://github.com/wasicse/DRBpred

[R1] 1.Zhou J, et al. , EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation. 2017. 18. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Dai X, Zhang S, and Zaleta-Rivera K, RNA: interactions drive functionalities. Mol Biol Rep, 2020. 47(2): p. 1413–1434. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Licatalosi DD, Roles of RNA-binding Proteins and Post-transcriptional Regulation in Driving Male Germ Cell Development in the Mouse. Adv Exp Med Biol, 2016. 907: p. 123–51. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Cozzolino F, et al. , Protein-DNA/RNA Interactions: An Overview of Investigation Methods in the - Omics Era. J Proteome Res, 2021. 20(6): p. 3018–3030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Kobras CM, Fenton AK, and Sheppard SK, Next-generation microbiology: from comparative genomics to gene function. Genome Biol, 2021. 22(1): p. 123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Li K, et al. , Prediction of hot spots in protein-DNA binding interfaces based on supervised isometric feature mapping and extreme gradient boosting. BMC Bioinformatics, 2020. 21(Suppl 13): p. 381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Mesri M, Advances in Proteomic Technologies and Its Contribution to the Field of Cancer. Advances in medicine, 2014. 2014: p. 238045–238045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Faezov B and Dunbrack RL Jr, PDBrenum: A webserver and program providing Protein Data Bank files renumbered according to their UniProt sequences. Plos one, 2021. 16(7): p. e0253411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Deng L, et al. , PDRLGB: precise DNA-binding residue prediction using a light gradient boosting machine. BMC Bioinformatics, 2018. 19(Suppl 19): p. 522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Yuan Q, et al. , AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Briefings in Bioinformatics, 2022. 23(2): p. bbab564. [DOI] [PubMed] [Google Scholar]

[R11] 11.Cai YD and Lin SL, Support vector machines for predicting rRNA-, RNA-, and DNA-binding proteins from amino acid sequence. Biochim Biophys Acta, 2003. 1648(1–2): p. 127–33. [DOI] [PubMed] [Google Scholar]

[R12] 12.Zou C, Gong J, and Li H, An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformatics, 2013. 14(1): p. 90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Lou W, et al. , Sequence based prediction of DNA-binding proteins based on hybrid feature selection using random forest and Gaussian naive Bayes. PLoS One, 2014. 9(1): p. e86703. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Zhang Y, et al. , newDNA-Prot: Prediction of DNA-binding proteins by employing support vector machine and a comprehensive sequence representation. Comput Biol Chem, 2014. 52: p. 51–9. [DOI] [PubMed] [Google Scholar]

[R15] 15.Yan J and Kurgan L, DRNApred, fast sequence-based method that accurately predicts and discriminates DNA-and RNA-binding residues. Nucleic acids research, 2017. 45(10): p. e84–e84. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Hwang S, Gou Z, and Kuznetsov IB, DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics, 2007. 23(5): p. 634–6. [DOI] [PubMed] [Google Scholar]

[R17] 17.Wang L, et al. , BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Systems Biology, 2010. 4: p. 1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Wang L and Brown SJ, BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res, 2006. 34(Web Server issue): p. W243–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Wang L, et al. , BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Syst Biol, 2010. 4 Suppl 1(Suppl 1): p. S3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Ahmad S and Sarai A, PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics, 2005. 6(1): p. 33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Zhou J, et al. , EL_PSSM-RT: DNA-binding residue prediction by integrating ensemble learning with PSSM Relation Transformation. BMC Bioinformatics, 2017. 18(1): p. 379. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Kabsch W and Sander C, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 1983. 22(12): p. 2577–637. [DOI] [PubMed] [Google Scholar]

[R23] 23.Zhang Q, et al. , StackPDB: Predicting DNA-binding proteins based on XGB-RFE feature optimization and stacked ensemble classifier. Applied Soft Computing, 2021. 99: p. 106921. [Google Scholar]

[R24] 24.Hendrix SG, et al. , DeepDISE: DNA Binding Site Prediction Using a Deep Learning Method. Int J Mol Sci, 2021. 22(11): p. 5510. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Zhou J, et al. , EL_LSTM: Prediction of DNA-Binding Residue from Protein Sequence by Combining Long Short-Term Memory and Ensemble Learning. IEEE/ACM Trans Comput Biol Bioinform, 2020. 17(1): p. 124–135. [DOI] [PubMed] [Google Scholar]

[R26] 26.Jones DT and Ward JJ, Prediction of disordered regions in proteins from position specific score matrices. Proteins, 2003. 53 Suppl 6(6): p. 573–8. [DOI] [PubMed] [Google Scholar]

[R27] 27.Ma X, et al. , A SVM-based approach for predicting DNA-binding residues in proteins from amino acid sequences. IEEE Xplore, 2009. [Google Scholar]

[R28] 28.Liu B, et al. , Using amino acid physicochemical distance transformation for fast protein remote homology detection. PLoS One, 2012. 7(9): p. e46633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Huang HL, et al. , Predicting and analyzing DNA-binding domains using a systematic approach to identifying a set of informative physicochemical and biochemical properties. BMC Bioinformatics, 2011: p. S47. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Liu B, et al. , Using Amino Acid Physicochemical Distance Transformation for Fast Protein Remote Homology Detection. Plos One, 2012. 7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] 31.Zhu L, et al. , Improving the accuracy of predicting disulfide connectivity by feature selection. J Comput Chem, 2010. 31(7): p. 1478–85. [DOI] [PubMed] [Google Scholar]

[R32] 32.Niu S, et al. , Prediction of tyrosine sulfation with mRMR feature selection and analysis. J Proteome Res, 2010. 9(12): p. 6490–7. [DOI] [PubMed] [Google Scholar]

[R33] 33.Iqbal S, Mishra A, and Hoque MT, Improved prediction of accessible surface area results in efficient energy function application. J Theor Biol, 2015. 380: p. 380–91. [DOI] [PubMed] [Google Scholar]

[R34] 34.Iqbal S and Hoque MT, PBRpredict-Suite: a suite of models to predict peptide-recognition domain residues from protein sequence. Bioinformatics, 2018. 34(19): p. 3289–3299. [DOI] [PubMed] [Google Scholar]

[R35] 35.Iqbal S and Hoque MT, Estimation of position specific energy as a feature of protein residues from sequence alone for structural classification. PLOS ONE, 2016. 11(9): p. e0161452. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Altschul SF, et al. , Basic Local Alignment Search Tool. J. Mol. Biol, 1990. 215: p. 403–410. [DOI] [PubMed] [Google Scholar]

[R37] 37.Sharma A, et al. , Evaluation of sequence features from intrinsically disordered regions for the estimation of protein function. PLoS One, 2014. 9(2): p. e89890. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Eddy SR, Profile hidden Markov models. Bioinformatics, 1998. 14(9): p. 755–63. [DOI] [PubMed] [Google Scholar]

[R39] 39.Mishra A, Kabir MWU, and Hoque MT, diSBPred: A machine learning based approach for disulfide bond prediction. Comput Biol Chem, 2021. 91: p. 107436. [DOI] [PubMed] [Google Scholar]

[R40] 40.Kabir MW, et al. TAFPred: Torsion Angle Fluctuations Prediction from Protein Sequences. Biology, 2023. 12, DOI: 10.3390/biology12071020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] 41.Hanson J, et al. , SPOT-Disorder2: Improved Protein Intrinsic Disorder Prediction by Ensembled Deep Learning. Genomics Proteomics Bioinformatics, 2019. 17(6): p. 645–656. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Wright PE and Dyson HJ, Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J Mol Biol, 1999. 293(2): p. 321–31. [DOI] [PubMed] [Google Scholar]

[R43] 43.Liu J, Tan H, and Rost B, Loopy proteins appear conserved in evolution. J Mol Biol, 2002. 322(1): p. 53–64. [DOI] [PubMed] [Google Scholar]

[R44] 44.Tompa P, Intrinsically unstructured proteins. Trends Biochem Sci, 2002. 27(10): p. 527–33. [DOI] [PubMed] [Google Scholar]

[R45] 45.Gattani S, Mishra A, and Hoque MT, StackCBPred: A stacking based prediction of protein-carbohydrate binding sites from sequence. Carbohydr Res, 2019. 486: p. 107857. [DOI] [PubMed] [Google Scholar]

[R46] 46.Vigneau E, et al. , Random forests: A machine learning methodology to highlight the volatile organic compounds involved in olfactory perception. Food Quality, 2018. 68: p. 135–145. [Google Scholar]

[R47] 47.Ranganathan P, Pramesh CS, and Aggarwal R, Common pitfalls in statistical analysis: Logistic regression. Perspect Clin Res, 2017. 8(3): p. 148–151. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R48] 48.Geurts P, Ernst D, and Wehenkel L, Extremely randomized trees. Machine learning, 2006. 63(1): p. 3–42. [Google Scholar]

[R49] 49.Alawad DM, Mishra A, and Hoque MT, AIBH: accurate identification of brain hemorrhage using genetic algorithm based feature selection and stacking. Machine Learning Knowledge Extraction, 2020. 2(2): p. 56–77. [Google Scholar]

[R50] 50.Ke G, et al. , Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 2017. 30. [Google Scholar]

[R51] 51.Dorogush AV, Ershov V, and Gulin A, CatBoost: gradient boosting with categorical features support. arXiv preprint; 2018. [Google Scholar]

[R52] 52.Lundberg S and Lee S-I, A Unified Approach to Interpreting Model Predictions, in Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 4768–4777. [Google Scholar]

[R53] 53.Akiba T, et al. , Optuna: A Next-generation Hyperparameter Optimization Framework. arXiv [cs.LG], 2019. [Google Scholar]

[R54] 54.Ribeiro MT, Singh S, and Guestrin C, “Why Should I Trust You?”, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016, Association for Computing Machinery: San Francisco, California, USA. p. 1135–1144. [Google Scholar]

PERMALINK

DRBpred: A Sequence-based Machine Learning Method to effectively predict DNA- and RNA-Binding Residues

Md Wasi Ul Kabir

Duaa Mohammad Alawad

Pujan Pokhrel

Md Tamjidul Hoque

Abstract

Graphical Abstract

1. Introduction

2. Related Work

3. Proposed Method

3.1. Dataset

Fig 1.

Table 1.

3.2. Feature Extraction

Fig 2.

Physiochemical properties:

Residue properties:

Evolutionary properties:

Structural properties:

Flexibility properties:

Energy profile:

3.3. Machine Learning Algorithms

3.4. Feature Selection

Fig 3.

Fig 4.

Fig 5.

Fig 6.

3.5. Performance Evaluation Metrics

4. Results

4.1. Performance of Machine learning methods on the training dataset

Fig 7.

Fig 8.

Fig 9.

Fig 10.

4.2. Optimizing Hyperparameters

Table 2.

4.3. Selection of best window size

Fig 11.

Fig 12.

Fig 13.

4.4. Performance on the test dataset

Table 3.

4.5. Comparison with existing methods

Table 4.

Fig 14.

Table 5.

Fig 15.

Fig 16.

Fig 17.

5. Case Study

Fig 18.

Fig 19.

6. Conclusions

Acknowledgments:

Footnotes

Data Availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases