Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2013 Jun 20;41(16):7606–7614. doi: 10.1093/nar/gkt544

Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins

R Nagarajan 1, Shandar Ahmad 2, M Michael Gromiha 1,*
PMCID: PMC3763535  PMID: 23788679

Abstract

Protein–DNA complexes play vital roles in many cellular processes by the interactions of amino acids with DNA. Several computational methods have been developed for predicting the interacting residues in DNA-binding proteins using sequence and/or structural information. These methods showed different levels of accuracies, which may depend on the choice of data sets used in training, the feature sets selected for developing a predictive model, the ability of the models to capture information useful for prediction or a combination of these factors. In many cases, different methods are likely to produce similar results, whereas in others, the predictors may return contradictory predictions. In this situation, a priori estimates of prediction performance applicable to the system being investigated would be helpful for biologists to choose the best method for designing their experiments. In this work, we have constructed unbiased, stringent and diverse data sets for DNA-binding proteins based on various biologically relevant considerations: (i) seven structural classes, (ii) 86 folds, (iii) 106 superfamilies, (iv) 194 families, (v) 15 binding motifs, (vi) single/double-stranded DNA, (vii) DNA conformation (A, B, Z, etc.), (viii) three functions and (ix) disordered regions. These data sets were culled as non-redundant with sequence identities of 25 and 40% and used to evaluate the performance of 11 different methods in which online services or standalone programs are available. We observed that the best performing methods for each of the data sets showed significant biases toward the data sets selected for their benchmark. Our analysis revealed important data set features, which could be used to estimate these context-specific biases and hence suggest the best method to be used for a given problem. We have developed a web server, which considers these features on demand and displays the best method that the investigator should use. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/. Further, we have grouped the methods based on their complexity and analyzed the performance. The information gained in this work could be effectively used to select the best method for designing experiments.

INTRODUCTION

Protein–DNA interactions play vital roles in several biological processes, including gene regulation, DNA repair, DNA replication and DNA packaging. The knowledge about DNA-binding residues and binding specificity would help to understand the recognition mechanism of protein–DNA complexes. The availability of experimental data on binding specificity (1) and 3D structures of protein–DNA complexes (2) encouraged researchers to reveal important factors for understanding protein–DNA recognition. The analysis has been focused on different directions such as amino acid properties, conservation of residues, contribution of non-covalent interactions and conformational changes of DNA (3–26). The importance of hydrogen bonds, electrostatic, hydrophobic and van der Waals interactions along with weak interactions including cation-π has been stressed by several investigators in the field (12,14,19,24,27–32). The contributions of energetic terms along with physical and chemical features have been used to understand the recognition mechanism of protein–DNA complexes. Furthermore, knowledge-based statistical potentials have been derived using atomic contacts between protein and DNA, and these potentials have been used to predict the binding specificity of protein–DNA complexes (33,34). Gromiha et al. (2004) combined both inter and intra-molecular interactions for understanding the recognition mechanism.

On the other hand, owing to the exponential increase in the gap between the available sequences and structures of DNA-binding proteins in Uniprot (35) and Protein Data Bank (2), several methods have been proposed to identify the binding site residues just from amino acid sequences. These methods are based on amino acid frequency, evolutionary profile, sequence conservation, predicted secondary structure and solvent accessibility, electrostatic potential, hydrophobicity, position-specific scoring matrix by using various machine learning methods such as support vector machine, neural network, Naïve Bayes classifier and random forest (33,36–51). Careful inspection of these methods revealed the fact that they are applicable to specific types of proteins, and the performance of each method varies drastically in the range of 20–90%. This situation leads confusions to the biologists for selecting the best method to identify the binding sites for designing their experiments. Hence, it is essential and important to reveal the applications and predictive ability of existing predictors to specific data sets based on various properties of the query protein.

In this work, we have systematically categorized the protein–DNA complexes into several groups based on the structure of the protein, structure of the DNA, binding motif and function. The complexes in each category have been divided into several sub-categories using known annotations in structural and functional databases. On the other hand, we have collected all the prediction servers, which have either online services or available standalone programs. We have developed necessary in-house programs to analyze the results obtained with each method using nine types of data sets. We noticed that no method is uniformly predicting the binding sites at high accuracy in all the data sets. This is applicable to the most recently developed methods with tuned parameters, efficient techniques and large data set as well as the earliest methods reported in the literature. We have related the performance of each method with different data sets and revealed the correspondence between them. These results would help the biologists to select the best method to design their experiments rather than choosing any specific method arbitrarily or a combination of methods. In addition, the present study explores the necessity of refining/developing bioinformatics tools to improve the performance in specific categories of DNA-binding proteins. Specific examples for the best and worst performance of methods in selected categories of data sets will be discussed.

MATERIALS AND METHODS

Data sets

We have collected all the protein–DNA complexes (2317 entries) deposited in Protein Data Bank (last accessed on 16 May 2012). These complexes were classified into four broader categories based on (i) protein structure, (ii) DNA structure, (iii) binding motif and (iv) protein function as described later in the text. All the data sets have been culled with the sequence identities of <25 and 40%. We obtained similar results, and the data with the cutoff of <25% sequence identity are presented in this article.

Classification based on protein structure

We have used the SCOP database (52) for structural classification of proteins based on their structural classes, folding types, superfamilies and families. Our final data set contains 260 protein chains from seven classes, 86 folds, 106 superfamilies and 194 families with the sequence identity of <25%.

Further, we have identified the disordered regions by comparing the structures of proteins in free and complex forms and analyzed the performance of different methods in disordered regions.

Classification based on DNA structure

We have classified the protein–DNA complexes based on DNA structure on two aspects: (i) DNA conformation such as A, B, Z, RH and U and (ii) type of DNA (single stranded, double stranded and palindrome and double stranded and non-palindrome). The conformation of DNA has been obtained from Nucleic acid database (NDB) (53). The databases PDB, NDB and PDIdb (54) have been used to get the information on double/single-stranded DNA and palindrome/non-palindrome DNA. The final data set contains 283 and 301 protein chains based on DNA conformation and type, respectively.

Motif-based classification

The binding motif is considered to be an important factor for identifying the binding sites (55). Hence, we classified the protein–DNA complexes based on their binding motifs, and the major ones are helix-turn-helix, β-barrel and β-ribbon. We obtained the motif information from different databases such as ProNuc (56), PDIdb (54) and Biomolecules gallery (http://gibk26.bio.kyutech.ac.jp/jouhou/image/dna-protein/all/all.html). The final data set contains 69 chains from 15 motifs. We noticed that several complexes are listed under enzymes, which are considered in the classification based on functions.

Functional classification of protein–DNA complexes

We have classified the protein–DNA complexes based on their functions such as enzymes, regulatory proteins and structural proteins. The functional information has been obtained from NDB. The final data set contains 126 enzymes, 149 regulatory proteins and 19 structural proteins with the sequence identity of <25%.

Methods for predicting the binding sites in DNA-binding proteins

We have collected all the available methods for predicting the binding sites in DNA-binding proteins from amino acid sequence, which have either online services or available standalone programs (57). The methods are BindN (39), BindN+ (47), BindN-RF (46), DBS-Pred (37), DBS-PSSM (38), DNABindR (49), DP-Bind with three categories, binary, BLOSUM and PSSM encoding (48), metaDBSite (51) and NAPS (50). The details about the name, features, technique, reference and link for the methods used in the present work are listed in Supplementary Table S1. These methods used different data sets and accuracies reported by the authors are in the range of 70–80%.

Identification of DNA-binding residues

Several criteria have been proposed to identify the DNA-binding sites such as the distance between contacting atoms in protein and DNA (37), reduction in solvent accessibility on binding (58) and interaction energy between protein and DNA (21). Most of the prediction methods analyzed in this work used the distance based criteria for identifying the binding sites. In this approach, a residue in a DNA-binding protein is identified as binding if the distance between any of its heavy atoms and a heavy atom in DNA is ≤3.5 Å. We have identified the binding sites using the same conditions in all the considered protein–DNA complexes.

Assessing the performance of prediction methods

We have assessed the performance of different methods using the measures, sensitivity, specificity, accuracy and Matthews correlation coefficient (MCC). Sensitivity shows the correct prediction of DNA-binding residues, specificity reveals the ability of excluding non-binding residues and accuracy provides the overall performance (59).

graphic file with name gkt544m1.jpg (1)
graphic file with name gkt544m2.jpg (2)
graphic file with name gkt544m3.jpg (3)
graphic file with name gkt544m4.jpg (4)
graphic file with name gkt544m5.jpg (5)

In these equations, TP (binding residues predicted as binding), TN (non-binding residues predicted as non binding), FP (non-binding residues predicted as binding) and FN (binding residues predicted as non binding) represent, true positives, true negatives, false positives and false negatives, respectively.

RESULTS AND DISCUSSIONS

We have assessed the performance of all the available methods in different sets of data as described in the ‘Materials and Methods’ section.

Structural classes

Protein–DNA complexes have been classified into seven structural classes such as all-α, all-β, α + β, α/β, multi-domain, coiled coil and small proteins. The accuracies obtained with all the 11 considered methods in these data sets are presented in Table 1. From Table 1, we noticed that the performance of a method depends on the structural class. Most of the methods predict well in all-α proteins where as the performance is poor in all-β class of proteins. This trend is similar to protein secondary structure prediction that all-α class proteins are better predicted than all-β class proteins (60). The binding sites in coiled coil proteins are predicted well in most of the methods. The comparison of different methods showed that BindN-RF has the best performance in all-α and all-β proteins. The sensitivity of BindN-RF and DP-Bind_BLOSUM is <60%, although the overall accuracies of these methods are more than metaDBSite. Further, none of the method showed the sensitivity of more than 59% in multi-domain proteins. This might be due to the size of the protein, and the binding site residues are <2%. Hence, we have separated domains in these proteins using SBASE (61) and predicted the binding sites in DNA-binding domain. We observed that the methods DP-Bind_PSSM and NAPS could predict the binding sites with >60% sensitivity, specificity and accuracy. Further, we have evaluated the performance of different methods using MCC, and the results are presented in Supplementary Table S2. We noticed that the trend is similar to that reported using the measure, accuracy.

Table 1.

Prediction accuracy of binding sites in different classes

Methods Average Accuracy all-α all-β α + β α/β Coiled coil Multidomain Small proteins
BindN 64.2 (74.9) 66.2 (76.3) 62.1 (74.6) 60.3 (74.8) 62.2 (79.6) 75.1 (73.2) 61.3 (79.7) 62.4 (66.5)
BindN+ 71.1 (82.8) 76.2 (83.8) 66.0 (83.7) 67.9 (81.9) 66.5 (85.8) 88.4 (86.9) 65.5 (87.3) 66.9 (70.2)
BindN-RF 71.9 (82.3) 76.4 (83.7) 68.0 (82.7) 68.4 (82.8) 67.8 (84.8) 88.1 (86.5) 65.8 (86.6) 68.5 (69.2)
DBS-Pred 64.3 (72.6) 64.2 (73.0) 62.6 (71.6) 62.0 (71.7) 63.1 (75.4) 74.4 (73.6) 59.8 (76.4) 63.6 (66.5)
DBS-PSSM 70.2 (78.5) 73.2 (80.2) 65.5 (78.2) 65.8 (76.5) 67.1 (83.3) 87.4 (81.6) 65.3 (87.0) 67.3 (62.3)
DP-Bind_Binary 66.9 (68.0) 68.1 68.6) 63.5 (65.8) 63.1 (67.2) 66.3 (70.6) 79.8 (70.4) 62.0 (70.7) 65.4 (62.9)
DP-Bind_BLOSUM 66.1 (67.8) 69.2 (69.5) 63.2 (66.3) 62.8 (67.8) 66.3 (71.5) 75.6 (66.6) 61.1 (70.5) 65.0 (62.4)
DP-Bind_PSSM 72.1 (76.4) 73.7 (78.4) 67.9 (75.3) 69.6 (76.6) 70.4 (80.5) 88.1 (84.6) 69.9 (79.4) 64.8 (56.8)
DNABindR 68.0 (71.9) 70.1 (72.9) 62.6 (68.1) 65.2 (71.0) 66.2 (75.2) 82.9 (77.3) 64.2 (77.1) 64.4 (61.7)
metaDBSite 69.9 (72.3) 72.0 (74.1) 66.9 (70.2) 67.2 (71.5) 69.2 (76.6) 82.0 (74.0) 65.4 (76.6) 66.5 (62.9)
NAPS 63.6 (65.1) 64.6 (64.8) 58.8 (61.3) 59.4 (62.5) 57.6 (66.9) 80.6 (75.0) 62.5 (67.9) 61.6 (57.6)

Accuracies obtained with Equation (3) are given in parentheses. The highest accuracy in each class is shown in bold.

Folds, superfamilies and families

The classification of protein–DNA complexes based on their structures showed that they are distributed in 86 different folding types, 106 superfamilies and 194 families. We have analyzed the performance of all the 11 prediction methods in all folds, superfamilies and families, and the summarized results are presented in Figure 1 and Supplementary Table S3. DP-Bind_PSSM showed the best performance in >20% of the folds/superfamilies. However, the accuracy of this method is <60% in 13 of the 86 considered folds. BindN-RF scored the highest rank in the classification of families. Methods such as DBS-PSSM and BindN+ predicted the binding sites with topmost accuracy in 10–20% of the considered 186 DNA-binding proteins. Interestingly, one of the earliest prediction methods DBS-Pred (37) also showed the best performance in four folds, four superfamiles and three families. These results showed that the prediction methods are complimenting each other in different types of DNA-binding proteins. It is essential to reveal the best method in specific type of proteins for practical applications.

Figure 1.

Figure 1.

Performance of DNA-binding site prediction methods in various folds, superfamilies and families.

We have systematically analyzed the correspondence between the structure of the complex and prediction performance, and the methods showing highest and lowest accuracies for identifying the binding sites in 86 folds, 106 superfamilies and 194 families are listed in Supplementary Table S4. Few typical examples for the best and worst predicted folds along with their performances are presented in Table 2. BindN+, DBS-Pred and DP-Bind_PSSM showed the best performance in profilin-like, tetracyclin repressor-like and transcription factor IIA types of folds, respectively. The predicted accuracies are >90% based on the average between sensitivity and specificity. On the other hand, other methods showed a poor performance in these folds with the accuracy in the range of 50–70%. Further, the accuracies of several folds are <70%, and three typical examples are listed in Table 2. The best method showed the accuracy of 57% in Retrovirus zinc finger-like domain fold. The sensitivity and specificity are 70.2 and 43.0%, respectively. In addition, PUA domain-like and HLH-like folds showed the accuracy of 61.1 and 61.2%, respectively. These results indicate the requirement of methods to be applicable to folds in which the binding sites are poorly predicted.

Table 2.

Typical examples of best and worst predicted folds, superfamilies and families

Fold/Superfamily/Family Method Sensitivity Specificity Accuracy1 Accuracy2 MCC Lowest Accuracy MCC
Fold
    Profilin-like (1) BindN+ 100.0 96.4 96.6 98.2 0.32 64.5 (DP-Bind_BLOSUM) 0.20
    Tetracyclin repressor-like, C terminal domain (2) DP-Bind_PSSM 96.2 89.6 89.8 92.9 0.28 51.1 (DP-Bind_BLOSUM) 0.20
    Transcription factor IIA(TFIIA), beta-barrel domain (2) DBS-Pred 100.0 80.4 82.0 90.2 0.16 67.9 (NAPS) 0.13
 HLH-like (1) BindN+ 40.0 82.4 72.7 61.2 0.38 49.6 (NAPS) 0.16
 PUA domain-like (1) DNABindR 75.0 47.3 51.0 61.1 0.21 54.4 (NAPS) 0.13
 Retrovirus zinc finger -like domains (2) DP-Bind_PSSM 70.2 43.0 54.7 56.6 0.29 46.5 (DNABindR) 0.22
Superfamily
    Pheromone-binding, quourm-sensing transcription factors (1) BindN+ 100.0 96.4 96.6 98.2 0.31 64.5 (DP-Bind_BLOSUM) 0.20
    Dimeric alpha + beta barrel (1) BindN-RF 87.5 96.4 95.9 92.0 0.34 47.3 (DBS-Pred) 0.17
    DNA-binding domain- eukaryotic transcription factors (1) DBS-PSSM 100.0 88.5 90.5 94.3 0.28 72.3 (DBS-Pred) 0.20
 Chromo domain-like (1) DBS-Pred 27.8 73.9 60.9 50.8 0.20 42.2 (NAPS) 0.14
 Immunoglobin (3) DBS-PSSM 77.8 60.0 60.4 68.9 0.29 37.4 (NAPS) 0.13
 RNase A-like (1) DP-Bind_PSSM 71.4 47.0 48.4 59.2 0.29 34.9 (BindN) 0.18
Family
    AraC type transcriptional activator (1) BindN-RF 100.0 99.0 99.1 99.5 0.32 65.9 (DBS-Pred) 0.19
    CopG-like (1) BindN 100.0 81.1 83.7 90.5 0.22 78.1 (BindN-RF) 0.20
    Z-DNA binding domain (1) DBS-PSSM 100.0 81.1 82.5 90.6 0.26 47.4 (DP-Bind_Binary) 0.19
 T7 RNA polymerase (1) DP-Bind_PSSM 50.0 88.0 86.2 69.0 0.28 58.5 (NAPS) 0.13
 RecA protein-like (ATPase-domain) (1) BindN-RF 33.3 87.7 86.5 60.5 0.33 44.1 (DNABindR) 0.23
 SRA domain-like (1) DNABindR 75.0 47.3 51.0 61.1 0.23 54.4 (NAPS) 0.13

The worst predicted folds/superfamilies/families are shown in italics.

The best predicted superfamilies and their performance are included in Table 2. We observed that pheromone binding and dimeric α + β barrel are predicted with the accuracy of >90% where as the lowest accuracies are 65 and 47%, respectively. The binding sites in eukaryotic transcription factors are predicted well with all the methods, and the highest and lowest accuracies are 94 and 72%, respectively. The worst predicted superfamiles are chromo-domain-like, immunoglobulin and RNase A-line with the highest accuracy of ∼60% (Table 2). Interestingly, the binding site residues in chromo-domain superfamily are predicted with high specificity, whereas other two superfamilies identify the binding residues with high sensitivity. This suggests that the interface residues in these domains may consist of a small number of residues with strong binding signal, which remain unchanged across the family, whereas there are other residues, which show diversity, and their binding is not directly predicted from sequence features alone.

We observed similar tendency in the classification of families. BindN-RF predicted the binding sites in AraC type transcriptional activator with the accuracy of 99.5%; the sensitivity and specificity are 100 and 99%, respectively. The binding sites in CopG and Z-DNA-binding domain are predicted with >90% accuracy by BindN and DBS-PSSM, respectively.

This analysis revealed that although newly developed methods included several features, fine tuning of parameters and large data set, which showed excellent performance over other methods, simpler methods reported earlier than others may outperform more complex methods on some systems, and hence their availability should be made use of predictions.

Disordered regions

We have analyzed the performance of different methods in disordered regions of 73 protein chains. The results are presented in Table 3. We observed that the methods, BindN-RF and DP-Bind_PSSM, which showed high accuracy in different structures classes (Table 1), have less sensitivity and specificity, respectively in disordered regions. The overall accuracy also reduced to 62%. On the other hand, DBS-Pred maintained the accuracy of 61% for disordered regions. The accuracy obtained with different methods given in Table 3 showed the necessity of developing new methods for predicting the binding sites in disordered regions.

Table 3.

Prediction performance of binding sites in disordered regions

Method Sensitivity Specificity Accuracy1 Accuracy2 MCC
DBS-Pred 61.3 60.7 60.8 61.0 0.17
BindN 55.5 67.5 65.2 61.5 0.19
BindN+ 61.3 64.6 64.0 63.0 0.21
BindN-RF 55.5 68.3 65.9 61.9 0.19
DP-Bind_Binary 78.1 48.4 54.0 63.3 0.21
DP-Bind_BLOSUM 73.0 50.3 54.5 61.6 0.18
DP-Bind_PSSM 65.7 56.4 60.6 61.0 0.20
NAPS 59.1 58.9 58.9 59.0 0.14
DNABindR 75.9 51.9 56.4 63.9 0.22
metaDBSite 73.0 56.0 59.2 64.5 0.23
DBS-PSSM 65.0 61.1 61.8 63.0 0.20

Motifs

We have grouped the protein–DNA complexes into 15 different motifs, which have the representation of 1–30 complexes. The best performance of each method in all the motifs is shown in Supplementary Table S5. In this Table, we have also included the number of motifs, sensitivity, specificity and accuracy. We noticed that BindN+ performed the best in alpha/beta, beta sheet and helix-loop-helix motifs. On the other hand, the performance is poor in Zalpha motif. BindN-RF showed the best performance in 9 of the 15 considered motifs. DBS-PSSM is ranked as the first in the ribbon-helix-helix and Zalpha motifs.

We have analyzed the performance of each method in all these motifs with the condition that the sensitivity and specificity are >60%, and the results are shown in Figure 2. We observed that all the methods performed well at least 2 of the 15 considered motifs. BindN-RF showed the best performances in 12 of 15 motifs followed by BindN+ (10/15). DBS-PSSM, DNABindR and metaDBSite showed the sensitivity and specificity of >70% in 5–8 motifs.

Figure 2.

Figure 2.

Performance of prediction methods in 15 different types of DNA binding motifs. Number of motifs, which are predicted with the sensitivity and specificity of >60% each in all considered methods are shown.

Type of DNA

We have classified the protein–DNA complexes based on three types of DNA such as single-stranded, double-stranded and palindrome, and double-stranded and non-palindrome DNA. We observed that the performance is poor for all the methods to predict the binding sites when the DNA is of single strand. The highest accuracy is 61.5% with the sensitivity and specificity of 49.9 and 73%, respectively, obtained for the method DP-Bind_PSSM. The binding sites with double-stranded DNA are predicted with >70% accuracy in both palindrome and non-palindrome cases. Further, the performance with double-stranded palindrome DNA–protein complexes is better than that with non-palindrome DNA. The accuracies are 71 and 76%, respectively. This result is understandable because many more double-stranded DNA-binding proteins have been solved and hence included in training sets than those binding to single-strand DNA. For example, the first published method for predicting DNA-binding sites (DBS-Pred) used only dsDNA-binding proteins for training the model.

DNA conformation

We have collected the DNA conformation details from NDB and accordingly classified the considered protein–DNA complexes. Majority of the DNA have the conformation of B-type. The prediction method, DP-Bind_PSSM showed the highest accuracy of 71% to predict the binding sites. The RH and Z-DNA types are predicted with the accuracy of 71%. Supplementary Table S6 shows the performance in the complexes with different types of DNA.

Functional classification of protein–DNA complexes

We have classified the protein–DNA complexes based on their functions and are mainly under three categories, namely, enzymes, regulatory and structural proteins.

The enzymes are classified into 17 groups, which have one to 52 protein–DNA complexes. The sensitivity, specificity and accuracy of the best methods in each group of enzymes are presented in Supplementary Table S7. We noticed that none of the prediction methods worked well in 13 of the 17 groups. Only four groups of enzymes, kinase, phosphatase, recombinase invertase and recombinase resolvase are predicted well with the accuracy of >80%. DNA endonuclease is a major group of enzymes with 52 complexes, and the prediction accuracy is 71% with the sensitivity of 63% and specificity of 78%. For the class of rare enzymes with only one complex, the accuracy varies from poor to good. The excellent performance of several methods in these enzymes might be due to the presence of these proteins in the training set of their respective methods. In contrast, DNA reverse transcriptase has two proteins, and the performance is poor in all the methods; the highest accuracy is 59.4% with the sensitivity of 39.4%. DNA polymerase with 17 samples is predicted poorly with the accuracy of 66%.

Regulatory proteins are classified into 13 groups with 149 chains and the accuracy of different methods lies in the range of 60–80% (Supplementary Table S7). Further inspection of Supplementary Table S7 shows that few classes of regulatory proteins such as DNA repair repressor, transcription factor co-activator and transcription factor termination have poor performance to identify the binding site residues with high sensitivity; the sensitivity is 33–46%.

Considering the structural proteins, the average accuracy is in the range of 65–78% for the 19 DNA-binding proteins in this data set. In this group of proteins, we noticed a balance between sensitivity and specificity in most of the methods. Further, one of the poorly performed methods, NAPS showed the best performance in viral coat protein.

General trends on different prediction methods

In addition, we have evaluated the performance of different prediction methods using two independent sets of test data: (i) using the protein–DNA complex structures deposited recently (since June 2011) and (ii) the structures, which were not used in individual methods for developing the respective algorithm. The results obtained with these two sets of data are presented in Table 4. We observed that the balance between sensitivity and specificity lies in the range of 60–70% in most of the methods for both the data sets. However, the accuracy is >75% in several methods, when the accuracy was evaluated using Equation (3), which shows the ability of different methods for either correctly predicting the binding sites or excluding non-binding sites. The data presented in this work based on different categories of data sets would be a valuable resource for the biologists to select the best method for their target DNA–binding protein.

Table 4.

Prediction performance of different methods in two independent data sets

Method Data set 1
Data set 2
Accuracy1 Accuracy2 MCC Accuracy1 Accuracy2 MCC
BindN 76.1 63.1 0.17 76.4 61.4 0.14
BindN+ 80.2 69.2 0.28 79.6 68.7 0.26
BindN-RF 78.0 69.5 0.28 75.3 68.7 0.24
DBS-Pred 72.6 62.4 0.16 72.8 62.2 0.14
DBS-PSSM 78.3 66.5 0.25 78.4 69.7 0.23
NAPS 63.5 60.2 0.13 64.8 60.3 0.12
DNABindR 71.6 66.3 0.21 72.1 66.7 0.20
metaDBSite 74.7 68.7 0.24 78.2 66.2 0.22
DP-Bind_Binary 67.9 65.9 0.19 68.6 67.7 0.19
DP-Bind_BLOSUM 68.4 66.1 0.19 67.3 65.4 0.17
DP-Bind_PSSM 75.9 70.3 0.27 77.7 70.0 0.25

Data set 1: List of DNA–protein complexes analyzed in this work and not used in the respective methods.

Data set 2: List of DNA–protein complexes published from June 2011, after the publication of all the analyzed methods.

Comparison between the best predicted method and combination of methods

The method, metaDBSite, combined six different methods and developed a prediction system for identifying the binding sites in DNA-binding proteins. We have compared the performance of metaDBSite with the best predicted method in different groups of DNA-binding proteins and the results are presented in Supplementary Table S8. We noticed that among 86 folds, metaDBSite performed the best only in six folds. Similar trend is observed in all the nine classification of data sets. This analysis emphasizes the importance of the present method over combination of different methods. In addition, we have estimated the difference in accuracy between the best method and metaDBSite, and we noticed an improved accuracy of up to 54% in all the DNA-binding proteins and the average accuracy is 9.6%. We have also carried out an ensemble-based prediction based on the majority of voting of the 11 methods used in this work, and we observed a similar trend that we obtained with metaDBSite predictor.

Grouping of methods based on their complexities

We have combined the methods into three groups based on their complexities such as (i) additive feature models (models which treat each input feature independent of the other), (ii) complex feature models (which use non-additive combination of features) without using PSSM and (iii) complex feature models using PSSM. The performance of these three groups of models was analyzed in all the considered data sets, and the results are presented in Supplementary Table S9. The results showed that the performance of additive feature models is similar to complex feature models without using PSSM. The complex feature models, which use PSSM, showed better performance for identifying the binding sites in most of the classes. However, the performance of these models to identify the binding sites of disordered regions was poor.

Applications

The insights obtained in the present work have several applications, and some of them are discussed later in the text. (i) For a protein with known structure and without the information of the complex, one can get all the structural information such as class, family, superfamily and so forth. In this case, depending on the type of the protein-specific method can be used to identify the binding sites, and the results will be reliable for designing experiments. (ii) Currently, protein secondary structure prediction are reported to show the accuracy of close to 85%, and structural class can be predicted with the accuracy of >95%. On a large scale analysis, it is possible to predict the structural class and apply suitable method to identify the binding sites. For example, the correct prediction of structural classes would predict the binding sites with the higher accuracy than the average accuracy of best methods reported in the literature. (iii) For a specific protein, it is possible to obtain the structural information using homology modeling or ab initio structure-prediction methods with reasonable accuracy. For selecting the best prediction method, the modeled structure would be sufficient to obtain the necessary structural information. The binding sites can be predicted by selecting the respective method based on structural information, which will be reliable for designing experiments. In addition, other information reported in this work can also be combined to get the desired information.

The data presented in Table 1 suggested that BindN+, BindN-RF and DP-Bind_PSSM are the best methods for identifying the binding sites in DNA-binding proteins. However, inspection of these methods showed a wide range of accuracies. For example, BindN+ showed the worst performance in predicting the binding sites in HMG-D protein (1QRV), and the average accuracy is 35%. On the other hand, BindN-RF showed the best performance with an accuracy of 87% in this protein. BindN-RF showed an accuracy of 67% in T4 phage beta-glucosyltransferase (1M5R), whereas DNABindR performed well with an accuracy of 85%. The accuracy is 31% in centromere-binding protein using DP-Bind_PSSM, and BindN-RF could predict with the highest accuracy of 55%, which requires further improvement. These data demonstrated the necessity of selecting methods for efficient prediction and the requirement of improvements in specific proteins.

Online tool for the correspondence between protein/DNA type and the best method

We have developed a web server to provide the best method for any type of protein/DNA-based on its class, fold, family, superfamily, motif, function, single/double-stranded DNA and DNA conformation. It takes the structural/function information of protein/DNA and displays the best method in the output. The web server is freely available at http://www.biotech.iitm.ac.in/DNA-protein/.

CONCLUSIONS

Selecting the best method for identifying the binding sites in DNA-binding proteins is one of the immediate requirements for biologists to design experiments. We have addressed this problem by carefully analyzing the available prediction methods using nine different types of data sets based on structural information, motifs, DNA types and functional information. The one-to-one correspondence between the subclass of DNA-binding proteins and best/worst prediction method are given for all the studied data sets. These information would be highly valuable to select the best method for understanding the recognition mechanism for specific proteins as well as massive analysis with large data sets.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online: Supplementary Tables 1–9.

Supplementary Data

ACKNOWLEDGEMENTS

The authors thank the reviewers for their constructive comments. The authors wish to thank Professor Sandor Pongor for helpful discussions. R.N. and M.M.G. thank the Bioinformatics facility of Department of Biotechnology and Indian Institute of Technology Madras for computational facilities. We thank Oxford University Press for partially waiving the publication charges.

FUNDING

DST grant, Government of India [SR/SO/BB-0036/2011] and an ICGEB short-term visiting fellowship (to M.M.G.). Funding for open access charge: Department of Science and Technology, Government of India research grant (partial); Oxford University Press (partial waiver).

Conflict of interest statement. None declared.

REFERENCES

  • 1.Prabakaran P, An J, Gromiha MM, Selvaraj S, Uedaira H, Kono H, Sarai A. Thermodynamic database for protein-nucleic acid interactions (ProNIT) Bioinformatics. 2001;17:1027–1034. doi: 10.1093/bioinformatics/17.11.1027. [DOI] [PubMed] [Google Scholar]
  • 2.Berman HM, Kleywegt GJ, Nakamura H, Markley JL. The protein data bank at 40: reflecting on the past to prepare for the future. Structure. 2012;20:391–396. doi: 10.1016/j.str.2012.01.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Sarai A, Kono H. Protein-DNA recognition patterns and predictions. Annu. Rev. Biophys. Biomol. Struct. 2005;34:379–398. doi: 10.1146/annurev.biophys.34.040204.144537. [DOI] [PubMed] [Google Scholar]
  • 4.Hogan ME, Austin RH. Importance of DNA stiffness in protein-DNA binding specificity. Nature. 1987;329:263–266. doi: 10.1038/329263a0. [DOI] [PubMed] [Google Scholar]
  • 5.Gromiha MM, Munteanu MG, Simon I, Pongor S. The role of DNA bending in Cro protein-DNA interactions. Biophys. Chem. 1997;69:153–160. doi: 10.1016/s0301-4622(97)00088-4. [DOI] [PubMed] [Google Scholar]
  • 6.Olson WK, Gorin AA, Lu XJ, Hock LM, Zhurkin VB. DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc. Natl Acad. Sci. USA. 1998;95:11163–11168. doi: 10.1073/pnas.95.19.11163. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gromiha MM. Influence of DNA stiffness in protein-DNA recognition. J. Biotechnol. 2005;117:137–145. doi: 10.1016/j.jbiotec.2004.12.016. [DOI] [PubMed] [Google Scholar]
  • 8.Mandel-Gutfreund Y, Margalit H. Quantitative parameters for amino acid-base interaction: implications for prediction of protein-DNA binding sites. Nucleic Acids Res. 1998;26:2306–2312. doi: 10.1093/nar/26.10.2306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Mandel-Gutfreund Y, Margalit H, Jernigan RL, Zhurkin VB. A role for CH·O interactions in protein-DNA recognition. J. Mol. Biol. 1998;277:1129–1140. doi: 10.1006/jmbi.1998.1660. [DOI] [PubMed] [Google Scholar]
  • 10.Nadassy K, Wodak SJ, Janin J. Structural features of protein-nucleic acid recognition sites. Biochemistry. 1999;38:1999–2017. doi: 10.1021/bi982362d. [DOI] [PubMed] [Google Scholar]
  • 11.Jones S, van Heyningen P, Berman HM, Thornton JM. Protein-DNA interactions: a structural analysis. J. Mol. Biol. 1999;287:877–896. doi: 10.1006/jmbi.1999.2659. [DOI] [PubMed] [Google Scholar]
  • 12.Jayaram B, McConnell K, Dixit SB, Das A, Beveridge DL. Free-energy component analysis of 40 protein-DNA complexes: a consensus view on the thermodynamics of binding at the molecular level. J. Comput. Chem. 2002;23:1–14. doi: 10.1002/jcc.10009. [DOI] [PubMed] [Google Scholar]
  • 13.Gromiha MM, Siebers JG, Selvaraj S, Kono H, Sarai A. Intermolecular and intramolecular readout mechanisms in protein-DNA recognition. J. Mol. Biol. 2004;337:285–294. doi: 10.1016/j.jmb.2004.01.033. [DOI] [PubMed] [Google Scholar]
  • 14.Lejeune D, Delsaux N, Charloteaux B, Thomas A, Brasseur R. Protein-nucleic acid recognition: statistical analysis of atomic interactions and influence of DNA structure. Proteins. 2005;61:258–271. doi: 10.1002/prot.20607. [DOI] [PubMed] [Google Scholar]
  • 15.Yamasaki S, Terada T, Kono H, Shimizu K, Sarai A. A new method for evaluating the specificity of indirect readout in protein-DNA recognition. Nucleic Acids Res. 2012;40:e129. doi: 10.1093/nar/gks462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Bouvier B, Zakrzewska K, Lavery R. Protein-DNA recognition triggered by a DNA conformational switch. Angew. Chem. Int. Ed. Engl. 2011;50:6516–6518. doi: 10.1002/anie.201101417. [DOI] [PubMed] [Google Scholar]
  • 17.Fuxreiter M, Simon I, Bondos S. Dynamic protein-DNA recognition: beyond what can be seen. Trends Biochem. Sci. 2011;36:415–423. doi: 10.1016/j.tibs.2011.04.006. [DOI] [PubMed] [Google Scholar]
  • 18.Kolomeisky AB. Physics of protein-DNA interactions: mechanisms of facilitated target search. Phys. Chem. Chem. Phys. 2011;13:2088–2095. doi: 10.1039/c0cp01966f. [DOI] [PubMed] [Google Scholar]
  • 19.Zou X, Ma W, Solov'yov IA, Chipot C, Schulten K. Recognition of methylated DNA through methyl-CpG binding domain proteins. Nucleic Acids Res. 2011;40:2747–2758. doi: 10.1093/nar/gkr1057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Zahran M, Daidone I, Smith JC, Imhof P. Mechanism of DNA recognition by the restriction enzyme EcoRV. J. Mol. Biol. 2010;401:415–432. doi: 10.1016/j.jmb.2010.06.026. [DOI] [PubMed] [Google Scholar]
  • 21.Gromiha MM, Fukui K. Scoring function based approach for locating binding sites and understanding the recognition mechanism of protein-DNA complexes. J. Chem. Inf. Model. 2011;51:721–729. doi: 10.1021/ci1003703. [DOI] [PubMed] [Google Scholar]
  • 22.Ahmad S, Keskin O, Sarai A, Nussinov R. Protein-DNA interactions: structural, thermodynamic and clustering patterns of conserved residues in DNA-binding proteins. Nucleic Acids Res. 2008;36:5922–5932. doi: 10.1093/nar/gkn573. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B. The role of DNA shape in protein-DNA recognition. Nature. 2009;461:1248–1253. doi: 10.1038/nature08473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Zhou P, Tian P, Ren Y, Zou J, Shang Z. Systemic classification and analysis of themes in protein-DNA recognition. J. Chem. Inf. Model. 2010;50:1476–1488. doi: 10.1021/ci100145d. [DOI] [PubMed] [Google Scholar]
  • 25.Pabo CO, Nekludova L. Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? J. Mol. Biol. 2000;301:597–624. doi: 10.1006/jmbi.2000.3918. [DOI] [PubMed] [Google Scholar]
  • 26.Prabakaran P, Siebers JG, Ahmad S, Gromiha MM, Singarayan MG, Sarai A. Classification of protein-DNA complexes based on structural descriptors. Structure. 2006;14:1355–1367. doi: 10.1016/j.str.2006.06.018. [DOI] [PubMed] [Google Scholar]
  • 27.Cherstvy AG. Electrostatic interactions in biological DNA-related systems. Phys. Chem. Chem. Phys. 2011;13:9942–9968. doi: 10.1039/c0cp02796k. [DOI] [PubMed] [Google Scholar]
  • 28.Mirny LA, Gelfand MS. Structural analysis of conserved base pairs in protein-DNA complexes. Nucleic Acids Res. 2002;30:1704–1711. doi: 10.1093/nar/30.7.1704. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Oda M, Nakamura H. Thermodynamic and kinetic analyses for understanding sequence-specific DNA recognition. Genes Cell. 2000;5:319–326. doi: 10.1046/j.1365-2443.2000.00335.x. [DOI] [PubMed] [Google Scholar]
  • 30.Wintjens R, Lievin J, Rooman M, Buisine E. Contribution of cation-pi interactions to the stability of protein-DNA complexes. J. Mol. Biol. 2000;302:395–410. doi: 10.1006/jmbi.2000.4040. [DOI] [PubMed] [Google Scholar]
  • 31.Rooman M, Lievin J, Buisine E, Wintjens R. Cation-pi/H-bond stair motifs at protein-DNA interfaces. J. Mol. Biol. 2002;319:67–76. doi: 10.1016/s0022-2836(02)00263-2. [DOI] [PubMed] [Google Scholar]
  • 32.Gromiha MM, Santhosh C, Suwa W. Influence of Cation-pi Interactions in Protein-DNA Complexes. Polymer. 2004;45:633–639. [Google Scholar]
  • 33.Kono H, Sarai A. Structure-based prediction of DNA target sites by regulatory proteins. Proteins. 1999;35:114–131. [PubMed] [Google Scholar]
  • 34.Donald JE, Chen WW, Shakhnovich EI. Energetics of protein-DNA interactions. Nucleic Acids Res. 2007;35:1039–1047. doi: 10.1093/nar/gkl1103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.The UniProt Consortium. Reorganizing the protein space at the Universal Protein Resource (UniProt) Nucleic Acids Res. 2012;40:D71–D75. doi: 10.1093/nar/gkr981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Breiman L. Random forests. Mach. Learn. 2001;45:5–32. [Google Scholar]
  • 37.Ahmad S, Gromiha MM, Sarai A. Analysis and Prediction of DNA-binding proteins and their binding residues based on composition, sequence and structure information. Bioinformatics. 2004;20:477–486. doi: 10.1093/bioinformatics/btg432. [DOI] [PubMed] [Google Scholar]
  • 38.Ahmad S, Sarai A. PSSM based prediction of DNA-binding sites in proteins. BMC Bioinformatics. 2005;6:33. doi: 10.1186/1471-2105-6-33. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wang L, Brown SJ. BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences. Nucleic Acids Res. 2006;34:W243–W248. doi: 10.1093/nar/gkl298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kuznetsov IB, Gou Z, Li R, Hwang S. Using evolutionary and structural information to predict DNA-binding sites on DNA-binding proteins. Proteins. 2006;64:19–27. doi: 10.1002/prot.20977. [DOI] [PubMed] [Google Scholar]
  • 41.Ofran Y, Mysore V, Rost B. Prediction of DNA-binding residues from sequence. Bioinformatics. 2007;23:i347–i353. doi: 10.1093/bioinformatics/btm174. [DOI] [PubMed] [Google Scholar]
  • 42.Ho SY, Yu FC, Chang CY, Huang HL. Design of accurate predictors for DNA-binding sites in proteins using hybrid SVM-PSSM method. Biosystems. 2007;90:234–241. doi: 10.1016/j.biosystems.2006.08.007. [DOI] [PubMed] [Google Scholar]
  • 43.Bhardwaj N, Lu H. Residue-level prediction of DNA-binding sites and its application on DNA-binding protein predictions. FEBS Lett. 2007;581:1058–1066. doi: 10.1016/j.febslet.2007.01.086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Wu J, Liu H, Duan X, Ding Y, Wu H, Bai Y, Sun X. Prediction of DNA-binding residues in proteins from amino acid sequences using a random forest model with a hybrid feature. Bioinformatics. 2009;25:30–35. doi: 10.1093/bioinformatics/btn583. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Xu B, Yang Y, Liang H, Zhou Y. An all-atom knowledge-based energy function for protein-DNA threading, docking decoy, discrimination, and prediction of transcription-factor binding profiles. Proteins. 2009;76:718–730. doi: 10.1002/prot.22384. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Wang L, Yang MQ, Yang JY. Prediction of DNA-binding residues from protein sequence information using random forests. BMC Genomics. 2009;10:S1. doi: 10.1186/1471-2164-10-S1-S1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Wang L, Huang C, Yang MQ, Yang JY. BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features. BMC Systems Biol. 2010;4:S3. doi: 10.1186/1752-0509-4-S1-S3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Hwang S, Gou Z, Kuznetsov IB. DP-Bind: a web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins. Bioinformatics. 2007;23:634–636. doi: 10.1093/bioinformatics/btl672. [DOI] [PubMed] [Google Scholar]
  • 49.Yan C, Terribilini M, Wu F, Jernigan R, Dobbs D, Honavar V. Predicting DNA-binding sites of proteins from amino acid sequence. BMC Bioinformatics. 2006;7:262. doi: 10.1186/1471-2105-7-262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Carson MB, Langlois R, Lu H. NAPS: a residue-level nucleic acid-binding prediction server. Nucleic Acids Res. 2010;38:W431–W435. doi: 10.1093/nar/gkq361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Si J, Zhang Z, Lin B, Schroeder M, Huang B. metaDBSite: a meta approach to improve protein DNA-binding site prediction. BMC Syst. Biol. 2011;5:S7. doi: 10.1186/1752-0509-5-S1-S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Murzin AG, Brenner SE, Hubbard T, Chothia C. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 1995;247:536–540. doi: 10.1006/jmbi.1995.0159. [DOI] [PubMed] [Google Scholar]
  • 53.Berman HM, Olson WK, Beveridge DL, Westbrook J, Gelbin A, Demeny T, Hsieh S-H, Srinivasan AR, Schneider B. The Nucleic Acid Database: A Comprehensive Relational Database of Three-Dimensional Structures of Nucleic Acids. Biophys. J. 1992;63:751–759. doi: 10.1016/S0006-3495(92)81649-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Norambuena T, Melo F. The Protein-DNA Interface database. BMC Bioinformatics. 2010;11:262. doi: 10.1186/1471-2105-11-262. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Shanahan HP, Garcia MA, Jones S, Thornton JM. Identifying DNA-binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res. 2004;32:4732–4741. doi: 10.1093/nar/gkh803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Bourne PE, Desai N. PRONUC: a software package for the analysis of protein and nucleic acid sequences. Comput Methods Programs Biomed. 1987;24:27–36. doi: 10.1016/0169-2607(87)90062-9. [DOI] [PubMed] [Google Scholar]
  • 57.Gromiha MM, Nagarajan R. Computational approaches for predicting the binding sites and understanding the recognition mechanism of protein-DNA complexes. Adv. Prot. Chem. Str. Biol. 2013;91:65–99. doi: 10.1016/B978-0-12-411637-5.00003-2. [DOI] [PubMed] [Google Scholar]
  • 58.Tjong H, Zhou H-X. DISPLAR: an accurate method for predicting DNA-binding sites on protein surfaces. Nucleic Acids Res. 2007;35:1465–1477. doi: 10.1093/nar/gkm008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Gromiha MM. Protein Bioinformatics: From Sequence to Function. New Delhi: Elsevier Publishers; 2010. [Google Scholar]
  • 60.Gromiha MM, Selvaraj S. Protein secondary structure prediction in different structural classes. Protein Eng. 1998;11:249–251. doi: 10.1093/protein/11.4.249. [DOI] [PubMed] [Google Scholar]
  • 61.Murvai J, Vlahovicek K, Pongor S. A simple probabilistic scoring method for protein domain identification. Bioinformtics. 2000;16:1155–1156. doi: 10.1093/bioinformatics/16.12.1155. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES