Abstract
DNA-binding proteins play critical roles in biological processes including gene expression, DNA packaging and DNA repair. They bind to DNA target sequences with different degrees of binding specificity, ranging from highly specific to non-specific. Alterations of DNA-binding specificity, due to either genetic variation or somatic mutations, can lead to various diseases. In this study, a comparative analysis of protein-DNA complex structures was carried out to investigate the structural features that contribute to binding specificity. Protein-DNA complexes were grouped into three general classes based on degrees of binding specificity: highly specific (HS), multi-specific (MS), and non-specific (NS). Our results show a clear trend of structural features among the three classes, including amino acid binding propensities, simple and complex hydrogen bonds, major/minor groove and base contacts, and DNA shape. We found that aspartate is enriched in highly specific DNA binding proteins and predominately binds to a cytosine through a single hydrogen bond or two consecutive cytosines through bidentate hydrogen bonds. Aromatic residues, histidine and tyrosine, are highly enriched in the HS and MS groups and may contribute to specific binding through different mechanisms. To further investigate the role of protein flexibility in specific protein-DNA recognition, we analysed the conformational changes between the bound and unbound states of DNA-binding proteins and structural variations. The results indicate that highly specific and multi-specific DNA-binding domains have larger conformational changes upon DNA-binding and larger degree of flexibility in both bound and unbound states.
Keywords: transcription factor, flexibility, hydrogen bond, structural variations, conformational changes, DNA shape, aromatic residue, aspartate, π-interaction
Introduction
Specific interactions between proteins and their DNA target sequences are essential in many fundamental biological processes and aberrant changes in binding specificity can cause serious consequences 1–4. It has been demonstrated that altered binding specificity between mutated transcription factors and their DNA target sequences plays a role in a broad variety of cancers 2,5–7. On the other side of the specificity spectrum, many DNA-binding proteins can bind to a wide range of DNA sequences. These non-specific DNA-binding proteins are also critical for fundamental cellular functions, including processing and packaging of DNA 8.
DNA-binding specificity generally refers to two interrelated terms: “sequence specificity” and “degree of specificity” 9. For example, type II restriction endonucleases EcoRI and BamHI specifically recognize their DNA target sequences GAATTC and GGATCC, respectively. Both enzymes show very high degrees of specificity towards different DNA sequences. Some transcription factors, such as homeodomains Ubx (from Drosophila melanogaster) and Nkx3-1 (from Homo sapiens), bind to different DNA sequence patterns, but with similar, high sequence conservation10. On the other hand, homeodomain Dbx1 (from Mus musculus) has a similar binding sequence pattern to Ubx, but most positions allow more variations and are less conserved 9,10. Most experimental and computational studies have focused on identifying sequence specificity or sequence patterns. No simple recognition rules between particular amino acids and specific DNA bases have been found, although some preferred pairings were observed 11–16. In this study, we focus on analysing structural determinants for different degrees of protein-DNA binding specificity.
Current structural studies range from individual cases to comparative analyses. Homing endonucleases 17–19 and zinc fingers 20–22 are two widely studied family proteins. Ashworth et al. developed a computational model and applied it to redesign the specificity of a homing endonuclease, I-MsoI 23. In their model, the specificity is described by packing, hydrogen bonding, solvation and electrostatic interactions. Several comparative studies have also been conducted to examine DNA-binding specificity. Luscombe and Thornton investigated the effects of individual mutations on binding specificity using small datasets, due to limited availability of protein-DNA complex structures at that time. They carried out a comparative analysis on two groups of transcription factors (including highly specific and multi-specific) and non-specific DNA-binding proteins 3. Ashworth et al. predicted the contribution of each interface residue to the binding affinity and binding specificity of four types of DNA-binding proteins: a) helical-motif transcription factors, b) restriction endonucleases, c) homing endonucleases, and d) non-specific DNA-binding enzymes 24. Another comparative analysis was performed on nine SCOP superfamilies, including homing nucleases, ribbon-helix-helix, glucocorticoid receptor-like, zinc fingers, homeodomain-like, winged helix, P53-like, lambda repressor-like, and restriction endonuclease-like 25. By comparing the ratio of indirect/direct readout and the frequency of atomic interactions, Contreras-Moreira et al. concluded that these specificity features are generally conserved and superfamily-specific 25.
Two readout mechanisms are considered to contribute to the binding specificity between proteins and DNA, base readout and shape readout (also called direct and indirect readout, respectively) 3,26–29. The base readout describes contributions from direct interaction of protein side-chains with DNA bases. The shape readout, on the other hand, describes the role of DNA shape and indirect contacts between proteins and DNA 27,28,30. The combination of base and shape readouts provides a general picture for specific protein-DNA interactions. However, what controls the degree of binding specificity, or why some proteins are highly selective on binding sequences while others are less stringent, is still not clear.
Protein-DNA recognition is by nature a dynamic process that involves delicate structural fitting between proteins and DNA 31,32. However, the exact role of flexibility and intrinsic disorder to the binding specificity is not well understood. As the specific interactions are mainly contributed by hydrogen bonding between proteins and DNA, high specificity between proteins and their cognate binding sequences is considered an optimized result of shape fit and binding thermodynamics. We have demonstrated previously that a point mutation F10V in P22 Arc repressor, which does not make direct DNA base contact, affects the degree of binding specificity by altering the flexibility of residues involved in direct base contacts 9. Therefore, more complete description in terms of both static and dynamic features is needed to fully understand the specificity in protein-DNA recognition. With the advancement of structure determination techniques, the number of protein-DNA complex structures in Protein Data Bank (PDB) is increasing at a higher rate 33. Currently there are over 3000 protein-DNA complex structures in PDB. The availability of a large number of protein-DNA complexes and their corresponding unbound protein structures makes it feasible to conduct a more comprehensive study of protein-DNA binding specificity. In this paper, we carried out a comparative analysis to investigate the static and dynamic structural features for protein-DNA binding specificity.
We first constructed datasets of protein-DNA complex structures and group these DNA-binding proteins into three general classes based on decreasing degrees of DNA-binding specificity: highly specific (HS, most type II restriction endonucleases with highly conserved sequences patterns), multi-specific (MS, transcription factors that bind to conserved sequences but allow a number of variations at various sites), and non-specific (NS) DNA-binding proteins that bind DNA sequences promiscuously. It should be noted that there are no distinct groups with respect to DNA-binding specificity; rather, we consider that DNA-binding proteins run a gamut of specificities from very specific (recognize exact sequences) to non-specific. For example, type II restriction enzyme MvaI recognizes CCWGG (W can be either A or T). On the other hand, some transcription factors, such as some nuclear receptors, exhibit high specificity 34–36. In this study, type II restriction enzymes with lower binding specificity, such as BglI (recognition sequence GCCNNNN^NGCC, where N represents any base), are not included in the HS dataset to minimize the potential specificity overlap between the HS and MS groups. Even though these same three terms, non-specific, multi-specific and highly specific, have been used in a previous study by Luscombe and Thornton 3, the meaning for multi-specific and highly specific are different. In their work, the specificity levels are defined based on DNA-binding protein families that include sixteen transcription factors and one gamma delta resolvase. Highly specific represents the cases that all family members bind to the same DNA sequence. In the multi-specific group, individual family members bind to different, but specific DNA sequences 3. Our classification, on the other hand, is based on individual proteins and includes the highly specific type II restriction endonucleases.
In addition to the three-class design, we used bound-unbound (or holo-apo) pairs for identifying dynamic structural features that contribute to binding specificity, such as the range of conformational change upon DNA-binding 32. Furthermore, to assess the relationship between protein flexibility and binding specificity, we compared the structural diversity of DNA-binding proteins, by comparing multiple apo and holo structures of the same DNA-binding protein.
Our results demonstrated a trend in several static structural features: amino acid propensities, interface size, number of residue-base contacts, backbone to base contact ratios, major to minor groove contact ratios, number of protein-DNA hydrogen bonds, and DNA shape parameters, among the three groups. We found that negatively charged aspartate is highly enriched in base interactions in highly specific DNA-binding proteins while it is depleted in multi-specific and non-specific DNA-binding proteins. Our data revealed a tight connection between aspartate and the cytosine base. We also showed the importance of two aromatic residues, tyrosine and histidine, in conferring specific protein-DNA binding. To our knowledge, this is the first large-scale comparative study to demonstrate the critical role of aspartate, tyrosine and histidine in specific protein-DNA recognition. In terms of dynamic features, we analysed the protein conformational changes upon DNA-binding and their structural variations in both free form and bound state. We found that highly specific DNA-binding proteins show larger conformational changes upon DNA-binding while the non-specific DNA-binding proteins have smaller structural variations and conformational changes.
Materials and Methods
Datasets
Three different datasets were generated in this study for different comparative analyses: (i) pdNR30, a non-redundant protein-DNA complex dataset, for investigation of static structural features related to protein-DNA interactions; (ii) pairNR30, a non-redundant bound-unbound pairs of DNA-binding domains, for comparing conformational changes upon DNA-binding; and (iii) svSet, a dataset for comparison of structural variations of DNA-binding domains.
A total of 3,098 protein-DNA complexes were selected from PDB 33. Of these complexes, some contain only DNA-binding domains while others represent full-length DNA-binding proteins, including signal-sensing domains or trans-activating domains besides DNA-binding domains. In this work, we used DNA-binding domains in protein-DNA complexes as comparison units to maintain consistency. For structural domain annotation, we combined the two most widely used structural classification databases, CATH 37 and SCOPe 38, with manual inspection if an annotation is not available in either database. A DNA-binding domain was selected if there are at least 4 protein-DNA contacts with a distance cutoff of 3.9Å, and the domain has 40 or more amino acids.
Figure 1 shows how pdNR30 was generated. First, all the X-ray crystal structures of protein-DNA complexes were selected from PDB. A series of quality filtering steps were then carried out. X-ray structures with resolution higher than 3Å and R-factor more than 0.3 were removed. Protein-DNA complexes with single-stranded DNA (ssDNA) were also filtered out. For the false ssDNA complexes, in which coordinates are provided for only one DNA chain of a double-stranded DNA, we used our in-house program PDA (Protein-DNA complex structure Analyzer) to reconstruct these protein-DNA complexes by calculating the positions of the missing complementary DNA chain 39. Since the main goal of this analysis is to study the structural features that contribute to the degree of protein-DNA binding specificity, removing mutant protein structures and non-cognate protein-DNA complexes is essential as it would add noise to our analysis. For example, researchers often use protein and/or DNA mutants to study the effects of mutations on protein-DNA binding specificity 40.
The DNA-binding domains that interact with double-stranded DNA in the complex structures were then annotated as HS (highly specific), MS (multi-specific), or NS (non-specific) DNA-binding proteins 3 based on their DNA-binding specificity and function. Type II restriction enzymes generally belong to the highly specific group and were selected based on enzyme classification number 3.1.21.4 and keywords in PDB, combined with manual inspection of the recognition sequences to assure that the bindings are highly specific. Transcription factors belong to the multi-specific group, since they generally recognize multiple conserved sequences. Transcription factors were selected using TFinDit, a data repository for known transcription factor-DNA complex structures 41. Except for histones, DNA polymerases and RNA polymerases, the annotation of other non-specific DNA-binding proteins is not trivial, which was done based on manual inspection of the PDB entry and related references. After clustering with a sequence identity of 30% using CD-HIT 42, the non-redundant set pdNR30, was generated by selecting one representative from each cluster, based on resolution and the number of missing residues. The pdNR30 dataset has 28 HS, 115 MS and 52 NS DNA-binding domains in complex with DNA (Supplementary Table S1).
The second dataset, pairNR30, was generated in a similar way except that we started with a list of DNA-binding domains with both bound and unbound structures in PDB. The DNA-binding domains in free, unbound state were selected if they have 100% sequence identity and at least 80% coverage with their corresponding structure in the dataset of bound structures. The pairNR30 dataset consists of 11 HS, 41 MS and 16 NS bound-unbound DNA-binding domain pairs (Supplementary Table S2).
The third dataset, svSet has three components: (i) multiHolo, DNA-binding domains with at least 6 PDB structures in complex with cognate DNA; (ii) multiApo, DNA-binding domains with at least 6 structures in the unbound state; and (iii) multiApoHolo, DNA-binding domains with at least 4 structures in both the unbound state and bound state with cognate DNA. This dataset was used to study the structural variations of DNA-binding domains in free state and in complex with DNA. There are 6 HS, 32 MS, and 24 NS DNA-binding domains in multiHolo dataset (Supplementary Table S3). Since the number of cases for the HS is small in the multiApo and multiApoHolo sets, we combined the HS and MS cases and compare specific (HS+MS) against non-specific (NS) DNA-binding domains. The multiApo set consists of 9 specific (HS+MS) and 6 non-specific (NS) DNA-binding domains (Supplementary Table S4) while the multiApoHolo set has 10 specific (HS+MS) and 4 non-specific (NS) DNA-binding domains (Supplementary Table S5).
Comparison of structural features of protein-DNA interactions
A comparative analysis of structural features that contribute to DNA-binding specificity was first carried out with the pdNR30 dataset that consists of a non-redundant dataset of DNA-binding domains in complex with DNA (28 HS, 115 MS and 52 NS DNA-binding domains). The structural features for protein-DNA interactions include: 1) protein side-chain/DNA-base binding propensities, 2) protein-DNA contact area (PDCA), 3) number of residue-base contacts (NRBC) 43, 4) the number and geometry of hydrogen bonds, 5) backbone to base contact ratio, 6) minor to major groove contact ratio, and 7) DNA shape.
The DNA binding propensity (p) for an amino acid is calculated as the ratio of the percentage of the amino acid in protein side-chain/DNA base contacts and the percentage of the amino acid in the specific dataset 43. Jackknife resampling was used to estimate the variances and potential bias of the data. PDCA is determined by calculating the difference in solvent accessible surface area (SASA) between the individual protein, DNA structures and the corresponding protein-DNA complexes 43. The solvent accessible surface areas were measured by Naccess with default parameters 44. Protein-DNA contacts were identified using a distance cutoff of 3.9Å between side-chain heavy atoms and all DNA heavy atoms. These residue-DNA interactions were divided in two non-overlapping sets: (i) residues that are in contact with DNA base (NRBC: number of residue-base contacts) and (ii) residues that are in contact with DNA backbone only. We also calculated the NRBC density, the ratio of NRBC over the PDCA, which represents the number of residue-base contacts per Å2.
Hydrogen bonds in protein-DNA complexes were identified with HBPLUS 45. In addition to simple hydrogen bonds, we also analysed the differences among the three specificity groups in terms of other types of hydrogen bond geometry, e.g., bidentate hydrogen bond that is defined when a residue forms more than one hydrogen bond with different acceptor and/or donor atoms (Supplementary Figure S1).
The DNA shape features, such as shear, stretch, stagger, shift, slide, rise, buckle, propeller, opening, tilt, roll, and twist, were measured using 3DNA 46. We selected nucleotides that are in contact with the protein, plus two more flanking nucleotides on each side, and compared the distributions of the DNA shape features among the three groups of DNA-binding domains. Major and minor groove width were also calculated using 3DNA, which reports the refined P-P distances 46.
The conformational change upon DNA-binding was calculated with two approaches using the pairNR30 dataset. The first approach is to calculate the Cα RMSD (root mean square deviation) between the unbound and bound conformations for a given DNA-binding protein. The RMSD is calculated by minimizing the Cα RMSD when superimposing two DNA-binding domain structures. In addition to calculating the Cα RMSD for all the residues in the DNA-binding domain, which is a useful measure to assess the overall conformational change, we also calculated the Cα RMSD in DNA-binding pocket, by selecting the binding residues in the bound conformation, using a heavy atom distance cutoff of 3.9 Å. The Cα RMSD of the binding residues can provide more detailed information of conformational adjustment for the pocket residues upon binding to DNA. The second approach is to compare Δχ1, the change of side-chain torsion angle χ1 (the torsion angle for the Cα-Cβ axis) between the bound and unbound conformations. We compared the median Δχ1 and the median absolute deviation (MAD) of Δχ1 for each domain among three groups.
The structural variations of DNA-binding domains were compared in the multiHolo, multiApo and multiApoHolo datasets based on RMSD differences. We calculated the median RMSD and MAD RMSD per DNA-binding domain, and compared the distributions among the three groups of DNA-binding proteins.
Statistical tests
The Kruskal-Wallis test, a multi-sample non-parametric method, was employed to test whether there are significant differences of each of the features among the three specificity groups, HS, MS and NS. If the p-value of the Kruskal-Wallis test is lower than 0.05, we would carry out a one-sided Mann–Whitney U test, to identify the significant differences between any two of the HS, MS and NS distributions.
Results
Amino acid propensity for DNA-binding
Arginine and lysine are the two dominant residues in overall protein-DNA contacts (18.4% and 14.9% respectively) as both are positively charged and can bind to negatively charged DNA backbone through electrostatic interactions (Supplementary Figure S2). Distributions of amino acids that are in contact with DNA, including both backbone contacts and base contacts, are similar among the three groups except for a relatively higher number of aspartate in the HS group (Supplementary Figure S2). The catalytic sites in type II restriction endonucleases usually contain aspartate, which may result in the high prevalence of aspartate in the HS group (the percentage changed from 8.4% to 5.9% after removing catalytic residues, which is still higher than those in the MS and NS groups with 1.2% and 3.4%, respectively). Even though amino acid distributions are similar, majority of the residues in the NS group are involved in DNA backbone contacts, while residues in the HS and MS groups participate in more direct residue-base interactions (Supplementary Figure S3).
To study which residues are preferred in specific protein-DNA binding, we compared the residue propensities for interacting with DNA bases among the three groups (see Methods). If the binding propensity of an amino acid is larger than 1, it would suggest that the amino acid is enriched in protein-DNA base interactions. Figure 2A shows that arginine is enriched in all three groups (pARG is 3.3, 3.2 and 4.8 for the HS, MS and NS groups respectively) while lysine is only highly enriched in the NS group (pLYS is 1.2, 0.9 and 2.5 for the HS, MS and NS groups respectively). Both residues have higher base interacting propensities in the NS group than those in the HS and MS groups. The high propensities of DNA base contact for arginine and lysine in the NS group are rather counter intuitive. A closer look at the data suggests that we need to be careful when interpreting the high propensities of arginine and lysine in the NS group in terms of their contributions to specific protein-DNA interactions. First of all, there are only 65 total residue-base contacts in the whole NS dataset. Among those contacts, 19 (~30%) are arginine-base contacts and 12 (~18%) are lysine-base contacts. Secondly, unlike the HS and MS groups, in which arginine and lysine bind predominantly in the major groove and form hydrogen bonds with guanine, arginine and lysine in the NS group are mainly involved in minor groove contacts (10 out of 19 for arginine and 10 out of 12 for lysine) (Figure 2B and Supplementary Figure S3). As generally accepted, minor groove contacts do not confer much specificity due to its lack of discriminative pattern for hydrogen bonds, either directly or mediated by water 28,47, although minor groove interactions with residues may contribute to binding specificity in individual cases (more discussion later) 48.
Asparagine, glutamine, serine, and threonine, which can form hydrogen bonds with DNA bases, are enriched in the HS and MS groups, but not in the NS group, suggesting their important roles in specific protein-DNA interactions. The hydrophobic residues such as alanine, valine, proline, leucine, and isoleucine, are depleted in all cases.
The two negatively charged residues, aspartate and glutamate, have low propensities in protein-DNA base interactions except for aspartate in the HS group (pASP is 1.37, 0.38 and 0.27 for HS, MS, and NS, respectively) (Figure 2A). In general, negatively charged residues are not favourable in protein-DNA interactions due to the negatively charged DNA backbone and electronegative groups on all the bases except for cytosine49. In addition, unlike asparagine and glutamine that can act as both hydrogen bond acceptor and donor, aspartate and glutamate can only serve as hydrogen bond acceptors. Therefore, it is not surprising to see they are depleted in protein-DNA base interactions in general. One interesting exception is the high enrichment of aspartate in the HS group (Figure 2A). Further analysis revealed a striking pattern as shown in Table 1. All the aspartate residues that contact DNA bases are involved in hydrogen bonding with major groove atoms in the highly specific DNA-binding domains. Out of the19 hydrogen bonds, 18 participate in hydrogen bonding with a cytosine. Though aspartate and glutamate have very low propensities in the MS group and NS group, their major groove contacts are primarily with a cytosine as well. While both cytosine and adenine have one hydrogen bond donor in the major groove, adenine has an electronegative surface, making it unfavourable for interacting with aspartate when compared to cytosine. In the minor groove, except for one case, all other aspartates and glutamates form hydrogen bonds with a guanine, which is not surprising since only guanine can serve as a hydrogen bond donor in the minor groove (Table 1 and supplementary Tables S6 and S7). More importantly, for aspartate-cytosine specific interactions in the HS group, aspartate form bidentate hydrogen bonds in 5 cases (accounting for 10 of the 19 total atom-level hydrogen bonds) with two consecutive cytosines (Figure 3 and Supplementary Table S6). The stereochemical properties and hydrogen bond patterns of DNA bases and aspartate make the aspartate-cytosine very specific (Figure 3). There are no bidentate hydrogen bonds for glutamate found in our non-redundant dataset. However, Ecl18kI (PDBID: 2FQZ with a recognition sequence ^CCNGG), not included in the dataset due to similarity with other enzymes, has a bidentate hydrogen bond between residue Glu187 and two consecutive cytosines50. In general, aspartate is preferred over glutamate, probably due to the shorter side-chain of aspartate. The observation of the specific hydrogen bonding between aspartate and glutamate may explain why both amino acids are rarely seen in the MS and NS groups as most transcription factors allow variations at different sites and non-specific binding proteins are not sequence-specific.
Table 1.
Amino Acid | Asp | Glu | |||
---|---|---|---|---|---|
Dataset | Groove | Major | Minor | Major | Minor |
HS | 19 (18C, 1G) | 5 (5G) | 2 (2C) | 1 (1G) | |
MS | 4 (4C) | 0 | 10 (7C, 3A) | 0 | |
NS | 0 | 1 (1G) | 0 | 1 (1T) |
Another interesting observation is the high enrichment of two aromatic residues, histidine and tyrosine in the HS (pHIS =1.9, pTYR=1.8) and MS group (pHIS =1.3, pTYR=2.6), but not in the NS group (pHIS =0.6, pTYR=0.9). But histidine and tyrosine may contribute to specific DNA-binding using different mechanisms. Histidine residues in the HS and MS groups primarily forms hydrogen bonds with guanine (Supplementary Table S8). The difference between these two groups is that 9 of the 10 histidine-base contacts in the HS group form hydrogen bonds while only half of the histidine-base contacts in the MS group are involved in hydrogen bonding. As for tyrosine, only a small percentage of the base contacts participate in hydrogen bonding (data not shown), suggesting that unlike histidine, hydrogen bonding does not play a major role in specific-protein-DNA binding for tyrosine. Previous studies have shown the importance of aromatic residues and π-π interactions in protein-DNA complexes 51,52. π interactions occur when the negatively charged electron cloud of an aromatic compound interacts with positively charged atoms or cations 53. While π interactions are generally thought to add stability and affinity to macromolecule interactions 51,52, more recent studies have suggested that aromatic residues may play a major role in determining binding specificity in molecular recognition, such as interaction between carbohydrates and proteins 54. Wilson et al. recently investigated the abundance, structure and strength of π interactions between aromatic residues and DNA bases and demonstrated that protein-DNA π interactions are more prevalent than previously thought 51,55,56. Yet, very little is known about the critical role of aromatic-base π interactions in protein-DNA binding specificity 51,55,56. Our results suggest that tyrosine may play more important roles in conferring specific protein-DNA interactions through π interactions due to its high propensities in the HS and MS but low propensity in the NS group, and scant of hydrogen bonds. Tryptophan has low occurrences with two residues in each of the three groups. Therefore the high propensity of tryptophan in the NS group is not conclusive due to the small sample size. Moreover, both tryptophan residues in the NS group interact with the minor groove of the DNA (Figure 2B). As for phenylalanine, about 50% of the base contacts are in the minor groove in both HS and MS, therefore, it is not clear how much contribution it provides for specific protein-DNA interactions.
Interaction interface
Comparison of interaction surface among the three groups shows a similar trend to their degree of binding specificity. The HS group has the largest protein-DNA contact area (PDCA) while the NS group has the smallest contact area (one-sided two-sample p-values<0.0002) (Figure 4A). Since interaction surface represents the total contact area between protein and DNA, a combination of both non-specific and specific interactions, we also compared the number of residue-base contact (NRBC) 43, which captures more of specific interactions. Results show a similar decreasing trend for NRBCs to PDCA as the protein-DNA binding specificity decreases (one-sided two-sample p-values<0.0005) (Figure 4B). In terms of number of residue-base contacts per Å2 (NRBC density), we found that HS and MS groups have similar NRBC density, while the NS group has much lower NRBC density (one-sided two-sample p-values<2×10−9) (Figure 4C).
The percentage of DNA base contact is much higher in the HS and MS groups than that in the NS group since the contacts between amino acids and DNA-backbone atoms are mainly non-specific (Figure 5A). We also compared the major and minor groove contacts, as major groove contacts represent primary contribution to binding specificity due to the sequence-specific patterns for hydrogen bonds in the major groove. The percentage of major groove contact in the HS and MS groups (81.1% and 82.3% respectively) is more than twice the number in the NS group (35.4%) (Figure 5B). In terms of the number of major groove contacts, we observed a clear trend similar to the binding specificity. The HS and MS groups have significantly higher number of major groove contacts than that in the NS group (one-sided two-sample p-values<9×10−12) (Figure 5C). The difference between the HS and MS groups is also significant (one-sided two-sample p-values<0.005).
The number of minor groove contacts does not have a trend as that in the major groove contacts. Interestingly, there is a statistically significant difference between the HS group and MS/NS groups with HS group having more minor groove contacts (Figure 5D). Even though minor groove contacts are generally considered non-specific, it has been demonstrated that minor groove contacts can contribute to protein-DNA binding specificity. Joshi R et al. previously reported that the functional specificity of a Hox protein is mediated by minor groove contacts 48. More specifically, the minor groove contacts are a result of sequence-dependent DNA shape recognition. It has been reported that the minor groove shape, which deviates from the canonical B-type DNA structure, also plays a role in sequence specific recognition for BsoBI endonuclease 57. Therefore, the relatively large number of minor groove contacts in the HS group may be the result of DNA shape (discussed in next section). Taken together, the HS and MS groups have similar ratios of residue-base contacts and similar percentages of DNA base and major groove contacts, which are significantly larger than those in the NS group. Between the HS and MS groups, HS has larger contact areas and higher number of DNA base and major groove contacts than those in the MS group, which is consistent with the previous study that shows larger interface in the restriction endonuclease superfamily than the transcription factor superfamilies 25.
Hydrogen bonds have been considered a major factor in protein-DNA binding specificity 13. Our analysis shows the decreasing pattern from the HS group to the NS group in terms of the total number of hydrogen bonds between protein and DNA (one-sided two-sample p-values<0.0003) (Figure 6A) as well as between protein and DNA bases (one-sided two-sample p-values<8×10−9) (Figure 6B). While the formation of hydrogen bonds is important for specific protein-DNA binding, the geometry of the hydrogen bonds can also help discern specific and non-specific interactions. The number of bidentate hydrogen bonds between protein and DNA also shows the same trend as the degree of binding specificity (one-sided two-sample p-values<2×10−3) (Figure 6C).
DNA shape
To compare the DNA shape in protein-DNA complexes, we used the program 3DNA to derive a number of structural features, including shear, stretch, stagger, buckle, propeller, opening, shift, slide, rise, tilt, roll and twist 46. We computed the median values for each domain, using only the nucleotides that are in contact with the protein and two flanking bases on each side, and compared the distributions among the three groups. The results show that the median values in each DNA for propeller, opening, rise and roll have significant differences among the HS, MS, and NS groups (Kruskal-Wallis test p-values < 0.02) (Figure 7). Further analysis using Mann-Whitney U test shows that the DNA-binding domains in the HS group have larger propeller (one-sided two-sample p-values < 0.02) and rise (one-sided two-sample p-values < 0.002) median values, and lower opening (one-sided two-sample p-values < 0.05) and roll (one-sided two-sample p-values < 0.004) median values than the MS and NS groups. We also compared the distributions of these four features by pooling all the data within each of the HS, MS, and NS groups and found similar significant differences (data not shown). These results indicate that the HS group has distinct shape features when compared with the other two groups, suggesting a key role of these shape features in the high binding specificity. These shape differences may also explain the number of minor groove contacts in the HS group. The high propeller and rise may make the minor groove more accessible to residues and offer more distinctive patterns for different DNA sequences, thus contributing more to binding specificity.
We also looked at the major and minor groove width of nucleotides in contact with the protein (+2 flanking bases on each side) using 3DNA by comparing the minimum, average, and maximum width for each DNA structure in the pdNR30 dataset. Our analysis shows that there is a similar pattern to the binding specificity in terms of the major groove width, where HS has the highest width, no matter which metric is used (Supplementary Figure S4 A–C). As for the minor groove width, the DNA structures in complex with highly specific DNA-binding domains have wider minor grooves than those in the multi-specific and non-specific DNA-binding domains (one-sided two sample p-values < 0.05) with the MS group having the smallest minor groove width (Supplementary Figure S4 D–F). This is in part consistent with the observation by Contreras-Moreira et al. that restriction endonuclease have a larger proportion of indirectly readout bases 25. Our data confirms the importance of DNA shape in specific protein-DNA interactions27,28.
Conformational changes upon DNA-binding
We calculated the conformational changes in terms of Cα RMSD of all the residues (Figure 8A) and the residues that are in contact with DNA base (Figure 8B) using pairNR30, a non-redundant dataset of bound/unbound DNA-binding domains. Besides Cα RMSD, which indicates the backbone conformational changes, we also looked at side-chain conformational changes of the binding residues based on χ1 dihedral angle changes Δχ1, including the distribution of the median Δχ1 per domain (Figure 8C), and the MAD of Δχ1, which shows variances of Δχ1 (Figure 8D). The conformational changes based on RMSDs show that changes are higher in the highly specific group (Figure 8A and 8B). Statistical analysis revealed that the domains in the HS group have significantly higher Cα RMSD for all residues (p-values < 0.02) and DNA-base contacting residues (p-value < 0.004). There is no significant difference between the MS group and the NS group. As for the χ1 dihedral angle changes, though the median values for the HS group are larger than those in the MS and NS group, the differences are not statistically significant (Figure 8C and 8D). The Δχ1 distributions for all DNA binding residues among the three groups were also compared, but no statistical significant differences were found (data not shown).
Our results suggest that the DNA-binding proteins with higher degree of binding specificity tend to have more conformational changes compared to the non-specific DNA binding proteins and transcription factors. Since the protein-DNA interaction interface for the highly specific proteins is larger, these proteins require backbone flexibility in order to have a precise interface fit for high specificity 58.
Structural variations of DNA-binding domains
In addition to studying structural differences between the bound and unbound structures, another way to explore the role of protein flexibility and dynamics to DNA binding specificity is to compare the conformational diversity of DNA-binding proteins in free state and bound form. Our analysis on three datasets multiHolo, multiApo and multiApoHolo revealed that highly specific and multi-specific DNA-binding domains have a larger range of structural variations in both the bound (Figure 9A–B) and free forms (Figure 9C–D), when compared to the non-specific DNA-binding domains. The plots based on the multiApoHolo set also show that the NS group has smaller structural variations in terms of median and MAD RMSD than those in the HS and MS groups (Figure 9E–F). These results suggest that the flexibility of DNA-binding proteins may contribute to their higher degree of binding specificity, which is consistent with previous findings using different metrics59.
Discussion
Knowledge of the structural basis of binding specificity is central to our understanding of protein-DNA interactions, and the evolution and divergence of protein-DNA binding specificity 60. Such knowledge is also essential to practical applications in rational design of new proteins with novel binding specificity in biotechnology and medicine 23,61–63. Our comparative analyses show a clear trend in terms of both static and dynamic structural features with the degree of protein-DNA binding specificity.
Arginine and lysine have been known to be abundant in protein-DNA interfaces (Supplementary Figure S2). Both arginine and lysine can form multiple types of hydrogen bonds with DNA, primarily with guanine in the major groove 13, which is a key factor in specific protein-DNA interactions in the HS and the MS groups. They also represent two major residues for non-specific interaction between their positively charged side-chains and the negatively charged DNA backbone. In NS group, majority of arginine and lysine residues interact with the DNA backbone. For the residues in the NS group that interact with the DNA bases, the contacts occur primarily in the minor groove. Both the non-specific and specific interactions of arginine and lysine may work together to achieve high specificity in the process of protein-DNA recognition. For specific DNA-binding proteins, the non-specific interactions between arginine/lysine and DNA backbone or minor groove can help search for the target sites very quickly via non-specific electrostatic interactions 58. Once the target sites are identified, the hydrogen bonds can contribute to sequence-specificity through specific residue-base hydrogen bonding in the major groove.
One important finding from our analysis is the high enrichment of aspartate-base contacts in the group of highly specific (HS) DNA-binding domains. This high enrichment in the HS group is not due to the high occurrence of aspartate in highly specific DNA-binding domains as revealed by the base binding propensity data (Figure 2A) and the percentage of aspartate that participates in base interaction in each DNA-binding domains (Supplementary Figure S5). Aspartate is a negatively charged residue and its side-chain atoms can only serve as hydrogen bond acceptor, which makes it unfavourable to interact with DNA due to the negatively charged backbone and electronegative surface, except for cytosine. As such, aspartate interacts with cytosine with high specificity, especially with two consecutive cytosine bases through bidentate hydrogen bonds, as aspartate has two hydrogen bond acceptors (Figure 3). It may also explain why aspartate is rarely seen in DNA base contacts in the MS and NS groups since DNA-binding proteins both groups allow variations to different degrees. In case studies, Jantz and Berg used designed zinc finger proteins and showed that when a residue in one of the fingers is changed from asparagine to aspartate, though the overall affinity decreased, the contacting base changed from adenine to cytosine with higher specificity49. Pingoud et al. studied SsoII and the evolutionary relationship between different subgroups related to this protein and found that Glu187 in SsoII is highly conserved when aligned to several other restriction enzymes, which can be either an aspartate or a glutamate64. To our knowledge, our comparative analysis is the first large-scale study to show the specific recognition of cytosine by aspartate.
Histidine and tyrosine appear to be enriched in highly specific and multi-specific DNA-binding proteins. In addition to their capability to form hydrogen bonds with bases, both aromatic residues can contribute to protein-DNA binding through π-interactions. Our data revealed that histidine contribute to specific DNA binding primarily through hydrogen bonding with guanines while tyrosine uses π-interactions to achieve the binding specificity. Recent studies demonstrated that π-interactions are more prevalent in protein-DNA recognition than previously thought 51,55,56. However, the role of π-interactions in specific protein-DNA recognition is still not clear. Our data suggest that these two aromatic residues play key roles in specific protein-DNA binding through hydrogen bonds and π-interactions. Based on these results, we have developed an integrative energy function that adds two atomic-level terms, π interaction energy and hydrogen bond energy, to a knowledge-based multi-body potential for structure-based prediction of transcription factor binding sites. Our results showed that incorporating π interaction and hydrogen bond energy greatly improved the prediction accuracy of transcription factor binding sites 14,65.
Not surprisingly, our data show that there are significantly larger base/backbone and major/minor groove contact ratios for DNA-binding proteins in the HS and MS groups when compared to the non-specific DNA-binding proteins. While the contact ratios and density are similar between HS and MS proteins, the total contact number and interaction interface in HS proteins are larger than those in the MS group (Figures 4 and 5). This is consistent with previous results by Contreras-Moreira et al. 25. Similarly, the number of simple and complex hydrogen bonds is another key contributing factor for the degree of DNA-binding specificity (Figure 6).
Since DNA shape has been implicated in protein-DNA binding specificity 27,28,30, we also looked for any shape differences among three groups by systematic analysis. However, comparison of the shape features is not as straightforward as examinations of the contact features since there are local and global shape features. Nevertheless, our results showed that the highly specific DNA-binding domains have larger rise between bases, something that can contribute to more base contacts since the bases can be more exposed 25. The results on opening, propeller and roll parameters as well as the major and minor groove width are also statistically significant. These differences may be a result of the flexibility of both protein and DNA, which help binding specificity through fitting and fine-tuning to achieve optimal interactions.
In addition to the “static” protein-DNA contact features and the difference in DNA shape, we investigated the dynamic structural features in DNA-binding domains. Currently, there are two widely accepted models for macromolecular recognition, induced-fit and conformational selection 66. We compared both the conformational changes after DNA-binding (mimicking the induced-fit model) and structural variations of each protein (mimicking the conformational selection model). Based on a limited number of cases in the datasets, we showed that the highly specific and multi-specific DNA-binding domains have larger degree of flexibility in the bound and unbound states, and larger conformational change upon DNA-binding. This is in accordance with the hypothesis that specific DNA-binding proteins need to explore different conformations in order to optimize their binding to the target DNA recognition sites 9,67, whereas non-specific DNA-binding proteins are not required to explore that many conformations in the process 47. The flexibility involved in the specific protein-DNA binding process could be a combination of structural variations and induced conformational changes upon binding for both protein and DNA. For example, a very recent work by Chen and Pettitt showed that the flexibility of a specific DNA sequence is about 40% intrinsic and 60% induced while no appreciable non-specific DNA bending is induced68.
In conclusion, protein-DNA recognition is a complex mechanism that can be dissected in terms of static and dynamic structural features that contribute to the degrees of binding specificity. Not only does the knowledge help us better understand the possible mechanisms of specific protein-DNA interactions, these features can also be used to assess the quality of protein-DNA docking predictions.
Supplementary Material
Acknowledgments
This work was supported by the National Institutes of Health [R15GM110618 to J.G]; and National Science Foundation [DBI0844749 and DBI1356459 to J.G].
References
- 1.Schott JJ, Benson DW, Basson CT, Pease W, Silberbach GM, Moak JP, Maron BJ, Seidman CE, Seidman JG. Congenital heart disease caused by mutations in the transcription factor NKX2-5. Science. 1998;281(5373):108–111. doi: 10.1126/science.281.5373.108. [DOI] [PubMed] [Google Scholar]
- 2.Filippova GN, Qi CF, Ulmer JE, Moore JM, Ward MD, Hu YJ, Loukinov DI, Pugacheva EM, Klenova EM, Grundy PE, Feinberg AP, Cleton-Jansen AM, Moerland EW, Cornelisse CJ, Suzuki H, Komiya A, Lindblom A, Dorion-Bonnet F, Neiman PE, Morse HC, 3rd, Collins SJ, Lobanenkov VV. Tumor-associated zinc finger mutations in the CTCF transcription factor selectively alter tts DNA-binding specificity. Cancer research. 2002;62(1):48–52. [PubMed] [Google Scholar]
- 3.Luscombe NM, Thornton JM. Protein-DNA interactions: amino acid conservation and the effects of mutations on binding specificity. Journal of molecular biology. 2002;320(5):991–1009. doi: 10.1016/s0022-2836(02)00571-5. [DOI] [PubMed] [Google Scholar]
- 4.Latchman DS. Transcription-factor mutations and disease. The New England journal of medicine. 1996;334(1):28–33. doi: 10.1056/NEJM199601043340108. [DOI] [PubMed] [Google Scholar]
- 5.Gohler T, Jager S, Warnecke G, Yasuda H, Kim E, Deppert W. Mutant p53 proteins bind DNA in a DNA structure-selective mode. Nucleic acids research. 2005;33(3):1087–1100. doi: 10.1093/nar/gki252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Thukral SK, Lu Y, Blain GC, Harvey TS, Jacobsen VL. Discrimination of DNA binding sites by mutant p53 proteins. Molecular and cellular biology. 1995;15(9):5196–5202. doi: 10.1128/mcb.15.9.5196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chene P. Mutations at position 277 modify the DNA-binding specificity of human p53 in vitro. Biochemical and biophysical research communications. 1999;263(1):1–5. doi: 10.1006/bbrc.1999.1294. [DOI] [PubMed] [Google Scholar]
- 8.Agback P, Baumann H, Knapp S, Ladenstein R, Hard T. Architecture of nonspecific protein-DNA interactions in the Sso7d-DNA complex. Nature structural biology. 1998;5(7):579–584. doi: 10.1038/836. [DOI] [PubMed] [Google Scholar]
- 9.Song W, Guo JT. Investigation of arc repressor DNA-binding specificity by comparative molecular dynamics simulations. J Biomol Struct Dyn. 2015;33(10):2083–2093. doi: 10.1080/07391102.2014.997797. [DOI] [PubMed] [Google Scholar]
- 10.Mathelier A, Fornes O, Arenillas DJ, Chen CY, Denay G, Lee J, Shi W, Shyr C, Tan G, Worsley-Hunt R, Zhang AW, Parcy F, Lenhard B, Sandelin A, Wasserman WW. JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic acids research. 2016;44(D1):D110–115. doi: 10.1093/nar/gkv1176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Matthews BW. Protein-DNA interaction. No code for recognition. Nature. 1988;335(6188):294–295. doi: 10.1038/335294a0. [DOI] [PubMed] [Google Scholar]
- 12.Pabo CO, Nekludova L. Geometric analysis and comparison of protein-DNA interfaces: why is there no simple code for recognition? Journal of molecular biology. 2000;301(3):597–624. doi: 10.1006/jmbi.2000.3918. [DOI] [PubMed] [Google Scholar]
- 13.Luscombe NM, Laskowski RA, Thornton JM. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic acids research. 2001;29(13):2860–2874. doi: 10.1093/nar/29.13.2860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Liu Z, Mao F, Guo JT, Yan B, Wang P, Qu Y, Xu Y. Quantitative evaluation of protein-DNA interactions using an optimized knowledge-based potential. Nucleic acids research. 2005;33(2):546–558. doi: 10.1093/nar/gki204. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Takeda T, Corona RI, Guo JT. A knowledge-based orientation potential for transcription factor-DNA docking. Bioinformatics. 2013;29(3):322–330. doi: 10.1093/bioinformatics/bts699. [DOI] [PubMed] [Google Scholar]
- 16.Xu B, Yang Y, Liang H, Zhou Y. An all-atom knowledge-based energy function for protein-DNA threading, docking decoy discrimination, and prediction of transcription-factor binding profiles. Proteins. 2009;76(3):718–730. doi: 10.1002/prot.22384. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ashworth J, Taylor GK, Havranek JJ, Quadri SA, Stoddard BL, Baker D. Computational reprogramming of homing endonuclease specificity at multiple adjacent base pairs. Nucleic acids research. 2010;38(16):5601–5608. doi: 10.1093/nar/gkq283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Murphy PM, Bolduc JM, Gallaher JL, Stoddard BL, Baker D. Alteration of enzyme specificity by computational loop remodeling and design. Proceedings of the National Academy of Sciences of the United States of America. 2009;106(23):9215–9220. doi: 10.1073/pnas.0811070106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Ulge UY, Baker DA, Monnat RJ., Jr Comprehensive computational design of mCreI homing endonuclease cleavage specificity for genome engineering. Nucleic acids research. 2011;39(10):4330–4339. doi: 10.1093/nar/gkr022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Paillard G, Deremble C, Lavery R. Looking into DNA recognition: zinc finger binding specificity. Nucleic acids research. 2004;32(22):6673–6682. doi: 10.1093/nar/gkh1003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kaplan T, Friedman N, Margalit H. Ab initio prediction of transcription factor targets using structural knowledge. PLoS computational biology. 2005;1(1):e1. doi: 10.1371/journal.pcbi.0010001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Siggers TW, Honig B. Structure-based prediction of C2H2 zinc-finger binding specificity: sensitivity to docking geometry. Nucleic acids research. 2007;35(4):1085–1097. doi: 10.1093/nar/gkl1155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ashworth J, Havranek JJ, Duarte CM, Sussman D, Monnat RJ, Jr, Stoddard BL, Baker D. Computational redesign of endonuclease DNA binding and cleavage specificity. Nature. 2006;441(7093):656–659. doi: 10.1038/nature04818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ashworth J, Baker D. Assessment of the optimization of affinity and specificity at protein-DNA interfaces. Nucleic acids research. 2009;37(10):e73. doi: 10.1093/nar/gkp242. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Contreras-Moreira B, Sancho J, Angarica VE. Comparison of DNA binding across protein superfamilies. Proteins. 2010;78(1):52–62. doi: 10.1002/prot.22525. [DOI] [PubMed] [Google Scholar]
- 26.Michael Gromiha M, Siebers JG, Selvaraj S, Kono H, Sarai A. Intermolecular and intramolecular readout mechanisms in protein-DNA recognition. Journal of molecular biology. 2004;337(2):285–294. doi: 10.1016/j.jmb.2004.01.033. [DOI] [PubMed] [Google Scholar]
- 27.Rohs R, West SM, Sosinsky A, Liu P, Mann RS, Honig B. The role of DNA shape in protein-DNA recognition. Nature. 2009;461(7268):1248–1253. doi: 10.1038/nature08473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Rohs R, Jin X, West SM, Joshi R, Honig B, Mann RS. Origins of specificity in protein-DNA recognition. Annual review of biochemistry. 2010;79:233–269. doi: 10.1146/annurev-biochem-060408-091030. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Zhou T, Shen N, Yang L, Abe N, Horton J, Mann RS, Bussemaker HJ, Gordan R, Rohs R. Quantitative modeling of transcription factor binding specificities using DNA shape. Proceedings of the National Academy of Sciences of the United States of America. 2015;112(15):4654–4659. doi: 10.1073/pnas.1422023112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Rohs R, West SM, Liu P, Honig B. Nuance in the double-helix and its role in protein-DNA recognition. Current opinion in structural biology. 2009;19(2):171–177. doi: 10.1016/j.sbi.2009.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Fuxreiter M, Simon I, Bondos S. Dynamic protein-DNA recognition: beyond what can be seen. Trends in biochemical sciences. 2011;36(8):415–423. doi: 10.1016/j.tibs.2011.04.006. [DOI] [PubMed] [Google Scholar]
- 32.Janin J, Sternberg MJ. Protein flexibility, not disorder, is intrinsic to molecular recognition. F1000 biology reports. 2013;5:2. doi: 10.3410/B5-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic acids research. 2000;28(1):235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Jolma A, Yan J, Whitington T, Toivonen J, Nitta KR, Rastas P, Morgunova E, Enge M, Taipale M, Wei G, Palin K, Vaquerizas JM, Vincentelli R, Luscombe NM, Hughes TR, Lemaire P, Ukkonen E, Kivioja T, Taipale J. DNA-binding specificities of human transcription factors. Cell. 2013;152(1–2):327–339. doi: 10.1016/j.cell.2012.12.009. [DOI] [PubMed] [Google Scholar]
- 35.Gordan R, Murphy KF, McCord RP, Zhu C, Vedenko A, Bulyk ML. Curated collection of yeast transcription factor DNA binding specificity data reveals novel structural and gene regulatory insights. Genome biology. 2011;12(12):R125. doi: 10.1186/gb-2011-12-12-r125. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Gama-Castro S, Salgado H, Peralta-Gil M, Santos-Zavaleta A, Muniz-Rascado L, Solano-Lira H, Jimenez-Jacinto V, Weiss V, Garcia-Sotelo JS, Lopez-Fuentes A, Porron-Sotelo L, Alquicira-Hernandez S, Medina-Rivera A, Martinez-Flores I, Alquicira-Hernandez K, Martinez-Adame R, Bonavides-Martinez C, Miranda-Rios J, Huerta AM, Mendoza-Vargas A, Collado-Torres L, Taboada B, Vega-Alvarado L, Olvera M, Olvera L, Grande R, Morett E, Collado-Vides J. RegulonDB version 7. 0: transcriptional regulation of Escherichia coli K-12 integrated within genetic sensory response units (Gensor Units) Nucleic acids research. 2011;39(Database issue):D98–105. doi: 10.1093/nar/gkq1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Sillitoe I, Lewis TE, Cuff A, Das S, Ashford P, Dawson NL, Furnham N, Laskowski RA, Lee D, Lees JG, Lehtinen S, Studer RA, Thornton J, Orengo CA. CATH: comprehensive structural and functional annotations for genome sequences. Nucleic acids research. 2015;43(Database issue):D376–381. doi: 10.1093/nar/gku947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Fox NK, Brenner SE, Chandonia JM. SCOPe: Structural Classification of Proteins--extended, integrating SCOP and ASTRAL data and classification of new structures. Nucleic acids research. 2014;42(Database issue):D304–309. doi: 10.1093/nar/gkt1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kim R, Guo JT. PDA: an automatic and comprehensive analysis program for protein-DNA complex structures. BMC Genomics. 2009;10(Suppl 1):S13. doi: 10.1186/1471-2164-10-S1-S13. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Sapienza PJ, Rosenberg JM, Jen-Jacobson L. Structural and thermodynamic basis for enhanced DNA binding by a promiscuous mutant EcoRI endonuclease. Structure. 2007;15(11):1368–1382. doi: 10.1016/j.str.2007.09.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Turner D, Kim R, Guo JT. TFinDit: transcription factor-DNA interaction data depository. BMC Bioinformatics. 2012;13(1):220. doi: 10.1186/1471-2105-13-220. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. doi: 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
- 43.Kim R, Corona RI, Hong B, Guo JT. Benchmarks for flexible and rigid transcription factor-DNA docking. BMC structural biology. 2011;11:45. doi: 10.1186/1472-6807-11-45. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Hubbard SJ, Thornton JM. NACCESS; Department of Biochemistry and Molecular Biology, University College London. NACCESS; Department of Biochemistry and Molecular Biology, University College London; 1993. [Google Scholar]
- 45.McDonald IK, Thornton JM. Satisfying hydrogen bonding potential in proteins. Journal of molecular biology. 1994;238(5):777–793. doi: 10.1006/jmbi.1994.1334. [DOI] [PubMed] [Google Scholar]
- 46.Lu XJ, Olson WK. 3DNA: a software package for the analysis, rebuilding and visualization of three-dimensional nucleic acid structures. Nucleic acids research. 2003;31(17):5108–5121. doi: 10.1093/nar/gkg680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Murphy FVt, Churchill ME. Nonsequence-specific DNA recognition: a structural perspective. Structure. 2000;8(4):R83–89. doi: 10.1016/s0969-2126(00)00126-x. [DOI] [PubMed] [Google Scholar]
- 48.Joshi R, Passner JM, Rohs R, Jain R, Sosinsky A, Crickmore MA, Jacob V, Aggarwal AK, Honig B, Mann RS. Functional specificity of a Hox protein mediated by the recognition of minor groove structure. Cell. 2007;131(3):530–543. doi: 10.1016/j.cell.2007.09.024. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Jantz D, Berg JM. Probing the DNA-binding affinity and specificity of designed zinc finger proteins. Biophysical journal. 2010;98(5):852–860. doi: 10.1016/j.bpj.2009.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Bochtler M, Szczepanowski RH, Tamulaitis G, Grazulis S, Czapinska H, Manakova E, Siksnys V. Nucleotide flips determine the specificity of the Ecl18kI restriction endonuclease. EMBO J. 2006;25(10):2219–2229. doi: 10.1038/sj.emboj.7601096. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Wintjens R, Lievin J, Rooman M, Buisine E. Contribution of cation-pi interactions to the stability of protein-DNA complexes. Journal of molecular biology. 2000;302(2):395–410. doi: 10.1006/jmbi.2000.4040. [DOI] [PubMed] [Google Scholar]
- 52.Michael Gromiha MSC, Suwa M. Influence of cation-pi interactions in protein-DNA complexes. Polymer. 2004;45(2):633. [Google Scholar]
- 53.Hunter C, Sanders J. The nature of pi-pi interactions. Journal of the American Chemical Society. 1990;1121(4):5525–5534. [Google Scholar]
- 54.Asensio JL, Arda A, Canada FJ, Jimenez-Barbero J. Carbohydrate-aromatic interactions. Acc Chem Res. 2013;46(4):946–954. doi: 10.1021/ar300024d. [DOI] [PubMed] [Google Scholar]
- 55.Wilson KA, Kellie JL, Wetmore SD. DNA-protein pi-interactions in nature: abundance, structure, composition and strength of contacts between aromatic amino acids and DNA nucleobases or deoxyribose sugar. Nucleic acids research. 2014;42(10):6726–6741. doi: 10.1093/nar/gku269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Baker CM, Grant GH. Role of aromatic amino acids in protein-nucleic acid recognition. Biopolymers. 2007;85(5–6):456–470. doi: 10.1002/bip.20682. [DOI] [PubMed] [Google Scholar]
- 57.van der Woerd MJ, Pelletier JJ, Xu S, Friedman AM. Restriction enzyme BsoBI-DNA complex: a tunnel for recognition of degenerate DNA sequences and potential histidine catalysis. Structure. 2001;9(2):133–144. doi: 10.1016/s0969-2126(01)00564-0. [DOI] [PubMed] [Google Scholar]
- 58.Kalodimos CG, Biris N, Bonvin AM, Levandoski MM, Guennuegues M, Boelens R, Kaptein R. Structure and flexibility adaptation in nonspecific and specific protein-DNA complexes. Science. 2004;305(5682):386–389. doi: 10.1126/science.1097064. [DOI] [PubMed] [Google Scholar]
- 59.Andrabi M, Mizuguchi K, Ahmad S. Conformational changes in DNA-binding proteins: relationships with precomplex features and contributions to specificity and stability. Proteins. 2014;82(5):841–857. doi: 10.1002/prot.24462. [DOI] [PubMed] [Google Scholar]
- 60.Baker CR, Tuch BB, Johnson AD. Extensive DNA-binding specificity divergence of a conserved transcription regulator. Proceedings of the National Academy of Sciences of the United States of America. 2011;108(18):7493–7498. doi: 10.1073/pnas.1019177108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Uil TG, Haisma HJ, Rots MG. Therapeutic modulation of endogenous gene function by agents with designed DNA-sequence specificities. Nucleic acids research. 2003;31(21):6064–6078. doi: 10.1093/nar/gkg815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Porteus MH, Baltimore D. Chimeric nucleases stimulate gene targeting in human cells. Science. 2003;300(5620):763. doi: 10.1126/science.1078395. [DOI] [PubMed] [Google Scholar]
- 63.Urnov FD, Miller JC, Lee YL, Beausejour CM, Rock JM, Augustus S, Jamieson AC, Porteus MH, Gregory PD, Holmes MC. Highly efficient endogenous human gene correction using designed zinc-finger nucleases. Nature. 2005;435(7042):646–651. doi: 10.1038/nature03556. [DOI] [PubMed] [Google Scholar]
- 64.Pingoud V, Kubareva E, Stengel G, Friedhoff P, Bujnicki JM, Urbanke C, Sudina A, Pingoud A. Evolutionary relationship between different subgroups of restriction endonucleases. J Biol Chem. 2002;277(16):14306–14314. doi: 10.1074/jbc.M111625200. [DOI] [PubMed] [Google Scholar]
- 65.Farrel A, Murphy J, Guo J-t. Structure-based prediction of transcription factor binding specificity using an integrative energy function. Bioinformatics. 2016 doi: 10.1093/bioinformatics/btw264. conditionally accepted. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Csermely P, Palotai R, Nussinov R. Induced fit, conformational selection and independent dynamic segments: an extended view of binding events. Trends in biochemical sciences. 2010;35(10):539–546. doi: 10.1016/j.tibs.2010.04.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Zhou HX. Intrinsic disorder: signaling via highly specific but short-lived association. Trends in biochemical sciences. 2012;37(2):43–48. doi: 10.1016/j.tibs.2011.11.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Chen C, Pettitt BM. DNA Shape versus Sequence Variations in the Protein Binding Process. Biophysical journal. 2016;110(3):534–544. doi: 10.1016/j.bpj.2015.11.3527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Lu XJ, Olson WK. 3DNA: a versatile, integrated software system for the analysis, rebuilding and visualization of three-dimensional nucleic-acid structures. Nature protocols. 2008;3(7):1213–1227. doi: 10.1038/nprot.2008.104. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.