Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2014 Oct 13;111(48):17140–17145. doi: 10.1073/pnas.1410569111

Protein−DNA binding in the absence of specific base-pair recognition

Ariel Afek a, Joshua L Schipper b, John Horton b, Raluca Gordân b,1, David B Lukatsky a,1
PMCID: PMC4260554  PMID: 25313048

Significance

Understanding molecular mechanisms of how regulatory proteins, called transcription factors (TFs), recognize their specific binding sites encoded into genomic DNA represents one of the central, long-standing problems of molecular biophysics. Strikingly, our experiments demonstrate that DNA context characterized by certain repeat symmetries surrounding specific TF binding sites significantly influences binding specificity. We expect that our results will significantly impact the understanding of molecular, biophysical principles of transcriptional regulation, and significantly improve our ability to predict how variations in DNA sequences, i.e., mutations or polymorphisms, and protein concentrations influence gene expression programs in living cells.

Keywords: protein−DNA binding, nonspecific protein−DNA binding, transcriptional regulation

Abstract

Until now, it has been reasonably assumed that specific base-pair recognition is the only mechanism controlling the specificity of transcription factor (TF)−DNA binding. Contrary to this assumption, here we show that nonspecific DNA sequences possessing certain repeat symmetries, when present outside of specific TF binding sites (TFBSs), statistically control TF−DNA binding preferences. We used high-throughput protein−DNA binding assays to measure the binding levels and free energies of binding for several human TFs to tens of thousands of short DNA sequences with varying repeat symmetries. Based on statistical mechanics modeling, we identify a new protein−DNA binding mechanism induced by DNA sequence symmetry in the absence of specific base-pair recognition, and experimentally demonstrate that this mechanism indeed governs protein−DNA binding preferences.


Forty years ago, von Hippel et al. demonstrated that nonspecific protein−DNA binding is an important biophysical mechanism operating in a living cell (1). This seminal work makes it possible to interpret experiments that measured how transcription factors (TFs) search for their specific target sites flanked by nonconsensus sequence elements (110). A specific consensus motif is a short DNA sequence, typically 6–20 base pairs (bp), that possesses an enhanced binding affinity for a particular TF. For example, the sequence CACGTG represents the specific consensus motif for the human protein Max used in this study (Fig. 1). The process of establishing specific, consensus protein−DNA binding requires the formation of precise geometrical fit between the protein and its consensus DNA motif, accompanied by the formation of specific hydrogen and electrostatic contacts at the protein−DNA binding interface (6, 7) (Fig. 1). In addition to binding to their consensus DNA motifs, transcription factors can also bind, albeit with lower affinity, to DNA regions lacking any consensus motifs. The term “nonspecific protein−DNA binding” (6) is typically used to describe these weaker interactions. Von Hippel and Berg suggested classifying nonspecific protein−DNA binding into two related mechanisms (6). The first mechanism includes protein binding to its mutated specific motifs that retain some residual, reduced specificity. The second mechanism is largely DNA sequence independent, and it involves electrostatic binding modulated by the overall DNA geometry (6). Despite significant experimental progress, molecular mechanisms responsible for these two types of nonspecific binding remain poorly understood, and the free energy of nonspecific protein−DNA binding has not been systematically characterized (1114). The interplay between consensus and nonconsensus DNA sequence elements emerges as a dominant factor that governs protein−DNA binding preferences. However, this interplay is also poorly understood (15, 16). Until now, it has been reasonably assumed that specific (consensus) base-pair recognition must control the genome-wide specificity of TF−DNA binding.

Fig. 1.

Fig. 1.

Examples of specific protein−DNA binding, involving proteins used in this study. Crystal structures of specific protein−DNA complexes formed by proteins from the two structural families explored in this work: bHLH family (Max) and E2F/DP family (E2F4:Dp2).

Contrary to this assumption, here we identify a general mechanism for protein−DNA binding in the absence of specific base-pair recognition, and show that it stems from statistical interactions between proteins and DNA sequence correlations, i.e., statistically repeated DNA sequence patterns (17). We use the term “nonconsensus protein−DNA binding” to describe such statistical interactions.

Using high-throughput protein−DNA binding assays combined with statistical mechanics analysis, we demonstrate here that nonconsensus protein−DNA binding induced by DNA sequence correlations is an entropy-dominated, statistical effect. Contrary to the case of specific protein−DNA binding, which stems from a single protein−DNA binding site, the nonconsensus effect characterized in our study is nonlocal, as it stems from multiple nonspecific interactions between protein and statistically repeated DNA sequence patterns. Here we show that this effect is quantitatively significant, and that, for natural genomic sequences, its strength is comparable to the effect of mutations in the specific TF motif.

In addition, we directly measure, for the first time, the free energy of nonconsensus protein−DNA binding. We use tens of thousands of computationally designed DNA sequences, each 36 bp long, with varying symmetry and length scale of DNA sequence correlations, but having identical GC content and identical specific protein−DNA binding site. We demonstrate that statistically, on average, the nonconsensus effect alone (coming exclusively from the flanking DNA regions, and without contribution from the specific protein−DNA binding site) contributes a free energy on the order of 1 kBT0.6 kcal/mol per DNA sequence (kB is the Boltzmann constant and T is the temperature), for human TF Max. We show that, even for short DNA sequences, this nonconsensus effect induces a nearly threefold difference in the amount of Max protein bound to DNA molecules containing identical specific binding site, but different symmetries and correlations scales in the flanking regions (Fig. 2B).

Fig. 2.

Fig. 2.

Direct experimental test of nonconsensus protein−DNA binding in the absence of specific base-pair recognition, using computationally designed DNA sequences with identical specific motifs, identical nucleotide content, and flanking regions with different nonconsensus sequence elements. (A) Examples of computationally designed 36-mer DNA sequences possessing different sequence repeat symmetries. The sequences shown in the example were generated at 1/ξ=0.3 (in dimensionless units). (B) Measured binding preferences of the Max protein toward designed DNA sequences characterized by different symmetries and length scales of DNA sequence correlations. The legend shows the symmetry types, where α represents A, T, C, or G. For each symmetry type, each point on the corresponding curve represents the measured average PBM fluorescent intensity (representing the average concentration of the Max protein bound to DNA) over a few hundred DNA sequences designed at a given value of ξ. For example, to obtain one point at 1/ξ=0.3, for the [αNα] symmetry, we used the measured PBM intensity from 404 different DNA sequences designed at this value of ξ. To compute error bars, for each value of ξ, we divided measured fluorescent intensities into four groups, computed the average intensity for each group, and computed the SD of the average intensities. The color code of the sequences corresponds to the color code used in A. The fluorescent intensity is given in dimensionless units; higher values correspond to stronger binding events. (C) Computed nonconsensus free energy per bp, f=FTF/M, (in units of kBT), shows a strong, negative linear correlation with the measured Max−DNA binding preferences. (D) Probability distribution of the measured Max−DNA binding intensities from the entire designed DNA library shows a large variability in the strength of Max binding to DNA sequences with identical specific motifs.

The magnitude of the identified nonconsensus effect reaches 66% of consensus (specific) TF−DNA binding. This demonstrates that the identified effect of nonconsensus protein−DNA binding is highly significant compared with consensus (specific) binding. In addition, we demonstrate that for DNA sequences lacking specific motif, the magnitude of the nonconsensus effect is as significant as for DNA sequences containing specific motif. We also demonstrate that the nonconsensus effect depends on the length of repeated DNA sequence elements in distal regions away from the location of the specific motif. We suggest therefore that for longer DNA sequences, such as open chromatin regions in the promoters of human genes, which are hundreds of bp long (18, 19), the nonconsensus effect will be even stronger and could significantly affect protein−DNA binding, and transcriptional regulation genome-wide.

Results

Experimental Design.

To assess the influence of nonconsensus sequence elements on protein−DNA binding, we used a stochastic design procedure to generate sets of DNA sequences with a common consensus motif and flanks characterized by different DNA sequence correlations (Fig. 2A). In particular, we selected four different sequence symmetries, and for each symmetry type, we computationally designed sequence sets with different values of the length scale of DNA sequence correlations (see SI Appendix, Fig. S1). The larger the correlation scale, the higher the DNA sequence symmetry, and the larger the number of repeated sequence elements for each symmetry type (SI Appendix, Fig. S1). The notion of the correlation scale in our sequence design is qualitatively analogous to the correlation length of the one-dimensional Ising model (20), or analogous to the frequency of the nearest-neighbor base sequences in DNA (21). All DNA sequences in our experimental design have identical nucleotide content, and all of them have an identical specific motif in the center of the sequence, flanked by the nonconsensus background. The sequences are 36 bp long, with the 10-bp specific motif located in the center.

We initially focused on a well-characterized human transcription factor, Max (Fig. 1). We used a large, 10-bp specific DNA motif (GTCACGTGAC) instead of the conventional 6-bp specific motif for the Max protein (CACGTG) to significantly reduce the possibility that a residual specific binding might lead to the variation in the experimentally measured protein−DNA binding levels (Fig. 2A). Our design procedure leaves the specific motif intact, and rearranges only the sequence flanks. We therefore exclusively probe the influence of different nonconsensus sequence elements on the protein−DNA binding strength, and thus we expect that the measured variability in the binding signal stems from nonspecific protein−DNA binding, i.e., binding that occurs in the absence of specific base-pair recognition. We used the in vitro protein-binding microarray (PBM) technology (2224) to simultaneously measure the strength of protein binding (i.e., the binding fluorescent intensity; see Methods) for all ∼29,000 computationally designed DNA sequences containing the constant motif GTCACGTGAC within flanks characterized by different DNA sequence correlations.

High-Throughput Measurements of Nonconsensus Protein−DNA Binding to Computationally Designed DNA Sequences.

The Max−DNA binding preferences measured on our custom PBM demonstrate that for each chosen symmetry type, the length scale of DNA sequence correlations is the key order parameter that controls the binding strength (Fig. 2B). In addition, the free energy of nonconsensus protein−DNA binding computed using a simple protein−DNA random-binder model without fitting parameters (17) demonstrates an excellent agreement with the experimental results (Fig. 2C) (see SI Appendix and our previous work (17, 25) for details on free-energy calculations). Given that all DNA sequences on our custom PBM contain the same consensus binding motif and differ only in the flanks, we conclude that the observed variability in the strength of Max binding (Fig. 2D) is governed by DNA sequence correlations of nonconsensus elements present in the flanking regions (Fig. 2C). To provide additional support for our results, we have tested two other human proteins, Mad2 and c-Myc (henceforth referred to as Mad and Myc, respectively), which belong to the same structural family as Max and have similar specific motifs. In addition, we performed measurements at a lower concentration of TF Max (SI Appendix, Fig. S2). In all these cases, we obtained results similar to those shown in Fig. 2, strongly supporting the generality of our findings.

Measured Free Energy of Nonconsensus Protein−DNA Binding.

In addition to measuring protein−DNA binding levels, our high-throughput protein−DNA binding assay allows us to directly measure the relative nonconsensus protein−DNA binding free energy, Fig. 3. In particular, we begin by defining the dissociation constant, Kd=[P][D]/[PD], where [P], [D], and [PD] denote the concentrations of unbound protein, unbound DNA molecules, and protein bound to DNA, respectively. The protein−DNA binding free energy, ΔG(ξ)=kBTln(Kd(ξ)), can be defined separately for each group of DNA sequences with a given symmetry and at a given value of the correlation scale, ξ. The free energy, ΔG(ξ), contains both the contribution stemming from specific binding motif and from nonconsensus binding. This free energy, ΔG(ξ), cannot be extracted directly from our experiment. However, we can extract the free-energy difference that characterizes exclusively the nonconsensus protein−DNA binding,

ΔΔG(ξ)=ΔG(ξ)ΔGrandkBTln([PD]rand/[PD]ξ),

where [PD]ξ is the average TF−DNA binding intensity (directly measured in the experiment) for DNA sequences computationally designed with a given symmetry and at a given value of the correlation scale, ξ, and [PD]rand is the measured average TF−DNA binding intensity for the random sequence set. As described above, all DNA sequences in both sets have identical GC content and identical specific motif located in the center.

Fig. 3.

Fig. 3.

Measured free energy of nonconsensus protein−DNA binding specifies the average, statistical strength of the nonconsensus effect. Figure shows the measured nonconsensus protein−DNA binding free-energy difference, ΔΔG=ΔG(ξ)ΔGrand, where ΔGrand is the free energy of protein binding to random DNA sequences, and ΔG represents free energy for sequences designed with different correlation scales, ξ, and different DNA sequence symmetries. All DNA sequences have identical GC content, and an identical specific binding motif in the center of each sequence.

The results of the measurements presented in Fig. 3 show that statistically, on average, the nonconsensus protein−DNA binding delivers the free-energy difference, ΔΔG, which varies within the range of 1 kBT0.6 kcal/mol per one DNA sequence. Although this value is not large, it induces a nearly threefold difference in the amount of proteins bound to DNA (Fig. 2B). In particular, Fig. 2B demonstrates that the concentration of the Max protein bound to DNA sequences with the [αNα] symmetry characterized by the longest correlation scale, (1/ξ0.3), is nearly threefold larger, on average, than the concentration of the Max protein bound to DNA sequences with the [ATCG] symmetry at 1/ξ0.7. Importantly, the measured nonconsensus effect stems exclusively from the 26-bp flanking DNA regions, and without contribution from the specific protein−DNA binding site. We suggest, therefore, that for longer DNA sequences, such as open chromatin regions across the human genome (18, 19), the nonconsensus effect will likely to be much stronger and could significantly affect protein−DNA binding genome-wide.

The measured nonconsensus free energy is also in excellent agreement with the theoretically predicted free energy based on a simple random-binder model (SI Appendix, Fig. S3). This agreement provides a proof-of-principle for the entropy-dominated mechanism for nonconsensus protein−DNA binding that we discuss in Discussion.

To further validate the statistical significance of the results presented in Fig. 3, we have performed an additional experiment at a lower concentration of the Max protein (50 nM, compared with 100 nM used in the experiment described above). Next, we separated each group of DNA sequences (with a given symmetry and correlation scale) into two nonoverlapping subgroups. We computed the ΔΔG using the first subgroup of sequences and the 50-nM Max PBM data (SI Appendix, Fig. S4A), and, separately, we computed the ΔΔG using the second subgroup of sequences and the 100-nM Max PBM data (SI Appendix, Fig. S4B). The almost perfect agreement obtained using two different experiments and two nonoverlapping groups of sequences (linear correlation coefficient R0.99, and the P value, p1023; SI Appendix, Fig. S4C) provides a very strong validation for the high accuracy of our free-energy measurements.

Comparing the Strength of Nonconsensus and Consensus (Specific) TF−DNA Binding.

We now ask the question: How does the magnitude of nonconsensus protein−DNA binding compare with the magnitude of consensus (specific) protein−DNA binding? To answer this question, we compared the measured TF−DNA binding intensities for Max, Mad, and Myc proteins to DNA sequences containing the specific motif GTCACGTGAC with TF−DNA binding intensities to DNA sequences in the absence of such motif or mutated variants of it (i.e., to negative control DNA sequences) (SI Appendix, Fig. S5). Two important conclusions follow from this analysis. First, the presence of the specific motif leads statistically, on average, to at most 2 kBT free-energy difference compared with the negative control. Second, the magnitude of the identified nonconsensus effect constitutes as much as 45–66% of consensus (specific) TF−DNA binding (SI Appendix, Figs. S5 and S6). This result is remarkable, as it demonstrates that the identified effect of nonconsensus protein−DNA binding is highly significant compared with consensus binding.

To further validate the estimated magnitude, we used the previously published Max-DNA dissociation constants, Kd, obtained using the microfluidic platform for in vitro characterization of TF−DNA interactions (MITOMI) (26). We compared those measurements with our PBM measurements performed using 24 DNA sequences (containing specific Max-binding motifs GnnnnGTGGG) from ref. 26, and we obtained an excellent linear correlation between the free-energy differences measured using these two methods (R0.87, with the P value, p4×1088) (SI Appendix, Figs. S7 and S8).

We also performed an additional PBM experiment where we compared binding intensities of Max, Mad, and Myc to 36-bp-long DNA sequences designed with [αNα] symmetry and containing the consensus motif (as above), with corresponding binding intensities to 36-bp-long DNA sequences designed with [αNα] symmetry and lacking the consensus motif (SI Appendix, Fig. S9). Strikingly, for the latter group of DNA sequences lacking the consensus motif, the magnitude of the nonconsensus effect remains as significant as for DNA sequences containing the consensus motif (SI Appendix, Fig. S9).

High-Throughput Measurements of Nonconsensus Protein−DNA Binding to Genomic DNA Sequences.

Next, we proceed to investigate how nonconsensus DNA sequence elements affect TF binding to DNA sites in their native genomic sequence context, using genomic-context protein-binding microarray (gcPBM) assays (15). To generate sequence libraries for our gcPBM measurements, we scanned available high-throughput, in vivo genomic TF−DNA binding profiles for six human proteins (16, 18): Max, Mad, and Myc, which belong to the basic Helix–loop–helix (bHLH) family, and E2F1, E2F4, and DP1, which belong to the E2F/DP family (see Methods). For each protein, we selected genomic 36-mer DNA sequences, such that each sequence contains a specific motif in the center of the sequence, surrounded by nonconsensus genomic background that does not contain any additional specific motifs for this protein. As shown in Fig. 4, the measured TF−DNA binding profiles for all six proteins exhibit a statistically significant correlation with the computed nonconsensus free energy, which reflects the fact that the predicted entropy-dominated mechanism leading to nonconsensus protein−DNA binding operates in natural genomic sequences, and not only in the case of our computationally designed sequences (see Discussion). These results demonstrate that nonconsensus elements characterized by certain repeated DNA sequence patterns constitute an additional layer of TF−DNA binding regulation in a living cell that operates in coordination with specific, consensus TF−DNA binding. Such coordination between consensus and nonconsensus TF−DNA binding is clearly observable for Max, Mad, and Myc proteins, which possess a well-defined specific, consensus motif, CACGTG (Fig. 4 AC). It is remarkable that when we separate DNA sequences into two subgroups, where the first subgroup contains sequences with the exact motif, while the second subgroup contains sequences with the mutated motif, we obtain an excellent correlation for each subgroup separately, with nearly identical slopes (Fig. 4 AC). However, the correlation for the subgroup containing the exact motif is shifted toward lower free energies (stronger binding), compared with the subgroup containing the mutated motif (Fig. 4 AC). These results demonstrate that statistically, on average, the strength of the nonconsensus effect is comparable to the effect of mutations in the specific motif of the TF. Notably, such separation into groups cannot be done for E2F1, E2F4, and DP1 proteins (Fig. 4 DF), as those proteins are thought to be considerably less specific than Max, Mad, and Myc (27). Among transcription factors in the E2F/DP family, DP proteins are known to bind with lower affinity to specific E2F/DP motifs compared with E2F proteins (28). Interestingly, we found that the DP1 protein, which is less specific than the other tested TFs, showed the highest correlation between predicted free energy of nonconsensus binding and experimentally measured binding strength (R = 0.97, Fig. 4F).

Fig. 4.

Fig. 4.

Genomic nonconsensus DNA sequence elements surrounding specific TF−DNA binding motifs significantly influence TF−DNA binding preferences. These examples show the results for human transcription factors belonging to the bHLH family: (A) Mad, (B) Max, and (C) Myc; and the E2F/DP family: (D) E2F1, (E) E2F4, and (F) Dp1. Each plot shows the correlation between the computed free energy of nonconsensus protein−DNA binding per bp, f (in units of kBT), and the bound protein occupancy measured experimentally (by gcPBM). The x axis represents the logarithm of the measured gcPBM signal intensity (in dimensionless units; higher values correspond to stronger binding events). The data are binned into 25 bins. In A, B, and C, we separated the 36-bp-long genomic DNA sequences used in the experiment into two groups: sequences with the exact specific motif, CACGTG (red) and with the mutated motif (black). Overall, TF−DNA binding was probed for 18,123 DNA sequences in A, 16,421 sequences in B, 15,936 sequences in C, and 5,329 sequences in D, E, and F. Below each plot are examples of sequences used in these experiments; specific motifs are marked in red.

Discussion

We now turn to provide an intuitive explanation of the experimentally observed effect of DNA sequence correlations on protein−DNA binding preferences. In summary, the nonconsensus effect stems from the fact that DNA sequence symmetry statistically shapes the spectrum of binding energy fluctuations in such a way that the amplitude of these fluctuations becomes a function of the symmetry type and the DNA correlation scale, ξ. To illustrate this effect, we consider a toy model, where DNA sequences are composed of just two nucleotide types, X and Y, and these sequences possess just two different symmetry types: (i) perfect [XX]/[YY] symmetry favoring the formation of poly(X)/poly(Y) tracts (this case is analogous to [αα] symmetry with the largest ξ used in the experiment) and (ii) random sequence (SI Appendix, Fig. S10).

To compute the average free energy of nonconsensus protein−DNA binding, we generate an ensemble of random DNA binders, where each model TF interacts with M DNA base pairs at each sequence position. The energy of binding at positions i to M + i − 1 is:

U(i)=j=1M+i1KXsX(j)+KYsY(j)

where sX(j)=1 and sY(j)=0 if the nucleotide at position j in our sequence is X, and sX(j)=0 and sY(j)=1 if the nucleotide at position j in our sequence is Y. This energy is a simplified, two-component version of the general four-component case (SI Appendix). Each random binder is uniquely defined in this model by the two energy parameters KX and KY, which represent the binding energies to the X and Y nucleotide, respectively. We assume that KX and KY are drawn from the Gaussian probability distribution with the mean, u0, and the SD, σ.

To understand the mechanism of nonconsensus protein−DNA binding induced by DNA sequence correlations, it is enough to notice that the average energy, U, in each case is identical, Mu0, but the energy fluctuation, Ω2=U2U2 is different: Ω[XX]/[YY]2=M2σ2, and Ωrand2=M(1+M)σ22, respectively, where the averaging is performed with respect to the ensemble of model TFs, represented by the random variables KX and KY. Two important conclusions follow from this simple result. First, the fact the average energy, U=Mu0, is the same in each of the two cases, means that the nonconsensus free energy is entropy dominated. Second, the [XX]/[YY] symmetry results in larger energy fluctuations compared with the case of random sequences, Ω[XX]/[YY]2/Ωrand22. The fact that this ratio is independent of both parameters of the model, M and σ, emphasizes the key point that the strength of the nonconsensus effect is quantitatively determined by the symmetry properties of DNA sequences, and only weakly depends on microscopic parameters of protein−DNA binding potentials. In particular, larger energy fluctuations, Ω2, always lead to the lower free energy associated with such fluctuations (29).

The average partition function, Zi=1Lexp(U(i)/kBT), and the resulting average free energy, F=kBTlnZ, can be computed analytically in each of the two cases shown in SI Appendix, Fig. S10:

F[XX]/[YY]Mu0kBTlnLσ2M22kBT,
FrandMu0kBTlnLσ2M(1+M)4kBT,

where we assumed that the DNA sequence length L was significantly greater than M, and we thus neglected boundary effects due to the finite size, M, of model TFs. These simple results are in qualitative agreement with the results of experimental measurements for the Max protein presented in Fig. 2B, where we observe that at all correlation scales, ξ, the [αα] symmetry type leads to a stronger Max−DNA binding compared with the case of random sequences. As expected, our simple random-binder model cannot always quantitatively correctly predict the effect of a particular sequence symmetry type on protein−DNA binding strength. However, the model does provide a statistically correct description of the relationship between DNA sequence correlations and the experimentally measured protein−DNA binding strength.

We emphasize that the entropy-dominated mechanism for nonconsensus protein−DNA binding identified here is conceptually different and complementary to the recently identified DNA sequence-specific mechanism induced by the presence of a few specific nucleotides adjacent to specific TF binding sites (TFBSs) (15, 30). The latter mechanism was shown to modulate specific protein−DNA binding for the two yeast proteins, Cbf1 and Tye7, via the modulation of the DNA shape (15). In our experimental design, we test protein−DNA interactions for thousands of different DNA molecules for each DNA symmetry type (Fig. 2). The key conclusion here is that the observed statistical effect is governed by the DNA symmetry rather than by specific sequence elements. The fact that our simple model (without any protein−DNA sequence specificity built in) provides an excellent description of the experimental data further validates this conclusion.

To further validate the predicted mechanism of the identified effect, we designed an experiment where we measured TF binding intensities to DNA sequences, varying the length of repeated sequence elements possessing the [αNα] symmetry but keeping the total length and the GC content of each DNA sequence identical (SI Appendix, Fig. S11). As above, we measured TF binding intensities for Max, Mad, and Myc proteins to DNA sequences that contain the specific motif, and, in addition, to DNA sequences that lack such motif. Strikingly, in both cases, we observe that measured TF−DNA binding intensities strongly depend on the presence of repeated DNA sequence elements in distal regions away from the location of the specific motif (SI Appendix, Fig. S11). For example, the magnitude of the Max-DNA nonconsensus binding effect is reduced by ∼46% by shortening the length of repeated sequence elements in the distal regions by half.

In addition, we used the high-throughput DNA structure prediction method (31) to compute the minor groove width (MGW) for designed 36-bp DNA sequences with [αα] and [αNα] symmetries used in our PBM measurements (SI Appendix, Fig. S12). Although measured TF binding preferences to DNA sequences possessing [αα] and [αNα] symmetries, respectively, show similar trend as a function of the inverse correlation scale, 1/ξ (SI Appendix, Fig. S2), the computed MGW shows the opposite trend for these two symmetries (SI Appendix, Fig. S12). These results, taken together with the fact that, for DNA sequences lacking specific motif, the nonconsensus effect is as significant as for DNA sequences containing such motif (SI Appendix, Fig. S9), strongly indicating that the identified nonconsensus effect and the effect of DNA shape represent two conceptually different, complementary effects.

In this study, we used six proteins belonging to two different structural families. We stress the point that the identified magnitude of the nonconsensus TF−DNA binding effect will be different for different proteins. It is expected that this magnitude might be smaller for more specific proteins. For example, the presence of specific motif leads statistically, on average to 2 kBT free-energy difference compared with nonspecific sequence (lacking such motif) for the Max protein used in our study. This free-energy difference reaches 12 kBT for one of the most specific prokaryotic proteins, the lac repressor, LacI (32). The magnitude of the nonconsensus effect is expected to be significantly smaller in the latter case.

In conclusion, the novel mechanism of protein−DNA binding in the absence of specific base-pair recognition investigated in this work leads to a number of general implications and strongly suggests that currently adopted basic concepts should be reevaluated. First, nonconsensus protein−DNA binding might provide an explanation for the so-called highly occupied target regions, which are depleted in known consensus motifs yet are highly enriched in repeated DNA sequences with unknown function (18, 33). Second, nonconsensus sequence elements flanking specific consensus motifs are likely to significantly affect the properties of the kinetic search that a protein performs to find its specific target sites on DNA (710, 34). Our results demonstrate that nonconsensus DNA background exerts substantially nonrandom statistical potential on DNA-binding proteins, and therefore, such potential must be taken into account to predict the kinetics of protein−DNA binding. Finally, we expect that our results will significantly impact the understanding of molecular, biophysical principles of transcriptional regulation, and significantly improve our ability to predict how variations in DNA sequences, i.e., mutations or polymorphisms, and protein concentrations influence gene expression programs in living cells.

Methods

PBM experiments were performed essentially as described previously (15, 22). Briefly, custom-designed 4 × 180 K microarrays were used (Agilent Technologies; AmadIDs 041707, 049201, and 060115). Agilent arrays were double-stranded by solid-phase primer extension using a small amount of spiked-in Cy3-labeled dUTP; the Cy3 signal was used to assess DNA double-stranding at each spot, as described previously (22). After the standard blocking step (22), arrays were incubated with a PBS buffer-based protein mixture of 100–200 nM His-tagged or GST-tagged protein, 2% milk, 200 ng/µL BSA, 50 ng/µL Salmon Testes DNA, and 0.02% TX-100. Bound protein was tagged with 10 ng/µL anti-His or anti-GST antibody conjugated to Alexa 488 (Qiagen; 35310 or Invitrogen; A11131) in PBS with 2% milk. Data were analyzed to obtain fluorescence intensities for all sequences represented on the arrays. Each sequence was present in six replicate spots. For each sequence, we report the median intensity over the six replicates. Importantly, we note that the microarray designs used in this study are very different from the widely used “universal PBMs” (22, 23, 3539). The latter provide general binding specificity information for all possible 8-bp sequences, whereas our custom arrays contain DNA sequences with a central specific binding motif flanked by different nonconsensus sequence elements (either extracted from the genome or designed computationally, as described below). The custom PBMs in our study were designed specifically to test the influence of nonconsensus sequence elements on TF−DNA binding.

TFs were expressed as full-length protein in bacteria and purified using affinity columns (GE Healthcare; 17-5130-01 for GST-tagged protein; 11-0004-58 for His-tagged proteins). The following protein concentrations were used: 200 nM homodimer concentrations for GST-tagged E2F1, E2F4, and DP1 on the E2F/DP genomic context PBM (AmadID 049201); 100 nM dimer concentration for His-tagged c-Myc:Max, Max:Max, and Mad2:Max on the Myc/Max/Mad genomic context PBM (AmadID 041707); 100 nM dimer concentration for His-tagged c-Myc:Max, Max:Max, and Mad2:Max, and also 50 nM Max:Max, on the custom PBM containing a constant specific TFBS within flanks with various nonconsensus elements (AmadID 060115). The genomic context PBMs (gcPBMs) were designed as described previously (15, 16). The custom PBM (AmadID 060115) was designed as described in SI Appendix, section 1.

Supplementary Material

Supplementary File

Acknowledgments

We thank M. Bulyk, P. von Hippel, E. Shakhnovich, B. Tsukerblat, and I. Weinstock for critical reading of the manuscript and for helpful comments. This work was supported through a PhRMA Foundation Research Starter grant (to R.G.) and the Israel Science Foundation Grant 1014/09 (D.B.L.). A.A. was supported by the Adams Fellowship program of the Israel National Academy of Science. J.L.S. was funded through a Duke Translational Medicine Quality Framework postdoctoral fellowship. R.G. is an Alfred P. Sloan Research Fellow.

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: The data reported in this paper have been deposited in the Gene Expression Omnibus (GEO) database, www.ncbi.nlm.nih.gov/geo (accession nos. GSE59845, GSE61854, and GSE61920).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1410569111/-/DCSupplemental.

References

  • 1.von Hippel PH, Revzin A, Gross CA, Wang AC. Non-specific DNA binding of genome regulating proteins as a biological control mechanism: I. The lac operon: Equilibrium aspects. Proc Natl Acad Sci USA. 1974;71(12):4808–4812. doi: 10.1073/pnas.71.12.4808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Riggs AD, Bourgeois S, Cohn M. The lac repressor-operator interaction. 3. Kinetic studies. J Mol Biol. 1970;53(3):401–417. doi: 10.1016/0022-2836(70)90074-4. [DOI] [PubMed] [Google Scholar]
  • 3.Berg OG, Winter RB, von Hippel PH. Diffusion-driven mechanisms of protein translocation on nucleic acids. 1. Models and theory. Biochemistry. 1981;20(24):6929–6948. doi: 10.1021/bi00527a028. [DOI] [PubMed] [Google Scholar]
  • 4.von Hippel PH, Berg OG. Facilitated target location in biological systems. J Biol Chem. 1989;264(2):675–678. [PubMed] [Google Scholar]
  • 5.Berg OG, von Hippel PH. Selection of DNA binding sites by regulatory proteins. Statistical-mechanical theory and application to operators and promoters. J Mol Biol. 1987;193(4):723–750. doi: 10.1016/0022-2836(87)90354-8. [DOI] [PubMed] [Google Scholar]
  • 6.von Hippel PH, Berg OG. On the specificity of DNA-protein interactions. Proc Natl Acad Sci USA. 1986;83(6):1608–1612. doi: 10.1073/pnas.83.6.1608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.von Hippel PH. From “simple” DNA-protein interactions to the macromolecular machines of gene expression. Annu Rev Biophys Biomol Struct. 2007;36:79–105. doi: 10.1146/annurev.biophys.34.040204.144521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Kolomeisky AB. Physics of protein-DNA interactions: Mechanisms of facilitated target search. Phys Chem Chem Phys. 2011;13(6):2088–2095. doi: 10.1039/c0cp01966f. [DOI] [PubMed] [Google Scholar]
  • 9.Slutsky M, Kardar M, Mirny LA. Diffusion in correlated random potentials, with applications to DNA. Phys Rev E Stat Nonlin Soft Matter Phys. 2004;69(6 Pt 1):061903. doi: 10.1103/PhysRevE.69.061903. [DOI] [PubMed] [Google Scholar]
  • 10.Slutsky M, Mirny LA. Kinetics of protein-DNA interaction: Facilitated target location in sequence-dependent potential. Biophys J. 2004;87(6):4021–4035. doi: 10.1529/biophysj.104.050765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Liebesny P, Goyal S, Dunlap D, Family F, Finzi L. Determination of the number of proteins bound non-specifically to DNA. J Phys Condens Matter. 2010;22(41):414104. doi: 10.1088/0953-8984/22/41/414104. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Blainey PC, et al. Nonspecifically bound proteins spin while diffusing along DNA. Nat Struct Mol Biol. 2009;16(12):1224–1229. doi: 10.1038/nsmb.1716. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Wang YM, Austin RH, Cox EC. Single molecule measurements of repressor protein 1D diffusion on DNA. Phys Rev Lett. 2006;97(4):048302. doi: 10.1103/PhysRevLett.97.048302. [DOI] [PubMed] [Google Scholar]
  • 14.Tafvizi A, et al. Tumor suppressor p53 slides on DNA with low friction and high stability. Biophys J. 2008;95(1):L01–L03. doi: 10.1529/biophysj.108.134122. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Gordân R, et al. Genomic regions flanking E-box binding sites influence DNA binding specificity of bHLH transcription factors through DNA shape. Cell Reports. 2013;3(4):1093–1104. doi: 10.1016/j.celrep.2013.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Mordelet F, Horton J, Hartemink AJ, Engelhardt BE, Gordân R. Stability selection for regression-based models of transcription factor-DNA binding specificity. Bioinformatics. 2013;29(13):i117–i125. doi: 10.1093/bioinformatics/btt221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Sela I, Lukatsky DB. DNA sequence correlations shape nonspecific transcription factor-DNA binding affinity. Biophys J. 2011;101(1):160–166. doi: 10.1016/j.bpj.2011.04.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Gerstein MB, et al. Architecture of the human regulatory network derived from ENCODE data. Nature. 2012;489(7414):91–100. doi: 10.1038/nature11245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Thurman RE, et al. The accessible chromatin landscape of the human genome. Nature. 2012;489(7414):75–82. doi: 10.1038/nature11232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Plischke M, Bergersen B. Equilibrium Statistical Physics. Prentice Hall; Englewood Cliffs, NJ: 1989. [Google Scholar]
  • 21.Josse J, Kaiser AD, Kornberg A. Enzymatic synthesis of deoxyribonucleic acid. VIII. Frequencies of nearest neighbor base sequences in deoxyribonucleic acid. J Biol Chem. 1961;236:864–875. [PubMed] [Google Scholar]
  • 22.Berger MF, Bulyk ML. Universal protein-binding microarrays for the comprehensive characterization of the DNA-binding specificities of transcription factors. Nat Protoc. 2009;4(3):393–411. doi: 10.1038/nprot.2008.195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Berger MF, et al. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat Biotechnol. 2006;24(11):1429–1435. doi: 10.1038/nbt1246. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Mukherjee S, et al. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat Genet. 2004;36(12):1331–1339. doi: 10.1038/ng1473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Afek A, Lukatsky DB. Genome-wide organization of eukaryotic preinitiation complex is influenced by nonconsensus protein-DNA binding. Biophys J. 2013;104(5):1107–1115. doi: 10.1016/j.bpj.2013.01.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Maerkl SJ, Quake SR. A systems approach to measuring the binding energy landscapes of transcription factors. Science. 2007;315(5809):233–237. doi: 10.1126/science.1131007. [DOI] [PubMed] [Google Scholar]
  • 27.Robasky K, Bulyk ML. UniPROBE, update 2011: Expanded content and search tools in the online database of protein-binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2011;39(Database issue):D124–D128. doi: 10.1093/nar/gkq992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Zheng N, Fraenkel E, Pabo CO, Pavletich NP. Structural basis of DNA recognition by the heterodimeric cell cycle transcription factor E2F-DP. Genes Dev. 1999;13(6):666–674. doi: 10.1101/gad.13.6.666. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Elkin M, Andre I, Lukatsky DB. Energy fluctuations shape free energy of nonspecific biomolecular interactions. J Stat Phys. 2012;146(4):870–877. [Google Scholar]
  • 30.Rohs R, et al. The role of DNA shape in protein-DNA recognition. Nature. 2009;461(7268):1248–1253. doi: 10.1038/nature08473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Zhou T, et al. DNAshape: A method for the high-throughput prediction of DNA structural features on a genomic scale. Nucleic Acids Res. 2013;41(Web Server issue):W56–62. doi: 10.1093/nar/gkt437. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Frank DE, et al. Thermodynamics of the interactions of lac repressor with variants of the symmetric lac operator: Effects of converting a consensus site to a non-specific site. J Mol Biol. 1997;267(5):1186–1206. doi: 10.1006/jmbi.1997.0920. [DOI] [PubMed] [Google Scholar]
  • 33.Nègre N, et al. A cis-regulatory map of the Drosophila genome. Nature. 2011;471(7339):527–531. doi: 10.1038/nature09990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Hu T, Grosberg AY, Shklovskii BI. How proteins search for their specific sites on DNA: The role of DNA conformation. Biophys J. 2006;90(8):2731–2744. doi: 10.1529/biophysj.105.078162. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Nakagawa S, Gisselbrecht SS, Rogers JM, Hartl DL, Bulyk ML. DNA-binding specificity changes in the evolution of forkhead transcription factors. Proc Natl Acad Sci USA. 2013;110(30):12349–12354. doi: 10.1073/pnas.1310430110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Weirauch MT, et al. DREAM5 Consortium Evaluation of methods for modeling transcription factor sequence specificity. Nat Biotechnol. 2013;31(2):126–134. doi: 10.1038/nbt.2486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Gordân R, et al. Curated collection of yeast transcription factor DNA binding specificity data reveals novel structural and gene regulatory insights. Genome Biol. 2011;12(12):R125. doi: 10.1186/gb-2011-12-12-r125. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Wei GH, et al. Genome-wide analysis of ETS-family DNA-binding in vitro and in vivo. EMBO J. 2010;29(13):2147–2160. doi: 10.1038/emboj.2010.106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Badis G, et al. Diversity and complexity in DNA recognition by transcription factors. Science. 2009;324(5935):1720–1723. doi: 10.1126/science.1162327. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES