A Biophysical Approach to Predicting Protein–DNA Binding Energetics

George Locke; Alexandre V Morozov

doi:10.1534/genetics.115.178384

. 2015 Jun 16;200(4):1349–1361. doi: 10.1534/genetics.115.178384

A Biophysical Approach to Predicting Protein–DNA Binding Energetics

George Locke ^*, Alexandre V Morozov ^*,^†,¹

PMCID: PMC4574261 PMID: 26081193

Abstract

Sequence-specific interactions between proteins and DNA play a central role in DNA replication, repair, recombination, and control of gene expression. These interactions can be studied in vitro using microfluidics, protein-binding microarrays (PBMs), and other high-throughput techniques. Here we develop a biophysical approach to predicting protein–DNA binding specificities from high-throughput in vitro data. Our algorithm, called BindSter, can model alternative DNA-binding modes and multiple protein species competing for access to DNA, while rigorously taking into account all sterically allowed configurations of DNA-bound factors. BindSter can be used with a hierarchy of protein–DNA interaction models of increasing complexity, including contributions of mononucleotides, dinucleotides, and longer words to the total protein–DNA binding energy. We observe that the quality of BindSter predictions does not change significantly as some of the energy parameters vary over a sizable range. To take this degeneracy into account, we have developed a graphical representation of parameter uncertainties called IntervalLogo. We find that our simplest model, in which each nucleotide in the binding site is treated independently, performs better than previous biophysical approaches. The extensions of this model, in which contributions of longer words are also considered, result in further improvements, underscoring the importance of higher-order effects in protein–DNA energetics. In contrast, we find little evidence of multiple binding modes for the transcription factors (TFs) and experimental conditions in our data set. Furthermore, there is limited consistency in predictions for the same TF based on microfluidics and PBM data.

Keywords: gene regulation, protein–DNA interactions, thermodynamic modeling, transcription factor

SEQUENCE-SPECIFIC interactions between proteins and genomic DNA control numerous cellular processes. An important class of DNA-binding proteins is transcription factors (TFs), which regulate expression of their target genes. Knowledge of in vivo binding locations of TFs is important for reconstructing regulatory networks of gene transcription and for understanding which factors control which genes. These locations can be mapped genome-wide, using chromatin immunoprecipitation followed by microarray hybridization (ChIP-chip) or sequencing (ChIP-seq) (Ren et al. 2000; Park 2009). The latest version of this technique, ChIP-exo, uses exonucleases to trim immunoprecipitated DNA fragments to precise locations of bound proteins, achieving single-base-pair precision (Rhee and Pugh 2011). Many factors influence TF binding locations in living cells, including chromatin structure, which strongly modulates DNA accessibility (Felsenfeld and Groudine 2003), and cooperative or competing interactions with other DNA-binding proteins (Siggers and Gordan 2014). However, the primary determinant of TF binding locations is intrinsic DNA sequence specificity: TFs can recognize and bind their cognate sites on the background of numerous competing sites available in the genome.

Therefore, an accurate in vitro picture of intrinsic TF–DNA binding affinity and specificity can provide insights into in vivo TF function. For example, high-affinity TF binding sites that are not occupied in vivo might indicate that these sites are covered with nucleosomes [the nucleosome is a fundamental unit of chromatin that packages 147 base pairs (bp) of genomic DNA into a tightly bent left-handed superhelix (Richmond and Davey 2003)]. Conversely, low-affinity sites bound in vivo might be a sign of indirect or cooperative binding (Gordan et al. 2009). Recently, several new technologies have been developed that enable high-throughput in vitro determination of protein–DNA binding affinities (see Stormo and Zhao 2010 for a review). Here we focus on two of these technologies: mechanically induced trapping of molecular interactions (MITOMI) (Maerkl and Quake 2007; Fordyce et al. 2010) and protein-binding microarrays (PBM) (Bulyk et al. 2001; Berger et al. 2006).

The MITOMI device uses microfluidics to simultaneously measure binding affinities of a TF to a few thousand DNA probe sequences. The second-generation device (Fordyce et al. 2010) contains 4160 unit cells; each cell has DNA probes with identical sequences, and a given sequence appears in at least 2 unit cells. All DNA probes are 70 bp long: a probe starts with CGC followed by a 52-bp variable region and a 15-bp fixed sequence at the 3′ end used for labeling and primer extension. Sequences in the variable region are designed to accommodate all 65,536 DNA 8-mers. Two fluorescent labels are used to quantify the number of surface-immobilized protein molecules (BODIPY) and protein-bound DNA probes (Cy5); the ratio of Cy5 to BODIPY fluorescence is linearly proportional to the total protein occupancy of each probe. Fordyce et al. (2010) report MITOMI measurements for 28 TFs from Saccharomyces cerevisiae, comprising 10 different families.

The PBM approach to studying protein–DNA interactions utilizes microarrays with up to tens of thousands of spots. Each spot contains double-stranded DNA probes with identical sequences; the total number of spots is high enough to allow for all possible permutations of a 10-bp-long sequence. DNA probes are fluorescently labeled to monitor the consistency of primer-directed DNA synthesis responsible for creating double-stranded DNA at each spot on the array. TF molecules are added to the array and the array is subsequently washed to remove weakly bound and unbound molecules. Finally, the number of remaining bound proteins is quantified using a fluorescent antibody (Alexa488). The antibody fluorescence intensity (normalized by the DNA fluorescence intensity at each spot; DNA is labeled with Cy3) is then proportional to the total protein occupancy of a given probe sequence.

Here we present an algorithm, called BindSter, for inferring energetics of protein–DNA interactions from high-throughput measurements such as MITOMI and PBM, which report total protein occupancies for a set of DNA probe sequences. Under the assumption of thermodynamic equilibrium, these occupancies can be computed using efficient recursive techniques (Morozov et al. 2009; Locke et al. 2010) for an arbitrary protein–DNA interaction energy model. The parameters of the model that determine binding energies can then be fitted against the data. We consider all sterically allowed configurations of proteins bound to each probe; multiple binding events, overlapping sites, and multiple DNA-bound species (including alternative binding modes of the same TF) are treated rigorously as they do not entail significant computational costs. Thus our framework is more consistent in treating steric exclusion and multiple-species competition for DNA sequence compared with previous biophysical models of protein–DNA energetics: MatrixREDUCE (Foat et al. 2006) and BEEML-PBM (Zhao and Stormo 2011; Zhao et al. 2012).

Moreover, our approach is designed to test a hierarchy of protein–DNA energy models of increasing complexity. We start with a basic mononucleotide model in which the contribution of each nucleotide in the binding site to the total interaction energy is independent of all the other nucleotides. This assumption is commonly used in constructing position-specific weight matrices (PWMs) from TF binding site data (Stormo and Fields 1998). Next, we extend the mononucleotide model in two different ways by including (a) dinucleotide contributions and (b) energies of di- and trinucleotides regardless of their position within the binding site. In addition, we check for alternative TF binding modes by fitting two models to the data either simultaneously or sequentially. Although we have focused on 28 TFs with MITOMI data in this work, we have also used PBM data to make predictions for 12 TFs, and have carried out a detailed comparison of PBM- and MITOMI-derived models. The software for our algorithm is available online at http://nucleosome.rutgers.edu/nucenergen/bindster/.

Materials and Methods

Physical model of protein–DNA interactions

Consider an ensemble of M species of particles distributed along a DNA segment of N bp in length. The particles can bind anywhere within the DNA segment subject to steric exclusion (that is, adjacent particles cannot overlap). Partially bounds states (e.g., where the protein hangs off the DNA) are also not allowed. Proteins may bind to either DNA strand. A grand-canonical partition function for this system of DNA-bound particles is given by

Z = \sum_{conf} e^{- E (conf)},

(1)

where “conf” denotes an arbitrary configuration of DNA-bound, nonoverlapping particles, and $E (conf)$ is the total DNA-binding energy of all the particles in the configuration. For simplicity we set $k_{B} T = 1$ , where $k_{B}$ is the Boltzmann constant and T is the temperature; note that chemical potentials are included implicitly into $E (conf) .$

One can compute Z efficiently, using a recursive relation (Segal et al. 2006; Morozov et al. 2009; Locke et al. 2010)

Z_{i}^{f} = \sum_{j = 0}^{M} Z_{i - L_{j}}^{f} e^{- E_{i - L_{j} + 1}^{j}} θ_{i - L_{j} + 1},

(2)

with the initial condition $Z_{0}^{f} = 1$ ( $i = 1 \dots N$ ). This procedure is similar to the forward algorithm widely employed in hidden Markov models (Durbin et al. 1998). Here, $Z_{i}^{f}$ is the forward partial partition function, $E_{i}^{j}$ is the binding energy for a species j particle at position i, $L_{j}$ is the length of the DNA footprint for a species j particle, and $θ_{i}$ is the Heavyside step function that is equal to 1 when i is positive and 0 otherwise. The zeroth species represents the background (no particle), with $L_{0} = 1$ and $E_{i}^{0} = 0$ for all i.

Likewise, we can compute the partial partition functions in the reverse direction (Segal et al. 2006; Morozov et al. 2009; Locke et al. 2010),

Z_{i}^{r} = \sum_{j = 0}^{M} Z_{i + L_{j}}^{r} e^{- E_{i}^{j}} θ_{N - i - L_{j} + 2},

(3)

with the initial condition $Z_{N + 1}^{r} = 1$ ( $i = N \dots 1$ ). This step is analogous to the backward algorithm (Durbin et al. 1998). Note that $Z_{N}^{f} = Z_{1}^{r} = Z$ by construction. Further, the probability of starting a particle of species j at position i is given by

P_{i}^{j} = \frac{Z_{i - 1}^{f} e^{- E_{i}^{j}} Z_{i + L_{j}}^{r}}{Z},

(4)

where $i = 1 \dots N - L_{j} + 1.$ Note that $P_{i}^{0}$ gives the probability of finding no particle at position i. Equation 4 has a simple interpretation: the sum of Boltzmann weights of sterically allowed configurations with a particle of species j bound at base pairs $i \dots i + L_{j} - 1$ , divided by the sum of Boltzmann weights of all possible configurations. The expected number of particles of species j covering base pairs i (particle occupancy $O_{i}^{j}$ ) is then given by $O_{i}^{j} = \sum_{k = i - L_{j} + 1}^{i} P_{k}^{j}$ ( $i = 1 \dots N$ ). Finally, the average total number of particles on the entire DNA segment $〈 N 〉$ is given by

〈 N 〉 = \sum_{j = 1}^{M} 〈 N_{j} 〉 = \sum_{j = 1}^{M} \sum_{i = 1}^{N - L_{j} + 1} P_{i}^{j},

(5)

where $〈 N_{j} 〉$ is the average total number of particles of species j. Correspondingly, $\sum_{j = 1}^{M} 〈 N_{j} 〉 L_{j}$ yields the total occupancy of the DNA segment. Taking binding to both strands into account amounts to replacing energies $E_{i}^{j}$ with “free energies” $- \log (e^{- E_{i}^{j}} + e^{- E_{i}^{j, rc}})$ , where $E_{i}^{j}$ is the binding energy of a particle of species j to the $i \dots i + L_{j} - 1$ site on the upper strand, and $E_{i}^{j, rc}$ is the binding energy of a particle of species j to the $i + L_{j} - 1 \dots i$ site on the lower strand.

We have also tested a simplified DNA-binding model that lacks steric exclusion. In this model, binding of each particle at a given position is independent of all the other particles. The probability of binding $P_{i}^{j}$ is then simply

P_{i}^{j} = \frac{1}{1 + e^{E_{i}^{j}}},

(6)

and the quantities $〈 N 〉$ and $O_{i}^{j}$ are calculated as above.

Models of protein–DNA binding energy

The mononucleotide model:

The mononucleotide model assumes independent contributions to the total protein–DNA interaction energy from each nucleotide in the binding site:

E_{i} = ε^{0} + \sum_{j = 1}^{L} ε_{α_{i + j - 1}}^{j} .

(7)

Here, $E_{i}$ is the total energy of binding at position i on a DNA segment (we suppress the species label for brevity), L is the binding site length, $α_{k} = {A, C, T, G}$ is the nucleotide at position k within the DNA segment, $ε_{α}^{j}$ is the energy contribution from the base α at position j, and $ε^{0}$ is a sequence-independent offset that subsumes the chemical potential. We constrain the $ε_{α}^{j}$ parameters at each position j so that $\sum_{α} ε_{α}^{j} = 0$ , ensuring that the set of fitted parameters is nondegenerate (Locke et al. 2010). With these constraints, the mononucleotide model has $3 L + 1$ independent parameters.

The dinucleotide model:

The dinucleotide model includes both mono- and dinucleotide energy contributions:

E_{i} = ε^{0} + \sum_{j = 1}^{L} ε_{α_{i + j - 1}}^{j} + \sum_{j = 1}^{L - 1} ε_{α_{i + j - 1} α_{i + j}}^{j} .

(8)

Here, $ε_{α β}^{j}$ represents the dinucleotide contribution from dinucleotide $α β$ at position $j, j + 1;$ all other symbols are as in Equation 7. The mononucleotide parameters are constrained as before, and the dinucleotide parameters are constrained so that $\sum_{α} ε_{α β}^{j} = 0$ and $\sum_{β} ε_{α β}^{j} = 0$ for all j. Thus the dinucleotide model has $3^{2} (L - 1) + 3 L + 1 = 12 L - 8$ independent parameters.

The k-mer model:

Another extension of the mononucleotide model adds energy contributions of longer DNA words, irrespective of where these words occur within the binding site (Locke et al. 2010; Chereji and Morozov 2011). In this model, the binding energy is given by

E_{i} = ε^{0} + \sum_{j = 1}^{L} ε_{α_{i + j - 1}}^{j} + \sum_{n = 2}^{N} \sum_{α_{1} \dots α_{n}}^{4^{n}} m_{α_{1} \dots α_{n}}^{i} ε_{α_{1} \dots α_{n}},

(9)

where N is the maximum length of DNA words (two or three in our fits), $α_{1} \dots α_{n}$ is a DNA word of length n, $m_{α_{1} \dots α_{n}}^{i}$ is the number of times that word appears within the binding site, and $ε_{α_{1} \dots α_{n}}$ is its energy contribution. The energies are constrained by

\sum_{α_{i}} ε_{α_{1} \dots α_{n}} = 0, i = 1... n .

(10)

With $N = 3 used in this work$ , there are $3^{2} + 3^{3} = 36$ independent k-mer parameters in addition to $3 L + 1$ mononucleotide parameters.

The position-independent model:

We have also constructed a model where contribution of a given DNA word is independent of its position within the binding footprint:

E_{i} = ε^{0} + \sum_{n = 1}^{N} \sum_{α_{1} \dots α_{n}}^{4^{n}} m_{α_{1} \dots α_{n}}^{i} ε_{α_{1} \dots α_{n}} .

(11)

All symbols are as in Equation 9, and the energy parameters are constrained by Equation 10. We use $N = 2$ and $N = 3$ with this model.

Alternative motif models:

These models are designed to account for protein binding in two distinct sequence-dependent modes. The binding energy for one of the modes is given by the mononucleotide model (Equation 7). In the two-motif model, the other mode is also described by the mononucleotide model, and both models are fitted simultaneously. In the secondary motif model, one of the mononucleotide motifs is fitted first and then held fixed (except for $ε^{0}$ ), while the parameters of the “secondary” mononucleotide model are allowed to vary. Finally, in the “mononucleotide + position-independent (PI)” model the alternative binding mode is described by the PI model (Equation 11), which is fitted simultaneously with the mononucleotide model.

PBM location bias

A known source of bias in PBM experiments is the increasing difficulty for the proteins to bind at positions closer to the glass slide (Weirauch et al. 2013). We account for this effect by introducing a bias “energy,” modeled as a quadratic function of the absolute position on the DNA segment:

E_{i}^{bias} = a_{j} i + b_{j} i^{2} .

(12)

Here, i is the position along the DNA probe, and $a_{j}$ and $b_{j}$ are fitting parameters. The bias energy can be added to any of the protein–DNA energy models described above, separately for each species j.

Fitting procedure

To fit a protein–DNA interaction model to the binding data, we perform 80 separate regressions and select the set of binding parameters (the “motif”) for which the linear correlation r between the predicted number of particles $〈 N 〉$ (Equation 5) for each DNA probe and the target data (MITOMI or PBM fluorescence ratios) is the highest. Each regression starts with a randomly generated motif. At each subsequent step, the motif from the previous step is modified by adding a random value $Δ ε$ , drawn from a normal distribution of unit variance, to a single randomly chosen parameter.

When one of the parameters (except $ε^{0}$ ) is modified, some of the other parameters have to be changed as well to satisfy the constraints. For example, in the mononucleotide model, if $ε_{A}^{j} \to ε_{A}^{j} + Δ ε$ , $ε_{{C, G, T}}^{j} \to ε_{{C, G, T}}^{j} - Δ ε / 3$ is used to offset the shift. In the dinucleotide model, mononucleotide parameters are treated as above, and dinucleotide parameters are modified as follows: if $ε_{A A}^{j} \to ε_{A A}^{j} + Δ ε$ , $ε_{α_{1} α_{2}}^{j} \to ε_{α_{1} α_{2}}^{j} - Δ ε / 3$ for all dinucleotides that have A at either the first or the second position, and $ε_{α_{1} α_{2}}^{j} \to ε_{α_{1} α_{2}}^{j} + Δ ε / 9$ for all other dinucleotides. In general, for words of length n we have

ε_{α_{1} \dots α_{n}}^{j} \to ε_{α_{1} \dots α_{n}}^{j} + Δ ε \frac{{(- 1)}^{n + m}}{3^{n - m}},

(13)

where m is the number of letters in common between $α_{1} \dots α_{n}$ and the word whose energy was shifted by $Δ ε$ (j is omitted for position-independent parameters).

After modifying the fitting parameters, the new r is evaluated. If it is better than the previous one, the new motif is accepted, and the algorithm proceeds to the next regression step. If the new r is not an improvement, another trial step is made. If no improvement is found after 200 trial steps, the modification that decreased the correlation least is accepted, and the algorithm proceeds to the next regression step (Tolkunov and Morozov 2012). We call this trial procedure Try2Step; see Supporting Information, File S1 for the pseudocode. After $T = 400$ regression steps have been made, r of the current step is compared to r obtained T steps ago. If this difference is <0.003, the current regression ends and its best r and the corresponding final motif are recorded. The motif that achieved the best linear correlation $r_{best}$ among 80 separate regression runs is reported as the final prediction.

Motif correlation

We find correlations between motifs produced by energy models of arbitrary complexity. First, we convert each energy parameter to probabilities:

P_{α_{1} \dots α_{n}}^{i} = \frac{e^{- ε_{α_{1} \dots α_{n}}^{i}}}{\sum_{β_{1}} \dots \sum_{β_{n}} e^{- ε_{β_{1} \dots β_{n}}^{i}}} .

(14)

Here, $P_{α_{1} \dots α_{n}}^{i}$ is the probability of the word $α_{1} \dots α_{n}$ at position i (i is omitted for position-independent parameters). The information content $R^{i}$ is then given by the difference between the maximum possible and the observed entropy (Crooks et al. 2004),

R^{i} = \log_{2} Ω + \sum_{β_{1}} \dots \sum_{β_{n}} P_{β_{1} \dots β_{n}}^{i} \log_{2} P_{β_{1} \dots β_{n}}^{i},

(15)

where Ω is the alphabet size (e.g., 4 for mono- and 16 for dinucleotides). Finally, the “height” of each word is defined as

h_{α_{1} \dots α_{n}}^{i} = R^{i} P_{α_{1} \dots α_{n}}^{i},

(16)

similarly to letter heights in standard PWM logos (Crooks et al. 2004).

We use word heights to compare two motifs with the same number of energy parameters. We allow one motif to be offset from the other by up to 4 bases in either direction. We also create a reverse complement of the other motif, testing 18 alignments in total. For each alignment, a linear correlation between the two sets of word heights is computed; the best alignment is that with the maximum correlation. Note that when the two motifs are offset from one another, only the subset of word heights that corresponds to the aligned part is taken into account. Figure S1 shows a histogram of motif correlations among randomly generated mononucleotide models with $L = 10.$ To generate energy parameters for random motifs, we start with all parameters set to zero. Then for each letter α at each position i we draw a number from a normal distribution with zero mean and standard deviation 2. We add this number to the current $ε_{α}^{i}$ and enforce the constraints using Equation 13. In the end, we compute word heights via Equation 16. Using this histogram, we approximate the P-value for a mononucleotide motif correlation of r as the fraction of correlations $\geq r .$ The P-values for dinucleotide motifs are obtained similarly.

IntervalLogo

We have devised a novel way of displaying parameter confidence intervals in sequence logos, called IntervalLogo. We define the confidence interval for each energy parameter as the range of values of that parameter within which the correlation $r \geq 0.98 r_{best}$ when the parameter is changed and all the other parameters are held fixed, except to satisfy the constraints. We demonstrate our procedure using the mononucleotide model; more complex models are treated in a similar manner. Let us say that $δ_{α}^{i}, {ε_{β}^{i}}_{β \neq α}$ are the values of the energy parameters at the confidence boundary. Here, $δ_{α}^{i}$ is the parameter modified explicitly, and ${ε_{β}^{i}}_{β \neq α}$ are the other parameters offset to satisfy the constraint (Equation 13). Given these energies, the probability of each letter at both confidence boundaries is defined by Equation 14, the information content by Equation 15, and the word heights $h_{α}^{upper, i}$ and $h_{α}^{lower, i}$ by Equation 16. By construction, we enforce $h_{α}^{upper, i} \geq h_{α}^{i} \geq h_{α}^{lower, i}$ , where $h_{α}^{i}$ is given by Equation 16 with best-fit parameters. Note that due to the nonmonotonic relationship between energy parameters and word heights, it is possible to have $h_{α}^{upper, i}$ and $h_{α}^{lower, i}$ associated with either end of the confidence interval for the energy parameter in question. Occasionally, $h_{α}^{upper, i} > h_{α}^{i}$ and $h_{α}^{lower, i} > h_{α}^{i}$ , where $h_{α}^{upper, i}$ is now associated with the high-energy boundary for the energy parameter, and $h_{α}^{lower, i}$ is associated with the low-energy boundary. In this case, we simply set $h_{α}^{lower, i} = h_{α}^{i} .$ Likewise, if $h_{α}^{lower, i} < h_{α}^{i}$ and $h_{α}^{upper, i} < h_{α}^{i}$ , we set $h_{α}^{upper, i} = h_{α}^{i} .$

In the IntervalLogo, the height of each letter is given by $h_{α}^{i};$ the letters are sorted by their heights $h_{α}^{lower, i}$ at the lower confidence boundary, so that letters with smaller $h_{α}^{i}$ and small errors occasionally appear on top of the letters with larger $h_{α}^{i}$ but also larger uncertainties. To show confidence boundaries in the logo, we draw the letter with light, medium, and dark colors (Figure 1; the choice of the overall color is arbitrary). The light color, representing the larger word height, is painted from the top of the letter down a distance $u_{α}^{i} = \min (h_{α}^{upper, i} - h_{α}^{i}, h_{α}^{i})$ (so that if $h_{α}^{upper, i} - h_{α}^{i} \geq h_{α}^{i}$ , the light color is painted over the whole letter). The dark color, representing the smaller word height, is painted from the bottom of the letter up a distance $l_{α}^{i} = h_{α}^{i} - h_{α}^{lower, i} .$ If there is no overlap, the medium color appears between the light and dark regions. Otherwise, the light and dark colors are striped in the region of overlap, and the medium color is not visible.

IntervalLogo deconstruction. The height, h, of a letter in a sequence logo represents the degree of preference for that letter (Equation 16) (Crooks *et al.* 2004). IntervalLogo shows confidence intervals for this height, using coloration instead of traditional error bars. (A) The best-fit values of model parameters lead to a height h for the letter “A” at some position. At the lower confidence boundary, the height would be $h^{lower};$ at the upper confidence boundary, the height would be $h^{upper} .$ (B) The height $l = h - h^{lower}$ shows how much shorter the letter would be at the lower confidence boundary, while the height $u = h^{upper} - h$ shows how much taller the letter would be at the upper confidence boundary. (C) The IntervalLogo color scheme. The height l of the dark band at the bottom shows how much shorter the letter would be at the lower confidence boundary, while the height u of the light band at the top shows how much taller the letter would be at the upper confidence boundary. If the light and dark bands were wide enough to overlap, the area of the overlap would be striped, and the medium color would not be visible. All symbols are as defined in *Materials and Methods*, but with subscripts and superscripts suppressed for clarity.

Mononucleotide logos for higher-order models

We can use a conventional mononucleotide sequence logo to represent higher-order motifs. To accomplish that, we predict the probability ${\tilde{P}}_{α}^{i}$ of finding base α at position i within the binding site, for all possible sites of length L. For mononucleotide motifs, ${\tilde{P}}_{α}^{i} = P_{α}^{i}$ from Equation 14; for higher-order motifs a recursive calculation is carried out.

Specifically, we need to compute

{\tilde{P}}_{α}^{i} = \frac{\sum_{s_{α_{i}}} e^{- E (s_{α_{i}})}}{\sum_{s} e^{- E (s)}},

(17)

where s is a set of all sequences of length L, and $s_{α_{i}}$ is the subset of these sequences with the nucleotide α at position i. To compute ${\tilde{P}}_{α}^{i}$ recursively, we first define $B_{α_{i} \dots α_{j}}^{k}$ , the product of all Boltzmann weights due to substrings of the word $α_{i} \dots α_{j}$ covering position k, where $i \leq k \leq j .$ $B_{α_{i} \dots α_{j}}^{k}$ is given by

B_{α_{i} \dots α_{j}}^{k} = \prod_{n = 1}^{N} \prod_{l = i}^{k} e^{- ε_{α_{l} \dots α_{l + n - 1}}^{l} θ_{(l + n - k - 1)} θ_{(j + 1 - n - l)}} .

(18)

We now introduce the forward partition function

Z_{α_{i} \dots α_{j}}^{f} = \sum_{α_{i - 1}} Z_{α_{i - 1} \dots α_{j - 1}}^{f} B_{α_{i - 1} \dots α_{j}}^{j},

(19)

where $j - i = N - 2$ , and the initial condition is that $Z_{α_{1} \dots α_{i}}^{f} = \prod_{j = 1}^{i} B_{α_{1} \dots α_{j}}^{j}$ for $i < N .$ Note that with $k = j$ and $j - i = N - 2$ , Equation 18 reduces to $B_{α_{i} \dots α_{j}}^{j} = \prod_{n = 1}^{N} e^{- ε_{α_{j - n + 1} \dots α_{j}}^{i - n + 1}} .$ The backward partition function is given by

Z_{α_{i} \dots α_{j}}^{r} = \sum_{α_{j + 1}} Z_{α_{i + 1} \dots α_{j + 1}}^{r} B_{α_{i} \dots α_{j + 1}}^{i},

(20)

where $j - i = N - 2$ , $B_{α_{i} \dots α_{j}}^{i} = \prod_{n = 1}^{N} e^{- ε_{α_{i} \dots α_{i + n - 1}}^{i}}$ , and the initial condition is that $Z_{α_{i} \dots α_{L}}^{r} = \prod_{j = L - N + 2}^{L} B_{α_{j} \dots α_{L}}^{j}$ for $i > L - N + 1.$

We then compute the full partition function from $Z^{f}$ or $Z^{r} :$

Z = \sum_{α_{L - N + 2}} \dots \sum_{α_{L}} Z_{α_{L - N + 2} \dots α_{L}}^{f} = \sum_{α_{1}} \dots \sum_{α_{N - 1}} Z_{α_{1} \dots α_{N - 1}}^{r} .

(21)

Finally, the marginalized probability is

{\tilde{P}}_{α}^{i} = {\begin{array}{l} Z^{- 1} \sum_{α_{1} \dots α_{i - 1}} \sum_{α_{i + 1} \dots α_{i + N - 1}} Z_{α_{1} \dots α_{i - 1}}^{f} B_{α_{1} \dots α_{i + N - 1}}^{i} Z_{α_{i + 1} \dots α_{i + N - 1}}^{r} \\ if i < N, \\ Z^{- 1} \sum_{α_{i - N + 1} \dots α_{i - 1}} \sum_{α_{i + 1} \dots α_{i + N - 1}} Z_{α_{i - N + 1} \dots α_{i - 1}}^{f} B_{α_{i - N + 1} \dots α_{i + N - 1}}^{i} Z_{α_{i + 1} \dots α_{i + N - 1}}^{r} \\ if N \leq i \leq L - N + 1, \\ Z^{- 1} \sum_{α_{i - N + 1} \dots α_{i - 1}} \sum_{α_{i + 1} \dots α_{L}} Z_{α_{i - N + 1} \dots α_{i - 1}}^{f} B_{α_{i - N + 1} \dots α_{L}}^{i} Z_{α_{i + 1} \dots α_{L}}^{r} \\ if L - N + 1 < i, \end{array}

(22)

where the $Z_{α_{i} \dots α_{j}}^{f}$ and $Z_{α_{i} \dots α_{j}}^{r}$ terms are set to 1 whenever $i > j .$ Specifically, $Z_{α_{1} α_{0}}^{f} = 1$ if $i = 1$ and $Z_{α_{L + 1} α_{L}}^{r} = 1$ if $i = L$ in Equation 22.

Results

Overview of the modeling procedure

We have developed an algorithm, BindSter, which predicts TF–DNA binding energetics using a biophysical approach (Figure 2). The algorithm assigns a binding energy $E_{i}$ to every site $i \dots i + L - 1$ of length L on both DNA strands. The energy is computed using a hierarchy of models of increasing complexity (Materials and Methods). The most basic of these, the mononucleotide model, assumes that the contribution of each DNA base pair to the total binding energy is independent of the other base pairs in the site (Equation 7).

Overview of BindSter, a biophysical algorithm for predicting protein–DNA binding specificity from high-throughput data. (A) MITOMI or PBM data are used as input to a series of Try2Step regressions employed to optimize model parameters. The fitted model (here, the Ace2 mononucleotide model) can be displayed as a sequence logo (*e.g.*, an IntervalLogo; see *Materials and Methods* for details) and used to compute a TF occupancy profile for each DNA probe. (B) Histogram of the Cy5 to BODIPY ratio, which quantifies the level of TF binding, from the Ace2 MITOMI assay (Fordyce *et al.* 2010). (C) Multiple regressions on the Ace2 MITOMI data. Each regression attempts to maximize the linear correlation coefficient between probe fluorescence ratios and the average number of particles bound to each probe (see *Materials and Methods*). The regression leading to the highest correlation is plotted in dark blue. (D) Predicted average number of particles *vs.* observed fluorescence ratios for each probe from the Ace2 MITOMI assay. The predictions are made using the mononucleotide model.

The mononucleotide model can be extended by including either energies of dinucleotides (adjacent base pairs; Equation 8) or k-mers of maximum length N that can occur anywhere within the binding site (Equation 9). The dinucleotide model is designed to capture sequence specificity of DNA bending, which plays a major role in protein–DNA recognition (Morozov et al. 2005; Morozov and Siggia 2007; Rohs et al. 2010) and is typically described at the level of base-stacking energies (Olson et al. 1998). The k-mer model is capable of accounting for sequence-specific experimental biases (Weirauch et al. 2013), as well as capturing true sequence preferences for longer motifs. Similar models have been previously used to describe sequence preferences of nucleosome formation (Tillo and Hughes 2009; Locke et al. 2010; Chereji and Morozov 2011). In addition, it is impractical to extend the dinucleotide model to longer words, due to an excessively large number of fitting parameters. We have also studied a PI model in which protein–DNA energetics are described purely through k-mer contributions (Equation 11).

We assume thermodynamic equilibrium between proteins and DNA probes. Under this assumption, we compute the average total number of proteins bound to each DNA probe, using efficient transfer matrix-like algorithms (Materials and Methods) (Morozov et al. 2009; Locke et al. 2010). Unlike previous approaches (Foat et al. 2006; Zhao et al. 2012) that assume low protein concentrations, our algorithm sums over all configurations of DNA-bound proteins allowed by steric exclusion. Furthermore, several species of DNA-binding factors can compete for positions on the DNA probe, allowing us to fit two models of protein–DNA interactions simultaneously to detect alternative binding modes.

Ordinarily, BindSter regressions start from a set of random points in the parameter space, so that the fits are not biased by initial conditions. In contrast, MatrixREDUCE employs motifs overrepresented in probe sequences as seeds for subsequent position-specific affinity matrix fits (Foat et al. 2006; Fordyce et al. 2010). Input data for BindSter come in the form of the ratio of Cy5 to BODIPY fluorescence intensity for MITOMI devices (Fordyce et al. 2010) and Alexa488 to Cy3 for PBMs (Berger et al. 2006; Badis et al. 2009; Zhu et al. 2009) (Figure 2B). Each regression attempts to maximize a linear correlation coefficient r between measured fluorescence ratios and predicted protein occupancy (Figure 2, C and D). Parameter optimization is carried out using a simple derivative-free algorithm that we call Try2Step (Tolkunov and Morozov 2012) (Materials and Methods). Briefly, the algorithm tries out modifications of the current parameter set and accepts the first one that increases r; if no such modifications are found after 200 steps, the parameter set that decreased r the least is accepted. This offers a possibility of surmounting barriers on the landscape defined by the dependence of r on model parameters, although most accepted steps do lead to improvements (Figure 2C). The regression is stopped once improvements to r become too small to matter. After 80 independent regressions, the model that yields the highest r is retained as the final prediction (Figure 2D). There are significant computational costs associated with our regression approach, primarily due to its global, derivative-free nature. Each prediction consists of a set of final model parameters, typically shown as a sequence logo, and of the protein occupancy profile for each DNA probe.

Sensitivity of prediction quality to model parameter values

Each BindSter run yields a set of model parameters that fit the data best. However, focusing only on best-fit models overlooks the fact that fit quality may be insensitive to the exact values of some of the fitting parameters. If the dependence of r on a given parameter is weak, different algorithms may yield apparently different motifs that would in fact have equivalent predictive power. To address this issue, we have developed a graphical representation of parameter uncertainties, called an IntervalLogo (see Materials and Methods for details). Similar to regular logos, in IntervalLogos letter heights correspond to predictions based on the best-fit model. In addition, however, color intensities and patterns are used to display the confidence interval associated with each letter (Figure 1). The intervals are determined by increasing/decreasing model parameters one at a time until r drops to 0.98 of its optimal value found in the best-model fit.

The mononucleotide model and comparison with previous work

We have fitted the mononucleotide model to 28 S. cerevisiae TFs with MITOMI data (Fordyce et al. 2010) (Figure 3A, Table S1). Our results can be compared with previous fits of a PWM-like model to the same data, which employed MatrixREDUCE (Fordyce et al. 2010). When the same length of the binding motif is used ( $L = 8$ ), our results are very similar to MatrixREDUCE predictions (P = 0.31, circles in Figure 3A). However, we find that prediction quality improves when motifs are longer (compare crosses and circles in Figure 3A) and therefore use $L = 10$ in all subsequent fits. Despite comparable overall predictive power, MatrixREDUCE and BindSter sometimes yield significantly different motifs (Table S1, Figure 3C), indicating that (a) parameter uncertainty may be significant at some positions and (b) more than one model of binding specificity may explain the same data equally well.

Protein–DNA interaction motifs obtained from MITOMI data. (A) Comparison of MatrixREDUCE and BindSter linear correlation coefficients between predicted and observed TF occupancies of each DNA probe. BindSter fits were carried out using a mononucleotide model with $L = 8$ and $L = 10$ and compared with $L = 8$ MatrixREDUCE results reported in Fordyce *et al.* (2010). The difference between BindSter and MatrixREDUCE linear correlation coefficients averaged over all TFs is denoted by μ; P-values for the two distributions of correlation coefficients are computed using a Wilcoxon signed-rank test. (B) Histogram of changes in correlation coefficients between extended (N = 3 k-mer, dinucleotide, and two-motif) and mononucleotide models. The difference between extended and mononucleotide model correlation coefficients averaged over all TFs is denoted by μ; P-values are computed as in A. (C) Motif predictions for Cup9: MatrixREDUCE mononucleotide logo, BindSter mononucleotide IntervalLogo, BindSter dinucleotide IntervalLogo, and the projected logo based on the dinucleotide model (see *Materials and Methods*). See Table S1 for all other TFs and the k-mer model.

To investigate this issue further, we have focused on mononucleotide BindSter fits for TFs Gcn4 and Cbf1, which produced non-canonical motifs (Table S1). The canonical AP-1 motif for Gcn4 is TGA(C/G)TCA (Ellenberger et al. 1992), whereas our fit finds TCA(C/G)TCA; the Cbf1 motif is CACGTG (Kuras et al. 1996), whereas our fit finds CACGTC. To investigate whether MITOMI data could be fitted equally well with canonical motifs, we performed seeded BindSter fits. Instead of starting the fits with random energy matrices as described above, we created seed motifs that prefer the canonical sites TGA(C/G)TCA (Gcn4) and CACGTG (Cbf1). At each position, energies of preferred letters were set 4 $k_{B} T$ lower than energies of nonpreferred letters. The relative position of the canonical site within the longer $L = 10$ motif was chosen to match the unseeded prediction (we have found that varying this position does not strongly affect the quality of fit). All energies outside of the canonical site were initialized to zero. For each seed motif, we performed four Try2Step regressions, reporting the result with the highest correlation to data. Our predictions are shown in Figure S2. For Gcn4, the seeded fit had nearly equivalent predictive power ( $r = 0.97$ vs. $r = 0.98$ for the randomly initialized fit). For Cbf1, the seeded fit performed marginally less well, achieving $r = 0.83$ vs. $r = 0.87$ but yielding a significantly different motif ( $r_{motif} = 0.67$ with the original prediction). Since the seeded fits did not result in a significant drop in performance, it appears that limited sequence diversity and measurement noise levels in MITOMI data sets may prevent reliable differentiation between closely related models. It is also possible that the MITOMI procedure, which involves binding TFs to the surface with antibodies, may be subtly altering the properties of the protein–DNA binding interface, perhaps enhancing monomer instead of dimer binding in some cases.

Note that in all MITOMI fits we have used only the 52-bp variable region from each DNA probe; refitting the $L = 10$ mononucleotide model with the entire 70-bp probe sequence results in a modest improvement compared to the result shown in Figure 3A: the average improvement in correlation coefficients μ is 0.013, with $P = 1.2 e - 4.$ The average motif correlation between 52- and 70-bp fits is 0.86, with motifs produced by poorly fitting models being more variable.

Steric exclusion

One advantage of BindSter is its rigorous treatment of steric exclusion, which makes our algorithm applicable to arbitrary protein concentrations (at low concentrations, multiple proteins bound to the same DNA probe are rare and steric exclusion is not important). The additional computational effort is modest since we use an efficient recursive algorithm to calculate DNA probe occupancies (Materials and Methods). To investigate the extent to which neglecting steric exclusion affects the quality of predictions, we have compared the mononucleotide model to a simple model without steric exclusion, which treats each binding position as independent of all the other positions on the DNA probe (Materials and Methods). We find that rigorous treatment of steric exclusion has minimal effect on the predictive power of the model trained on MITOMI data (Figure S3). Both the correlation coefficients (Figure S3A) and the motifs (Figure S3B) are very similar, although there are occasionally minor differences in probe occupancy profiles. For example, Rox1 profiles are somewhat different with and without steric exclusion (Figure S3C), probably due to the preference for T/C at position 2 in the latter fit (Figure S3B). Our findings are consistent with the MITOMI experimental setup in which proteins are immobilized and DNA is recruited to the surface (Fordyce et al. 2010). It appears that in MITOMI experiments the effective concentration of immobilized proteins is relatively low, making steric exclusion a minor factor.

Higher-order models of protein–DNA energetics

The mononucleotide model may not fully capture the complexity of amino acid–DNA base pair interactions at the protein binding interface (Luscombe et al. 2001), including the role of DNA shape in protein recognition (Rohs et al. 2010). To determine whether energetic contributions of longer motifs improve predictive performance, we have tested several extensions of the basic mononucleotide model (Figure 3B, Figure S4, Table S1, and Table S2; see Materials and Methods for model definitions). The extended models yielding the best improvements are the $N = 3$ k-mer model (where N is the maximum word length) and the dinucleotide model (top and middle panels in Figure 3B; Table S1). Interestingly, the $N = 2$ k-mer model has significantly less predictive power (Figure S4, top).

Either the $N = 2$ or the $N = 3$ position-independent model fitted simultaneously with the mononucleotide one does not substantially improve performance (Figure S4, two middle panels). Thus there is scant evidence in MITOMI data for nearly nonspecific binding driven by preferences for short sequence motifs, regardless of their position within the binding site. Finally, whereas fitting two mononucleotide models simultaneously is beneficial (the two-motif model in Figure 3B, bottom; Table S2), fitting one mononucleotide parameter set first and then fitting the other one on the residual signal is not (the secondary motif model in Figure S4, bottom; Table S2). Since the parameter spaces for these two models are equivalent, the issue lies with convergence: simultaneous fits achieve better optimization than sequential ones, probably because of greater flexibility afforded by a higher-dimensional optimization in the former case. Correspondingly, secondary motifs are often less specific and have little predictive power by themselves (cf. $r_{alone}$ values for the secondary model motif₂ in Table S2). In contrast, motif₂ fits in the two-motif model are more predictive and the logos are often visibly related to motif₁ and mononucleotide model fits (Table S2). The differences between motif₁ and motif₂ predictions may sometimes be interpreted as alternative binding modes. For example, in the case of Gcn4 motif₁ is close to the symmetric AP-1 site sequence (TGA(C/G)TCA) that corresponds to homodimeric binding (Ellenberger et al. 1992); motif₂, on the other hand, may reflect monomeric binding to the TCA half-site (note, however, that C at position 7 has a large uncertainty associated with it). Consistent with the original study (Fordyce et al. 2010), we do not see evidence for Gcn4 binding to longer ATF/CREB (TGACGTCA) sites (Konig and Richmond 1993) in MITOMI data.

Interestingly, motif specificity (as seen in the heights of logo letters) does not necessarily confer predictive power, as has been previously observed (Weirauch et al. 2013). For example, the mononucleotide model prediction for the TF Cup9 is significantly more specific than the MatrixREDUCE one (compare Figure 3C, top left and bottom left), but their linear correlation coefficients are fairly similar ( $r = 0.56$ and $r = 0.49$ , respectively). The dinucleotide model, which provides a substantial improvement ( $r = 0.73$ ), yields a motif that appears less specific still, although there are several prominent dinucleotide contributions (Figure 3C, top right and bottom right). Note that the dinucleotide model’s projected logo in Figure 3C is constructed using the full dinucleotide parameter set, as described in Materials and Methods.

In general, it is challenging to associate improvements in performance of higher-order models with a particular physical mechanism. For example, in the case of Cup9 the k-mer and dinucleotide models yield very similar correlation coefficients (0.71 and 0.73, respectively; Table S1), yet their physical underpinnings are completely different. The dinucleotide model is designed to capture DNA base pair stacking energies, whereas the k-mer model assigns energies to short words with lengths of 2 and 3 nucleotides independently of their position within the binding site, reflecting global preferences for these short motifs. Nonetheless, in the scatter plot of predicted vs. observed DNA probe occupancies (Figure S5A), predictions of both models colocalize for high-occupancy probes and are more accurate compared with the mononucleotide model. Furthermore, predictions of both higher-order models correlate equally well with the mononucleotide model and again colocalize for high-occupancy probes (Figure S5B). A direct comparison of k-mer and dinucleotide models reveals that they predict very similar distributions of probe occupancies ( $r = 0.90;$ Figure S5C). Similar behavior is observed across the entire MITOMI data set: k-mer and dinucleotide model predictions, and in particular their deviations from the mononucleotide model, are highly correlated (Figure S5, D and E). Thus the same signal is captured using two distinct higher-order models, which generalize the mononucleotide model in different ways.

The above observations (Figure 3B, Figure S4) lead us to conclude that modifying basic mononucleotide energetics appears to be more beneficial than introducing alternative binding modes, as far as extracting protein–DNA binding specificities from MITOMI data is concerned. Indeed, according to our fits there is relatively little evidence that TFs in the MITOMI experiments we analyzed switch between binding modes and in addition to the primary motif bind an alternative motif that could be characterized by either word counts (PI models) or another mononucleotide parameter set (two-motif and secondary-motif models). The most successful approach of this type, the two-motif fit, typically yields primary and secondary motifs that are clearly related (Table S2).

Cross-validation of BindSter fits

For every TF, the MITOMI data set provides ∼3700 DNA probe occupancy measurements; the full data set contains two complete replicates and an incomplete replicate with a subset of probes from the complete one (Fordyce et al. 2010). Thus the number of free parameters in all our models (Materials and Methods) is significantly less than the total number of available measurements. Nonetheless, we have cross-validated our mononucleotide fits by comparing the models trained on first and second complete replicates with each other and with the model trained on all available data (Figure S6, Table S3, Table S4). Models trained on single replicates produce distributions of probe occupancies that are highly correlated with the full model (Figure S6A) and have similar predictive power against the data (Table S3). The fitted model parameters also tend to be similar, as measured by motif correlation coefficients against the mononucleotide model trained on the entire data set (Figure S6B, Table S4). As a rule, predictions with high correlation coefficients against the data tend to yield nearly identical motifs (see, e.g., Rox1 and Sko1 examples in Figure S6C); motifs vary more for lower-quality predictions (see, e.g., the Cad1 example in Figure S6C).

It is also possible that higher-order models fit the data better simply because they have more fitting parameters. However, there is no clear connection between the number of additional parameters and model performance. For example, the best-performing models in Figure 3B and Figure S4 are the dinucleotide model (112 independent parameters compared with 31 for the mononucleotide model) and the k-mer model (67 independent parameters with $N = 3$ ). Despite many more parameters in the dinucleotide model, its performance is quite similar to that of the k-mer model, as discussed above.

The comparison between the k-mer and mononucleotide + PI models is particularly instructive. The parameter sets employed in these two models are nearly identical, but k-mer models typically outperform their PI counterparts. The physical mechanisms of protein–DNA interactions described by the two models are very different. In the mononucleotide + PI model, TFs are assumed to bind in two distinct modes (Materials and Methods). In the k-mer model, there is a single binding mode whose sequence preferences are described by a combination of position-specific mononucleotide parameters and position-independent contributions from longer words. Thus merely adding parameters will not significantly improve predictive performance unless the underlying physical model captures some essential aspects of protein–DNA energetics.

Since the dinucleotide model has the most fitting parameters, we have also cross-validated it directly (Table S3, Table S4). As with the mononucleotide model, predictions of DNA probe occupancies are largely consistent across the two models trained on separate replicates, with slight loss of predictive power that becomes more pronounced if the correlation coefficients from all models (replicate 1, replicate 2, and full) against the data are low overall (Table S3). However, dinucleotide motifs fitted on replicates are much more variable than their mononucleotide counterparts (Table S4). For example, Ace2 models fitted on separate replicates have the same predictive power as the full model. Nonetheless, the corresponding motifs are correlated with the full model motif only at $r = 0.51 (P < 10^{- 6})$ and $r = 0.68 (P < 10^{- 6})$ , respectively. For Met31, the dinucleotide motif fit on replicate 1 is correlated with the full motif at $r = 0.85 (P < 10^{- 6})$ , whereas the replicate 2 fit yields $r = 0.37 (P = 2.3 \times 10^{- 5}) .$ Interestingly, both motifs predict DNA probe occupancies equally well ( $r = 0.82$ and 0.80, respectively) and comparably to the full model prediction ( $r = 0.85$ ). Overall, these observations underscore the fact that multiple models, and especially multiple higher-order models, can fit MITOMI data equally well.

Comparison of PBM and MITOMI-based predictions

Predictions of protein–DNA binding specificity should not depend on the experimental platform being used to measure protein–DNA interactions. To compare the robustness of BindSter predictions across MITOMI and PBM platforms, we have carried out mononucleotide, dinucleotide, and N = 3 k-mer fits for 12 TFs for which both PBM and MITOMI measurements are available (Table S5). We have also compared our results with previously published BEEML-PBM mononucleotide fits (Zhao and Stormo 2011; Zhao et al. 2012).

One essential difference between BindSter MITOMI and PBM fits is the use of location bias in the latter. The location bias in our model is a quadratic energy contribution with two free parameters per DNA-binding species (Equation 12). It is designed to make binding sites closer to the glass slide less favorable, due to their lower accessibility for TFs (Weirauch et al. 2013). Indeed, although the coefficients of the quadratic function are not constrained in any way in our fits, for most TFs they fit to values that impose energetic penalties on the proximity of the protein to the glass slide (Figure S7A). Including location bias into PBM fits leads to sizable improvements in prediction quality for mononucleotide and k-mer models (Figure S7B). In contrast, including location bias into MITOMI fits leads to $μ = 0.013$ , several times smaller than the improvements observed with PBM data. Although including location bias increases predictive power of PBM fits, mononucleotide models fitted with and without it typically yield similar motifs (Figure S7C).

On average, our mononucleotide predictions outperform those from the BEEML-PBM algorithm (Figure 4A, Table S5). As with MITOMI data, k-mer and dinucleotide models provide statistically significant improvements compared to mononucleotide fits. Consistency between PBM- and MITOMI-based motifs (as measured by motif correlations) improves with overall prediction quality on PBM data; in several cases higher-order models recover motifs that, when projected, are closer to the MITOMI-based mononucleotide motifs than mononucleotide models inferred from PBM data (Figure 4B). We also note that BEEML-PBM motifs have the highest average correlation with BindSter MITOMI motifs on this data set. For example, the core GACACA Cup9 motif seen in both BEEML-PBM- and MITOMI-based mononucleotide fits is recovered only at the dinucleotide or k-mer level with BindSter PBM fits (Figure 4C, Table S5). This may be due to substantial k-mer bias present in PBM data (Weirauch et al. 2013), which requires use of higher-order models.

Protein–DNA interaction motifs obtained from PBM data. (A) Comparison of linear correlation coefficients predicted using the mononucleotide model with BEEML-PBM (Zhao and Stormo 2011; Zhao *et al.* 2012), dinucleotide, and k-mer models. All models are fitted on PBM data. The average difference between linear correlation coefficients in each comparison is denoted by μ; P-values for the two distributions of correlation coefficients are computed using a Wilcoxon signed-rank test. (B) Motif correlations between the mononucleotide BindSter model fit on MITOMI data and three BindSter models (mononucleotide, k-mer, and dinucleotide) plus the BEEML-PBM model fit on PBM data. Dinucleotide and k-mer motifs are projected onto the mononucleotide parameter set for this comparison (see *Materials and Methods*). All motif correlations are sorted by the linear correlation coefficient obtained by fitting the mononucleotide BindSter model to PBM data; μ indicates the average correlation coefficient for each model. Subscripts on TF names denote the source of the PBM data (see Table S5 for details). (C) Representative logos for MITOMI mononucleotide, BEEML-PBM, PBM mononucleotide, and PBM dinucleotide fits (see Table S5 for complete results). The dinucleotide model is represented by its projected logo.

Predictive power of MITOMI-trained models on PBM data is modest overall, although it does improve if location bias is included (Table S6). As a rule, PBM-based fits yield lower correlation coefficients and partial or less specific motifs. For example, for Bas1 neither BEEML-PBM nor BindSter fits recover the highly predictive motif ( $r = 0.83$ ) inferred from MITOMI data, and even the high-quality prediction of the canonical Cbf1 motif with PBM-based mono- and dinucleotide models has less information content than the corresponding MITOMI result (Figure 4C, Table S5). Note, however, that with MITOMI data the symmetric CACGTG motif is replaced with GACGTG, perhaps as a sign of an alternative binding mode; PBM fits on the other hand yield the expected symmetric motif. Interestingly, the Cbf1 motif becomes significantly more specific when the potential k-mer bias is taken into account (Table S5).

Discussion

We have developed a biophysical algorithm, BindSter, for predicting protein–DNA binding energetics from high-throughput in vitro measurements of total protein occupancies on a set of DNA probes. We have focused on MITOMI, a technique that uses a microfluidic device to measure binding interactions at thermodynamic equilibrium (Maerkl and Quake 2007; Fordyce et al. 2010). The MITOMI binding data are thus well suited for computational modeling based on equilibrium statistical mechanics, although the MITOMI approach is less high-throughput than PBMs (Berger et al. 2006; Badis et al. 2008, 2009; Zhu et al. 2009).

Our computational framework differs from previously published biophysical models of protein–DNA interactions (Foat et al. 2006; Zhao and Stormo 2011; Zhao et al. 2012) in two important aspects: (a) it employs a rigorous procedure in which all sterically allowed configurations of TFs bound to DNA probes are taken into account and (b) it is designed to accommodate a variety of protein–DNA interaction energy models. Thus, in contrast to previous work, BindSter is able to provide a quantitative treatment of overlapping binding sites and multiple DNA-binding species (as may occur if the TF switches between dimeric and monomeric binding or employs distinct binding interfaces to interact with different classes of sites). This built-in flexibility with respect to describing protein–DNA energetics has allowed us to explore a hierarchy of models of increasing complexity.

The most basic mononucleotide model, in which each nucleotide in the binding site contributes independently to the total protein–DNA binding energy, has predictive power on MITOMI data comparable to that of MatrixREDUCE (Fordyce et al. 2010). However, many BindSter motifs look substantially different (Table S1). Some of this discrepancy may be attributed to the fact that the models are insensitive to the values of a subset of fitting parameters. Such parameters could therefore change within a sizable range with little consequences for the quality of fit. To account for this possibility, we have developed a novel way of graphically displaying the uncertainty of model parameters, called IntervalLogos.

IntervalLogos are constructed as regular logos, but with letter color intensities and patterns used to indicate the confidence interval associated with each parameter as it is varied independently (Materials and Methods; Figure 1). However, visual inspection of mononucleotide and MatrixREDUCE motifs in Table S1 shows that even after accounting for parameter uncertainty the motifs remain related but distinct in many cases. Thus there is degeneracy among motifs that can explain MITOMI data equally well; moreover, there is no clear correlation between the specificity of a given motif and its predictive power. Finally, the role of steric exclusion is negligible in MITOMI experiments (Figure S3), likely due to the fact that proteins are immobilized on the surface (Fordyce et al. 2010), resulting in low effective protein concentrations.

Among seven extensions of the mononucleotide model that we have tested, the dinucleotide and k-mer models result in the most significant improvements to prediction quality (Figure 3B, Figure S4). Both of these models extend the mononucleotide description of protein–DNA energetics by including contributions of longer words. In contrast, models that test for the presence of alternative binding motifs fair less well, indicating that TFs in MITOMI experiments tend to have a single binding mode. The most successful model of the multiple-binding-mode type, in which two mononucleotide motifs are fitted simultaneously to the data, does result in a reasonable improvement (Figure 3B); some of the fitted model pairs can be interpreted as evidence of dimeric vs. monomeric binding or distinct subclasses of sequence motifs (Table S2). In general, it is challenging to associate an observed improvement in model performance with a particular physical mechanism of protein–DNA interactions. For example, k-mer and dinucleotide models tend to produce correlated predictions for high-occupancy DNA probes despite the fact that they are supposed to describe distinct aspects of protein–DNA energetics and/or experimental biases (Figure S5).

Finally, we find that BindSter fits on PBM data sometimes yield substantially different motifs compared to MITOMI-based fits, despite the fact that their predictive quality is as good as or better than previously published BEEML-PBM results (Zhao and Stormo 2011; Zhao et al. 2012) (Figure 4, Table S5). Motif correlations between MITOMI and PBM-fit models are modest overall, although they do improve with the quality of the PBM fit (Figure 4B). Correspondingly, the ability of MITOMI-trained models to predict the distribution of probe occupancies (fluorescent intensities) observed in PBM experiments also tends to be modest (Table S6). One potential explanation is limited applicability of thermodynamic models to the analysis of PBMs (Fordyce et al. 2010). Another possibility is experimental biases other than the k-mer bias and the location bias that are explicitly taken into account in the BindSter framework.

Supplementary Material

Supporting Information

supp_200_4_1349__index.html^{(3.1KB, html)}

Acknowledgments

A.V.M. acknowledges support from the National Institutes of Health (R01 HG004708) and an Alfred P. Sloan Research Fellowship.

Footnotes

Communicating editor: G. D. Stormo

Supporting information is available online at www.genetics.org/lookup/suppl/doi:10.1534/genetics.115.178384/-/DC1.

Literature Cited

Badis G., Chan E. T., van Bakel H., Pena-Castillo L., Tillo D., et al. , 2008. A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol. Cell 32: 878–887. [DOI] [PMC free article] [PubMed] [Google Scholar]
Badis G., M. F. Berger, A. A. Philippakis,S. Talukder, A. R. Gehrkeet al. , 2009. Diversity and complexity in DNA recognition by transcription factors. Science 324: 1720–1723. [DOI] [PMC free article] [PubMed] [Google Scholar]
Berger M. F., Philippakis A. A., Qureshi A. M., He F. S., Estep P. W., et al. , 2006. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24: 1429–1435. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bulyk M. L., Huang X., Choo Y., Church G. M., 2001. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl. Acad. Sci. USA 98: 7158–7163. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chereji R. V., Morozov A. V., 2011. Statistical mechanics of nucleosomes constrained by higher-order chromatin structure. J. Stat. Phys. 144: 379–404. [DOI] [PMC free article] [PubMed] [Google Scholar]
Crooks G. E., Hon G., Chandonia J. M., Brenner S. E., 2004. WebLogo: a sequence logo generator. Genome Res. 14: 1188–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]
Durbin R., Eddy S., Krogh A., Mitchison G., 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. [Google Scholar]
Ellenberger T. E., Brandl C. J., Struhl K., Harrison S. C., 1992. The GCN4 basic region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal structure of the protein-DNA complex. Cell 71: 1223–1237. [DOI] [PubMed] [Google Scholar]
Felsenfeld G., Groudine M., 2003. Controlling the double helix. Nature 421: 448–453. [DOI] [PubMed] [Google Scholar]
Foat B. C., Morozov A. V., Bussemaker H. J., 2006. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22: e141–e149. [DOI] [PubMed] [Google Scholar]
Fordyce P. M., Gerber D., Tran D., Zheng J., Li H., et al. , 2010. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol. 28: 970–975. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gordan R., Hartemink A. J., Bulyk M. L., 2009. Distinguishing direct vs. indirect transcription factor-DNA interactions. Genome Res. 19: 2090–2100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Konig P., Richmond T. J., 1993. The X-ray structure of the GCN4-bZIP bound to ATF/CREB site DNA shows the complex depends on DNA flexibility. J. Mol. Biol. 233: 139–154. [DOI] [PubMed] [Google Scholar]
Kuras L., Cherest H., Surdin-Kerjan Y., Thomas D., 1996. A heteromeric complex containing the centromere binding factor 1 and two basic leucine zipper factors, Met4 and Met28, mediates the transcription activation of yeast sulfur metabolism. EMBO J. 15: 2519–2529. [PMC free article] [PubMed] [Google Scholar]
Locke G., Tolkunov D., Moqtaderi Z., Struhl K., Morozov A. V., 2010. High-throughput sequencing reveals a simple model of nucleosome energetics. Proc. Natl. Acad. Sci. USA 107: 20998–21003. [DOI] [PMC free article] [PubMed] [Google Scholar]
Luscombe N. M., Laskowski R. A., Thornton J. M., 2001. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Res. 29: 2860–2874. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maerkl S. J., Quake S. R., 2007. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315: 233–237. [DOI] [PubMed] [Google Scholar]
Morozov A. V., Siggia E. D., 2007. Connecting protein structure with predictions of regulatory sites. Proc. Natl. Acad. Sci. USA 104: 7068–7073. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morozov A. V., Havranek J. J., Baker D., Siggia E. D., 2005. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 33: 5781–5798. [DOI] [PMC free article] [PubMed] [Google Scholar]
Morozov A. V., Fortney K., Gaykalova D. A., Studitsky V. M., Widom J., et al. , 2009. Using DNA mechanics to predict in vitro nucleosome positions and formation energies. Nucleic Acids Res. 37: 4707–4722. [DOI] [PMC free article] [PubMed] [Google Scholar]
Olson W. K., Gorin A. A., Lu X. J., Hock L. M., Zhurkin V. B., 1998. DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc. Natl. Acad. Sci. USA 95: 11163–11168. [DOI] [PMC free article] [PubMed] [Google Scholar]
Park P. J., 2009. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10: 669–680. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ren B., Robert F., Wyrick J. J., Aparicio O., Jennings E. G., et al. , 2000. Genome-wide location and function of DNA binding proteins. Science 290: 2306–2309. [DOI] [PubMed] [Google Scholar]
Rhee H. S., Pugh B. F., 2011. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147: 1408–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]
Richmond T. J., Davey C. A., 2003. The structure of DNA in the nucleosome core. Nature 423: 145–150. [DOI] [PubMed] [Google Scholar]
Rohs R., Jin X., West S. M., Joshi R., Honig B., et al. , 2010. Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79: 233–269. [DOI] [PMC free article] [PubMed] [Google Scholar]
Segal E., Fondufe-Mittendorf Y., Chen L. Y., Thastrom A., Field Y., et al. , 2006. A genomic code for nucleosome positioning. Nature 442: 772–778. [DOI] [PMC free article] [PubMed] [Google Scholar]
Siggers T., Gordan R., 2014. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42: 2099–2111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stormo G. D., Fields D. S., 1998. Specificity, free energy and information content in protein-DNA interactions. Trends Biochem. Sci. 23: 109–113. [DOI] [PubMed] [Google Scholar]
Stormo G. D., Zhao Y., 2010. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 11: 751–760. [DOI] [PubMed] [Google Scholar]
Tillo D., Hughes T. R., 2009. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics 10: 442. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tolkunov D., Morozov A. V., 2012. Single temperature for Monte Carlo optimization on complex landscapes. Phys. Rev. Lett. 108: 250602. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weirauch M. T., Cote A., Norel R., Annala M., Zhao Y., et al. , 2013. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31: 126–134. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y., Stormo G. D., 2011. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29: 480–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao Y., Ruan S., Pandey M., Stormo G. D., 2012. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics 191: 781–790. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu C., K. J. Byers, R. P. McCord, Z. Shi, M. F. Berger, et al., 2009. High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 19: 556–566. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supporting Information

supp_200_4_1349__index.html^{(3.1KB, html)}

10.1534_115.178384_genetics.115.178384-12.pdf^{(5.1MB, pdf)}

10.1534_115.178384_genetics.115.178384-4.pdf^{(28KB, pdf)}

10.1534_115.178384_genetics.115.178384-9.pdf^{(47KB, pdf)}

10.1534_115.178384_genetics.115.178384-1.pdf^{(102.9KB, pdf)}

10.1534_115.178384_genetics.115.178384-7.pdf^{(112.6KB, pdf)}

10.1534_115.178384_genetics.115.178384-14.pdf^{(37.2KB, pdf)}

10.1534_115.178384_genetics.115.178384-5.pdf^{(416.9KB, pdf)}

10.1534_115.178384_genetics.115.178384-11.pdf^{(146.7KB, pdf)}

10.1534_115.178384_genetics.115.178384-2.pdf^{(97.4KB, pdf)}

10.1534_115.178384_genetics.115.178384-10.pdf^{(2MB, pdf)}

10.1534_115.178384_genetics.115.178384-3.pdf^{(1.3MB, pdf)}

10.1534_115.178384_genetics.115.178384-8.pdf^{(20.3KB, pdf)}

10.1534_115.178384_genetics.115.178384-15.pdf^{(21.3KB, pdf)}

10.1534_115.178384_genetics.115.178384-6.pdf^{(977KB, pdf)}

10.1534_115.178384_genetics.115.178384-13.pdf^{(28.1KB, pdf)}

[bib1] Badis G., Chan E. T., van Bakel H., Pena-Castillo L., Tillo D., et al. , 2008. A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. Mol. Cell 32: 878–887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] Badis G., M. F. Berger, A. A. Philippakis,S. Talukder, A. R. Gehrkeet al. , 2009. Diversity and complexity in DNA recognition by transcription factors. Science 324: 1720–1723. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] Berger M. F., Philippakis A. A., Qureshi A. M., He F. S., Estep P. W., et al. , 2006. Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities. Nat. Biotechnol. 24: 1429–1435. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib4] Bulyk M. L., Huang X., Choo Y., Church G. M., 2001. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl. Acad. Sci. USA 98: 7158–7163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib5] Chereji R. V., Morozov A. V., 2011. Statistical mechanics of nucleosomes constrained by higher-order chromatin structure. J. Stat. Phys. 144: 379–404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] Crooks G. E., Hon G., Chandonia J. M., Brenner S. E., 2004. WebLogo: a sequence logo generator. Genome Res. 14: 1188–1190. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib7] Durbin R., Eddy S., Krogh A., Mitchison G., 1998. Biological Sequence Analysis. Cambridge University Press, Cambridge, UK. [Google Scholar]

[bib8] Ellenberger T. E., Brandl C. J., Struhl K., Harrison S. C., 1992. The GCN4 basic region leucine zipper binds DNA as a dimer of uninterrupted alpha helices: crystal structure of the protein-DNA complex. Cell 71: 1223–1237. [DOI] [PubMed] [Google Scholar]

[bib9] Felsenfeld G., Groudine M., 2003. Controlling the double helix. Nature 421: 448–453. [DOI] [PubMed] [Google Scholar]

[bib10] Foat B. C., Morozov A. V., Bussemaker H. J., 2006. Statistical mechanical modeling of genome-wide transcription factor occupancy data by MatrixREDUCE. Bioinformatics 22: e141–e149. [DOI] [PubMed] [Google Scholar]

[bib11] Fordyce P. M., Gerber D., Tran D., Zheng J., Li H., et al. , 2010. De novo identification and biophysical characterization of transcription-factor binding sites with microfluidic affinity analysis. Nat. Biotechnol. 28: 970–975. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] Gordan R., Hartemink A. J., Bulyk M. L., 2009. Distinguishing direct vs. indirect transcription factor-DNA interactions. Genome Res. 19: 2090–2100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib13] Konig P., Richmond T. J., 1993. The X-ray structure of the GCN4-bZIP bound to ATF/CREB site DNA shows the complex depends on DNA flexibility. J. Mol. Biol. 233: 139–154. [DOI] [PubMed] [Google Scholar]

[bib14] Kuras L., Cherest H., Surdin-Kerjan Y., Thomas D., 1996. A heteromeric complex containing the centromere binding factor 1 and two basic leucine zipper factors, Met4 and Met28, mediates the transcription activation of yeast sulfur metabolism. EMBO J. 15: 2519–2529. [PMC free article] [PubMed] [Google Scholar]

[bib15] Locke G., Tolkunov D., Moqtaderi Z., Struhl K., Morozov A. V., 2010. High-throughput sequencing reveals a simple model of nucleosome energetics. Proc. Natl. Acad. Sci. USA 107: 20998–21003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib16] Luscombe N. M., Laskowski R. A., Thornton J. M., 2001. Amino acid-base interactions: a three-dimensional analysis of protein-DNA interactions at an atomic level. Nucleic Acids Res. 29: 2860–2874. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib17] Maerkl S. J., Quake S. R., 2007. A systems approach to measuring the binding energy landscapes of transcription factors. Science 315: 233–237. [DOI] [PubMed] [Google Scholar]

[bib18] Morozov A. V., Siggia E. D., 2007. Connecting protein structure with predictions of regulatory sites. Proc. Natl. Acad. Sci. USA 104: 7068–7073. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] Morozov A. V., Havranek J. J., Baker D., Siggia E. D., 2005. Protein-DNA binding specificity predictions with structural models. Nucleic Acids Res. 33: 5781–5798. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib20] Morozov A. V., Fortney K., Gaykalova D. A., Studitsky V. M., Widom J., et al. , 2009. Using DNA mechanics to predict in vitro nucleosome positions and formation energies. Nucleic Acids Res. 37: 4707–4722. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] Olson W. K., Gorin A. A., Lu X. J., Hock L. M., Zhurkin V. B., 1998. DNA sequence-dependent deformability deduced from protein-DNA crystal complexes. Proc. Natl. Acad. Sci. USA 95: 11163–11168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib22] Park P. J., 2009. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10: 669–680. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] Ren B., Robert F., Wyrick J. J., Aparicio O., Jennings E. G., et al. , 2000. Genome-wide location and function of DNA binding proteins. Science 290: 2306–2309. [DOI] [PubMed] [Google Scholar]

[bib24] Rhee H. S., Pugh B. F., 2011. Comprehensive genome-wide protein-DNA interactions detected at single-nucleotide resolution. Cell 147: 1408–1419. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib25] Richmond T. J., Davey C. A., 2003. The structure of DNA in the nucleosome core. Nature 423: 145–150. [DOI] [PubMed] [Google Scholar]

[bib26] Rohs R., Jin X., West S. M., Joshi R., Honig B., et al. , 2010. Origins of specificity in protein-DNA recognition. Annu. Rev. Biochem. 79: 233–269. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib27] Segal E., Fondufe-Mittendorf Y., Chen L. Y., Thastrom A., Field Y., et al. , 2006. A genomic code for nucleosome positioning. Nature 442: 772–778. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib28] Siggers T., Gordan R., 2014. Protein-DNA binding: complexities and multi-protein codes. Nucleic Acids Res. 42: 2099–2111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib29] Stormo G. D., Fields D. S., 1998. Specificity, free energy and information content in protein-DNA interactions. Trends Biochem. Sci. 23: 109–113. [DOI] [PubMed] [Google Scholar]

[bib30] Stormo G. D., Zhao Y., 2010. Determining the specificity of protein-DNA interactions. Nat. Rev. Genet. 11: 751–760. [DOI] [PubMed] [Google Scholar]

[bib31] Tillo D., Hughes T. R., 2009. G+C content dominates intrinsic nucleosome occupancy. BMC Bioinformatics 10: 442. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib32] Tolkunov D., Morozov A. V., 2012. Single temperature for Monte Carlo optimization on complex landscapes. Phys. Rev. Lett. 108: 250602. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib33] Weirauch M. T., Cote A., Norel R., Annala M., Zhao Y., et al. , 2013. Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31: 126–134. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib34] Zhao Y., Stormo G. D., 2011. Quantitative analysis demonstrates most transcription factors require only simple models of specificity. Nat. Biotechnol. 29: 480–483. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib35] Zhao Y., Ruan S., Pandey M., Stormo G. D., 2012. Improved models for transcription factor binding site identification using nonindependent interactions. Genetics 191: 781–790. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib36] Zhu C., K. J. Byers, R. P. McCord, Z. Shi, M. F. Berger, et al., 2009. High-resolution DNA-binding specificity analysis of yeast transcription factors. Genome Res. 19: 556–566. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A Biophysical Approach to Predicting Protein–DNA Binding Energetics

George Locke

Alexandre V Morozov

Abstract

Materials and Methods

Physical model of protein–DNA interactions

Models of protein–DNA binding energy

The mononucleotide model:

The dinucleotide model:

The k-mer model:

The position-independent model:

Alternative motif models:

PBM location bias

Fitting procedure

Motif correlation

IntervalLogo

Figure 1.

Mononucleotide logos for higher-order models

Results

Overview of the modeling procedure

Figure 2.

Sensitivity of prediction quality to model parameter values

The mononucleotide model and comparison with previous work

Figure 3.

Steric exclusion

Higher-order models of protein–DNA energetics

Cross-validation of BindSter fits

Comparison of PBM and MITOMI-based predictions

Figure 4.

Discussion

Supplementary Material

Acknowledgments

Footnotes

Literature Cited

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases