Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2018 Jan 27;46(8):e44. doi: 10.1093/nar/gky027

Modular discovery of monomeric and dimeric transcription factor binding motifs for large data sets

Jarkko Toivonen 1,, Teemu Kivioja 2, Arttu Jolma 3, Yimeng Yin 3, Jussi Taipale 2,3,4, Esko Ukkonen 1,5,
PMCID: PMC5934673  PMID: 29385521

Abstract

In some dimeric cases of transcription factor (TF) binding, the specificity of dimeric motifs has been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other. Current motif discovery methods are unable to learn monomeric and dimeric motifs in modular fashion such that deviations from the expected motif would become explicit and the noise from dimeric occurrences would not corrupt monomeric models. We propose a novel modeling technique and an expectation maximization algorithm, implemented as software tool MODER, for discovering monomeric TF binding motifs and their dimeric combinations. Given training data and seeds for monomeric motifs, the algorithm learns in the same probabilistic framework a mixture model which represents monomeric motifs as standard position-specific probability matrices (PPMs), and dimeric motifs as pairs of monomeric PPMs, with associated orientation and spacing preferences. For dimers the model represents deviations from pure modular model of two independent monomers, thus making co-operative binding effects explicit. MODER can analyze in reasonable time tens of Mbps of training data. We validated the tool on HT-SELEX and ChIP-seq data. Our findings include some TFs whose expected model has palindromic symmetry but the observed model is directional.

INTRODUCTION

In transcriptional regulation, proteins called transcription factors (TFs) bind to specific DNA motifs, to have a regulatory effect on the transcription rate of particular genes. The regulating TFs may bind co-operatively in clusters of two or more factors which makes the regulation combinatorial by nature (1–4). Therefore, it is of interest not only to find the binding motifs for individual monomeric TFs but also motifs for dimeric and higher order co-operative binding of several TFs on the same regulatory area in DNA. With the massive training data currently available from, e.g. high-throughput SELEX (5,6) and ChIP-seq experiments (7), it is possible to learn complex binding models from quite weak signals.

In a large number of dimeric cases of TF binding, the specificity of the dimeric motif has recently been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other (4,6,8). Current automatic motif discovery tools do not learn monomeric and dimeric motifs soundly within one probabilistic framework in modular fashion such that the effects of co-operative binding on motifs could be shown and analyzed. In this paper, we propose such a learning algorithm and a software tool for modular discovery of monomeric and dimeric binding motifs for TFs.

The algorithm uses a class of probabilistic mixture models for (possibly multi-profile) monomeric binding motifs and all their dimeric combinations. Our model represents each monomeric motif as a standard position-specific probability matrix (PPM) (9,10). Each dimeric motif is represented in modular fashion as a pair of monomeric PPMs, with associated information on the relative orientation and spacing of the two monomeric components. In our model, the monomeric components need not be spatially separate but their sites may overlap; such overlaps have been reported, e.g. in (4,11). A novel feature of our model is that it includes a deviation matrix that represents explicitly how much the discovered dimeric PPM deviates from the expected PPM for independent component monomers. Another novelty is that monomeric and dimeric models are learned such that the effect of the noise from dimeric occurrences on monomeric models is minimized. Moreover, the mixing parameters of the model reveal the relative abundances of different motif combinations. In particular, the mixing parameters for the dimeric variants give precise quantitative indication of orientation and spacing preferences of the two monomers that make the dimer.

For learning our binding model we describe an expectation maximization (EM) algorithm (12), called MODER (MOtif DEtectoR). Given a data set of sequences that contain enriched motif instances, MODER learns by EM search the parameters of all model components simultaneously, as a mixture of several PPMs, by optimizing the alignment of the model with the training data using maximum likelihood estimation. The EM search is initialized with user-given seed sequences for the monomeric profiles of the model. It finds PPMs for the monomers as well as for their dimeric combinations within given range of spacings and orientations. Higher-order combinations are not included, as it would exponentially increase the complexity and the size of the model. Monomer PPMs are learned using pruning techniques that minimize contamination from near-by motifs occurrences and from background. The requirement to provide seeds is a limitation of MODER which depends on prior knowledge (such as motif databases) or the use of other motif discovery algorithms. On the other hand, seed-based initialization makes MODER fast and capable of processing in reasonable time a training data consisting of sequences that are hundreds of bps long and are several Mbps in total size. MODER was designed for motif discovery from HT-SELEX reads, but other type of training data, such as ChIP-seq data sets, can be used as well.

Validation experiments of MODER show robust and fast performance both on HT-SELEX and ChIP-seq data. We applied MODER on six HT-SELEX data sets, each consisting of 105–106 reads of length 30 or 40, and found varying amounts of difference between observed and expected motifs: for example, for factors FLI1 and PKNOX2 the expected homodimeric model has palindromic symmetry but the observed model is directional, reconfirming an earlier observation in (6). From ChIP-seq data MODER finds for factor CTCF essentially the same dimeric model as reported in (13,14), and for modular receptor RXRA a dimeric model that the Tomtom tool (15) matches with a known RXRA heterodimer. For factor NRSF, MODER finds from ChIP-seq data essentially the same multi-profile model as in (16).

In previous research, a dimer model quite similar to ours but without explicit modular structure and overlaps of monomers within dimers was introduced, with an entropy minimization learning algorithm Bipad/Maskminent (17–19). Discovery of spaced dyads (pairs of relatively short motifs) was considered in (20,21). Gibbs sampling based BioProspector (22) is another early dimer search algorithm. Recent dimer prediction methods include SpaMo (23), iTFs (24), and TACO (25). All start from given monomeric PPMs and find, using thresholding, the occurrence sites of the PPMs in the training data. Then enrichment of specific spacings of pairs of occurrences is detected, with an analysis of the statistical significance but without an analysis of co-operative effects of dimer components. SpaMo was designed for finding preferred distances between the site of the primary TF and the sites of secondary TFs in ChIP-seq data. The dimer model of iTFs includes relative orientation of the components but it does not consider overlaps and uses binned distances. Finally, TACO’s model includes orientation and distance and allows the components to overlap, but does not analyze the effect of overlap on the binding profile.

Using the EM algorithm in motif discovery was initiated by Lawrence and Reilly (26) and was used for finding motifs with spacers by Cardon and Stormo (27). The mixture model and the EM learning of MODER generalize the techniques of MEME (28,29) to multi-profile dimeric case. As compared to MEME, an important feature of MODER is that it learns all submodels simultaneously, using all training data symmetrically. coMOTIF (30) is another simultaneous multi-profile motif finder based EM algorithm. It does not, however, keep track of the distances between binding sites and does not allow overlaps of binding sites, nor does it have the modeling of deviation or learning of the motif in the gap positions between the dimer components. MODER can be seen as a generalization of coMOTIF. Recent EM algorithm based finders of monomer motifs include GADEM and rGADEM (31,32) which use genetic algorithm with EM to improve starting PPMs, SEME (33) which uses importance sampling to speed-up the search, EXTREME (16) which achieves speed-up by using the on–line version of the EM algorithm, and STEME (34) which resorts to suffix-trees. Moreover, Liu et al. (35) use Gibbs sampling and Ikebata and Yoshida (36) use a repulsive MCMC version of MEME type search for simultaneous discovery of several motifs, Alipanahi et al. (37) use deep learning for motif discovery with good validation results but non-modular structure of the underlying model, and Colombo and Vlassis (38) find monomeric motifs with a fast spectral learning algorithm. Recent motif finders specially designed for large ChIP-seq data include rGADEM (32), HOMER (39), ChIP-Munk (40), and MEME-ChIP (41), evaluated in (42).

In the rest of the paper, the next section defines the mixture model of MODER, the next one gives the associated EM algorithm for estimating the model parameters, then our implementation of MODER is described, with techniques to initialize and prune the search, and finally we report some validation and comparison experiments and discuss motifs found by MODER for TFs FLI1, HOXB13, HNF4A, TFAP2A, FOXC1, PKNOX2, NRSF, CTCF and RXRA.

MATERIALS AND METHODS

Model structure

The binding affinity model learned by MODER, specified by parameters η = (θ, ψ, λ), gives a probability distribution for sequences in some alphabet Σ. We will use always the DNA alphabet Σ = {A, C, G, T} but the model works for arbitrary alphabets.

Model η is a mixture of distributions for monomeric sequences that contain one occurrence of a monomeric motif, and distributions for dimeric sequences that contain two monomeric motifs in a specific relative orientation and spacing, and a distribution for background sequences. Monomeric distributions are built from the PPMs of the monomers and the background. For all orientation and spacing alternatives between the two monomers in a dimer, dimeric distributions are built either from the PPMs of the monomers and the background or from the PPM of the entire dimer and the background. If the two monomers of a dimer do not overlap and have a long gap in between, then the dimeric distribution is just the product of the two monomer PPMs, that is, the model assumes that there is no co-operative effect affecting the independence of the two binding profiles. If the monomers overlap or the gap between them is short, then the binding profiles of two monomers do not necessarily remain independent. There can be interaction between the components of a dimer as they may physically contact each other, or the interaction can be DNA mediated (4). Therefore the model allows deviating from pure reduction to monomer PPMs and also represents, using the so-called deviation matrix, how the PPM learned from data differs from the product of monomer PPMs which would be the expected model if there are no interactions.

The three parameter groups of η = (θ, ψ, λ) and the parametrization of the dimeric structures are defined in detail in the following subsections.

Monomeric PPMs θk and background θ0

Parameter θ = (θ0, θ1, …, θp) gives the background distribution θ0 and p monomeric motifs θk. Each θk, k ≠ 0, is a 4 × ℓk PPM

graphic file with name M20.gif

where Inline graphic gives the probability for an alphabet symbol (nucleotide) a to occur in position h of θk, and ℓk denotes the length of θk. The reverse complement Inline graphic of θk is a PPM such that Inline graphic for each a and h, where Inline graphic is the complementary base of a (e.g., Inline graphic).

The mononucleotide background model Inline graphic gives the occurrence probabilities of each alphabet symbol in a position that is outside the occurrences of monomers or dimers. The background model is position-independent.

Dimer specification k1k2od

The model uses monomeric motifs θk as building blocks of dimeric motifs. The possible dimeric motifs are indexed with quadruples (k1, k2, o, d) which we abbreviate as k1k2od (this should not be confused with the multiplication of these symbols). A dimer with index k1k2od is composed of monomers Inline graphic and Inline graphic whose orientation is o and distance (spacing) from the end of Inline graphic to the start of Inline graphic is d, where Inline graphic and Inline graphic. Because of co-operative binding effects, monomer motifs alone are not enough for building dimeric models. To model such effects we will use an additional PPM (see the next subsection) that covers the middle area of the dimer, called the bridging segment. Figure 1 illustrates our parametrization of dimeric structures; cf. (17).

Figure 1.

Figure 1.

Parametrization of dimeric structures. On top, a non-overlapping dimer with spacing d and head-to-tail orientation. Parameter δ gives the lower bound such that monomers separated by a space ≥δ are assumed independent. Below, a dimer with reversed first monomer and overlap of length d. When d < δ, the model also includes the bridging PPM ψ that covers the bridging segment of the dimer.

The set of possible pairwise orientations o is Inline graphic if k1 = k2 (homodimer), and Inline graphic otherwise (heterodimer). Table 1 describes different orientations o = (o1, o2) giving the directions of motifs Inline graphic and Inline graphic. Note that for homodimers the orientations HT and TH are identical, and one can use HT to represent them both. We assume that motif Inline graphic always occurs before motif Inline graphic when moving from 5′ end to 3′ end and using motif start position as reference point. The reverse order of the two motifs transforms back to this case by considering the complementary strand.

Table 1. Relative orientation of two motif occurrences within a dimer.
Orientation o Short-hand o 1 o 2
Head-to-Tail HT → → +1 +1
Head-to-Head HH → ← +1 −1
Tail-to-Tail TT ← → −1 +1
Tail-to-Head TH ← ← −1 −1

Exponents o1 and o2 give the orientation of the first and the second PPM: for a PPM θk, Inline graphic leaves the matrix intact but Inline graphic takes the reverse complement.

The possible distances between the two occurrences are given as an interval Inline graphic. If Inline graphic is non-negative, it gives the number of gap positions between the two occurrences. If d < 0, then the occurrences overlap by −d positions. The smallest possible distance Inline graphic has to be Inline graphic. MODER implementation uses (optionally adjustable) default value Inline graphic, that is, overlaps only up to half of the length of the monomers are allowed. The longest distance possible for sequences of maximum length Lmax  is Inline graphic.

We use parameter δ ≥ 0 to give the minimum spacing such that if the space between the two monomers of a dimer is ≥δ then the monomer profiles are assumed independent, i.e. in this case the model ignores the possible co-operative interactions that would change the binding preferences of the two TFs or the gap between them. Parameter δ is a user-given constant (default value δ = 4 in our implementation).

In what follows, we refer to the available monomeric and dimeric motifs with index k that may belong to the following three separate sets:

M = {1, …, p}: the indices for monomeric motifs.

D + = {k1k2od: d ≥ δ, k1, k2M}: the indices for dimeric motifs whose monomers Inline graphic and Inline graphic have a gap of length ≥δ in between. This is called the independent case.

D = {k1k2od: d < δ, k1, k2M}: the indices for dimeric motifs whose monomers Inline graphic and Inline graphic have a gap of length <δ in between. This is called the dependent case. Note that this case includes dimers whose monomers overlap.

Dimeric PPMs τk1k2od, bridging PPMs ψk1k2od and deviation matrices κk1k2od

We use Inline graphic to denote the PPM (which is a Inline graphic matrix) for motif k1k2odD+D. Each Inline graphic is a derived parameter, composed of free parameters, such that if k1k2odD+ then Inline graphic is built from Inline graphic, Inline graphic, and background θ0, and if k1k2odD then Inline graphic is built from Inline graphic, Inline graphic, and the bridging PPM Inline graphic to be defined below. Constructions are as follows.

If k1k2odD+, then we put simply

graphic file with name M61.gif (1)

where • concatenates matrices. There are d column-matrices θ0 in the middle of Inline graphic, that is, the middle gap is filled with the background.

If k1k2odD, then a middle segment of Inline graphic is a free parameter learned from data: for d < 0, the columns that are on the overlap area (plus one more column on both sides) are free parameters, and for 0 ≤ d < δ, the columns that are on the area between the monomers (plus one more column on both sides) are free parameters. This area of length |d| + 2 in the middle of a dimer is called the bridging segment, and the 4 × (|d| + 2) PPM for the bridging segment is called the bridging PPM. We let Inline graphic denote the bridging PPM. Now, the columns of Inline graphic that cover the bridging segment come from Inline graphic while the columns outside this segment are supposed to reduce to the monomer motifs, i.e. they are as in the implied prefix and suffix segments of monomer matrices Inline graphic and Inline graphic. So we get

graphic file with name M69.gif (2)

Next, we make it explicit how PPM Inline graphic differs from the PPM that would be expected were the monomer motifs independent in the dimer. We denote such an expected PPM as Inline graphic. It models the situation that motifs Inline graphic and Inline graphic have independent instances at distance d from each other in sequences Inline graphic with an occurrence of θf at the left end and θr at the right end.

Let first d ≥ 0. Consider the occurrence probability P(ai) of the ith symbol ai. Obviously, if Inline graphic, then P(ai) = θf[ai, i]; if Inline graphic, then P(ai) = θ0(ai), i.e., we expect to see the background distribution between the two motifs; and if Inline graphic, then Inline graphic. This means that Inline graphic is just θf followed by d columns, each equal to θ0, followed by θr; c.f., the definition of Inline graphic in the independent case (1).

Let then d < 0, i.e., the motifs overlap by |d| symbols. Consider again the probability P(ai). If Inline graphic, then P(ai) = θf[ai, i], and hence the ith column of the expected PPM is Inline graphic. Similarly, if Inline graphic, then Inline graphic, and hence Inline graphic. In the remaining case we have Inline graphic, and the ith symbol ai belongs to the area where the two motifs overlap. Hence ai is generated by both θf and θr, under the condition that both generate the same symbol because in the overlapping area the two motifs have to coincide. Therefore P(ai) would be equal to Inline graphic, normalized by the condition that both motifs generate the same symbol. This gives

graphic file with name M88.gif (3)

and therefore the ith column becomes

graphic file with name M89.gif (4)

where × denotes element-wise product.

Finally, the deviation matrixInline graphic, defined as

graphic file with name M91.gif

gives the difference between observed and expected model. Deviation matrices will be visualized using a variant of the sequence logo in which positive values are shown above a separating line and negative values below it, see Figure 2. Note also that the expected PPM of homodimers is always palindrome symmetric for orientations HH and TT.

Figure 2.

Figure 2.

Example model η for factor FLI1. The model was learned by MODER from a HT-SELEX data set (PRJEB14550, 143 389 reads of length 40), see Section Validation results. (A) Monomeric PPM θ1 with original seed ACCGGAAGTN. (B) Dimeric PPM Inline graphic with (above) the seed and arrows indicating the orientation, and (below) horizontal bar indicating the bridging segment, and deviation matrix Inline graphic with the positive values visualized above and the negative values below the horizontal line. (C) Dimeric PPM Inline graphic and its deviation matrix. (D) Background PPM θ0. (E) Mixture break-down into monomer, all dimers, and background. For example, 0.21 is the sum of mixing parameter values λk, kD+D. Note that the proportion of background is larger than the signal, because data from an early SELEX cycle with lots of background was used. (F) Heat map of COB table (λ1, 1, o, d) for homodimers of FLI1, giving the break-down into individual dimers and indicating that Inline graphic (panel b), Inline graphic (panel c), and Inline graphic are the strongest dimers. Horizontal axis gives the distance d, and a cell with no value indicates that the corresponding dimeric case was pruned during the EM search; see Section Pruning the search. The units in the COB table are integer multiples of 0.001.

Mixing parameters λ

Mixing parameters λ = {λk: k ∈ {0}∪MD+D} give the probability of each component of the mixture as follows:

λk, kM, is the probability that the sequence contains exactly one monomeric occurrence of motif θk and no other occurrences.

λk, k = k1k2odD+D, is the probability that the sequence contains exactly one occurrence of motif Inline graphic and no occurrences of other motifs.

λ0 is the probability that the sequence contains no motif occurrences.

For each pair (k1, k2), the array Inline graphic of mixing parameter values is called the co-operative binding table (COB table) of motifs Inline graphic and Inline graphic. The values in a COB table indicate the orientation and spacing preferences of the dimeric structures that are composed of Inline graphic and Inline graphic.

Figure 2 illustrates model η for binding motifs of TF FLI1.

Learning by expectation maximization

Given a training data set X = {X1, X2, …, Xn} consisting of n DNA sequences Inline graphic, where Li is the length of the ith sequence, we use the EM algorithm (12,28) to find model parameters η which maximize the expectation of the likelihood L(η|X, Z) = P(X, Z|η), where latent variables Z give the ’missing information’ used by an EM algorithm.

Latent variables are 0–1-valued random variables that indicate how the data X is aligned to the model. To align Xi, there are latent variables Zik ·, k ∈ {0}∪MD+D, with exactly one of them having value 1, that code the alignment as follows.

Case Zi0 = 1: Sequence Xi has no occurrences of motifs and is generated by the background model θ0 alone.

Case Zikj = 1: If kM, then the sequence Xi has an occurrence of motif θk starting at position j. The rest of Xi is generated by the background model. If k = k1k2odD+D then the sequence Xi has an occurrence of motif τk at position j, that is, an occurrence of motif Inline graphic at position j and an occurrence of motif Inline graphic at Inline graphic such that the occurrences of Inline graphic and Inline graphic have relative orientation o.

We denote by Sik the set of positions j at which motif k may occur in Xi. For kM we have Sik = {1, …, Li − ℓk + 1}, and for k = k1k2odD+D, Inline graphic.

The probability of Xi in model η, given the missing information Zi ·, is straightforward to evaluate as follows. If sequence Xi contains no motif occurrences, i.e. Zi0 = 1, its probability is

graphic file with name M111.gif (5)

If the sequence Xi contains one motif occurrence, i.e. Zikj = 1 for some kM, jSik, its probability is

graphic file with name M112.gif (6)

where B1 = {1, …, Li}∖[j, j + ℓk).

For the dimeric binding we have two cases: independent (d ≥ δ) and dependent (d < δ). Let first k = k1k2odD+. Define the set Inline graphic. Then the probability of Xi is

graphic file with name M114.gif (7)

Let then k = k1k2odD. The probability of Xi is

graphic file with name M115.gif (8)

Recall from (2) that Inline graphic is composed of bridging PPM Inline graphic in the middle and of flanking segments taken from PPMs Inline graphic and Inline graphic.

Now the joint likelihood of the model parameters, given data X and missing information Z, is the product of mixture probabilities of each Xi:

graphic file with name M120.gif

It is important to note here that, to simplify notation, we have ignored the fact that we should consider motif occurrences appearing in the reverse DNA strand as well. For this algorithm to work in the two-stranded case, a new index should be added, which specifies the direction (+1 or –1) of a monomer or a dimer occurrence. Then in all the places where we sum over j, we should sum over the directions as well. Moreover, to make sure that the probabilities add up to one, an additional division by two should be performed where we currently divide by |Sik|.

As for each i, exactly one of the latent values Zi · equals 1 and the others are zeros, the log-likelihood has the following form:

graphic file with name M121.gif (9)

The EM algorithm repeatedly applies the following rule to update η = (θ, κ, λ) until convergence:

graphic file with name M122.gif

One iteration of the algorithm, indexed with t, consists of an E-step and an M-step. These steps are described next.

Expectation step

E-step finds the expectation of log-likelihood (9) for current parameter values η(t). By linearity of expectation, this reduces to finding the expected values Inline graphici · of latent variables Zi ·. By noting that 0 and 1 are the only possible values of a latent variable, and by applying the Bayes rule, one can see that the expected values and hence the update rule of the E-step becomes, for k ∈ {0}∪MD+D and jSik, as follows:

graphic file with name M124.gif (10)
graphic file with name M125.gif (11)

Here, probability P(Xi|Zi0 = 1, η(t)) is given by (5) and probability P(Xi|Zikj = 1, η(t)) by (6), (7) or (8), and

graphic file with name M126.gif (12)

Maximization step

M-step maximizes the expectation of log-likelihood for current Inline graphic(t) by updating parameters η = (θ, ψ, λ). The form of log-likelihood (9) is such that the M-step is of Baum–Welch type: parameters are updated by normalizing the expected counts of using different components of the model when X is aligned to the model according to Inline graphic(t).

The update rules for mixing parameters become:

graphic file with name M129.gif (13)
graphic file with name M130.gif (14)

To update θ and ψ we first accumulate the expected counts of how many times each mixture component is used when X is aligned with η(t). For all kM, we get the 4 × ℓk matrices of expected counts of the monomer motifs as

graphic file with name M131.gif

Here Inline graphic is 4 × ℓk matrix-valued indicator function such that Inline graphic if Xi[j + h − 1] = a, and otherwise Inline graphic. Again, Inline graphic is the reverse complement of Inline graphic. Note that the above aggregation of Wk implements the modularity of binding: a monomer model θk gets its counts from monomeric occurrences of θk as well as from occurrences of θk as an independent component of a dimer. Since the monomer models are not learned from the overlapping cases, there is no coupling between the monomers and the deviations matrices, i.e. both are uniquely defined.

For kD, the Inline graphic matrix of the expected counts is

graphic file with name M138.gif

According to our modularity constraint the columns of Inline graphic that are outside the bridging segment should be modeled with Inline graphic and Inline graphic. They should therefore be added to Inline graphic and Inline graphic as follows

graphic file with name M144.gif (15)
graphic file with name M145.gif (16)

The count vector of the background model is obtained as

graphic file with name M146.gif

where Inline graphic is the column-vector of total counts of alphabet symbols in the data set X.

When normalized column-wise, the matrices Wk (with pseudo-counts possibly added) give updated θk for k ∈ {0}∪M:

graphic file with name M148.gif (17)
graphic file with name M149.gif (18)

Similarly, the bridging segments of Wk, k = k1k2odD, give updated bridging PPMs ψk:

graphic file with name M150.gif (19)

where h = 1, …, |d| + 2.

Implementation of MODER

In this section we give practical details of our implementation of the MODER algorithm and provide some modifications to improve its efficiency.

Input

The input of MODER consists of the following items.

  1. Data set X that consists of DNA sequences X1, X2, …, Xn, with |Xi| = Li for all i = 1, …, n.

  2. The seeds s1, s2, …, sp. Each sk is an IUPAC sequence of length |sk| = ℓk. Seeds should be high-affinity representative sequences, one for each monomeric motif to be learned from data X. They will be used for constructing initial values for PPMs θk.

  3. Set R⊂{1, 2, …, p}2 of pairs that restrict the set of dimeric motifs represented in η. MODER learns only dimers k1k2od such that (k1, k2) is in R.

  4. Minimum gap length in dimers whose monomers are assumed independent, δ; maximum number of EM-iterations, maxiter; and the convergence threshold for parameter change in consecutive EM-iterations, ε.

EM iterations

As the EM algorithm converges to a local optimum, it is crucial to use good initial values for the parameters. Initial PPMs Inline graphic are obtained from input data X and seeds s1, …, sp using the multinomial method (5). Initial bridging PPMs Inline graphic are obtained from input data X and combined seedsInline graphic using the multinomial method. A combined seed is constructed by orienting seeds Inline graphic and Inline graphic according to o, spacing them by d symbols, and replacing the symbols in the bridging segment by the neutral IUPAC symbol N. This gives sequence y. Then the combined seed Inline graphic is the highest counting non-palindromic subsequence of input data X that matches with y. A non-palindromic seed makes it possible for the EM search to break the symmetry and find non-palindromic PPMs. Background model is initialized as Inline graphic where Inline graphic is the column-vector of total counts of alphabet symbols in X. The mixing parameters λ(1) are initialized as follows:

  • Inline graphic 0.5,

  • Inline graphicInline graphic

  • Inline graphic 0.2/|R|, for all (k1, k2) ∈ R. Within a COB table the value 0.2/|R| is divided evenly among the cells as Inline graphic.

The EM iterations then proceed as follows:

graphic file with name gky027ufig1.jpg

It should be noted that the above algorithm outputs the deviation matrix κ just for completeness. As κ is a derived parameter, it could be evaluated from θ and ψ in a post-processing phase as well.

Pruning the search

MODER implementation makes some heuristic modifications to the EM framework of Section 3 in order to speed-up the search and to utilize prior knowledge of data quality.

First, as the information content of well-known binding affinity PPMs is on average quite high while low information content may indicate contamination from background, MODER trims during the EM all overlapping dimeric mixture components k whose average column-wise information content in the overlapping area goes below a threshold (default 0.40 bits). This is done by setting λk ≔ 0. Similarly, any dimeric component k whose λk gets below a small threshold (default 0.001) is eliminated as k is too weak. Blank entries of COB tables indicate eliminated dimers.

Second, MODER learns new values Inline graphic of monomeric PPMs not from the full data but from dimeric occurrences of the monomer such that the distance d between the components is large enough (default d ≥ δ = 4). This is because such isolated occurrences within a dimer are supposed to give the best data for a monomer PPM, not distorted by close-by other sites such as the other component of a dimer. However, if the share of these dimeric cases in the mixture is less than 0.02, then the dimeric data is treated too small. In this case distances d ≥ 0 are included into the analysis.

The third modification is motivated by the fact that transcription factors may have different binding motifs whose consensus sequences are only a few Hamming steps apart. To minimize disturbance from such similar motifs and from background, MODER tends to restrict the learning of PPMs Inline graphic and Inline graphic to high-affinity training sequences. Such sequences are identified by the heuristic rule that they are in small Hamming neighbourhood of the consensus sequences (sequences with highest probability) of the PPMs found so far. Monomer PPM Inline graphic is learned from data sites that are in the 1-Hamming neighbourhood of the seed (using the consensus sequence as the seed) of Inline graphic. Bridging PPM Inline graphic is learned from data sites that are in the 1-Hamming neighbourhood of combined seed Inline graphic. The combined seed is obtained as the initial combined seed Inline graphic (see Section EM Iterations) but using the seeds of Inline graphic and Inline graphic. MODER uses this seed-guided EM search by default, with the standard search as an option.

RESULTS

Generated data

As an initial sanity test we created a model η, generated a data set using it, and checked that MODER is able to learn η back from the generated data. We first created one monomeric PPM and deviation matrices κHH − 4 and κHT − 4. From these we constructed a model that had uniform background (λ = 0.71) and PPMs for homodimers HH 5 (λ = 0.12), HH –4 (0.08) and HT –4 (0.09). Using this model, we generated 100 000 sequences of length 40 bp. The sequences contained dimeric motifs and background only, no monomeric sequences were included. MODER accurately relearned the model from this data as the learned parameter values deviated from the original at most by 0.036; see Supplementary Figure S1 for details.

Validation using HT-SELEX data

Next we measured the quality of PPM models produced by MODER using correlation (R2) between occurrence counts and PPM scores of 8–mers or 10–mers of SELEX data. When counting the k–mers, all occurrences and both directions were considered. As the score of a k-mer x by a single PPM ρ we used the maximum value of Inline graphic when y and ρ′ go over all intersections of ρ and x and of ρ and reverse complement Inline graphic. As the score of x by a mixture of PPMs ρ1, …, ρt, whose mixing parameters by MODER are λ1, …, λt, we used λ1S1 + ⋅⋅⋅ + λtSt where S1, …, St are the individual scores of x by the PPMs. The scatter plots in the figures visualize the counts and scores of different 8- or 10-mers in hexagonal bins. The color of a bin reflects the number of different k–mers in that bin, with a darker color meaning higher number of different k–mers. As the early cycles of SELEX data can contain large proportion of nonspecific sequences (i.e. background), the counts were corrected against background using the data of the previous SELEX cycle, as described in (5).

We report results for the monomer and dimer PPMs of factors HOXB13, HNF4A, TFAP2A, FLI1, FOXC1 and PKNOX2 learned from HT-SELEX data. A basic correlation analysis is done for factor HOXB13. For HNF4A, TFP2A, FLI1, FOXC1, and PKNOX2 we also analyse the differences between observed and purely modular motifs. In all validations, the SELEX data sets were randomly divided into two halves, one half used for learning the model and other half used for validating it.

We used the following HT-SELEX data sets: HOXB13 (PRJEB14550, 164 768 reads), HNF4A (ERX169045, (6), 655 432 reads), TFAP2A (ERX1085476, (43), 168 053 reads), FLI1 (PRJEB14550, 143 389 reads), FOXC1 (ERX169015, (6), 189 009 reads), and PKNOX2 (ERX1084652, (43), 423 339 reads). Each read was 40 bp long except for FOXC1 whose reads were 30 bp long. The following seeds, selected by hand using the models published in (6), were used as input: HOXB13 (CTCGTAAAA, CCAATAAAA), HNF4A (RGGTCA, RGTCCA), TFAP2A (GGGCA), FLI1 (ACCGGAAGTN), FOXC1 (RTAAAYA), and PKNOX2 (TGACANN). Note that it was essential to use non-palindromic seeds for overlapping dimers as, for example, the observed cases HH -6 for FLI1 and HH -2 for PKNOX2 are directional; see Figure S3 in (6) and Section EM iterations.

Selecting strong components of the model

The learned total model is likely to contain useless, weak components (weak dimeric motifs) that should be removed before the model is applied, e.g. to predict new putative binding sites. One could, for example, include model components in decreasing order of weight λ until a certain fraction of the non-background sequences is covered. Here we used the fraction of 85% to select the models for validation experiments. In addition, we also studied the effect of parameter δ (minimum gap length in the independent case) by experimenting with large values (up to Lmax ) of δ. As all larger deviations from expected were observed to usually occur in dimers with gap <4, default value δ = 4 was selected.

Factors HOXB12, HNF4A, TFAP2A, FLI1, FOXC1 and PKNOX2

Figure 3 shows the sequence logos of the learned PPMs for factor HOXB13 and reports correlations of the scores of individual PPMs and of their mixture with counts of 8-mers. Since MODER did not find any strong dimeric motifs, the model for this factor is composed of two monomers only. The power of multi-motif modeling can be seen: the combined mixture consistently gives the highest R2.

Figure 3.

Figure 3.

Correlation analysis of HOXB13 binding motifs. Two monomer PPMs and their mixture (last panel) with weights λ1 = 0.187, λ2 = 0.198 were used. No dimer models were included. The combined model has higher correlation than the component models.

HNF4A, TFAP2A, FLI1, FOXC1 and PKNOX2 are examples of TFs for which many dimeric PPMs deviate clearly from the purely modular PPMs. Analyses of these factors are shown in Figures 4–8, where the correlations are shown for both the expected (purely modular) and the learned models, and the deviation matrices are also visualized. The number of dimeric models included into the mixture by the 85% rule ranged from 3 to 14 for different factors, only the top three dimeric models shown in the Figures. The full set of models, weights, and resulting correlations are available in the Supplementary File S1. Not surprisingly, the learned model has always higher correlation, but with varying margin. Sometimes (TFAP2A) deviating from the expected model gives strongly improved model while sometimes (FOXC1) the difference to the expected model is minor. As for the directionality of the motifs, sometimes both the expected and learned motifs are palindromic (Figure 5) while sometimes expected palindromes become directed in the learned motif (Figures 6 and 8).

Figure 4.

Figure 4.

Modularity analysis of HNF4A binding motifs. (A) Monomer models 1 and 2 (λ1 = 0.073, λ2 = 0.056) and the COB tables in units of integer multiples of 0.001. Since all the mixing parameters are in the same scale, comparison of λ values is also possible between two distinct COB tables. Also shown is correlation analysis for the two monomer models. (B) The first monomer PPM and dimeric models τ1, 2, HT, 1, τ2, 2, HT, 1, τ1, 1, HT, 1, and τ1, 2, HT, 2 were included in the analysis by the 85% rule (only the best three dimeric models are shown in the Figure). Deviation matrices are depicted below the logos of the dimeric PPMs. Their mixture used corresponding weights λ = 0.073, 0.328, 0.206, 0.104, 0.062. The combined model has much higher correlation than any individual model. (C) Correlation analysis as in B but for the PPMs E1, 2, HT, 1, E2, 2, HT, 1, E1, 1, HT, 1 and E1, 2, HT, 2 that are expected under the independence assumption. All R2-values for the learned and expected PPMs differ remarkably, reflecting the large deviations between the learned and the expected PPMs. The purely modular model cannot detect the AAA sequence connecting the half-sites.

Figure 5.

Figure 5.

Modularity analysis of the binding motifs of TFAP2A. (A) The monomer model (λ1 = 0.003) and the COB table in units of integer multiples of 0.001. The monomer model was not included in the correlation analysis of the mixture by the 85% rule. (B) Correlation analysis of the model learned for TFAP2A by MODER: three dimeric PPMs τ1, 1, TT, 2, τ1, 1, TT, 1 and τ1, 1, TT, 3, and their mixture. Deviation matrices are depicted below the logos of the dimeric PPMs. The mixture of the PPMs uses weights λ = 0.381, 0.208, 0.133. (C) Correlation analysis as in B but for the PPMs E1, 1, TT, 2, E1, 1, TT, 1, E1, 1, TT, 3 that are expected under the independence assumption. All R2-values for the learned and expected PPMs differ remarkably, reflecting the large deviations between the learned and the expected PPMs. It is obvious that the purely modular model is not able to capture the binding affinity of TFAP2A. Note that all three PPMs TT 1, TT 2 and TT 3 that are palindromic in the expected model, stay palindromic in the learned model.

Figure 6.

Figure 6.

Modularity analysis of the binding motifs of FLI1. (A) The monomer model (λ1 = 0.08) and the COB table in units of integer multiples of 0.001. Also shown is the correlation analysis using only the monomer model. (B) Correlation analysis of the model learned for FLI1 by MODER: the monomer model and six dimeric PPMs τ1, 1, HH, 2, τ1, 1, HH, −6, τ1, 1, HH, 1, τ1, 1, HT, 2, τ1, 1, HT, 1, τ1, 1, HH, 3 (only three best are shown in the Figure) were included in the analysis by the 85% rule. Deviation matrices are depicted below the logos of the dimeric PPMs. The mixture of the PPMs uses weights λ = 0.080, 0.040, 0.040, 0.038, 0.027, 0.020, 0.019. (C) Correlation analysis as in B but for the expected PPMs E1, 1, HH, 2, E1, 1, HH, −6, E1, 1, HH, 1, E1, 1, HT, 2, E1, 1, HT, 1, E1, 1, HH, 3 under the independence assumption. The R2-values for the learned and expected PPMs differ clearly for the mixture and the dimeric case HH –6. The purely modular model cannot handle the dimeric case HH –6 properly, since expected PPMs are always palindromic for orientations HH and TT, while here the learned model HH –6 turns out to be directed. Albeit quite weak, the directionality has a clear effect on R2.

Figure 7.

Figure 7.

Modularity analysis of the binding motifs of FOXC1. (A) The monomer model (λ1 = 0.071) and the COB table in units of integer multiples of 0.001. Also shown is the correlation analysis using only the monomer model. (B) Correlation analysis of the model learned for FOXC1 by MODER: the monomeric model and five dimeric PPMs τ1, 1, HH, −2, τ1, 1, HH, −3, τ1, 1, HT, −3, τ1, 1, TT, 10, τ1, 1, TT, 9 were included in the analysis by the 85% rule. Deviation matrices are depicted below the logos of the dimeric PPMs. The mixture of the PPMs uses weights λ = 0.071, 0.080, 0.033, 0.029, 0.012, 0.011. (C) Correlation analysis as in B but for the PPMs E1, 1, HH, −2, E1, 1, HH, −3, E1, 1, HT, −3, E1, 1, TT, 10, E1, 1, TT, 9 that are expected under the independence assumption. The R2-values for the learned and expected PPMs differ quite clearly, for HT –3 and HH –3 in particular, as also suggested by their large deviation matrices, while the difference is small for HH –2, the heaviest component of the mixture. It is obvious that the purely modular model is not able to fully capture the binding affinity of FOXC1. Note that expected palindromic PPM HH –3 becomes directed while expected PPM HH –2 stays palindromic in the learned model.

Figure 8.

Figure 8.

Modularity analysis of the binding motifs of PKNOX2. (A) The monomer model (λ1 = 0.256) and the COB table in units of integer multiples of 0.001. Also shown is the correlation analysis using only the monomer model. (B) Correlation analysis of the model learned for PKNOX2 by MODER: 14 dimeric PPMs were included in the analysis by the 85% rule but only the best three, τ1, 1, HH, −2, τ1, 1, HT, 3, and τ1, 1, HT, 4, are described here. In the mixture the weight for the monomer and three best dimeric models were λ = 0.256, 0.177, 0.068, 0.065. The sum of the lambdas for the last 11 dimeric models was 0.267. Deviation matrices are depicted below the logos of the dimeric PPMs. (C) Correlation analysis as in A but for the PPMs E1, 1, HH, −2, E1, 1, HT, 3, E1, 1, HT, 4 that are expected under the independence assumption. Here the 85% rule selected very many dimeric models, because the lambda values have quite even distribution in this case. Again the R2-values for the learned and expected PPMs differ quite clearly. It can be seen that the purely modular model is already satisfactory but can be improved somewhat by allowing deviations. Note that the palindromic HH -2 motif of the expected model becomes directed in the learned model.

Validation using ChIP-seq data

We then tested for factors HOXB13, HNF4A, and TFAP2A the validity of the obtained in vitro models on in vivo data. We performed standard ROC analysis to measure the performance of the models learned from SELEX data on binary classification of ChIP-Seq peaks. The following ChIP-seq data was used: HOXB13 (European Nucleotide Archive accession ERX332516, IgG: ERX332513) (44), HNF4A (Sequence Read Archive accession SRR952427, IgG: SRR952608) (45) and TFAP2A (SRR952485,IgG: SRR952608) (45). To find the peaks, the reads were aligned with BWA (46), and peak calling was done with Peakzilla (47). The genome assembly used was GRCh37 (hs37d5). From each ChIP-seq peak set, top n = 230 peaks with highest quality score were selected, and for each peak a sequence of length L = 190 bp flanking the peak summit was chosen for the positive set. A negative set of the same size was chosen randomly from the human genome, making sure that the positions were mappable. Sequences were scored using the SELEX models of HOXB13, HNF4A and TFAP2A shown in Figures 3, 4 and 5. The resulting ROC curves of the (very good) classification performance of PPM scores are shown in Figure 9.

Figure 9.

Figure 9.

Classification performance of multi-motif models for HOXB13, HNF4A, and TFP2A. ROC curves of classification using the scores by PPMs and their mixtures on ChIP-seq data. The area under each curve is also shown. Same models were used as in the correlation analysis (Figures 3–5).

We also applied MODER on the ChIP-seq data set of factor NRSF on the GM12878 cell line produced by (48) and further analyzed by (16). To obtain the seeds, we first took all the k-mers of lengths 9–11 from the data set, applied hierarchical clustering, and selected two best clusters and their representative k-mers (TTCAGCACC and GGACAGCTCC) by using the order given by the Inline graphic-score. MODER finds 9 out of 10 models reported by Quang and Xie, for details, see Figure 2 of (16) and Supplementary Figure S2. Note that as NRSF has two monomeric motifs, MODER discovers heterodimeric motifs whose COB-table has all four orientations.

Next, we tried to detect the core and side motifs of factor CTCF. This should test MODER’s capabilities in detecting long sites, as together these motifs are known to form a site of length about 34 bp (13). We used raw ChIP-exo data from human LoVo cells targeting factor CTCF from Katainen et al. (49) (ENA accession ERX986066). Mapping and peak calling was done as in Hartonen et al. (50), briefly: alignment was done using BWA (46) against assembly GRCh37 and the peaks were called using PeakXus (50). Five thousand highest scoring peaks were selected, and around the peak summits sequences of length 60 were extracted (blacklisted regions (51), ENCODE accession ENCFF001TDO, and centromeres were removed). Results in Supplementary Figure S3 show that MODER is able to detect similar configurations of distances and orientations between the core and side motifs of CTCF as in Schmidt et al. (13) (Figure 2) and Nakahashi et al. (14) (Figure 4). The strongest dimer formed by the core and side motifs, namely τ1, 2, HT, 8, was found in 20% of the top 5000 peaks.

As nuclear receptors commonly bind as dimers (52), we chose another factor, in addition to HNF4A, from this family to display the performance of MODER. ChIP-seq data from ENCODE for factor RXRA in cell line HepG2 was used (ENCODE accession ENCFF002CKZ). Ten thousand highest scoring peaks were selected, and around the peak summits sequences of length 40 were extracted (blacklisted regions and centromeres were removed). The monomer seed GGGGTCA for the experiment was handpicked based on the Rxra mouse model in Jaspar (53). The seed finding method used with NRSF for k-mer lengths 6–15 would give AGGTCA, which could have been used as well. Supplementary Figure S4 shows that MODER detects a strong dimeric binding motif τ1, 1, HT, 0, which could either be a homodimer of RXRA or a heterodimer such as NR1H2-RXRA, as suggested by Tomtom (15).

MODER versus MEME

When comparing MODER with the popular tool MEME it should be noted that the models of motifs of the two methods are different. MEME learns separate monomer models in successive passes, deleting the found sites of a model from data before the next pass, while MODER aims at discovering the modularity of motifs and hence learns the entire modular structure of monomeric and dimeric motifs in the same probabilistic framework in one run. The difference is illustrated in Supplementary Figure S5 that compares the models learned for factors TFAP2A and FLI1 by the two methods.

MODER versus Bipad/Maskminent

We also compared MODER with Bipad/Maskminent (17,19) which among the previous tools comes closest to MODER. An example qualitative comparison using alignment of models is illustrated in Supplementary Figure S6. Similarities between motifs obtained using these two algorithms are obvious, although Maskminent seems to introduce some background noise into the motifs. Comparison using correlation analysis was not performed since Maskminent does not learn the monomer model and the two orientation classes of dimers (DR and IR) in the same commensurate model, and hence the weights for models of different types could not be decided.

In order to make a quantitative comparison to Maskminent, we used the data sets for which a bipartite Maskminent model is available from Lu et al. (19). There were 53 such data sets in ENCODE (51), and we used 40 of those (6 had been revoked from ENCODE, 7 were unidentifiable). The identification problems were due to Lu et al. not giving the accession codes, but merely describing the used data sets. For all the data sets we managed to identify, we have now included the accessions in the Supplementary Table S1. The same number of top scoring peaks were used as in (19), but only 100 bp around the peak summit was selected, and these data sets were randomly divided into learning and validation sets of equal size. We selected the initial seeds for MODER based on Jaspar (53) in the following way: for JUN-like factors (ATF3, BACH1, BATF, FOS, FOSL1, FOSL2, JUN, JUNB, JUND, NFE2) we used the seed ATGA, for EBF1 the seed TCCC, for ESR1 the seed AGGTCA, for MAFF and MAFK the seed TCAGCA, and for STAT1 the seed TTC. Then MODER was run on learning data sets, and the best dimeric PPM (according to lambda) was chosen for each data set. For Maskminent we used their published model for each data. The results displayed in Supplementary Table S1 and Figure S7 show that MODER gets better AUC value in 35 cases out of 40. Note also that MODER wins in 33 out of 35 cases, when considering only the optimal IDR-thresholded data sets (the other five data sets are initial peaksets, marked with a star in the table). The selection of factors used by Lu et al. (19) was unfortunately quite repetitive, but the comparison shows consistent behaviour for both methods.

DISCUSSION

The MODER algorithm is based on reductionist view that PPM models for dimers can be built in a modular fashion from monomer PPMs.

As noted e.g. by (4), such modularity is not always valid as in a number of dimeric cases the specificity of the dimeric motif differs notably from what could be expected from its monomeric components. The deviation matrix of MODER represents such differences explicitly. These deviations from the expected models are especially important for orientations HH and TT, for which all expected models are always symmetric (palindromes), whereas the real binding motifs might have a direction (6). This was demonstrated by several examples in our validation experiments. In addition, the deviations from expected motif commonly occur when the core segments of the motifs of two factors are closely packed and the overlapping flanks are distorted from the expected model. In TF-DNA binding, the core positions in a motif are usually recognized by direct bonds to the bases, whereas the weaker positions are recognized by contacts to DNA backbone (4) and are hence more prone to deviations.

The motif discovery algorithm of MODER considers simultaneously all possible orientation–distance pairs and finds the preferred dimeric motifs. Learning multiple motifs in serial manner—first finding one motif, then removing its occurrences from the data, and then running the algorithm again—does not treat symmetrically the sequences that may belong to several motifs. MODER improves over the similar coMOTIF algorithm (30) by including the spacing information in the overall model, and by adding overlapping motifs and the deviations from the expected motif. Allowing overlaps of monomer motifs within a dimer turned out a very useful feature. In fact, for factors FLI1, FOXC1 and PKNOX2 the strongest dimer has such an overlap.

Simultaneous learning of all motif components and their mixing parameters makes direct comparison of the relative strengths of the motifs possible by using the mixing parameters. Depending on the application, it might be useful to rescale the obtained mixing parameters, after the actual algorithm is finished. This was done, when we chose the motifs for performance testing by the 85% rule: the mixing parameters were rescaled to exclude the background. Then motifs were included in descending order, until the motifs covered 85% of the signal. Sometimes it might also be useful to rescale the mixing parameters in each COB table separately, although this would prevent the comparison of mixing parameters between distinct COB tables.

MODER is not too sensitive to noise in the seeds. For factor HOXB13, we mutated the first initial seed in two positions and the second seed in three positions, including informative positions. Still the algorithm managed to obtain the same results as with the original seeds. MODER is reasonably fast. For example, it took 2 min 18 s wall-clock time and 15 min 30 s CPU time when run simultaneously on eight cores to learn the model for FLI1 in Figure 2 from a 2 865 880 bp long HT-SELEX data set. The seeds for MODER can be found from existing PPM databases or can be produced by seed-finding tools such as DREME (54) or by using the procedure in section Validation using ChIP-seq data to find representative k-mers.

AVAILABILITY

MODER is implemented in C++ on Linux platform and is available from https://github.com/jttoivon/MODER. European Nucleotide Archive, accession code PRJEB14550.

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

The authors would like to thank Tuomo Hartonen for doing the peak calling from the ChIP-seq and ChIP-exo reads.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online.

FUNDING

EU FP7 project SYSCOL [UE7-SYSCOL-258236]; Leverhulme Trust [VP1-2014-044 to E.U.]. Funding for open access charge: University of Helsinki.

Conflict of interest statement. None declared.

REFERENCES

  • 1. Rodda D.J., Chew J.-L., Lim L.-H., Loh Y.-H., Wang B., Ng H.-H., Robson P.. Transcriptional regulation of nanog by OCT4 and SOX2. J. Biol. Chem. 2005; 280:24731–24737. [DOI] [PubMed] [Google Scholar]
  • 2. Panne D., Maniatis T., Harrison S.C.. An atomic model of the interferon-β enhanceosome. Cell. 2007; 129:1111–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. De Val S., Chi N.C., Meadows S.M., Minovitsky S., Anderson J.P., Harris I.S., Ehlers M.L., Agarwal P., Visel A., Xu S.-M. et al. . Combinatorial regulation of endothelial gene expression by ETS and Forkhead transcription factors. Cell. 2008; 135:1053–1064. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Jolma A., Yin Y., Nitta K.R., Dave K., Popov A., Taipale M., Enge M., Kivioja T., Morgunova E., Taipale J.. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature. 2015; 527:384–388. [DOI] [PubMed] [Google Scholar]
  • 5. Jolma A., Kivioja T., Toivonen J., Cheng L., Wei G., Enge M., Taipale M., Vaquerizas J.M., Yan J., Sillanpää M.J. et al. . Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 2010; 20:861–873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Jolma A., Yan J., Whitington T., Toivonen J., Nitta K.R., Rastas P., Morgunova E., Enge M., Taipale M., Wei G. et al. . DNA-binding specificities of human transcription factors. Cell. 2013; 152:327–339. [DOI] [PubMed] [Google Scholar]
  • 7. Valouev A., Johnson D.S., Sundquist A., Medina C., Anton E., Batzoglou S., Myers R.M., Sidow A.. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods. 2008; 5:829–834. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Isakova A., Berset Y., Hatzimanikatis V., Deplancke B.. Quantification of cooperativity in heterodimer-DNA binding improves the accuracy of binding specificity models. J. Biol. Chem. 2016; 291:10293–10306. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Stormo G.D., Schneider T.D., Gold L.. Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res. 1986; 14:6661–6679. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Stormo G.D. DNA binding sites: representation and discovery. Bioinformatics. 2000; 16:16–23. [DOI] [PubMed] [Google Scholar]
  • 11. LaRonde-LeBlanc N.A., Wolberger C.. Structure of HoxA9 and Pbx1 bound to DNA: Hox hexapeptide and DNA recognition anterior to posterior. Genes Dev. 2003; 17:2060–2072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Dempster A.P., Laird N.M., Rubin D.B.. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 1977; 39:1–38. [Google Scholar]
  • 13. Schmidt D., Schwalie P.C., Wilson M.D., Ballester B., Gonçalves A., Kutter C., Brown G.D., Marshall A., Flicek P., Odom D.T.. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell. 2012; 148:335–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Nakahashi H., Kieffer Kwon K.-R., Resch W., Vian L., Dose M., Stavreva D., Hakim O., Pruett N., Nelson S., Yamane A. et al. . A genome-wide map of CTCF multivalency redefines the CTCF code. Cell Rep. 2013; 3:1678–1689. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Gupta S., Stamatoyannopoulos J.A., Bailey T.L., Noble W.S.. Quantifying similarity between motifs. Genome Biol. 2007; 8:R24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Quang D., Xie X.. EXTREME: an online EM algorithm for motif discovery. Bioinformatics. 2014; 30:1667–1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Bi C., Rogan P.K.. Bipartite pattern discovery by entropy minimization-based multiple local alignment. Nucleic Acids Res. 2004; 32:4979–4991. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Bi C., Leeder J.S., Vyhlidal C.A.. A comparative study on computational two-block motif detection: algorithms and applications. Mol. Pharm. 2008; 5:3–16. [DOI] [PubMed] [Google Scholar]
  • 19. Lu R., Mucaki E.J., Rogan P.K.. Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Nucleic Acids Res. 2017; 45:e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Helden J.v., Rios A., Collado-Vides J.. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 2000; 28:1808–1818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Li H., Rhodius V., Gross C., Siggia E.D.. Identification of the binding sites of regulatory proteins in bacterial genomes. Proc. Natl. Acad. Sci. U.S.A. 2002; 99:11772–11777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Liu X., Brutlag D.L., Liu J.S.. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symp. Biocomput. 2001; 6:127–138. [PubMed] [Google Scholar]
  • 23. Whitington T., Frith M.C., Johnson J., Bailey T.L.. Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res. 2011; 39:e98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Kazemian M., Pham H., Wolfe S.A., Brodsky M.H., Sinha S.. Widespread evidence of cooperative DNA binding by transcription factors in Drosophila development. Nucleic Acids Res. 2013; 41:8237–8252. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Jankowski A., Prabhakar S., Tiuryn J.. TACO: a general-purpose tool for predicting cell-type–specific transcription factor dimers. BMC Genomics. 2014; 15:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lawrence C.E., Reilly A.A.. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Struct. Funct. Bioinformatics. 1990; 7:41–51. [DOI] [PubMed] [Google Scholar]
  • 27. Cardon L.R., Stormo G.D.. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 1992; 223:159–170. [DOI] [PubMed] [Google Scholar]
  • 28. Bailey T.L., Elkan C.. The value of prior knowledge in discovering motifs with MEME. Proc. Third Internat. Conf. on Intelligent Systems for Molecular Biology. 1995; AAAI Press; 21–29. [PubMed] [Google Scholar]
  • 29. Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S.. MEME suite: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(Suppl. 2):W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Xu M., Weinberg C.R., Umbach D.M., Li L.. coMOTIF: a mixture framework for identifying transcription factor and a coregulator motif in ChIP-seq Data. Bioinformatics. 2011; 27:2625–2632. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Li L. GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J. Comput. Biol. 2009; 16:317–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Mercier E., Droit A., Li L., Robertson G., Zhang X., Gottardo R.. An integrated pipeline for the genome-wide analysis of transcription factor binding sites from ChIP-Seq. PLoS One. 2011; 6:e16432. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Zhang Z., Chang C.W., Hugo W., Cheung E., Sung W.-K.. Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm. J. Comput. Biol. 2013; 20:237–248. [DOI] [PubMed] [Google Scholar]
  • 34. Reid J.E., Wernisch L.. STEME: a robust, accurate motif finder for large data sets. PLoS One. 2014; 9:e90735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Liu J.S., Neuwald A.F., Lawrence C.E.. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 1995; 90:1156–1170. [Google Scholar]
  • 36. Ikebata H., Yoshida R.. Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets. Bioinformatics. 2015; 31:1561–1568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Alipanahi B., Delong A., Weirauch M.T., Frey B.J.. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature Biotech. 2015; 33:831–838. [DOI] [PubMed] [Google Scholar]
  • 38. Colombo N., Vlassis N.. FastMotif: spectral sequence motif discovery. Bioinformatics. 2015; 31:2623–2631. [DOI] [PubMed] [Google Scholar]
  • 39. Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K.. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010; 38:576–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Kulakovskiy I.V., Boeva V., Favorov A.V., Makeev V.J.. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010; 26:2622–2623. [DOI] [PubMed] [Google Scholar]
  • 41. Ma W., Noble W.S., Bailey T.L.. Motif-based analysis of large nucleotide data sets using MEME-ChIP. Nat. Protoc. 2014; 9:1428–1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Jayaram N., Usvyat D., Martin A.C.. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics. 2016; 17:1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Yin Y., Morgunova E., Jolma A., Kaasinen E., Sahu B., Khund-Sayeed S., Das P.K., Kivioja T., Dave K., Zhong F. et al. . Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science. 2017; 356:eaaj2239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Huang Q., Whitington T., Gao P., Lindberg J.F., Yang Y., Sun J., Väisänen M.-R., Szulkin R., Annala M., Yan J. et al. . A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat. Gen. 2014; 46:126–135. [DOI] [PubMed] [Google Scholar]
  • 45. Yan J., Enge M., Whitington T., Dave K., Liu J., Sur I., Schmierer B., Jolma A., Kivioja T., Taipale M. et al. . Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell. 2013; 154:801–813. [DOI] [PubMed] [Google Scholar]
  • 46. Li H., Durbin R.. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Bardet A.F., Steinmann J., Bafna S., Knoblich J.A., Zeitlinger J., Stark A.. Identification of transcription factor binding sites from ChIP-seq data at high resolution. Bioinformatics. 2013; 29:2705–2713. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. ENCODE Project Consortium Birney E., Stamatoyannopoulos J.A., Dutta A., Guigó R., Gingeras T.R., Margulies E.H., Weng Z., Snyder M., Dermitzakis E.T., Thurman R.E. et al. . Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447:799–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Katainen R., Dave K., Pitkänen E., Palin K., Kivioja T., Välimäki N., Gylfe A.E., Ristolainen H., Hänninen U.A., Cajuso T. et al. . CTCF/cohesin-binding sites are frequently mutated in cancer. Nature Genetics. 2015; 47:818–821. [DOI] [PubMed] [Google Scholar]
  • 50. Hartonen T., Sahu B., Dave K., Kivioja T., Taipale J.. PeakXus: comprehensive transcription factor binding site discovery from ChIP-Nexus and ChIP-Exo experiments. Bioinformatics. 2016; 32:i629–i638. [DOI] [PubMed] [Google Scholar]
  • 51. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Gronemeyer H., Gustafsson J.-A., Laudet V.. Principles for modulation of the nuclear receptor superfamily. Nat. Rev. Drug. Discov. 2004; 3:950–964. [DOI] [PubMed] [Google Scholar]
  • 53. Mathelier A., Fornes O., Arenillas D.J., Chen C.-y., Denay G., Lee J., Shi W., Shyr C., Tan G., Worsley-Hunt R. et al. . JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016; 44:D110–D115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Bailey T.L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics. 2011; 27:1653–1659. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES