Abstract
In some dimeric cases of transcription factor (TF) binding, the specificity of dimeric motifs has been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other. Current motif discovery methods are unable to learn monomeric and dimeric motifs in modular fashion such that deviations from the expected motif would become explicit and the noise from dimeric occurrences would not corrupt monomeric models. We propose a novel modeling technique and an expectation maximization algorithm, implemented as software tool MODER, for discovering monomeric TF binding motifs and their dimeric combinations. Given training data and seeds for monomeric motifs, the algorithm learns in the same probabilistic framework a mixture model which represents monomeric motifs as standard position-specific probability matrices (PPMs), and dimeric motifs as pairs of monomeric PPMs, with associated orientation and spacing preferences. For dimers the model represents deviations from pure modular model of two independent monomers, thus making co-operative binding effects explicit. MODER can analyze in reasonable time tens of Mbps of training data. We validated the tool on HT-SELEX and ChIP-seq data. Our findings include some TFs whose expected model has palindromic symmetry but the observed model is directional.
INTRODUCTION
In transcriptional regulation, proteins called transcription factors (TFs) bind to specific DNA motifs, to have a regulatory effect on the transcription rate of particular genes. The regulating TFs may bind co-operatively in clusters of two or more factors which makes the regulation combinatorial by nature (1–4). Therefore, it is of interest not only to find the binding motifs for individual monomeric TFs but also motifs for dimeric and higher order co-operative binding of several TFs on the same regulatory area in DNA. With the massive training data currently available from, e.g. high-throughput SELEX (5,6) and ChIP-seq experiments (7), it is possible to learn complex binding models from quite weak signals.
In a large number of dimeric cases of TF binding, the specificity of the dimeric motif has recently been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other (4,6,8). Current automatic motif discovery tools do not learn monomeric and dimeric motifs soundly within one probabilistic framework in modular fashion such that the effects of co-operative binding on motifs could be shown and analyzed. In this paper, we propose such a learning algorithm and a software tool for modular discovery of monomeric and dimeric binding motifs for TFs.
The algorithm uses a class of probabilistic mixture models for (possibly multi-profile) monomeric binding motifs and all their dimeric combinations. Our model represents each monomeric motif as a standard position-specific probability matrix (PPM) (9,10). Each dimeric motif is represented in modular fashion as a pair of monomeric PPMs, with associated information on the relative orientation and spacing of the two monomeric components. In our model, the monomeric components need not be spatially separate but their sites may overlap; such overlaps have been reported, e.g. in (4,11). A novel feature of our model is that it includes a deviation matrix that represents explicitly how much the discovered dimeric PPM deviates from the expected PPM for independent component monomers. Another novelty is that monomeric and dimeric models are learned such that the effect of the noise from dimeric occurrences on monomeric models is minimized. Moreover, the mixing parameters of the model reveal the relative abundances of different motif combinations. In particular, the mixing parameters for the dimeric variants give precise quantitative indication of orientation and spacing preferences of the two monomers that make the dimer.
For learning our binding model we describe an expectation maximization (EM) algorithm (12), called MODER (MOtif DEtectoR). Given a data set of sequences that contain enriched motif instances, MODER learns by EM search the parameters of all model components simultaneously, as a mixture of several PPMs, by optimizing the alignment of the model with the training data using maximum likelihood estimation. The EM search is initialized with user-given seed sequences for the monomeric profiles of the model. It finds PPMs for the monomers as well as for their dimeric combinations within given range of spacings and orientations. Higher-order combinations are not included, as it would exponentially increase the complexity and the size of the model. Monomer PPMs are learned using pruning techniques that minimize contamination from near-by motifs occurrences and from background. The requirement to provide seeds is a limitation of MODER which depends on prior knowledge (such as motif databases) or the use of other motif discovery algorithms. On the other hand, seed-based initialization makes MODER fast and capable of processing in reasonable time a training data consisting of sequences that are hundreds of bps long and are several Mbps in total size. MODER was designed for motif discovery from HT-SELEX reads, but other type of training data, such as ChIP-seq data sets, can be used as well.
Validation experiments of MODER show robust and fast performance both on HT-SELEX and ChIP-seq data. We applied MODER on six HT-SELEX data sets, each consisting of 105–106 reads of length 30 or 40, and found varying amounts of difference between observed and expected motifs: for example, for factors FLI1 and PKNOX2 the expected homodimeric model has palindromic symmetry but the observed model is directional, reconfirming an earlier observation in (6). From ChIP-seq data MODER finds for factor CTCF essentially the same dimeric model as reported in (13,14), and for modular receptor RXRA a dimeric model that the Tomtom tool (15) matches with a known RXRA heterodimer. For factor NRSF, MODER finds from ChIP-seq data essentially the same multi-profile model as in (16).
In previous research, a dimer model quite similar to ours but without explicit modular structure and overlaps of monomers within dimers was introduced, with an entropy minimization learning algorithm Bipad/Maskminent (17–19). Discovery of spaced dyads (pairs of relatively short motifs) was considered in (20,21). Gibbs sampling based BioProspector (22) is another early dimer search algorithm. Recent dimer prediction methods include SpaMo (23), iTFs (24), and TACO (25). All start from given monomeric PPMs and find, using thresholding, the occurrence sites of the PPMs in the training data. Then enrichment of specific spacings of pairs of occurrences is detected, with an analysis of the statistical significance but without an analysis of co-operative effects of dimer components. SpaMo was designed for finding preferred distances between the site of the primary TF and the sites of secondary TFs in ChIP-seq data. The dimer model of iTFs includes relative orientation of the components but it does not consider overlaps and uses binned distances. Finally, TACO’s model includes orientation and distance and allows the components to overlap, but does not analyze the effect of overlap on the binding profile.
Using the EM algorithm in motif discovery was initiated by Lawrence and Reilly (26) and was used for finding motifs with spacers by Cardon and Stormo (27). The mixture model and the EM learning of MODER generalize the techniques of MEME (28,29) to multi-profile dimeric case. As compared to MEME, an important feature of MODER is that it learns all submodels simultaneously, using all training data symmetrically. coMOTIF (30) is another simultaneous multi-profile motif finder based EM algorithm. It does not, however, keep track of the distances between binding sites and does not allow overlaps of binding sites, nor does it have the modeling of deviation or learning of the motif in the gap positions between the dimer components. MODER can be seen as a generalization of coMOTIF. Recent EM algorithm based finders of monomer motifs include GADEM and rGADEM (31,32) which use genetic algorithm with EM to improve starting PPMs, SEME (33) which uses importance sampling to speed-up the search, EXTREME (16) which achieves speed-up by using the on–line version of the EM algorithm, and STEME (34) which resorts to suffix-trees. Moreover, Liu et al. (35) use Gibbs sampling and Ikebata and Yoshida (36) use a repulsive MCMC version of MEME type search for simultaneous discovery of several motifs, Alipanahi et al. (37) use deep learning for motif discovery with good validation results but non-modular structure of the underlying model, and Colombo and Vlassis (38) find monomeric motifs with a fast spectral learning algorithm. Recent motif finders specially designed for large ChIP-seq data include rGADEM (32), HOMER (39), ChIP-Munk (40), and MEME-ChIP (41), evaluated in (42).
In the rest of the paper, the next section defines the mixture model of MODER, the next one gives the associated EM algorithm for estimating the model parameters, then our implementation of MODER is described, with techniques to initialize and prune the search, and finally we report some validation and comparison experiments and discuss motifs found by MODER for TFs FLI1, HOXB13, HNF4A, TFAP2A, FOXC1, PKNOX2, NRSF, CTCF and RXRA.
MATERIALS AND METHODS
Model structure
The binding affinity model learned by MODER, specified by parameters η = (θ, ψ, λ), gives a probability distribution for sequences in some alphabet Σ. We will use always the DNA alphabet Σ = {A, C, G, T} but the model works for arbitrary alphabets.
Model η is a mixture of distributions for monomeric sequences that contain one occurrence of a monomeric motif, and distributions for dimeric sequences that contain two monomeric motifs in a specific relative orientation and spacing, and a distribution for background sequences. Monomeric distributions are built from the PPMs of the monomers and the background. For all orientation and spacing alternatives between the two monomers in a dimer, dimeric distributions are built either from the PPMs of the monomers and the background or from the PPM of the entire dimer and the background. If the two monomers of a dimer do not overlap and have a long gap in between, then the dimeric distribution is just the product of the two monomer PPMs, that is, the model assumes that there is no co-operative effect affecting the independence of the two binding profiles. If the monomers overlap or the gap between them is short, then the binding profiles of two monomers do not necessarily remain independent. There can be interaction between the components of a dimer as they may physically contact each other, or the interaction can be DNA mediated (4). Therefore the model allows deviating from pure reduction to monomer PPMs and also represents, using the so-called deviation matrix, how the PPM learned from data differs from the product of monomer PPMs which would be the expected model if there are no interactions.
The three parameter groups of η = (θ, ψ, λ) and the parametrization of the dimeric structures are defined in detail in the following subsections.
Monomeric PPMs θk and background θ0
Parameter θ = (θ0, θ1, …, θp) gives the background distribution θ0 and p monomeric motifs θk. Each θk, k ≠ 0, is a 4 × ℓk PPM
where gives the probability for an alphabet symbol (nucleotide) a to occur in position h of θk, and ℓk denotes the length of θk. The reverse complement of θk is a PPM such that for each a and h, where is the complementary base of a (e.g., ).
The mononucleotide background model gives the occurrence probabilities of each alphabet symbol in a position that is outside the occurrences of monomers or dimers. The background model is position-independent.
Dimer specification k1k2od
The model uses monomeric motifs θk as building blocks of dimeric motifs. The possible dimeric motifs are indexed with quadruples (k1, k2, o, d) which we abbreviate as k1k2od (this should not be confused with the multiplication of these symbols). A dimer with index k1k2od is composed of monomers and whose orientation is o and distance (spacing) from the end of to the start of is d, where and . Because of co-operative binding effects, monomer motifs alone are not enough for building dimeric models. To model such effects we will use an additional PPM (see the next subsection) that covers the middle area of the dimer, called the bridging segment. Figure 1 illustrates our parametrization of dimeric structures; cf. (17).
The set of possible pairwise orientations o is if k1 = k2 (homodimer), and otherwise (heterodimer). Table 1 describes different orientations o = (o1, o2) giving the directions of motifs and . Note that for homodimers the orientations HT and TH are identical, and one can use HT to represent them both. We assume that motif always occurs before motif when moving from 5′ end to 3′ end and using motif start position as reference point. The reverse order of the two motifs transforms back to this case by considering the complementary strand.
Table 1. Relative orientation of two motif occurrences within a dimer.
Orientation o | Short-hand | o 1 | o 2 | |
---|---|---|---|---|
Head-to-Tail | HT | → → | +1 | +1 |
Head-to-Head | HH | → ← | +1 | −1 |
Tail-to-Tail | TT | ← → | −1 | +1 |
Tail-to-Head | TH | ← ← | −1 | −1 |
Exponents o1 and o2 give the orientation of the first and the second PPM: for a PPM θk, leaves the matrix intact but takes the reverse complement.
The possible distances between the two occurrences are given as an interval . If is non-negative, it gives the number of gap positions between the two occurrences. If d < 0, then the occurrences overlap by −d positions. The smallest possible distance has to be . MODER implementation uses (optionally adjustable) default value , that is, overlaps only up to half of the length of the monomers are allowed. The longest distance possible for sequences of maximum length Lmax is .
We use parameter δ ≥ 0 to give the minimum spacing such that if the space between the two monomers of a dimer is ≥δ then the monomer profiles are assumed independent, i.e. in this case the model ignores the possible co-operative interactions that would change the binding preferences of the two TFs or the gap between them. Parameter δ is a user-given constant (default value δ = 4 in our implementation).
In what follows, we refer to the available monomeric and dimeric motifs with index k that may belong to the following three separate sets:
M = {1, …, p}: the indices for monomeric motifs.
D + = {k1k2od: d ≥ δ, k1, k2 ∈ M}: the indices for dimeric motifs whose monomers and have a gap of length ≥δ in between. This is called the independent case.
D − = {k1k2od: d < δ, k1, k2 ∈ M}: the indices for dimeric motifs whose monomers and have a gap of length <δ in between. This is called the dependent case. Note that this case includes dimers whose monomers overlap.
Dimeric PPMs τk1k2od, bridging PPMs ψk1k2od and deviation matrices κk1k2od
We use to denote the PPM (which is a matrix) for motif k1k2od ∈ D+∪D−. Each is a derived parameter, composed of free parameters, such that if k1k2od ∈ D+ then is built from , , and background θ0, and if k1k2od ∈ D− then is built from , , and the bridging PPM to be defined below. Constructions are as follows.
If k1k2od ∈ D+, then we put simply
(1) |
where • concatenates matrices. There are d column-matrices θ0 in the middle of , that is, the middle gap is filled with the background.
If k1k2od ∈ D−, then a middle segment of is a free parameter learned from data: for d < 0, the columns that are on the overlap area (plus one more column on both sides) are free parameters, and for 0 ≤ d < δ, the columns that are on the area between the monomers (plus one more column on both sides) are free parameters. This area of length |d| + 2 in the middle of a dimer is called the bridging segment, and the 4 × (|d| + 2) PPM for the bridging segment is called the bridging PPM. We let denote the bridging PPM. Now, the columns of that cover the bridging segment come from while the columns outside this segment are supposed to reduce to the monomer motifs, i.e. they are as in the implied prefix and suffix segments of monomer matrices and . So we get
(2) |
Next, we make it explicit how PPM differs from the PPM that would be expected were the monomer motifs independent in the dimer. We denote such an expected PPM as . It models the situation that motifs and have independent instances at distance d from each other in sequences with an occurrence of θf at the left end and θr at the right end.
Let first d ≥ 0. Consider the occurrence probability P(ai) of the ith symbol ai. Obviously, if , then P(ai) = θf[ai, i]; if , then P(ai) = θ0(ai), i.e., we expect to see the background distribution between the two motifs; and if , then . This means that is just θf followed by d columns, each equal to θ0, followed by θr; c.f., the definition of in the independent case (1).
Let then d < 0, i.e., the motifs overlap by |d| symbols. Consider again the probability P(ai). If , then P(ai) = θf[ai, i], and hence the ith column of the expected PPM is . Similarly, if , then , and hence . In the remaining case we have , and the ith symbol ai belongs to the area where the two motifs overlap. Hence ai is generated by both θf and θr, under the condition that both generate the same symbol because in the overlapping area the two motifs have to coincide. Therefore P(ai) would be equal to , normalized by the condition that both motifs generate the same symbol. This gives
(3) |
and therefore the ith column becomes
(4) |
where × denotes element-wise product.
Finally, the deviation matrix, defined as
gives the difference between observed and expected model. Deviation matrices will be visualized using a variant of the sequence logo in which positive values are shown above a separating line and negative values below it, see Figure 2. Note also that the expected PPM of homodimers is always palindrome symmetric for orientations HH and TT.
Mixing parameters λ
Mixing parameters λ = {λk: k ∈ {0}∪M∪D+∪D−} give the probability of each component of the mixture as follows:
λk, k ∈ M, is the probability that the sequence contains exactly one monomeric occurrence of motif θk and no other occurrences.
λk, k = k1k2od ∈ D+∪D−, is the probability that the sequence contains exactly one occurrence of motif and no occurrences of other motifs.
λ0 is the probability that the sequence contains no motif occurrences.
For each pair (k1, k2), the array of mixing parameter values is called the co-operative binding table (COB table) of motifs and . The values in a COB table indicate the orientation and spacing preferences of the dimeric structures that are composed of and .
Figure 2 illustrates model η for binding motifs of TF FLI1.
Learning by expectation maximization
Given a training data set X = {X1, X2, …, Xn} consisting of n DNA sequences , where Li is the length of the ith sequence, we use the EM algorithm (12,28) to find model parameters η which maximize the expectation of the likelihood L(η|X, Z) = P(X, Z|η), where latent variables Z give the ’missing information’ used by an EM algorithm.
Latent variables are 0–1-valued random variables that indicate how the data X is aligned to the model. To align Xi, there are latent variables Zik ·, k ∈ {0}∪M∪D+∪D−, with exactly one of them having value 1, that code the alignment as follows.
Case Zi0 = 1: Sequence Xi has no occurrences of motifs and is generated by the background model θ0 alone.
Case Zikj = 1: If k ∈ M, then the sequence Xi has an occurrence of motif θk starting at position j. The rest of Xi is generated by the background model. If k = k1k2od ∈ D+∪D− then the sequence Xi has an occurrence of motif τk at position j, that is, an occurrence of motif at position j and an occurrence of motif at such that the occurrences of and have relative orientation o.
We denote by Sik the set of positions j at which motif k may occur in Xi. For k ∈ M we have Sik = {1, …, Li − ℓk + 1}, and for k = k1k2od ∈ D+∪D−, .
The probability of Xi in model η, given the missing information Zi ·, is straightforward to evaluate as follows. If sequence Xi contains no motif occurrences, i.e. Zi0 = 1, its probability is
(5) |
If the sequence Xi contains one motif occurrence, i.e. Zikj = 1 for some k ∈ M, j ∈ Sik, its probability is
(6) |
where B1 = {1, …, Li}∖[j, j + ℓk).
For the dimeric binding we have two cases: independent (d ≥ δ) and dependent (d < δ). Let first k = k1k2od ∈ D+. Define the set . Then the probability of Xi is
(7) |
Let then k = k1k2od ∈ D−. The probability of Xi is
(8) |
Recall from (2) that is composed of bridging PPM in the middle and of flanking segments taken from PPMs and .
Now the joint likelihood of the model parameters, given data X and missing information Z, is the product of mixture probabilities of each Xi:
It is important to note here that, to simplify notation, we have ignored the fact that we should consider motif occurrences appearing in the reverse DNA strand as well. For this algorithm to work in the two-stranded case, a new index should be added, which specifies the direction (+1 or –1) of a monomer or a dimer occurrence. Then in all the places where we sum over j, we should sum over the directions as well. Moreover, to make sure that the probabilities add up to one, an additional division by two should be performed where we currently divide by |Sik|.
As for each i, exactly one of the latent values Zi · equals 1 and the others are zeros, the log-likelihood has the following form:
(9) |
The EM algorithm repeatedly applies the following rule to update η = (θ, κ, λ) until convergence:
One iteration of the algorithm, indexed with t, consists of an E-step and an M-step. These steps are described next.
Expectation step
E-step finds the expectation of log-likelihood (9) for current parameter values η(t). By linearity of expectation, this reduces to finding the expected values i · of latent variables Zi ·. By noting that 0 and 1 are the only possible values of a latent variable, and by applying the Bayes rule, one can see that the expected values and hence the update rule of the E-step becomes, for k ∈ {0}∪M∪D+∪D− and j ∈ Sik, as follows:
(10) |
(11) |
Here, probability P(Xi|Zi0 = 1, η(t)) is given by (5) and probability P(Xi|Zikj = 1, η(t)) by (6), (7) or (8), and
(12) |
Maximization step
M-step maximizes the expectation of log-likelihood for current (t) by updating parameters η = (θ, ψ, λ). The form of log-likelihood (9) is such that the M-step is of Baum–Welch type: parameters are updated by normalizing the expected counts of using different components of the model when X is aligned to the model according to (t).
The update rules for mixing parameters become:
(13) |
(14) |
To update θ and ψ we first accumulate the expected counts of how many times each mixture component is used when X is aligned with η(t). For all k ∈ M, we get the 4 × ℓk matrices of expected counts of the monomer motifs as
Here is 4 × ℓk matrix-valued indicator function such that if Xi[j + h − 1] = a, and otherwise . Again, is the reverse complement of . Note that the above aggregation of Wk implements the modularity of binding: a monomer model θk gets its counts from monomeric occurrences of θk as well as from occurrences of θk as an independent component of a dimer. Since the monomer models are not learned from the overlapping cases, there is no coupling between the monomers and the deviations matrices, i.e. both are uniquely defined.
For k ∈ D−, the matrix of the expected counts is
According to our modularity constraint the columns of that are outside the bridging segment should be modeled with and . They should therefore be added to and as follows
(15) |
(16) |
The count vector of the background model is obtained as
where is the column-vector of total counts of alphabet symbols in the data set X.
When normalized column-wise, the matrices Wk (with pseudo-counts possibly added) give updated θk for k ∈ {0}∪M:
(17) |
(18) |
Similarly, the bridging segments of Wk, k = k1k2od ∈ D−, give updated bridging PPMs ψk:
(19) |
where h = 1, …, |d| + 2.
Implementation of MODER
In this section we give practical details of our implementation of the MODER algorithm and provide some modifications to improve its efficiency.
Input
The input of MODER consists of the following items.
Data set X that consists of DNA sequences X1, X2, …, Xn, with |Xi| = Li for all i = 1, …, n.
The seeds s1, s2, …, sp. Each sk is an IUPAC sequence of length |sk| = ℓk. Seeds should be high-affinity representative sequences, one for each monomeric motif to be learned from data X. They will be used for constructing initial values for PPMs θk.
Set R⊂{1, 2, …, p}2 of pairs that restrict the set of dimeric motifs represented in η. MODER learns only dimers k1k2od such that (k1, k2) is in R.
Minimum gap length in dimers whose monomers are assumed independent, δ; maximum number of EM-iterations, maxiter; and the convergence threshold for parameter change in consecutive EM-iterations, ε.
EM iterations
As the EM algorithm converges to a local optimum, it is crucial to use good initial values for the parameters. Initial PPMs are obtained from input data X and seeds s1, …, sp using the multinomial method (5). Initial bridging PPMs are obtained from input data X and combined seeds using the multinomial method. A combined seed is constructed by orienting seeds and according to o, spacing them by d symbols, and replacing the symbols in the bridging segment by the neutral IUPAC symbol N. This gives sequence y. Then the combined seed is the highest counting non-palindromic subsequence of input data X that matches with y. A non-palindromic seed makes it possible for the EM search to break the symmetry and find non-palindromic PPMs. Background model is initialized as where is the column-vector of total counts of alphabet symbols in X. The mixing parameters λ(1) are initialized as follows:
0.5,
0.2/|R|, for all (k1, k2) ∈ R. Within a COB table the value 0.2/|R| is divided evenly among the cells as .
The EM iterations then proceed as follows:
It should be noted that the above algorithm outputs the deviation matrix κ just for completeness. As κ is a derived parameter, it could be evaluated from θ and ψ in a post-processing phase as well.
Pruning the search
MODER implementation makes some heuristic modifications to the EM framework of Section 3 in order to speed-up the search and to utilize prior knowledge of data quality.
First, as the information content of well-known binding affinity PPMs is on average quite high while low information content may indicate contamination from background, MODER trims during the EM all overlapping dimeric mixture components k whose average column-wise information content in the overlapping area goes below a threshold (default 0.40 bits). This is done by setting λk ≔ 0. Similarly, any dimeric component k whose λk gets below a small threshold (default 0.001) is eliminated as k is too weak. Blank entries of COB tables indicate eliminated dimers.
Second, MODER learns new values of monomeric PPMs not from the full data but from dimeric occurrences of the monomer such that the distance d between the components is large enough (default d ≥ δ = 4). This is because such isolated occurrences within a dimer are supposed to give the best data for a monomer PPM, not distorted by close-by other sites such as the other component of a dimer. However, if the share of these dimeric cases in the mixture is less than 0.02, then the dimeric data is treated too small. In this case distances d ≥ 0 are included into the analysis.
The third modification is motivated by the fact that transcription factors may have different binding motifs whose consensus sequences are only a few Hamming steps apart. To minimize disturbance from such similar motifs and from background, MODER tends to restrict the learning of PPMs and to high-affinity training sequences. Such sequences are identified by the heuristic rule that they are in small Hamming neighbourhood of the consensus sequences (sequences with highest probability) of the PPMs found so far. Monomer PPM is learned from data sites that are in the 1-Hamming neighbourhood of the seed (using the consensus sequence as the seed) of . Bridging PPM is learned from data sites that are in the 1-Hamming neighbourhood of combined seed . The combined seed is obtained as the initial combined seed (see Section EM Iterations) but using the seeds of and . MODER uses this seed-guided EM search by default, with the standard search as an option.
RESULTS
Generated data
As an initial sanity test we created a model η, generated a data set using it, and checked that MODER is able to learn η back from the generated data. We first created one monomeric PPM and deviation matrices κHH − 4 and κHT − 4. From these we constructed a model that had uniform background (λ = 0.71) and PPMs for homodimers HH 5 (λ = 0.12), HH –4 (0.08) and HT –4 (0.09). Using this model, we generated 100 000 sequences of length 40 bp. The sequences contained dimeric motifs and background only, no monomeric sequences were included. MODER accurately relearned the model from this data as the learned parameter values deviated from the original at most by 0.036; see Supplementary Figure S1 for details.
Validation using HT-SELEX data
Next we measured the quality of PPM models produced by MODER using correlation (R2) between occurrence counts and PPM scores of 8–mers or 10–mers of SELEX data. When counting the k–mers, all occurrences and both directions were considered. As the score of a k-mer x by a single PPM ρ we used the maximum value of when y and ρ′ go over all intersections of ρ and x and of ρ and reverse complement . As the score of x by a mixture of PPMs ρ1, …, ρt, whose mixing parameters by MODER are λ1, …, λt, we used λ1S1 + ⋅⋅⋅ + λtSt where S1, …, St are the individual scores of x by the PPMs. The scatter plots in the figures visualize the counts and scores of different 8- or 10-mers in hexagonal bins. The color of a bin reflects the number of different k–mers in that bin, with a darker color meaning higher number of different k–mers. As the early cycles of SELEX data can contain large proportion of nonspecific sequences (i.e. background), the counts were corrected against background using the data of the previous SELEX cycle, as described in (5).
We report results for the monomer and dimer PPMs of factors HOXB13, HNF4A, TFAP2A, FLI1, FOXC1 and PKNOX2 learned from HT-SELEX data. A basic correlation analysis is done for factor HOXB13. For HNF4A, TFP2A, FLI1, FOXC1, and PKNOX2 we also analyse the differences between observed and purely modular motifs. In all validations, the SELEX data sets were randomly divided into two halves, one half used for learning the model and other half used for validating it.
We used the following HT-SELEX data sets: HOXB13 (PRJEB14550, 164 768 reads), HNF4A (ERX169045, (6), 655 432 reads), TFAP2A (ERX1085476, (43), 168 053 reads), FLI1 (PRJEB14550, 143 389 reads), FOXC1 (ERX169015, (6), 189 009 reads), and PKNOX2 (ERX1084652, (43), 423 339 reads). Each read was 40 bp long except for FOXC1 whose reads were 30 bp long. The following seeds, selected by hand using the models published in (6), were used as input: HOXB13 (CTCGTAAAA, CCAATAAAA), HNF4A (RGGTCA, RGTCCA), TFAP2A (GGGCA), FLI1 (ACCGGAAGTN), FOXC1 (RTAAAYA), and PKNOX2 (TGACANN). Note that it was essential to use non-palindromic seeds for overlapping dimers as, for example, the observed cases HH -6 for FLI1 and HH -2 for PKNOX2 are directional; see Figure S3 in (6) and Section EM iterations.
Selecting strong components of the model
The learned total model is likely to contain useless, weak components (weak dimeric motifs) that should be removed before the model is applied, e.g. to predict new putative binding sites. One could, for example, include model components in decreasing order of weight λ until a certain fraction of the non-background sequences is covered. Here we used the fraction of 85% to select the models for validation experiments. In addition, we also studied the effect of parameter δ (minimum gap length in the independent case) by experimenting with large values (up to Lmax ) of δ. As all larger deviations from expected were observed to usually occur in dimers with gap <4, default value δ = 4 was selected.
Factors HOXB12, HNF4A, TFAP2A, FLI1, FOXC1 and PKNOX2
Figure 3 shows the sequence logos of the learned PPMs for factor HOXB13 and reports correlations of the scores of individual PPMs and of their mixture with counts of 8-mers. Since MODER did not find any strong dimeric motifs, the model for this factor is composed of two monomers only. The power of multi-motif modeling can be seen: the combined mixture consistently gives the highest R2.
HNF4A, TFAP2A, FLI1, FOXC1 and PKNOX2 are examples of TFs for which many dimeric PPMs deviate clearly from the purely modular PPMs. Analyses of these factors are shown in Figures 4–8, where the correlations are shown for both the expected (purely modular) and the learned models, and the deviation matrices are also visualized. The number of dimeric models included into the mixture by the 85% rule ranged from 3 to 14 for different factors, only the top three dimeric models shown in the Figures. The full set of models, weights, and resulting correlations are available in the Supplementary File S1. Not surprisingly, the learned model has always higher correlation, but with varying margin. Sometimes (TFAP2A) deviating from the expected model gives strongly improved model while sometimes (FOXC1) the difference to the expected model is minor. As for the directionality of the motifs, sometimes both the expected and learned motifs are palindromic (Figure 5) while sometimes expected palindromes become directed in the learned motif (Figures 6 and 8).
Validation using ChIP-seq data
We then tested for factors HOXB13, HNF4A, and TFAP2A the validity of the obtained in vitro models on in vivo data. We performed standard ROC analysis to measure the performance of the models learned from SELEX data on binary classification of ChIP-Seq peaks. The following ChIP-seq data was used: HOXB13 (European Nucleotide Archive accession ERX332516, IgG: ERX332513) (44), HNF4A (Sequence Read Archive accession SRR952427, IgG: SRR952608) (45) and TFAP2A (SRR952485,IgG: SRR952608) (45). To find the peaks, the reads were aligned with BWA (46), and peak calling was done with Peakzilla (47). The genome assembly used was GRCh37 (hs37d5). From each ChIP-seq peak set, top n = 230 peaks with highest quality score were selected, and for each peak a sequence of length L = 190 bp flanking the peak summit was chosen for the positive set. A negative set of the same size was chosen randomly from the human genome, making sure that the positions were mappable. Sequences were scored using the SELEX models of HOXB13, HNF4A and TFAP2A shown in Figures 3, 4 and 5. The resulting ROC curves of the (very good) classification performance of PPM scores are shown in Figure 9.
We also applied MODER on the ChIP-seq data set of factor NRSF on the GM12878 cell line produced by (48) and further analyzed by (16). To obtain the seeds, we first took all the k-mers of lengths 9–11 from the data set, applied hierarchical clustering, and selected two best clusters and their representative k-mers (TTCAGCACC and GGACAGCTCC) by using the order given by the -score. MODER finds 9 out of 10 models reported by Quang and Xie, for details, see Figure 2 of (16) and Supplementary Figure S2. Note that as NRSF has two monomeric motifs, MODER discovers heterodimeric motifs whose COB-table has all four orientations.
Next, we tried to detect the core and side motifs of factor CTCF. This should test MODER’s capabilities in detecting long sites, as together these motifs are known to form a site of length about 34 bp (13). We used raw ChIP-exo data from human LoVo cells targeting factor CTCF from Katainen et al. (49) (ENA accession ERX986066). Mapping and peak calling was done as in Hartonen et al. (50), briefly: alignment was done using BWA (46) against assembly GRCh37 and the peaks were called using PeakXus (50). Five thousand highest scoring peaks were selected, and around the peak summits sequences of length 60 were extracted (blacklisted regions (51), ENCODE accession ENCFF001TDO, and centromeres were removed). Results in Supplementary Figure S3 show that MODER is able to detect similar configurations of distances and orientations between the core and side motifs of CTCF as in Schmidt et al. (13) (Figure 2) and Nakahashi et al. (14) (Figure 4). The strongest dimer formed by the core and side motifs, namely τ1, 2, HT, 8, was found in 20% of the top 5000 peaks.
As nuclear receptors commonly bind as dimers (52), we chose another factor, in addition to HNF4A, from this family to display the performance of MODER. ChIP-seq data from ENCODE for factor RXRA in cell line HepG2 was used (ENCODE accession ENCFF002CKZ). Ten thousand highest scoring peaks were selected, and around the peak summits sequences of length 40 were extracted (blacklisted regions and centromeres were removed). The monomer seed GGGGTCA for the experiment was handpicked based on the Rxra mouse model in Jaspar (53). The seed finding method used with NRSF for k-mer lengths 6–15 would give AGGTCA, which could have been used as well. Supplementary Figure S4 shows that MODER detects a strong dimeric binding motif τ1, 1, HT, 0, which could either be a homodimer of RXRA or a heterodimer such as NR1H2-RXRA, as suggested by Tomtom (15).
MODER versus MEME
When comparing MODER with the popular tool MEME it should be noted that the models of motifs of the two methods are different. MEME learns separate monomer models in successive passes, deleting the found sites of a model from data before the next pass, while MODER aims at discovering the modularity of motifs and hence learns the entire modular structure of monomeric and dimeric motifs in the same probabilistic framework in one run. The difference is illustrated in Supplementary Figure S5 that compares the models learned for factors TFAP2A and FLI1 by the two methods.
MODER versus Bipad/Maskminent
We also compared MODER with Bipad/Maskminent (17,19) which among the previous tools comes closest to MODER. An example qualitative comparison using alignment of models is illustrated in Supplementary Figure S6. Similarities between motifs obtained using these two algorithms are obvious, although Maskminent seems to introduce some background noise into the motifs. Comparison using correlation analysis was not performed since Maskminent does not learn the monomer model and the two orientation classes of dimers (DR and IR) in the same commensurate model, and hence the weights for models of different types could not be decided.
In order to make a quantitative comparison to Maskminent, we used the data sets for which a bipartite Maskminent model is available from Lu et al. (19). There were 53 such data sets in ENCODE (51), and we used 40 of those (6 had been revoked from ENCODE, 7 were unidentifiable). The identification problems were due to Lu et al. not giving the accession codes, but merely describing the used data sets. For all the data sets we managed to identify, we have now included the accessions in the Supplementary Table S1. The same number of top scoring peaks were used as in (19), but only 100 bp around the peak summit was selected, and these data sets were randomly divided into learning and validation sets of equal size. We selected the initial seeds for MODER based on Jaspar (53) in the following way: for JUN-like factors (ATF3, BACH1, BATF, FOS, FOSL1, FOSL2, JUN, JUNB, JUND, NFE2) we used the seed ATGA, for EBF1 the seed TCCC, for ESR1 the seed AGGTCA, for MAFF and MAFK the seed TCAGCA, and for STAT1 the seed TTC. Then MODER was run on learning data sets, and the best dimeric PPM (according to lambda) was chosen for each data set. For Maskminent we used their published model for each data. The results displayed in Supplementary Table S1 and Figure S7 show that MODER gets better AUC value in 35 cases out of 40. Note also that MODER wins in 33 out of 35 cases, when considering only the optimal IDR-thresholded data sets (the other five data sets are initial peaksets, marked with a star in the table). The selection of factors used by Lu et al. (19) was unfortunately quite repetitive, but the comparison shows consistent behaviour for both methods.
DISCUSSION
The MODER algorithm is based on reductionist view that PPM models for dimers can be built in a modular fashion from monomer PPMs.
As noted e.g. by (4), such modularity is not always valid as in a number of dimeric cases the specificity of the dimeric motif differs notably from what could be expected from its monomeric components. The deviation matrix of MODER represents such differences explicitly. These deviations from the expected models are especially important for orientations HH and TT, for which all expected models are always symmetric (palindromes), whereas the real binding motifs might have a direction (6). This was demonstrated by several examples in our validation experiments. In addition, the deviations from expected motif commonly occur when the core segments of the motifs of two factors are closely packed and the overlapping flanks are distorted from the expected model. In TF-DNA binding, the core positions in a motif are usually recognized by direct bonds to the bases, whereas the weaker positions are recognized by contacts to DNA backbone (4) and are hence more prone to deviations.
The motif discovery algorithm of MODER considers simultaneously all possible orientation–distance pairs and finds the preferred dimeric motifs. Learning multiple motifs in serial manner—first finding one motif, then removing its occurrences from the data, and then running the algorithm again—does not treat symmetrically the sequences that may belong to several motifs. MODER improves over the similar coMOTIF algorithm (30) by including the spacing information in the overall model, and by adding overlapping motifs and the deviations from the expected motif. Allowing overlaps of monomer motifs within a dimer turned out a very useful feature. In fact, for factors FLI1, FOXC1 and PKNOX2 the strongest dimer has such an overlap.
Simultaneous learning of all motif components and their mixing parameters makes direct comparison of the relative strengths of the motifs possible by using the mixing parameters. Depending on the application, it might be useful to rescale the obtained mixing parameters, after the actual algorithm is finished. This was done, when we chose the motifs for performance testing by the 85% rule: the mixing parameters were rescaled to exclude the background. Then motifs were included in descending order, until the motifs covered 85% of the signal. Sometimes it might also be useful to rescale the mixing parameters in each COB table separately, although this would prevent the comparison of mixing parameters between distinct COB tables.
MODER is not too sensitive to noise in the seeds. For factor HOXB13, we mutated the first initial seed in two positions and the second seed in three positions, including informative positions. Still the algorithm managed to obtain the same results as with the original seeds. MODER is reasonably fast. For example, it took 2 min 18 s wall-clock time and 15 min 30 s CPU time when run simultaneously on eight cores to learn the model for FLI1 in Figure 2 from a 2 865 880 bp long HT-SELEX data set. The seeds for MODER can be found from existing PPM databases or can be produced by seed-finding tools such as DREME (54) or by using the procedure in section Validation using ChIP-seq data to find representative k-mers.
AVAILABILITY
MODER is implemented in C++ on Linux platform and is available from https://github.com/jttoivon/MODER. European Nucleotide Archive, accession code PRJEB14550.
Supplementary Material
ACKNOWLEDGEMENTS
The authors would like to thank Tuomo Hartonen for doing the peak calling from the ChIP-seq and ChIP-exo reads.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR online.
FUNDING
EU FP7 project SYSCOL [UE7-SYSCOL-258236]; Leverhulme Trust [VP1-2014-044 to E.U.]. Funding for open access charge: University of Helsinki.
Conflict of interest statement. None declared.
REFERENCES
- 1. Rodda D.J., Chew J.-L., Lim L.-H., Loh Y.-H., Wang B., Ng H.-H., Robson P.. Transcriptional regulation of nanog by OCT4 and SOX2. J. Biol. Chem. 2005; 280:24731–24737. [DOI] [PubMed] [Google Scholar]
- 2. Panne D., Maniatis T., Harrison S.C.. An atomic model of the interferon-β enhanceosome. Cell. 2007; 129:1111–1123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. De Val S., Chi N.C., Meadows S.M., Minovitsky S., Anderson J.P., Harris I.S., Ehlers M.L., Agarwal P., Visel A., Xu S.-M. et al. . Combinatorial regulation of endothelial gene expression by ETS and Forkhead transcription factors. Cell. 2008; 135:1053–1064. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Jolma A., Yin Y., Nitta K.R., Dave K., Popov A., Taipale M., Enge M., Kivioja T., Morgunova E., Taipale J.. DNA-dependent formation of transcription factor pairs alters their binding specificity. Nature. 2015; 527:384–388. [DOI] [PubMed] [Google Scholar]
- 5. Jolma A., Kivioja T., Toivonen J., Cheng L., Wei G., Enge M., Taipale M., Vaquerizas J.M., Yan J., Sillanpää M.J. et al. . Multiplexed massively parallel SELEX for characterization of human transcription factor binding specificities. Genome Res. 2010; 20:861–873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Jolma A., Yan J., Whitington T., Toivonen J., Nitta K.R., Rastas P., Morgunova E., Enge M., Taipale M., Wei G. et al. . DNA-binding specificities of human transcription factors. Cell. 2013; 152:327–339. [DOI] [PubMed] [Google Scholar]
- 7. Valouev A., Johnson D.S., Sundquist A., Medina C., Anton E., Batzoglou S., Myers R.M., Sidow A.. Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nat. Methods. 2008; 5:829–834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Isakova A., Berset Y., Hatzimanikatis V., Deplancke B.. Quantification of cooperativity in heterodimer-DNA binding improves the accuracy of binding specificity models. J. Biol. Chem. 2016; 291:10293–10306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Stormo G.D., Schneider T.D., Gold L.. Quantitative analysis of the relationship between nucleotide sequence and functional activity. Nucleic Acids Res. 1986; 14:6661–6679. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Stormo G.D. DNA binding sites: representation and discovery. Bioinformatics. 2000; 16:16–23. [DOI] [PubMed] [Google Scholar]
- 11. LaRonde-LeBlanc N.A., Wolberger C.. Structure of HoxA9 and Pbx1 bound to DNA: Hox hexapeptide and DNA recognition anterior to posterior. Genes Dev. 2003; 17:2060–2072. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Dempster A.P., Laird N.M., Rubin D.B.. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 1977; 39:1–38. [Google Scholar]
- 13. Schmidt D., Schwalie P.C., Wilson M.D., Ballester B., Gonçalves A., Kutter C., Brown G.D., Marshall A., Flicek P., Odom D.T.. Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell. 2012; 148:335–348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Nakahashi H., Kieffer Kwon K.-R., Resch W., Vian L., Dose M., Stavreva D., Hakim O., Pruett N., Nelson S., Yamane A. et al. . A genome-wide map of CTCF multivalency redefines the CTCF code. Cell Rep. 2013; 3:1678–1689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Gupta S., Stamatoyannopoulos J.A., Bailey T.L., Noble W.S.. Quantifying similarity between motifs. Genome Biol. 2007; 8:R24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Quang D., Xie X.. EXTREME: an online EM algorithm for motif discovery. Bioinformatics. 2014; 30:1667–1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Bi C., Rogan P.K.. Bipartite pattern discovery by entropy minimization-based multiple local alignment. Nucleic Acids Res. 2004; 32:4979–4991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Bi C., Leeder J.S., Vyhlidal C.A.. A comparative study on computational two-block motif detection: algorithms and applications. Mol. Pharm. 2008; 5:3–16. [DOI] [PubMed] [Google Scholar]
- 19. Lu R., Mucaki E.J., Rogan P.K.. Discovery and validation of information theory-based transcription factor and cofactor binding site motifs. Nucleic Acids Res. 2017; 45:e27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Helden J.v., Rios A., Collado-Vides J.. Discovering regulatory elements in non-coding sequences by analysis of spaced dyads. Nucleic Acids Res. 2000; 28:1808–1818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Li H., Rhodius V., Gross C., Siggia E.D.. Identification of the binding sites of regulatory proteins in bacterial genomes. Proc. Natl. Acad. Sci. U.S.A. 2002; 99:11772–11777. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Liu X., Brutlag D.L., Liu J.S.. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pacific Symp. Biocomput. 2001; 6:127–138. [PubMed] [Google Scholar]
- 23. Whitington T., Frith M.C., Johnson J., Bailey T.L.. Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res. 2011; 39:e98. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Kazemian M., Pham H., Wolfe S.A., Brodsky M.H., Sinha S.. Widespread evidence of cooperative DNA binding by transcription factors in Drosophila development. Nucleic Acids Res. 2013; 41:8237–8252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Jankowski A., Prabhakar S., Tiuryn J.. TACO: a general-purpose tool for predicting cell-type–specific transcription factor dimers. BMC Genomics. 2014; 15:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lawrence C.E., Reilly A.A.. An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins: Struct. Funct. Bioinformatics. 1990; 7:41–51. [DOI] [PubMed] [Google Scholar]
- 27. Cardon L.R., Stormo G.D.. Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol. 1992; 223:159–170. [DOI] [PubMed] [Google Scholar]
- 28. Bailey T.L., Elkan C.. The value of prior knowledge in discovering motifs with MEME. Proc. Third Internat. Conf. on Intelligent Systems for Molecular Biology. 1995; AAAI Press; 21–29. [PubMed] [Google Scholar]
- 29. Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S.. MEME suite: tools for motif discovery and searching. Nucleic Acids Res. 2009; 37(Suppl. 2):W202–W208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Xu M., Weinberg C.R., Umbach D.M., Li L.. coMOTIF: a mixture framework for identifying transcription factor and a coregulator motif in ChIP-seq Data. Bioinformatics. 2011; 27:2625–2632. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Li L. GADEM: a genetic algorithm guided formation of spaced dyads coupled with an EM algorithm for motif discovery. J. Comput. Biol. 2009; 16:317–329. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Mercier E., Droit A., Li L., Robertson G., Zhang X., Gottardo R.. An integrated pipeline for the genome-wide analysis of transcription factor binding sites from ChIP-Seq. PLoS One. 2011; 6:e16432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Zhang Z., Chang C.W., Hugo W., Cheung E., Sung W.-K.. Simultaneously learning DNA motif along with its position and sequence rank preferences through expectation maximization algorithm. J. Comput. Biol. 2013; 20:237–248. [DOI] [PubMed] [Google Scholar]
- 34. Reid J.E., Wernisch L.. STEME: a robust, accurate motif finder for large data sets. PLoS One. 2014; 9:e90735. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Liu J.S., Neuwald A.F., Lawrence C.E.. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 1995; 90:1156–1170. [Google Scholar]
- 36. Ikebata H., Yoshida R.. Repulsive parallel MCMC algorithm for discovering diverse motifs from large sequence sets. Bioinformatics. 2015; 31:1561–1568. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Alipanahi B., Delong A., Weirauch M.T., Frey B.J.. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nature Biotech. 2015; 33:831–838. [DOI] [PubMed] [Google Scholar]
- 38. Colombo N., Vlassis N.. FastMotif: spectral sequence motif discovery. Bioinformatics. 2015; 31:2623–2631. [DOI] [PubMed] [Google Scholar]
- 39. Heinz S., Benner C., Spann N., Bertolino E., Lin Y.C., Laslo P., Cheng J.X., Murre C., Singh H., Glass C.K.. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell. 2010; 38:576–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Kulakovskiy I.V., Boeva V., Favorov A.V., Makeev V.J.. Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics. 2010; 26:2622–2623. [DOI] [PubMed] [Google Scholar]
- 41. Ma W., Noble W.S., Bailey T.L.. Motif-based analysis of large nucleotide data sets using MEME-ChIP. Nat. Protoc. 2014; 9:1428–1450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Jayaram N., Usvyat D., Martin A.C.. Evaluating tools for transcription factor binding site prediction. BMC Bioinformatics. 2016; 17:1298. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43. Yin Y., Morgunova E., Jolma A., Kaasinen E., Sahu B., Khund-Sayeed S., Das P.K., Kivioja T., Dave K., Zhong F. et al. . Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science. 2017; 356:eaaj2239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Huang Q., Whitington T., Gao P., Lindberg J.F., Yang Y., Sun J., Väisänen M.-R., Szulkin R., Annala M., Yan J. et al. . A prostate cancer susceptibility allele at 6q22 increases RFX6 expression by modulating HOXB13 chromatin binding. Nat. Gen. 2014; 46:126–135. [DOI] [PubMed] [Google Scholar]
- 45. Yan J., Enge M., Whitington T., Dave K., Liu J., Sur I., Schmierer B., Jolma A., Kivioja T., Taipale M. et al. . Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell. 2013; 154:801–813. [DOI] [PubMed] [Google Scholar]
- 46. Li H., Durbin R.. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009; 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Bardet A.F., Steinmann J., Bafna S., Knoblich J.A., Zeitlinger J., Stark A.. Identification of transcription factor binding sites from ChIP-seq data at high resolution. Bioinformatics. 2013; 29:2705–2713. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. ENCODE Project Consortium Birney E., Stamatoyannopoulos J.A., Dutta A., Guigó R., Gingeras T.R., Margulies E.H., Weng Z., Snyder M., Dermitzakis E.T., Thurman R.E. et al. . Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007; 447:799–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49. Katainen R., Dave K., Pitkänen E., Palin K., Kivioja T., Välimäki N., Gylfe A.E., Ristolainen H., Hänninen U.A., Cajuso T. et al. . CTCF/cohesin-binding sites are frequently mutated in cancer. Nature Genetics. 2015; 47:818–821. [DOI] [PubMed] [Google Scholar]
- 50. Hartonen T., Sahu B., Dave K., Kivioja T., Taipale J.. PeakXus: comprehensive transcription factor binding site discovery from ChIP-Nexus and ChIP-Exo experiments. Bioinformatics. 2016; 32:i629–i638. [DOI] [PubMed] [Google Scholar]
- 51. ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Gronemeyer H., Gustafsson J.-A., Laudet V.. Principles for modulation of the nuclear receptor superfamily. Nat. Rev. Drug. Discov. 2004; 3:950–964. [DOI] [PubMed] [Google Scholar]
- 53. Mathelier A., Fornes O., Arenillas D.J., Chen C.-y., Denay G., Lee J., Shi W., Shyr C., Tan G., Worsley-Hunt R. et al. . JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2016; 44:D110–D115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Bailey T.L. DREME: motif discovery in transcription factor ChIP-seq data. Bioinformatics. 2011; 27:1653–1659. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.