Abstract
Combinatorial interactions among transcription factors (TFs) are critical for integrating diverse intrinsic and extrinsic signals, fine-tuning regulatory output and increasing the robustness and plasticity of regulatory systems. Current knowledge about combinatorial regulation is rather limited due to the lack of suitable experimental technologies and bioinformatics tools. The rapid accumulation of ChIP-Seq data has provided genome-wide occupancy maps for a large number of TFs and chromatin modification marks for identifying enhancers without knowing individual TF binding sites. Integration of the two data types has not been researched extensively, resulting in underused data and missed opportunities. We describe a novel method for discovering frequent combinatorial occupancy patterns by multiple TFs at enhancers. Our method is based on probabilistic item set mining and takes into account uncertainty in both types of ChIP-Seq data. By joint analysis of 108 TFs in four human cell types, we found that cell–type-specific interactions among TFs are abundant and that the majority of enhancers have flexible architecture. We show that several families of transposable elements disproportionally overlap with enhancers with combinatorial patterns, suggesting that these transposable element families play an important role in the evolution of combinatorial regulation.
INTRODUCTION
In higher eukaryotes, transcription factors (TFs) rarely operate by themselves, but rather directly or indirectly interact with specific partner TFs or chromatin regulators when binding to enhancers. It has been estimated that roughly 75% of all metazoan TFs heterodimerize with other factors (1). Classical examples of combinatorial TF regulation include the paradigmatic ‘even-skipped stripe 2’ enhancer for body patterning in fly (2) (involving 7 TFs and 34 sites in a 1.7-kb region), the endo16 gene enhancer for endoderm specification in sea urchin (3) (involving 19 TFs and 56 sites in a 2.2-kb region), the Myf5 gene enhancer for muscle development in mouse (4) and the Ifng gene enhancer for the production of interferon gamma in human and mouse (5) (involving 6 TFs and 4 sites in a 55-bp region). Our current knowledge about combinatorial regulation, including rules governing the molecular architecture, evolutionary, spatial and temporal dynamics of enhancers, has relied heavily on studies of these classical enhancers (6,7).
The rapid accumulation of ChIP-Seq data has provided genome-wide occupancy maps for a large number of TFs. By clustering these TF occupancy maps, several recent studies have uncovered hundreds of genomic loci that are co-occupied by multiple TFs in various species and cell types (8–12), suggesting the abundance of combinatorial regulation. However, given the number of TFs in mammalian genomes [∼2,000, (13)], our current knowledge represents only the tip of the iceberg. On the technical side, clustering-based approaches to finding combinatorial TF occupancy patterns have a few shortcomings. First, most analyses use binary presentation of binding peaks, which makes them vulnerable to noise in ChIP-Seq data. Second, because most combinatorial regulatory events occur at enhancers, focusing enhancers will enhance the signal-to-noise ratio. So far most clustering studies have not incorporated such a constraint.
The discovery of unique chromatin signatures associated with enhancers greatly facilitates enhancer mapping without knowledge about the locations of individual TFs (14–16). In addition, such an approach is well suited for finding cell- and developmental-specific enhancers and providing information about enhancer action in the native genomic context. Given the increasing abundance of TF and chromatin modification ChIP-Seq data, a better approach to finding combinatorial patterns would be analyzing multiple TF ChIP-Seq data sets using enhancers defined by chromatin signatures as the genomic location constraint. An advantage of such an approach is the reduction of spurious clusters of TF peaks at non-enhancer sites and presumably non-functional.
We propose a novel probabilistic algorithm to discover frequent combinatorial occupancy patterns (FCOPs) involving multiple TFs at enhancers, taking into account noise in both types of ChIP-Seq data. Our method differs from previous DNA-motif-centered approaches by directly using ChIP-Seq data and thus avoiding complications associated with DNA motif analysis (e.g. motif quality, the need for binding site cutoff). To our best knowledge, this is the first principled approach to integrating TF occupancy and chromatin modification ChIP-Seq data to study combinatorial TF interactions.
By applying our algorithm to a set of 108 TFs in four human cell types, we identify a number of combinatorial TF occupancy patterns that occur frequently in the genome. Additional analyses of identified FCOPs reveal that cell–type-specific TF interactions are abundant and that the majority of enhancers have flexible architecture. In addition, we show that several families of transposable elements (TEs) play an important role in the evolution of complex enhancers occupied by multiple TFs.
MATERIALS AND METHODS
Discover FCOP of multiple TFs by probabilistic frequent itemset mining of uncertain data
Our method borrows idea from frequent itemset mining (FIM) (17). In FIM, customers’ transaction data were collected. Each transaction contains a list of products that are called items. FIM discovers customer buying habits by finding associations between different items that customers place in their ‘shopping baskets’. An itemset is frequent when it occurs in a minimal number of transactions. Here we equate enhancers to transactions and the set of TF binding sites in an enhancer to the itemset. By this analogy, the problem of identifying FCOPs becomes the problem of identifying frequent itemsets. To deal with noise in ChIP-Seq data, we use occurrence probabilities of Transcription Factor Binding Site (TFBSs) and enhancers in our framework. Traditional FIM does not take into account uncertainty associated with transactions and items. Doing so gives rise to an uncertain transaction database, in which the support of an itemset is uncertain and is defined by a discrete probability distribution function. We introduce a novel framework to mine such a probabilistic transaction database.
Problem definition
Given are a set of ‘N’ enhancers predicted by CSI-ANN (18), , each of which has a probability of being a true enhancer, , and genome-wide binding peaks of a set of ‘M’ TFs identified by ChIP-Seq, , respectively. For a TF binding peak tj in enhancer ei, we use to denote the binding probability of the TF to the enhancer. For a given set of TFs, , whose peak centers fall within the enhancer, ei, we consider the itemset as being supported by the enhancer ei. Given a minimum support threshold, minSup, the goal of the algorithm is to exhaustively search for non-redundant frequent combinatorial TF occupancy patterns , whose frequentness probabilities satisfy a given threshold ‘α’.
Probability calculation for enhancers and TF binding peaks
In traditional FIM, it is assumed that both transactions and items of a transaction occur with certainty. However, in our case, the information captured in transactions and items has some degree of uncertainty because the existence of an enhancer and/or a TF binding peak is inferred from ChIP-Seq data. To deal with such uncertainties, we assign a probability to each enhancer predicted by CSI-ANN and each TF binding peak. Because the CSI-ANN algorithm outputs probabilities of predicted enhancers, it is straightforward to use these probabilities directly. For TF ChIP-Seq data, given its better performance compared with other peak callers, we first use MACS with its default parameter setting to call binding peaks (19). To compute the probability of a TF binding peak, we use min-max normalization to transform the MACS fold enrichment score into a range of [0.5, 1]. We choose this range because peaks called using default MACS parameter are relatively strong peaks. We also tested the performance of our algorithm with the normalization ranges of [0.3, 1] and [0.7, 1]. Our result shows that the performance is not sensitive to the choice of normalization range (Supplementary Figure S3). To avoid peaks with extremely high fold enrichment that could skew the normalized probability scores, we truncate the fold enrichment score at 95 percentile and set the probability score of such peaks to 1.
Algorithm to compute the frequentness probability of a combinatorial pattern
Denote X as a combinatorial pattern observed in a set of enhancers. is the probability of the pattern having support k, k ∈ {0, … , N} and is the probability of the pattern having a support at least k. X is considered as a frequent pattern when . We consider two types of uncertainties: those associated with enhancer predictions and those associated with TF binding peak calls. If we only consider the uncertainties of TF binding peaks in an enhancer (i.e. enhancer is certain), the probability of a combinatorial pattern X supported by an enhancer ei can be calculated as follows:
. Considering both types of uncertainties, the overall probability of pattern X having exactly k support can be calculated as
(1) |
where is the probability of being a true enhancer.
Recall that we are interested in the probability that an item set occurs in at least minSup transactions. In other words, we need the probability that the support of X is at least k. Denote this probability as , which can be computed by the following equation.
(2) |
With a large number of enhancers and TF binding peaks, computation of by brute-force enumeration is inefficient. The time complexity is exponential with respect to the number of transactions. Here we adopt a dynamic programming-based algorithm first proposed by Bernecker et al. (20) to compute the frequentness probability recursively. Briefly, define as the probability that k of l enhancers contains pattern X and as the probability that at least k of l enhancers contains the pattern. Then,
(3) |
where
(4) |
Starting from Equation (4), we can recursively calculate using Equation (3) till k = minSup, l = N. By definition, the initial values of the dynamic programming matrix are and . Using this dynamic programming scheme, the computation of the frequentness probability requires at most O(N*minSup) = O(N) time and at most O(N) space.
Automatic determination of pattern-specific minimal support threshold
Because different TFs have different numbers of binding sites in the genome, it is expected that patterns involving TFs with more binding sites have more support than patterns involving TFs with fewer binding sites. Therefore, a universal minimal support is not appropriate for computing frequentness probability. Instead, we determine pattern-specific minimal support thresholds that have the same frequentness probability in randomized input data. Such a probability is analogous to a P-value. In this sense, all pattern-specific minimal support thresholds are ‘normalized’ to have a P ≤ 0.01. We generate a set of permutated background transactions based on the binding frequencies of the specific TFs involved in a pattern. We do so by randomly redistributing the peaks of a TF across the set of enhancers. In this way, the number of binding peaks for each TF is unchanged in the permutated data, but the correlation among TFs is destroyed. For a given pattern and a range of minimal support thresholds, we then compute a set of frequentness probabilities in the permutated data as the P-values. We pick the final pattern-specific minimal support threshold as the one that gives a P-value of 0.01. As an example, minimum support thresholds for pairwise patterns calculated this way are shown in Supplementary Figure S4. We also calculated minimum support thresholds for higher-order patterns.
See Supplementary Methods for the rest of method descriptions.
RESULTS
A large fraction of enhancers are occupied by multiple TFs
Using CSI-ANN (18) and cell–type-specific histone modification ChIP-Seq data, we predicted 16 128, 23 731, 23 586 and 3 1327 enhancers in GM12878, K562, HepG2 and H1 cells, respectively. This set of predictions has high quality, as roughly 70% of them overlap with at least one of three other genomic marks for enhancers: distal DHS, sequence conservation and distal p300 site (Supplementary Figure S1).
To assess the extent of combinatorial TF binding at enhancers, we overlapped TF binding peaks with the set of predicted enhancers. We obtained ChIP-Seq data from ENCODE for 62, 65, 44 and 39 TFs for GM12878, K562, HepG2 and H1 cells, respectively (Supplementary Figure S2 and Supplementary Table S1). Figure 1A shows the cumulative distributions of enhancers that contain various number of distinct TF binding peak(s) within the 1 kb enhancer window. As can be seen, 61, 69, 72 and 39% of enhancers are occupied by at least two TFs for GM12878, K562, HepG2 and H1 cells, respectively, suggesting that combinatorial binding is prevalent.
Method overview
We introduce a probabilistic algorithm for identifying frequent combinatorial occupancy patterns by multiple TFs at enhancers, termed FCOP in this study (Figure 1B). Our method is motivated by the concept of FIM (21). However, unlike traditional FIM, our method considers the probabilities of both transactions and items as a way to deal with uncertainties in ChIP-Seq data. Our method has the following three components: (i) calculation of probabilities associated with enhancers (transactions) and TF binding peaks (items); (ii) automatic determination of pattern-specific minimum support threshold; and (iii) calculation of frequentness probability of candidate FCOPs using an uncertain transaction database and dynamic programming. The input data consists of genome-wide location information for enhancers and multiple TFs and probability values that are associated with each type of sequence. The only adjustable parameter of our method is the frequentness probability threshold for frequent combinatorial patterns, α. A higher α value will give rise to a set of predictions with higher specificity but lower sensitivity. To pick a default value for the parameter, we ran our method with a range of α values. We found that α value of 0.5 gives a balanced sensitivity and fraction of supported predictions (Supplementary Figure S5). This default value was used for the following analyses. Software implementing our method is freely available to academic use on request.
Performance comparison with alternative algorithms
A couple of groups have used traditional FIM to identify frequent TF combinatorial patterns (22,23). Both approaches used TF binding sites defined by DNA motif scan instead of ChIP-Seq peaks as items. For transactions, Morgan et al. used 100-bp windows across the genome and Sun et al. used a contiguous DNA sequence that contain binding sites of all TFs under consideration for a given pattern. No confidence measure for either TF peaks or enhancers was used in either previous studies. We compared the performance of traditional FIM and our method using Receiver Operating Characteristic (ROC) Curve. To evaluate the performance of the methods, we manually curated a set of gold standard TF interactions identified using experimental protocols. To make the comparison meaningful, we used the same set of transactions and items as the input, i.e. enhancers and TF binding peaks described earlier in the text. As shown in Figure 2A, our method has significantly higher area under the curve values than the traditional FIM [all P < 2.2 × 10−16, statistical test using the method of DeLong et al. (24)]. We note that the area under the curve values in these plots are the lower bounds of the actual values because of the limited coverage of the gold standard set of interactions.
Predicted FCOPs in four cell types
Using the default frequentness probability threshold of 0.5, we discovered 320, 300, 194 and 62 FCOPs in GM12878, K562, HepG2 and H1 cells, respectively. The fraction of TFs involved in FCOPs ranges from 35.5% for GM12878 cell to 70.4% for HepG2 cell. Likewise, the fraction of enhancers that support FCOPs ranges from 30.1% for K562 cell to 67.7% for HepG2 cell (Table 1). On average, each FCOP is supported by 1025, 1086, 1062 and 991 enhancers in GM12878, K562, HepG2, and H1 cells, respectively (Supplementary Table S4). These numbers are much higher than the minimum numbers of support used during the FCOP search, suggesting that the discovered FCOPs are not due to random chance (Supplementary Figure S6).
Table 1.
Statistics | GM12878 | H1 | HepG2 | K562 |
---|---|---|---|---|
Number of predicted patterns | 320 | 62 | 194 | 300 |
Number of supported pairwise interactions (%) | 109 (5.8) | 80 (10.8) | 68 (7.2) | 166 (8.0) |
Number of supported patterns (P-value) | 183 (0) | 21 (5.9E-9) | 35 (4.0E-9) | 100 (8.2E-59) |
Number of TF pairs, triplets and quadruplets | 38, 210, 72 | 61, 1, 0 | 111, 79, 4 | 137, 154, 9 |
Number of TFs in patterns (%) | 22 (35.5) | 21 (53.9) | 31 (70.4) | 41 (63.1) |
Number of enhancers in patterns (%) | 7076 (56) | 13391 (64.6) | 13838 (67.7) | 9073 (30.1) |
A TF set is considered to be supported by known TF interactions if any of the pairwise interactions between two TFs in the TF set was supported by known TF interactions. P-values are calculated using binomial distribution.
We further corroborated our predictions using evolutionary conservation and known experimentally derived TF interactions. Interestingly, for all four cell types, a large portion of the frequent TF patterns are supported by enhancers that are significantly more conserved than the full set of enhancers in the genome (P < 0.01, hypergeometric test, Supplementary Figure S7). For instance, 20.4% of enhancers that support FCOPs in GM12878 cell are more conserved than the full set of enhancers in this cell type. These fractions are 34.8%, 35.3% and 50% for K562, HepG2 and H1 cells, respectively. This higher level of conservation is likely due to the requirement of conserving multiple TF binding to the enhancers. Besides a higher conservation level, all four sets of discovered patterns are significantly enriched for known TF protein–protein interactions (P < 0.05, one-sided binomial test) (Table 1), providing additional supporting evidence for the discovered patterns.
Cell–type-specific TF interactions are common
Cell–type-specific interactions among TFs play a critical role in differential gene expression (25). But systematic investigation of this important issue is limited. Taking advantage of our set of predicted FCOPs, we asked to what extent they are cell–type-specific. For simplicity, we focused on pairwise interactions within FCOPs. We first expanded predicted combinatorial patterns into all possible TF pairs. In total we obtained 668 frequent TF pairs in the four cell types. To avoid complication due to missing data, we focused on the set of 54 TFs that are shared by at least two cell types in the ChIP-Seq data. There are 14, 20 and 20 TFs that are shared by 4, 3 and 2 cell types, respectively (Supplementary Figure S2). Among these shared TFs, we predicted 282 interactions that occur in the FCOPs. Only 44 pairs (15.6%) are shared by two cell types and no pair is shared by >2 cell types. For the shared pairs, they are supported by strong evidence of co-occurrence frequencies in the relevant cell types. For instance, the co-occupancy by JunD and REST is observed at 11.9 and 3.8% of enhancers that support the respective FCOPs in K562 and HepG2 cells. In stark contrast, co-occupancy by the same two factors is not frequently observed at enhancers in GM12878 and H1 cells (Figure 3). By the same token, for the unique pairs, they are supported by the lack of co-occupancy in cell types in which no interactions are predicted (Supplementary Figure S8).
Even for TF pairs that occur in more than one cell types, the sets of target genes co-regulated by the TF pairs in different cell types may differ for multiple reasons, such as relative arrangement of the two TFs, additional regulators that may be involved and different chromatin context of the involved enhancers. Thus, we asked to what extent genes regulated by the same TF pairs are involved in the same biological process. We performed Gene Ontology enrichment analysis on the target genes of the enhancers supporting 44 TF pairs shared between 2 cell types. We found 21 TF pairs (47%) whose target gene sets do not have any shared GO biological process terms (Supplementary Table S5). In other words, although 47% of the TF pairs are shared by two cell types, genes controlled by them have different functions in the two cell types.
In summary, our result suggests that pairwise TF interactions are fairly dynamic across cell types. Even for TF interactions that occur in multiple cell types, the sets of co-regulated genes could differ substantially in different cell types, leading to diverse regulatory outputs.
The majority of combinatorial patterns have flexible architecture
The spacing and arrangement of TF binding sites in an enhancer affects its regulatory activity, which is known as the cis-regulatory grammar of enhancers. Two models of enhancer architecture have been proposed based on a number of studies. At one extreme, the enhanceosome model postulates that a strict pattern of TF binding site arrangement is required for proper enhancer function (26,27). A well-known example of this model is the interferon beta gene (IFN-β) in which there are fixed binding order and site spacing among the six TFs, ATF-2, c-Jun, IRF-3, IRF-7, p50 and RelA (5). On the other hand, the billboard model states that (28) binding sites in enhancers are flexibly disposed without a strictly defined overall architecture. Many developmental enhancers have such billboard features (29). A well-known example of billboard enhancer is the even-skipped stripe 2 enhancer for body patterning in fly (2).
To better understand the relative abundance of enhancers belonging to each model and the general architectural features of enhancers, we conducted a couple of systematic investigations, taking advantage of our set of predicted FCOPs.
We first asked what fraction of our combinatorial patterns involves TFs with preferred binding site spacing. To this end, we first performed de novo motif finding for each TF using its ChIP-Seq data (see Supplementary Methods). Next, using the set of motifs and the SPAMO tool (30), we identified TF pairs in combinatorial patterns that have a preferred binding site distance that is statistically overrepresented in the set of enhancers supporting the combinatorial patterns. Of the 179 frequent TF pairs in GM12878 cell, 31 of them (17.3%) have preferred binding site distance (SPAMO P < 0.05). These numbers are 101 of 279 (36.2%), 80 of 190 (42.1%) and 26 of 64 (40.6%) for K562, HepG2 and H1 cells, respectively (Figure 4A, Supplementary Table S6). Most TF pairs have only one preferred distance and 95% of the binding site distances are shorter than 126 bp (Supplementary Figure S9). The short distance between TF pairs suggests physical interactions among them. For all TF pairs with preferred binding distances, 23.3% are supported by known TF physical interactions, compared with 16.4% for TF pairs without preferred binding site distance. Taken together, our result suggests that the majority of TFs in combinatorial patterns do not have preferred binding site spacing requirement.
Besides preferred binding site distance, we also investigated preferred binding order among TFs in combinatorial patterns. For each pattern, of all possible orderings of the involved TFs, we identified preferred orderings as those that are overrepresented in the set of supporting enhancers (see Supplementary Methods for details). At a P-value cutoff of 0.01, we found that 33.4, 28.7, 21.7 and 8.1% of combinatorial patterns exhibit a preferred binding order for GM12878, K562, HepG2 and H1 cells, respectively (Figure 4B and Supplementary Table S4).
Enhancers belonging to the enhanceosome model have both strict TF binding order and binding site spacing. We next examined what fraction of combinatorial patterns has such stricter architecture. To this end, we considered a pattern belonging to the enhanceosome model if it has a preferred TF binding order and at least one TF pair in the pattern has a preferred binding site distance. In total, we found that 18.4, 17, 15.5 and 3% of the patterns fall into the enhanceosome model in GM12878, K562, HepG2 and H1 cells, respectively (Supplementary Table S4).
In conclusion, our systematic analysis suggests that enhancers with strict architecture only accounts for a small fraction. Both the binding site distance and order of arrangement among the bound TFs are flexible for the majority of complex enhancers that involve multiple bound TFs.
Many combinatorial patterns have significant overlap with TEs
TEs in the human genome are significantly associated with TF binding sites. In several cases their expansion in the genome led to a substantial rewiring of the regulatory network (31–34). However, almost all studies so far focused on binding sites of individual TFs and little is known about the extent of overlap between TEs and combinatorial TF binding sites. To address this issue, we searched for TEs that significantly overlap with FCOPs. We examined 29 families of TEs belonging to four major classes, DNA transposon, long-terminal repeat (LTR) retrotransposon, long interspersed element and short interspersed element. When compared with all enhancers in the genome, we found a remarkable enrichment of a number of TE families among enhancers that support FCOPs (P < 0.01, hypergeometric test, Supplementary Table S4). As an example, we found that the combinatorial pattern [NR2F2, STAT5A, TAL1] significantly overlaps with the ERVL family of LTR retrotransposon (Figure 5A and B). Figure 5C summarizes the fractions of combinatorial patterns that significantly overlap with the four major classes of TEs. For both K562 and H1 cells, LTR retrotransposons overlap with the largest fractions of FCOPs. For HepG2, all four classes of TEs contribute roughly equal fraction of overlap with FCOPs. Besides the most frequently overlapping TEs, it is also worth knowing which TF combinations most frequently overlap with TEs. The top three FCOPs that have most significant overlap with TEs are [JUND, TAL1, TEAD4], [CEBPB, TAL1, TEAD4], [STAT5, TAL1, TEAD4] for K562 cell, [CEBPB, FOXA1, HNF4A], [FOXA1, FOXA2, HNF4A], [HDAC2, HNF4A, HNF4G] for HepG2 cell and [BCL11A, SP1], [NANOG, SP1], [NANOG, TCF12] for H1 cell.
When broken down into individual TE families, for each cell type, the top three TE families that overlap with most FCOPs are ERVL, ERV1, L2 for K562 cell, Alu, ERVL, L1 for HepG2 cell and ERV1, ERVL-MaLR, ERVL for H1 cell (Figure 5C inset). This result suggests that ERV families overlap with a disproportionate fraction of combinatorial binding patterns. ERV (endogenous retrovirus) is a major family of LTR retrotransposons. They are thought to have played an important role in the evolution of mammalian genomes. Among the four cell types, H1 cell stands out in that all TEs that overlap with combinatorial patterns belong to ERV families, suggesting the prominent role of ERVs in shaping the transcriptional regulation network of embryonic stem cells.
DISCCUSSION
Our method integrates both histone modification and TF ChIP-Seq data in a probabilistic framework. The histone modification data are used to generate constraints in the form of enhancers. Furthermore, the histone modification signature used here has been shown to be associated with active enhancers (16,35). Thus, by focusing on combinatorial TF peaks in active enhancer regions, we reduce the chance of finding spurious clusters of TF peaks in the genome. To evaluate the added benefit of the histone modification data, we ran our algorithm by replacing the enhancers with moving windows of the same size across the genome. We found that the performance decreased significantly (Supplementary Figure S10).
An alternative to our FIM-based approach is clustering analysis. It has been used to conduct joint analysis of ChIP-Seq peaks of multiple TFs (8,9,11). The advantage of clustering analysis is its relative simplicity. However, they do not directly address the noise in the underlying ChIP-Seq data. Second, for most clustering algorithms, determining the number of clusters is a difficult problem. Finally, the statistical significance of resulting clusters is typically not assessed in clustering analysis. In comparison, our method addresses all three issues. It treats ChIP-Seq data probabilistically. It is deterministic in terms of number of discovered patterns. Finally, it determines the statistical significance of candidate frequent patterns.
By analyzing a compendium of 108 TF ChIP-Seq data sets in four cell types, we found that the majority of enhancers have flexible architecture in terms of the arrangement and spacing of constituent TF binding sites. Among the four cell types, we found that embryonic stem cell (H1), rather than the terminally differentiated cells, has the largest fraction of enhancers with flexible architecture. This is reminiscent of the observation in fly that many developmental enhancers tend to have a more flexible architecture than enhanceosome (29). In this sense, the architectural difference may reflect different functional requirements of the two classes of enhancers. Enhanceosomes may represent a type of regulatory switches that are mainly responsible for generating gene expression patterns that are relatively simple, whereas enhancers with flexible architecture may be responsible for generating complex patterns of expression such as those during development.
We demonstrate that LTR/ERV retrotransposons overlap with a disproportionate fraction of combinatorial TF binding regions, especially in H1 embryonic stem cells. This is interesting given that ERV elements have been observed to contribute to the rewiring of transcriptional regulatory networks in ESCs and placenta (36–38). The unusual high percentage of ERV-derived combinatorial patterns in H1 cells is likely a consequence of the permissive chromatin state found in ESCs (39,40). It has been suggested that the manipulations that were initially exerted by the ancestral viruses on their hosts to bypass these antiviral control mechanisms have also facilitated their co-option into enhancer elements (41). Along this line, it has been shown that stem cell potency fluctuates with endogenous retrovirus activity in mouse (42).
In summary, we introduce a powerful computational method for uncovering combinatorial interactions among TFs. As the amount of genome-wide localization data continues to accumulate for various regulatory proteins, our method will prove increasingly useful for dissecting combinatorial gene regulation by the action of TFs as well as other types of regulatory proteins.
SUPPLEMENTARY DATA
Supplementary Data are available at NAR Online, including [43–50].
FUNDING
Funding for open access charge: National Institutes of Health [HG006130 to K.T.].
Conflict of interest statement. None declared.
Supplementary Material
ACKNOWLEDGEMENT
The authors thank members of the Tan laboratory for helpful discussion. They also thank David Eichmann, Lucas Van Tol and the University of Iowa Institute for Clinical and Translational Science for providing computing support.
REFERENCES
- 1.Walhout AJ. Unraveling transcription regulatory networks by protein–DNA and protein–protein interaction mapping. Genome Res. 2006;16:1445–1454. doi: 10.1101/gr.5321506. [DOI] [PubMed] [Google Scholar]
- 2.Arnosti DN, Barolo S, Levine M, Small S. The eve stripe 2 enhancer employs multiple modes of transcriptional synergy. Development. 1996;122:205–214. doi: 10.1242/dev.122.1.205. [DOI] [PubMed] [Google Scholar]
- 3.Yuh CH, Bolouri H, Davidson EH. Cis-regulatory logic in the endo16 gene: switching from a specification to a differentiation mode of control. Development. 2001;128:617–629. doi: 10.1242/dev.128.5.617. [DOI] [PubMed] [Google Scholar]
- 4.Hadchouel J, Carvajal JJ, Daubas P, Bajard L, Chang T, Rocancourt D, Cox D, Summerbell D, Tajbakhsh S, Rigby PW, et al. Analysis of a key regulatory region upstream of the Myf5 gene reveals multiple phases of myogenesis, orchestrated at each site by a combination of elements dispersed throughout the locus. Development. 2003;130:3415–3426. doi: 10.1242/dev.00552. [DOI] [PubMed] [Google Scholar]
- 5.Panne D, Maniatis T, Harrison SC. An atomic model of the interferon-beta enhanceosome. Cell. 2007;129:1111–1123. doi: 10.1016/j.cell.2007.05.019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Maston GA, Landt SG, Snyder M, Green MR. Characterization of enhancer function from genome-wide analyses. Annu. Rev. Genomics. Hum. Genet. 2012;13:29–57. doi: 10.1146/annurev-genom-090711-163723. [DOI] [PubMed] [Google Scholar]
- 7.Peter IS, Davidson EH. Evolution of gene regulatory networks controlling body plan development. Cell. 2011;144:970–985. doi: 10.1016/j.cell.2011.02.017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Chen X, Xu H, Yuan P, Fang F, Huss M, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, et al. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. doi: 10.1016/j.cell.2008.04.043. [DOI] [PubMed] [Google Scholar]
- 9.Wilson NK, Foster SD, Wang X, Knezevic K, Schutte J, Kaimakis P, Chilarska PM, Kinston S, Ouwehand WH, Dzierzak E, et al. Combinatorial transcriptional control in blood stem/progenitor cells: genome-wide analysis of ten major transcriptional regulators. Cell Stem Cell. 2010;7:532–544. doi: 10.1016/j.stem.2010.07.016. [DOI] [PubMed] [Google Scholar]
- 10.Soler E, Andrieu-Soler C, de Boer E, Bryne JC, Thongjuea S, Stadhouders R, Palstra RJ, Stevens M, Kockx C, van Ijcken W, et al. The genome-wide dynamics of the binding of Ldb1 complexes during erythroid differentiation. Genes Dev. 2010;24:277–289. doi: 10.1101/gad.551810. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Negre N, Brown CD, Ma L, Bristow CA, Miller SW, Wagner U, Kheradpour P, Eaton ML, Loriaux P, Sealfon R, et al. A cis-regulatory map of the Drosophila genome. Nature. 2011;471:527–531. doi: 10.1038/nature09990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Gerstein MB, Lu ZJ, Van Nostrand EL, Cheng C, Arshinoff BI, Liu T, Yip KY, Robilotto R, Rechtsteiner A, Ikegami K, et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science. 2010;330:1775–1787. doi: 10.1126/science.1196914. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ravasi T, Suzuki H, Cannistraci CV, Katayama S, Bajic VB, Tan K, Akalin A, Schmeier S, Kanamori-Katayama M, Bertin N, et al. An atlas of combinatorial transcriptional regulation in mouse and man. Cell. 2010;140:744–752. doi: 10.1016/j.cell.2010.01.044. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet. 2007;39:311–318. doi: 10.1038/ng1966. [DOI] [PubMed] [Google Scholar]
- 15.Kim TK, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S, et al. Widespread transcription at neuronal activity-regulated enhancers. Nature. 2010;465:182–187. doi: 10.1038/nature09033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rada-Iglesias A, Bajpai R, Swigut T, Brugmann SA, Flynn RA, Wysocka J. A unique chromatin signature uncovers early developmental enhancers in humans. Nature. 2010;470:279–283. doi: 10.1038/nature09692. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Agrawal R, Imielinski T, Swami A. In Proc. of ACM SIGMOD Conference on Management of Data. Washington, DC: 1993. Mining association rules between sets of items in large datasets; pp. 207–216. [Google Scholar]
- 18.Firpi HA, Ucar D, Tan K. Discover regulatory DNA elements using chromatin signatures and artificial neural network. Bioinformatics. 2010;26:1579–1586. doi: 10.1093/bioinformatics/btq248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W, et al. Model-based analysis of ChIP-Seq (MACS) Genome Biol. 2008;9:R137. doi: 10.1186/gb-2008-9-9-r137. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bernecker T, Kriegel H, Renz M, Verhein F. In Proc. of ACM SIGKDD Conference on Knowledge Discovery and Data Mining. Paris, France: 2009. Probabilistic frequent itemset mining in uncertain Databases; pp. 119–128. [Google Scholar]
- 21.Tan P, Steinbach M, Kumar V. Introduction to Data Mining. 1st edn. Boston, MA: Addison Wesley; 2005. [Google Scholar]
- 22.Morgan XC, Ni S, Miranker DP, Iyer VR. Predicting combinatorial binding of transcription factors to regulatory elements in the human genome by association rule mining. BMC bioinformatics. 2007;8:445. doi: 10.1186/1471-2105-8-445. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Sun H, Guns T, Fierro AC, Thorrez L, Nijssen S, Marchal K. Unveiling combinatorial regulation through the combination of ChIP information and in silico cis-regulatory module detection. Nucleic Acids Res. 2012;40:e90. doi: 10.1093/nar/gks237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
- 25.Maston GA, Landt SG, Snyder M, Green MR. Characterization of enhancer function from genome-wide analyses. Ann. Rev. Genomics Hum. Genet. 2012;13:29–57. doi: 10.1146/annurev-genom-090711-163723. [DOI] [PubMed] [Google Scholar]
- 26.Thanos D, Maniatis T. Virus induction of human IFN beta gene expression requires the assembly of an enhanceosome. Cell. 1995;83:1091–1100. doi: 10.1016/0092-8674(95)90136-1. [DOI] [PubMed] [Google Scholar]
- 27.Merika M, Thanos D. Enhanceosomes. Curr. Opin Genet. Dev. 2001;11:205–208. doi: 10.1016/s0959-437x(00)00180-5. [DOI] [PubMed] [Google Scholar]
- 28.Kulkarni MM, Arnosti DN. Information display by transcriptional enhancers. Development. 2003;130:6569–6575. doi: 10.1242/dev.00890. [DOI] [PubMed] [Google Scholar]
- 29.Levine M. Transcriptional enhancers in animal development and evolution. Curr. Biol. 2010;20:R754–R763. doi: 10.1016/j.cub.2010.06.070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Whitington T, Frith MC, Johnson J, Bailey TL. Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res. 2011;39:e98. doi: 10.1093/nar/gkr341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Bejerano G, Lowe CB, Ahituv N, King B, Siepel A, Salama SR, Rubin EM, Kent WJ, Haussler D. A distal enhancer and an ultraconserved exon are derived from a novel retroposon. Nature. 2006;441:87–90. doi: 10.1038/nature04696. [DOI] [PubMed] [Google Scholar]
- 32.Santangelo AM, de Souza FS, Franchini LF, Bumaschny VF, Low MJ, Rubinstein M. Ancient exaptation of a CORE-SINE retroposon into a highly conserved mammalian neuronal enhancer of the proopiomelanocortin gene. PLoS Genet. 2007;3:1813–1826. doi: 10.1371/journal.pgen.0030166. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Franchini LF, Lopez-Leal R, Nasif S, Beati P, Gelman DM, Low MJ, de Souza FJ, Rubinstein M. Convergent evolution of two mammalian neuronal enhancers by sequential exaptation of unrelated retroposons. Proc. Natl Acad. Sci. USA. 2011;108:15270–15275. doi: 10.1073/pnas.1104997108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Pi W, Zhu X, Wu M, Wang Y, Fulzele S, Eroglu A, Ling J, Tuan D. Long-range function of an intergenic retrotransposon. Proc. Natl Acad. Sci. USA. 2010;107:12992–12997. doi: 10.1073/pnas.1004139107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA, et al. Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proc. Natl Acad. Sci. USA. 2010;107:21931–21936. doi: 10.1073/pnas.1016071107. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kunarso G, Chia NY, Jeyakani J, Hwang C, Lu X, Chan YS, Ng HH, Bourque G. Transposable elements have rewired the core regulatory network of human embryonic stem cells. Nat. Genet. 2010;42:631–634. doi: 10.1038/ng.600. [DOI] [PubMed] [Google Scholar]
- 37.Lynch VJ, Leclerc RD, May G, Wagner GP. Transposon-mediated rewiring of gene regulatory networks contributed to the evolution of pregnancy in mammals. Nat. Genet. 2011;43:1154–1159. doi: 10.1038/ng.917. [DOI] [PubMed] [Google Scholar]
- 38.Chuong EB, Rumi MA, Soares MJ, Baker JC. Endogenous retroviruses function as species-specific enhancer elements in the placenta. Nat. Genet. 2013;45:325–329. doi: 10.1038/ng.2553. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Rougier N, Bourc'his D, Gomes DM, Niveleau A, Plachot M, Paldi A, Viegas-Pequignot E. Chromosome methylation patterns during mammalian preimplantation development. Genes Dev. 1998;12:2108–2113. doi: 10.1101/gad.12.14.2108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Meshorer E, Misteli T. Chromatin in pluripotent embryonic stem cells and differentiation. Nat. Rev. Mol. Cell Biol. 2006;7:540–546. doi: 10.1038/nrm1938. [DOI] [PubMed] [Google Scholar]
- 41.Feschotte C, Gilbert C. Endogenous viruses: insights into viral evolution and impact on host biology. Nat. Rev. Genet. 2012;13:283–296. doi: 10.1038/nrg3199. [DOI] [PubMed] [Google Scholar]
- 42.Macfarlan TS, Gifford WD, Driscoll S, Lettieri K, Rowe HM, Bonanomi D, Firth A, Singer O, Trono D, Pfaff SL. Embryonic stem cell potency fluctuates with endogenous retrovirus activity. Nature. 2012;487:57–63. doi: 10.1038/nature11244. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007;447:799–816. doi: 10.1038/nature05874. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Ernst J, Kheradpour P, Mikkelsen TS, Shoresh N, Ward LD, Epstein CB, Zhang X, Wang L, Issner R, Coyne M, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature. 2011;473:43–49. doi: 10.1038/nature09906. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Lee I, Blom UM, Wang PI, Shim JE, Marcotte EM. Prioritizing candidate disease genes by network-based boosting of genome-wide association data. Genome Res. 2011;21:1109–1121. doi: 10.1101/gr.118992.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Razick S, Magklaras G, Donaldson IM. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinformatics. 2008;9:405. doi: 10.1186/1471-2105-9-405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, et al. STRING v9.1: protein–protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013;41:D808–D815. doi: 10.1093/nar/gks1094. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Portales-Casamar E, Thongjuea S, Kwon AT, Arenillas D, Zhao X, Valen E, Yusuf D, Lenhard B, Wasserman WW, Sandelin A. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 2010;38:D105–D110. doi: 10.1093/nar/gkp950. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Wingender E, Dietze P, Karas H, Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996;24:238–241. doi: 10.1093/nar/24.1.238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Newburger DE, Bulyk ML. UniPROBE: an online database of protein binding microarray data on protein-DNA interactions. Nucleic Acids Res. 2009;37:D77–D82. doi: 10.1093/nar/gkn660. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.