Significance
How the locus-specific histone modifications are achieved is not fully understood. One of the contributing mechanisms is that DNA binding molecules recognize specific sequences and their binding recruits or stabilizes the histone modification enzyme complexes. Comprehensive identification of such sequence patterns is the first step toward revealing possible regulatory grammar for establishing histone modifications. In this study, we have cataloged the DNA motifs tightly associated with six and eight important histone modifications in human and mouse, respectively. We show that mutating the found motifs at particular loci led to significant reduction of the histone modification levels. These histone-associated motifs, especially H3K4me3 motifs, significantly overlap with expression of quantitative trait loci SNPs in cancer patients more than known motifs, further suggesting their regulatory roles. We also found possible feedback loops mediated by these motifs, implicating their possible roles in histone modification dynamics and epigenetic priming.
Keywords: epigenomics, cis-regulatory elements, locus specificity, chromatin dynamics, CRISPR
Abstract
Histones are modified by enzymes that act in a locus, cell-type, and developmental stage-specific manner. The recruitment of enzymes to chromatin is regulated at multiple levels, including interaction with sequence-specific DNA-binding factors. However, the DNA-binding specificity of the regulatory factors that orchestrate specific histone modifications has not been broadly mapped. We have analyzed 6 histone marks (H3K4me1, H3K4me3, H3K27ac, H3K27me3, K3H9me3, H3K36me3) across 121 human cell types and tissues from the NIH Roadmap Epigenomics Project as well as 8 histone marks (with addition of H3K4me2 and H3K9ac) from the mouse ENCODE Consortium. We have identified 361 and 369 DNA motifs in human and mouse, respectively, that are the most predictive of each histone mark. Interestingly, 107 human motifs are conserved between the two species. In human embryonic cell line H1, we mutated only the found DNA motifs at particular loci and the significant reduction of H3K27ac levels validated the regulatory roles of the perturbed motifs. The functionality of these motifs was also supported by the evidence that histone-associated motifs, especially H3K4me3 motifs, significantly overlap with the expression of quantitative trait loci SNPs in cancer patients more than the known and random motifs. Furthermore, we observed possible feedbacks to control chromatin dynamics as the found motifs appear in the promoters or enhancers associated with various histone modification enzymes. These results pave the way toward revealing the molecular mechanisms of epigenetic events, such as histone modification dynamics and epigenetic priming.
Histone modifications play key roles in many biological processes. Mammalian genomes contain histone-modifying enzymes that are responsible for modifying histone tails by adding or removing chemical groups, such as methyl and acetyl groups. The placement of histone modifications is precisely regulated to ensure that specific regulatory elements and genes are correctly activated or repressed in a given cell-type, environment, or development stage. Understanding the mechanisms that regulate locus-specific modification in a cell-state–dependent manner is critical toward uncovering the grammar of epigenetic regulation.
A possible mechanism to establish or maintain locus-specific histone modification is through binding of sequence-specific proteins or noncoding RNAs, which recruit or enhance the modifying enzymes’ binding to a particular locus. Other factors can contribute to this specificity, such as DNA methylation, chromatin accessibility, and 3D chromatin contacts. Because histone modifications are wiped out and reestablished in the zygote, the information encoded in the DNA sequence is pivotal to initiate the process of locus-specific histone modifications. Despite the existence of other contributing factors, it is still critical to comprehensively catalog the sequence motifs that can provide locus-specific guidance for the enzymatic functions, which can be the first step toward fully decoding the mechanisms regulating locus specificity of histone modifications. Furthermore, if particular DNA motifs are associated with histone modifications in many and diverse cell types, they are likely important or even causally related to histone modifications. An analogy is that a transcription factor (TF) recognizes the same DNA motif but its binding sites are cell-type–dependent. However, if we identify all motifs enriched in the TF binding sites across a large and diverse set of cell types, the most common motif is likely the one recognized by the TF. Histone modifications are more complicated than a single TF binding and one histone mark can be regulated by multiple factors recognizing different motifs. Therefore, a comparative analysis across diverse cell types/tissues is critical.
Recently, machine learning approaches have proven to be useful in understanding epigenetic processes. For example, a support vector machine has been used to predict the impact of SNPs on DNase I sensitivity in their native genomic context (1). Prediction of histone modifications solely from knowledge of TF binding both at promoters and at potential distal regulatory elements (2) was done using logistic regression-based classifier or using k-mer features to train a logistic regression model that distinguishes peak sequences from flanking regions (3). Our previous work also demonstrated that DNA motifs are predictive of histone modifications and DNA methylation in five cell types (4). All of these works have suggested the possibility of deciphering the grammar encoded in the genome regulating epigenetic modifications, but the scope of the previous studies is still limited.
Furthermore, because the protein sequences of many histone-modifying enzymes are conserved, it would also be interesting to investigate whether the regulatory grammar that controls the placement of histone modification is conserved. However, a direct comparison between the human and mouse genome is unlikely to identify these motifs because they may be dispersed in the overall nonconserved genomic regions. A strategy to circumvent this difficulty is to uncover the DNA motifs associated with the same histone modification patterns in different species and then compare the similarities between them to assess their conservation.
Here, we present a comprehensive survey of histone modification-associated motifs in a large set of diverse cell types and tissues in both human and mouse (5, 6). Comparative analyses have revealed that 107 motifs are conserved between human and mouse. Furthermore, in the human embryonic stem cell H1 cell mutating the motifs led to significant perturbation of the H3K27ac levels. We also found that the histone-associated motifs are likely to overlap with SNPs in cancer patients, which indicates their regulatory functions.
Results
Identification of DNA Motifs in 121 Human Cell Types.
To have a comprehensive catalog of the cis-regulatory elements that are involved in regulating the human epigenome, we used Epigram (4) to analyze the data of six histone modifications from 121 different cell-types and tissues generated by the NIH Roadmap Epigenomics Project (6) and ENCODE (5) (Fig. 1A).
In general, Epigram looks for enriched motifs that best differentiate the foreground from the background sequences. The program first computes an enrichment score for each k-mer based on how often it appears in the input sequences compared with the shuffled input sequences and a genomic background. k-mers are then ranked based on their final weights:
with W as the k-mer’s enrichment weight, PP as the proportion of sequences that contains the k-mer over the total number of input sequences, Ewg as the k-mer’s enrichment over the genomic background, and Esh as its enrichment over the shuffled input. Position weight matrices (PWMs) are then generated by first picking a top k-mer and enriched k-mers similar to itself to construct a “seed” PWM, which is then extended by adding more enriched k-mers that are a few base pairs shifted from the original one. The motifs are then further ranked and filtered based on how well they differentiate the foreground from the background using LASSO (least absolute shrinkage and selection operator) logistic regression. The final set of motifs is then evaluated by random forest.
Epigram was individually applied to each dataset (see Materials and Methods for details). For each histone modification in each sample, Epigram found DNA motifs that discriminate enrichment peaks of the mark under consideration from a background of regions that do not overlap with any peak of the six histone modifications. Importantly, the background has the equal GC content, number of regions, and sequence lengths as the foreground to avoid inflated prediction results caused by simple features or an unbalanced dataset (4). In our previous paper (4), we performed several additional analyses to remove confounding factors, such as some histone marks preferring particular genomic regions (e.g., H3K4me3 in promoters). Our analyses showed that the identified motifs can discriminate the modified regions from different backgrounds. Given the large number of experiments we analyzed in this study, we did not repeat these additional analyses for each experiment. We achieved good performances, with average areas under the curve (AUCs) ranging from 0.71 to 0.91 (Fig. 1A and Dataset S2).
In total, Epigram identified 65,361 motifs. Because some motifs are likely to be shared between different cell types or histone modifications, it is not surprising that many motifs were found multiple times. To reduce the redundancy, we used a motif distance metric to quantify the similarity between different motifs, based on which we hierarchically clustered the motifs (see Materials and Methods for details). The resulting tree was then cut using a threshold of 0.15, corresponding to a P value of ∼10−3 that was calculated using a distribution of similarity distances for randomly shuffled motifs (SI Appendix, Fig. S1 shows the process and example of a cluster). Motifs within the distance threshold were considered to represent the same motif and it is obvious that examples shown in SI Appendix, Fig. S1 are similar. The motif having the most enrichment scores within each cluster was selected as the representative, where the enrichment score was computed by comparing the occurrence of a motif in the histone modification peaks of interest and the background. As a result, we obtained motif clusters. To identify the most confident motifs, we selected the largest clusters so that together they capture roughly 50% of the original motifs. In the end, there are 361 clusters with at least 40 individual motifs (containing 52.6% of our total starting motifs); the resulted clusters are shown in Fig. 1B (see example motifs in Dataset S1).
To determine whether a motif cluster is mark-specific or shared between marks, we counted the number of times that its member motifs were found to be predictive of each mark in any cell or tissue. We then performed a hypergeometric test (P value cut-off of 10−3) to identify the statistically significant association between the motif cluster and marks. The background of the hypergeometric test was the original set of 65,361 motifs. For each cluster, the hypergeometric test was based on all members of that cluster. For example, cluster H3K4me3+H3K27ac_872 had 384 motifs in total, among which 133 were identified from H3K4me3 experiments and 84 motifs found in H3K27ac experiments, while the background contained 10,936 of the total 65,361 motifs obtained from H3K4me3 experiments, and 8,839 obtained from H3K27ac experiments; the P value was thus 1.01 × 10−16 to be associated with mark H3K4me3 and 1.65 × 10−5 for H3K27ac. Among the 361 motifs, 303 are associated with only one histone mark, indicating their high specificity to histone modification. For these mark-specific motifs, H3K36me3 and H3K9me3 contribute a large portion (117 and 89 motifs, respectively), and the motifs associated with narrow marks are inclined to be shared between marks (Fig. 1C). Because broad marks like H3K36me3 often cover whole gene bodies, identified motifs can come from introns or exon regions. These are confounding factors in predicting H3K36me3 signals. Because H3K36me3 has been shown to be important for splicing (7), the found motifs can be important for both establishing H3K36me3 and regulating splicing. Some H3K36me3 motif clusters contain some motifs associated with H3K4me1 (Fig. 1B). However, when the background was taken into account, these motif clusters did not pass the hypergeometric test for H3K4me1 enrichment and thus were not classified as such.
Among the 58 motifs associated with more than one histone mark, a large portion is motifs associated with H3K27ac, H3K4me3, and H3K4me1. In general, broad histone marks do not share motifs with narrow marks. Furthermore, the multimark motifs are largely associated with functional combinations of histone marks. For example, H3K27ac share a significant number of motifs with H3K4me3 and H3K4me1, which is reasonable because H3K4me3/H3K27ac and H3K4me1/H3K27ac mark active promoters and enhancers, respectively. In contrast, H3K4me3 and H3K4me1 do not share motifs with each other. We also found that H3K27me3 and H3K4me1 share motifs, which is not surprising as they together mark poised enhancers (8). H3K27me3 and H3K4me3 also share motifs as these motifs occur in bivalent promoters, which are important in early embryogenesis (9).
The majority of the found motifs do not match with any known motif: in human, 71 of 361 motifs have a match using TomTom (14) at e-value cut-off of 0.1 (examples of known motif matches are in Fig. 1D and SI Appendix, Fig. S2D). We have provided the complete list of the identified motifs and whether they match with any known motif in the Dataset S1. Numerous identified motifs are known to be important for histone modifications. For example, the c-JUN motif was found to be associated with H3K27ac in our analysis, which is consistent with the previous studies showing the regulatory role of c-JUN on histone modifications, such as Ser10 phosphorylation of histone H3, acetylation of histones H3 and H4, and recruitment of histone deacetylase 3 (HDAC3), NF-κB subunits, and RNA polymerase II across the ccl2 locus (10). In c-JUN–deficient cells, HDAC3 binding around the ccl2 locus was low compared with nondeficient cells, leading to increased histone acetylation levels in the 5′ region of the transcription start site (TSS) (8). Other examples include SP1 and SP3 motifs that are known to recruit HDAC1 to repress transcription of various genes; HDAC inhibitors can target SP1 sites to activate transcription (11). Thus, it makes sense to find these motifs within promoter/enhancer-specific histone marks. We also found the motif recognized by the cAMP response element-binding protein (CREB). CREB is known to recruit CBP (CREB-binding protein), which has intrinsic HDAC activity (12).
Experimentally Validating the Possible Regulatory Roles of DNA Motifs on Histone Modifications.
We selected H3K27ac for experimental validation as it marks both the active promoter and enhancer. We took a strategy of mutating the motifs rather than deleting the entire region using CRISPR/Cas9 to validate the direct impact of the motifs we identified on histone modifications. The advantage of this approach is to keep the investigated sequence remaining at the same length and thus avoids the effects on H3K27ac caused by sequence deletion rather than motif disruption (SI Appendix, Tables S2–S4). On the other hand, this strategy limited our choice of cell lines to embryonic stem cells in which recombination is possible, compared with the fully differentiated cells. Therefore, we focused on the histone-associated motifs identified from H1 embryonic stem cells and four H1-derived cells representing early developmental stages (Mes, MSC, NPC, TRO cells) (4). For the experiment, we chose motif clusters that are associated with H3K27ac and scanned the two regions: one from the top-ranked predicted locus in chromosome 3 (chr3) by Epigram and one from the middle ranked in chromosome 1 (chr1). The regions are ranked by the number of trees within the random forest model in Epigram that had correctly predicted the regions as containing H3K27ac modification, which indicates the confidence of prediction. These two regions are about 300-bp long, making them suitable for motif shuffling with the genomic-editing strategy. The score cut-off chosen for each motif to call occurrence was the score that best differentiated the foreground from the background in Epigram. The chr3 site contains four motifs, including three matched to known ones (TEAD4, GATA, and JUNB), and the chr1 site contains seven motifs, including two matched to known ones (TEAD4 and ZBTB33) (SI Appendix, Table S2).
Because the strategy to introduce shuffled motifs into the genome will have residual loxP sites left after selection cassette removal (Materials and Methods and Fig. 2 A and B), we need to consider the residual loxP sequence when comparing the histone modification between unmodified and modified cells. The modified cells with both alleles that have loxP but are heterozygous at motif regions are ideal for our purpose, where the allelic WT motif regions served as a control in comparing the status of histone modification. We used CRISPR plasmid and homology-directed repair (HDR) donor plasmid to cotransfect the H1 cells, followed by Puromycin selection, clonal isolation, and genotyping PCR. By analyzing the genotyping PCR and Sanger sequencing results of the clones from the two chromosome loci, we found that about half of the clones had loxP cassette in both alleles but the shuffled motifs region in only one of the alleles (named motif-shuffled heterozygote), and the rest of clones had a loxP cassette and WT motifs region in both alleles (named loxP-control homozygote). No clones with the loxP cassette and shuffled motifs region in both alleles were found. These results were ascribed to the possibility that the spacer region may also serve as the homology arm along with the left homology arm to facilitate the integration of the loxP cassette (indicated with the light blue dashed line in Fig. 2B). For our purpose, the motif-shuffled heterozygote should be ideal for comparing the H3K27ac level between alleles within the same cell, while the loxP-control homozygote would serve as a control for comparing the potential effect of residual loxP sequence to unmodified cells. For each locus, we selected a motif-shuffled heterozygote and a loxP-control homozygote from the sequenced clones to proceed with Cre-mediated loxP cassette removal. Subsequent clones were identified by genotyping PCR for the loxP cassette removal from both alleles and further confirmed by Sanger sequencing.
We performed H3K27ac ChIP assays on three different genotypes for each locus, including unmodified H1 with two WT alleles, a loxP-control homozygote with two loxP-control alleles, and a motif-shuffled heterozygote with one loxP-control allele and one mutant allele. Regarding the allele-specific probes for qPCR, we designed two pairs of primers for each allele based on the sequence differences between the WT and sequence-shuffled motifs (MS) (Fig. 2 C and D). The ChIP-qPCR result was analyzed with the percent input method and further normalized to internal controls of regions with low and high H3K27ac level to minimize the variability of processing different samples. In the chr1 locus, both of the two WT probes (WT1 and WT2) showed similar H3K27ac level between the samples of H1 and loxP-control homozygote, indicating the effect caused by the residual loxP sequence was negligible in this locus (Fig. 2C). In the sample of motif-shuffled heterozygote, the allele-specific probe chr1-MS1 showed about 60% decrease of the H3K27ac level compared with its paired probe chr1-WT1 with P = 0.0072 from three biological replicates, while another pair of probes showed 33% decrease with P = 0.0371 (Fig. 2C). Similar results were found in the chr3 locus (Fig. 2D). These results validated the regulatory roles of the identified motifs on histone modifications. Note that the top-ranked chr3 locus has four motifs mutated but showed more significant alteration of the H3K27ac signals compared with the middle-ranked chr1 locus with seven motifs mutated.
DNA Motifs Associated with Histone Modifications in Mouse Embryonic Tissues.
To investigate how the DNA motifs associated with histone modifications have evolved, we conducted the same analysis in the mouse ENCODE dataset that contains 8 histone modifications (H3K4m1/2/3, H3K9ac/me3, H3K27ac/me3, H3K36me3) from 12 embryonic tissues at 7 different developmental stages (SI Appendix, Table S5). To be consistent with all of the other analyses done by the ENCODE Consortium, we used the peaks called by the ENCODE DCC. While the performance on H3K4me3 is comparable between human and mouse, the average AUCs of Epigram for each mark in mouse is slightly (about 0.04–0.05) lower than in human. There can be many reasons for this difference, one of which is the difference in data quality: most of the human data are from cell lines with a higher quality than the mouse data obtained from tissues that are composed of heterogeneous cell types. We indeed observed a lower number of broad peaks in the mouse samples than in the human samples: for example, several thousands of H3K9me3 peaks in many mouse tissues compared with the average 40,000 peaks in human, and an average of 13,000 and 32,000 peaks of H3K27me3 in mouse and human, respectively, and 22,000 peaks in mouse and 47,000 peaks in human for H3K36me3, respectively. Despite that, the performances are still at a significant level (AUC of 0.7–0.95 in SI Appendix, Fig. S2A and also in Dataset S2).
We identified 48,080 motifs in mouse. After hierarchical clustering, we obtained 5,086 motif clusters. To focus on the most confident motifs and achieve a comparable number of motifs as in human, we selected the clusters using a size cut-off so that the resulted clusters contain roughly 50% of all of the original motifs. With a size cut-off of 30, we ended up with 369 clusters, containing 50.8% of the total motifs. Among these 369 motifs, 94 are matched with known motifs (Dataset S1). Similar to the human results, a majority (263 motifs) of the 369 motifs is specific to their respective histone marks, while 89 motifs are associated with two or more marks (SI Appendix, Fig. S2B). This number is increased compared with that of the human analysis (58 shared motifs) likely because we included two more narrow peak histone modifications (H3K9ac and K3K4me2) and narrow marks tend to share motifs with each other as observed in human.
The distribution of mark-specific motifs in mouse resembles those in human (SI Appendix, Fig. S2 C and D). For example, the H3K9me3- and H3K36me3-specific motifs account for a large portion of the motifs but another broad mark, H3K27me3, has only several specific motifs. Furthermore, numerous DNA motifs matched to the known ones—such as c-Jun, SP1, SP3, and CREB discussed above—were also found in mouse and some additional example motifs [such as USF1, known to recruit histone modification complexes (13)] are shown in SI Appendix, Fig. S2D.
Histone Modification-Associated Motifs Are Conserved Between Human and Mouse.
The similarities from the two independent analyses in two species indicate that the possible regulatory relationship between DNA motifs and histone modifications may be conserved. In fact, among the 361 human and 369 mouse motifs, 107 of them are conserved [TomTom (14) e-value cut-off of 0.1]. Among these 107 conserved motifs, a majority are associated with the same or similar histone marks (Fig. 3A): 24 with the same mark; 67 motifs with at least one shared mark in the multimark-associated motifs or with different marks that occur in similar regions. For example, except for one motif, the H3K4me3 human motifs are all associated with H3K4me2, H3K9ac, H3K27me3, or H3K27ac in mouse. A small portion (16 motifs) of conserved motifs has different mark associations between human and mouse (example motifs in SI Appendix, Table S1).
We next examined whether these conserved motifs appear in the conserved regions. PhastCons (15) scores from multiple alignment of human and 45 vertebrate genomes (including mouse) were plotted for these motifs. Obviously, the conserved motifs appear in regions having significantly higher PhastCons scores than the nonconserved ones (Fig. 3C). Among the conserved motifs, motifs associated with the same marks show the overall highest PhastCons scores, which is not unexpected. Manual examination of example loci also confirmed this trend (Fig. 3C). The conserved motifs with different mark association and nonconserved motifs can be the fast evolved or species-specific ones. Their appearance in the regions with relatively lower PhastCons scores is not surprising.
For the 91 motifs retaining the same or similar marks between human and mouse, the conservation patterns vary. H3K4me3 is a promoter mark and its associated motifs appear in the most conserved regions. H3K4me1 is an enhancer mark and also appears in promoters; its motifs’ PhastCons scores are also higher than the background. Noticeably, the PhastCons score shows a dip at the H3K4me3/1 motifs. We performed SPAMO (16) analysis on the H3K4me3 motif loci versus TSS and found that conserved H3K4me3 motifs are frequently 6–7 bp downstream of TSS with a P value of 1.89 × 10−7 (SI Appendix, Fig. S4). Because TSSs are the most conserved (Fig. 3B), this creates the dip when plotting the PhastCons scores by centering the motifs and without considering the orientation of the promoters. We suspect that the dip in the H3K4me1 plot may result from a similar reason because enhancers may have orientation bias indicated by unidirectional eRNA signals (17). The motifs associated with the other four marks all show a peak in the PhastCons score at the motif location. H3K9me3 motifs are the most conserved surrounded by a background PhastCons score; note that some known TF motifs, such as ELK1, SOX2, NYFA, and NANOG, have similar peaky conservation at the motif sites but not every known motif shows such a pattern. For example, TEAD1/TEAD4 and GATA1 have low conservation at motif sites (Fig. 3B). Because H3K9me3 marks heterochromatin, it is not surprising that the PhastCons scores in the nearby regions are the same as the background. What is striking is that the H3K9me3 motifs are the most conserved based on PhastCons scores and this is true for both conserved and nonconserved H3K9me3 motifs between human and mouse. As a comparison, the regions immediately around the H3K27ac, H3K27me3, or H3K36me3 motifs are less conserved than the background, which indicates that these regions are overall fast evolved; but the motif positions have a peaky PhastCons score that suggests their functional importance.
Histone Modification-Associated Motifs Overlap with Disease Expression Quantitative Trait Loci SNPs.
To further explore the functional roles of the found motifs, we took the SNP data in the The Cancer Genome Atlas (18) (TCGA) on 32 cancers to determine whether these motifs are important in diseases. We first used Matrix expression quantitative trait loci (eQTL) (19) to identify eQTL SNPs from the mutation and RNA-sequencing (RNA-seq) data. The resulting eQTL SNPs were then overlapped with the human motifs’ loci (an example is shown in SI Appendix, Fig. S3) to determine whether the histone-associated motifs overlap with eQTL SNPs more often than random and known motifs. We calculated the number of overlaps per kilobase of SNPs per million base of motif binding sequences (OPKM) to compare between the motifs (Materials and Methods).
The distributions of motifs H3K4me3, H3K27ac, and H3K27me3 over gene bodies (Fig. 4A) show that these motifs are more specific to the first 10% of the gene body, which is close to their promoters and the first exons. H3K9me3, H3K36me3, and H3K4me1 motifs are more spread out over the genome and consistently they are roughly equally distributed over gene bodies. TCGA SNPs concentrate in the second half and the first 10% of the gene body. As a result, the overlaps between SNPs and histone motifs have a bias toward the gene’s 3′ end. However, we did observe that a significant number of overlaps occur at the first 10% of the gene body (Fig. 4A). The first exon is known to be important for establishing the histone modifications needed for gene transcription. H3K4me3 and H3K27ac are active promoter marks, while H3K27me3 indicates repressed or poised promoters. Their associated motifs overlapping with eQTL SNPs at the beginning of the gene body in cancer patients is thus reasonable and this observation also indicates the functionality of the found motifs.
The human motifs associated with H3K4me3 overlap significantly more often with eQTL SNPs compared with both known and random motifs (Fig. 4B). A Mann–Whitney U test was used to calculate the P value of each mark’s log(OPKM) distribution over all of the 32 cancers being shifted to the right of the known motifs’. For example, the log(OPKM) distribution for H3K4me3-associated motifs and known motifs have a mean of 0.969 (SD of 1.726) and −0.163 (SD of 1.22), respectively. The corresponding P values for all H3K4me3, H3K27ac, and H3K27me3 motifs were 0.00, 1.045 × 10−68, and 8.43 × 10−5, respectively. Among the histone motifs that overlap the most with cancer SNPs, several match with the known motifs, such as ZNF639, NRF1, and CREB (Fig. 4D) that have been shown to be related to cancer development. For example, in liver, inactivation of the NRF1 gene can lead to hepatic neoplasia (20). ZNF639 protein has been shown to be associated with the pathogenesis of oral and esophageal squamous cell carcinomas (21). CREB protein is mutated in more than 85% of microsatellite instability colon cancer cell lines (22).
Note that these known motifs only have intermedium log(OPKM) and the de novo motifs (motifs that did have a significant match with HOCOMOCOv10 when using TomTom) have even much higher log(OPKM). This suggests that the found motifs are biologically relevant. It is also important to note that the TCGA SNPs were measured by SNP arrays, which are designed with probes focused on promoters and gene bodies. This genomic location bias may explain why Epigram motifs associated with histone marks, such as H3K9me3 and H3K4me1, do not overlap more with TCGA SNPs than the known motifs. Interestingly, the conserved H3K4me3 motifs showed more overlap compared with the nonconserved motifs (Fig. 4B), which suggests that these conserved motifs are more relevant to gene expression. Surprisingly, compared with randomly shuffled motifs, only H3K4me3 motifs overlap more with the SNPs (P value of 8.95 × 10−109), and the known motifs overlap even less than the random motifs (P value ∼0.0) regardless the random motifs generated from shuffling histone motifs or known motifs. A possible explanation is that the known motifs are crucial for housekeeping functions and disease-SNPs avoid disrupting them to facilitate proliferation of tumor cells. This speculation needs further experimental investigation.
Different cancers have drastically different mutation rates (23) (Fig. 4C). When considering OPKM, the cancers with relatively lower mutation rates have higher OPKM values than those with relatively higher mutation rates, which indicates that the somatic mutations tend to occur in the histone associated motifs. For example, LAML (acute myeloid leukemia) has significantly lower somatic mutation frequency than lung adenocarcinoma (LUAD); consistently, LAML has significantly higher OPKM with all motifs than LUAD (Fig. 4C). We have calculated the correlation between the average mutation rate and average OPKM per cancer for all histone mark-related motifs, and the Spearman correlation score is −0.635 with a P value of 9.1 × 10−5. This observation also supports the functionality of the found motifs. In cancers with low mutation rate, each mutation is likely more important than those of higher mutation rates. The higher OPKM for these cancers suggests that histone motifs are important and among the first to be altered in cancers.
Histone Modification Can Be Regulated via Positive or Negative Feedback Loops.
To further characterize the found motifs, we collected the top 5,000 occurring loci for each motif and performed GREAT (24) analysis. The majority of the motifs did not show any enriched gene ontology (GO) terms, which is not unexpected because these motifs are associated with histone modifications generally needed for every biological process. However, 17 motifs were found associated with histone modifications. For example, motif H3K4me3_3087, identified as a GCC-box motif (consensus sequence CGCCGCCGCCGC), is highly specific to H3K4me3 (for example, 63% of its occurring loci are within the H3K4me3 peaks in the H1 cells) and was found in 118 of 121 cell lines or tissues; interestingly, its enriched GO terms include histone lysine methylation (hypergeometric P value 4.7366 × 10−8). In fact, motif H3K4me3_3087 is located at the promoter regions of several methyltransferases (examples in Fig. 5A). An example is SET Domain Containing 3 (SETD3), a protein important for development, that can act as transcription coactivator and histone methylatransferase (25). Thus, SETD3 can potentially activate the transcription of itself and other methyltransferases, further regulating the differentiation process. Another example is Lysine Demethylase 6A (KDM6A) that specifically demethylates Lys-27 of histone H3. KDM6A’s demethylation of Lys-27 is accompanied by methylation of Lys-4 of histone H3 (26), which can potentially further up-regulate KDM6A. For the H3K4me3_3087 motif, its activation is likely to exert a positive feedback through enhancing active marks, including H3K4me1/2/3 and inhibiting repressive marks, including H3K27me3 and H3K9me3.
We have investigated whether H3K4me3 motifs avoid or prefer the promoters of the histone methylation enzymes. Even though H3K4me3 marks the majority of promoters, not every H3K4me3 motif appears in all promoters; in fact, each H3K4me3 motif only appears in on average 7.8% of the gene promoters. From the GREAT database, we identified 105 genes that have terms related to histone methylation for both positive and negative regulations (methyltransferase and demethylase), such as KDM4C, KDM4A, SUZ12, KDM4D, TET2, KDM8, SETD2, and SETD3. We counted the number of H3K4me3 motif matches that are within promoters of histone methylation enzyme genes and found a significant increase compared with promoters of protein-coding genes that are not considered histone-methylation related by GREAT (P value of 1.472 × 10−6 given by a Mann–Whitney U test) (Fig. 5E). This is consistent with the fact that GREAT analysis picked up the associations with histone-methylation in these H3K4me3 motifs.
Interestingly, we observed that the H3K27ac-associated motifs seem to form negative feedbacks on acetylation. The possible feedback mechanisms are derived from the motifs’ occurrence in both the promoters and enhancers closest to the histone modification enzymes. For example, HDAC genes’ promoters all contain H3K27ac-related motifs. Motif H3K27ac_4280 CCTCCTCCC, found in 39 cells/tissues (P value 2.72 × 10−3), appears in the promoters of HDAC1/HDAC2 (Fig. 5B) and numerous other deacetylases. HDAC1/2 are responsible for lysine deacetylation of the core histone proteins (H2A, H2B, H3, H4) as annotated in the UniProt database and is specifically documented to deacetylate H3K9ac in the GREAT annotation (Fig. 5B). This may suggest a negative feedback loop of histone acetylation: the H3K27ac motifs are responsible for establishing/maintaining the H3K27ac signal in the promoters of HDACs, which suggests transcribing HDACs; the transcribed HDACs, in turn, deacetylate H3K9ac and/or H3K27ac marks in the genome.
These observations in human were also confirmed in mouse (Fig. 5 C and D). Of 369 mouse motifs, 91 of them have enriched GO terms related to histone modification. Fig. 5C shows an example motif H3K4me3_4223 that is highly specific to the H3K4me3 mark (86.6% of the loci appear within H3K4me3 peaks in mouse forebrain E11.5; 35 of 66 tissue time points contain this motif). This motif appears in the promoter regions of several histone methyltranferases, such as histone methyltransferase MLL1, also known as KMT2A (appearing in the human example in Fig. 5A), which is a catalytic subunit of the MLL1/MLL complex, which facilitates the methylation of H3K4 and forms a positive feedback to H3K4me3. Similar to the human motifs, acetylation-associated mouse motifs provide negative feedbacks. For example, H3K9ac_4441 (65.15% of its occurrences locate within H3K9ac peaks in mouse forebrain E11.5, and it was found in 40 of 66 tissue time points), appears in the promoter regions of HDACs. The negative feedback loop involves several genes previously seen in the human example (HDAC2-family genes, Sirt1). This illustrates that methylation/acetylation processes can be controlled by interplays of several factors involving feedback loops.
Discussion
Similar to identifying gene-coding sequences in the genome being the first step toward understanding gene expression and function, we argue that cataloguing motifs associated with histone modifications would pave the way toward revealing the molecular mechanisms of how the information encoded in the genomic sequence is read to regulate histone modification in a tissue- and time-dependent manner. Taking advantage of the epigenomic data generated by the ENCODE and the Epigenomics Roadmap projects in diverse cell types and tissues, we have established the most comprehensive catalog of DNA motifs associated with histone modifications in both human and mouse. The regulatory function of some of these motifs on local histone modifications was validated by the drastic change of H3K27ac upon only mutating the relevant motifs, and supported by their significant overlap with eQTL SNPs in cancer patients. Particularly interesting, the cancers with lower somatic mutation frequency tend to have larger portion of mutations overlapping with histone-associated motifs than the cancers with higher somatic mutation frequency, which also supports that the found motifs are functionally important.
Furthermore, the comparison between human and mouse motifs showed that a large portion of the found motifs is conserved. Therefore, the insights obtained from the mouse embryogenesis can facilitate studying human development. A surprising observation is that the conservation at the motif loci is significantly different from the neighbor regions, such as a dip of PhastCons score at the H3K4me3 motifs compared with the surrounding regions, which is completely different from the conservation profiles of the known TF motifs. Indeed, there are only a few found motifs similar to the known TF motifs, which may partially explain why the interplay between DNA sequence and histone modifications remains largely mysterious.
Another interesting observation is that the histone-associated motifs appear to relate with histone modification enzymes; for example, the H3K4me3 motifs tend to be associated with methyltransferases, suggesting positive feedbacks, and H3K27ac motifs tend to be associated with deacetylases, which indicates possible negative feedbacks. Because the temporal deposition of histone modifications is particularly important in development and differentiation, the feedbacks provided by the histone-associated motifs may guide studies to reveal the mechanisms, such as histone dynamics and epigenetic priming.
Materials and Methods
Data Processing.
For human data, ChIP-seq experiments using antibodies for six different histone modifications in 121 cell types were used to assess the predictability of histone modification from DNA motifs. The six histone modifications are H3K4me1, H3K4me3, H3K27me3, H3K27ac, H3K9me3, and H3K36me3. Each of the ChIP-seq experiments had at least two replicates, and input control samples are also provided. Mapped reads were made monoclonal using HOMER. For mouse data, ChIP-seq experiments used antibodies for 8 different histone modifications in 12 embryonic cell types at 7 different developmental time points. The eight histone modifications include the six used in the human data with the addition of H3K9ac and K3K4me2 (SI Appendix, Table S5).
The human data were processed as described previously by Whitaker et al. (4). HOMER was used to call peaks for the ChIP-seq data. We used two different criteria for narrow histone peaks (H3K27ac, H3K4me1, and H3K4me3) and broad peaks (H3K27me3, H3K36me3, and H3K9me3). The mouse data were processed by the ENCODE Processing Pipeline. The pipeline calls peaks separately on each of the biological replicate using MACS2 (27). Note that the resulted replicated peaks were significantly shorter than peaks obtained from Homer. Therefore, to match our human data, we further merged histone peaks within 1,000 bp for the narrow peaks, and within 2,500 bp for the broad peaks.
Making the Sets of Sequences for the Prediction of Histone Modification by Epigram.
We run Epigram to compare regions that are enriched with an epigenomic modification to regions that do not possess any of the modifications being considered. The enriched regions, or foreground, were the high-confidence regions that were identified as the intersect of two or more replicates (as described above). To establish a background, we took all of the continuous stretches in the genome that were 100% mappable but do not overlap with any of the histone modifications peaks. Regions of the genome are not 100% mappable if the DNA sequence is replicated elsewhere in the genome. This replication of DNA sequences reduces mappability, as it is a requirement of the mapping procedure that reads map uniquely. To measure regions’ mappability, we used a precomputed dataset that considered 35-bp reads mapping uniquely within the human genome. When considering overlap between 100% mappable regions and histone modification peaks, the union of all peaks was used rather than the high-confidence regions (the intersect of two or more replicates).
Applying Epigram to Each Data.
Epigram was individually applied on different datasets (correspond to each cell type-histone mark). For example, Epigram identifies 100 motifs from the H1-H3K4me3 data, then 120 motifs from H9-H3K27ac data, and so forth. We combined all of these motifs together and removed redundancy among them.
Motif Clustering.
We used a standard hierarchical clustering algorithm to group similar motifs. To calculate the similarity between motifs, we first aligned the motifs. Let m1, m2 be two motif PWMs: m1 and m2 were aligned together with a gap penalty that increases based on the number of overhanging positions. For each overlapping position, a Jensen–Shannon Divergence score is calculated. These scores were then averaged to get the overall score. The average score per position was calculated as:
with n being the number of overlapping positions.
This averaging method puts more weight on large differences between the PWMs at single positions than small differences over several positions.
For each alignment, a gap penalty is added to alignments that do not maximize the overlapping portion. The alignment distance was computed as:
with k being the number of gaps in the alignment.
The distances of all possible alignments, including reverse-complementary, were computed and the smallest one was the distance between m1 and m2. Then, a hierarchical tree was constructed using average linkage. Motif clustering was done in two steps. First, motifs from the same histone mark were clustered together and a motif with the highest information content was selected to represent each cluster. Then, the representative motifs of all different marks were clustered together. We used a height cut-off of 0.15 to cut the resulting tree for each clustering step.
Histone-Associated Motifs Forming Feedback Loops Analysis.
We first used GREAT to analyze the functions of the motifs. For each motif, the top 5,000 loci with the lowest P values were analyzed. In the case when more than 5,000 top loci have the same P value, we randomly picked 5,000 loci from those. The default background was used. For GO-term enrichment cut-offs, we used a false-discovery rate of <0.05 for both the binomial and hypergeometric tests. We tested with 5,000, 4,000, and 3,000 top loci for H3K4me3_3087. In all three cases, the GO term “histone methyltransferase activity” was enriched, albeit at different P values: for 5,000 loci the P value is 4.3244e-12, 2.7109e-8 for 4,000 loci, and 3.4110e-6 for 3,000 loci. Thus, changing the number of loci picked will likely not change the results significantly.
Based on the GREAT results, we examined the functions of the target genes of the histone modification motifs and constructed a network to represent their relationship. First, the motifs were filtered for enriched terms that contain histone acetylation/deacetylation or histone methylation/demethylation based on the GREAT annotation, which can be slightly different from the other databases. For each motif, we identified the genes involved in each of the term (i.e., the enzymes such as methyltransferase and demethylase). We then determined the histone residue that each enzyme modifies and whether the modification is positive or negative (e.g., methylation or demethylation), based on the GREAT database. The motifs were connected to the specific histone marks. Finally, we pooled together the relationship between motif and genes, gene and histone marks, histone marks and motif to build a network.
Call eQTL SNPs in the TCGA Data.
Processed RNA-seq and mutation data were downloaded from Firehose (28). The data contain tumor samples from 32 cancer types. The R package MatrixeQTL was used to find eQTLs SNPs. The linear model was chosen. Age, gender, race, days to death, and days to last follow-up were taken from clinical data to use as covariates. To associate a SNP with a gene, we used SNPs that are on the same chromosome and at most 106 bp away from the gene’s TSS. False-discovery rate cut-off was chosen at 0.05. We further accounted for linkage disequilibrium using PLINK (29) and SNP data from the 1000 Genomes Project (30). For linkage disequilibrium pruning, the minor allele frequency cut-off was 0.10 and variance inflation factor threshold 1.5. eQTL SNPs that were in the pruned out set are removed from further analyses.
OPKM Calculations.
The OPKM of a motif is defined as the number of SNPs overlapping with all of the motif-occurring loci divided by the number of SNPs (in thousands) and number of base pairs covered by all of the motif-occurring loci (in millions). Thus, the formula is:
Plotting PhastCons Scores.
For each motif, 5,000 loci were picked randomly. The resulting sites were combined based on histone mark association. The PhastCons scores derived from multiple alignments of 45 vertebrate genomes with human were used. To plot the baseline, PhastCons scores of 100,000 randomly chosen 10-bp sites in hg19 were also calculated.
SPAMO Analysis.
The original SPAMO algorithm determines whether a distance is significantly between a pair of motifs. In general, the program calculates the distribution of distances between the primary and secondary motifs’ loci and looks for overrepresentative distances. We adapted the algorithm to our case with the primary loci being TSS and the secondary loci being histone H3K4me3 motifs’.
Experimental Validation Protocols.
The processes for CRISPR/Cas9 and donor construct design, plasmid construction, cell culture and electroporation, and histone H3K27ac ChIP-qPCR analysis are detailed in SI Appendix.
Data.
Clustered motifs for human (361 motifs) and mouse (369 motifs) can be found in Dataset S1. Additional information can be found in the companion website wanglab.ucsd.edu/star/MouseENCODE/HistoneMotifs.
Supplementary Material
Acknowledgments
This project is partially supported by NIH Grants U54HG006997 and R01HG009626 and California Institute of Regenerative Medicine Grant RB5 07012.
Footnotes
The authors declare no conflict of interest.
This article is a PNAS Direct Submission.
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1813565116/-/DCSupplemental.
References
- 1.Lee D, et al. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet. 2015;47:955–961. doi: 10.1038/ng.3331. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Benveniste D, Sonntag H-J, Sanguinetti G, Sproul D. Transcription factor binding predicts histone modifications in human cell lines. Proc Natl Acad Sci USA. 2014;111:13367–13372. doi: 10.1073/pnas.1412081111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Setty M, Leslie CS. SeqGL identifies context-dependent binding signals in genome-wide regulatory element maps. PLoS Comput Biol. 2015;11:e1004271. doi: 10.1371/journal.pcbi.1004271. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Whitaker JW, Chen Z, Wang W. Predicting the human epigenome from DNA motifs. Nat Methods. 2015;12:265–272, 7, 272. doi: 10.1038/nmeth.3065. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Yue F, et al. Mouse ENCODE Consortium A comparative encyclopedia of DNA elements in the mouse genome. Nature. 2014;515:355–364. doi: 10.1038/nature13992. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kundaje A, et al. Roadmap Epigenomics Consortium Integrative analysis of 111 reference human epigenomes. Nature. 2015;518:317–330. doi: 10.1038/nature14248. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Kolasinska-Zwierz P, et al. Differential chromatin marking of introns and expressed exons by H3K36me3. Nat Genet. 2009;41:376–381. doi: 10.1038/ng.322. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Calo E, Wysocka J. Modification of enhancer chromatin: What, how, and why? Mol Cell. 2013;49:825–837. doi: 10.1016/j.molcel.2013.01.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Vastenhouw NL, Schier AF, Akhtar A, Neugebauer K. Bivalent histone modifications in early embryogenesis. Curr Opin Cell Biol. 2012;24:374–386. doi: 10.1016/j.ceb.2012.03.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Wolter S, et al. c-Jun controls histone modifications, NF-kappaB recruitment, and RNA polymerase II function to activate the ccl2 gene. Mol Cell Biol. 2008;28:4407–4423. doi: 10.1128/MCB.00535-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Sowa Y, et al. Histone deacetylase inhibitor activates the p21/WAF1/Cip1 gene promoter through the Sp1 sites. Ann N Y Acad Sci. 1999;886:195–199. doi: 10.1111/j.1749-6632.1999.tb09415.x. [DOI] [PubMed] [Google Scholar]
- 12.Ogryzko VV, Schiltz RL, Russanova V, Howard BH, Nakatani Y. The transcriptional coactivators p300 and CBP are histone acetyltransferases. Cell. 1996;87:953–959. doi: 10.1016/s0092-8674(00)82001-2. [DOI] [PubMed] [Google Scholar]
- 13.Huang S, Li X, Yusufzai TM, Qiu Y, Felsenfeld G. USF1 recruits histone modification complexes and is critical for maintenance of a chromatin barrier. Mol Cell Biol. 2007;27:7991–8002. doi: 10.1128/MCB.01326-07. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gupta S, Stamatoyannopoulos JA, Bailey TL, Noble WS. Quantifying similarity between motifs. Genome Biol. 2007;8:R24. doi: 10.1186/gb-2007-8-2-r24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Siepel A, et al. Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 2005;15:1034–1050. doi: 10.1101/gr.3715005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Whitington T, Frith MC, Johnson J, Bailey TL. Inferring transcription factor complexes from ChIP-seq data. Nucleic Acids Res. 2011;39:e98. doi: 10.1093/nar/gkr341. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Mikhaylichenko O, et al. The degree of enhancer or promoter activity is reflected by the levels and directionality of eRNA transcription. Genes Dev. 2018;32:42–57. doi: 10.1101/gad.308619.117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Weinstein JN, et al. Cancer Genome Atlas Research Network The Cancer Genome Atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Shabalin AA. Matrix eQTL: Ultra fast eQTL analysis via large matrix operations. Bioinformatics. 2012;28:1353–1358. doi: 10.1093/bioinformatics/bts163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Xu Z, et al. Liver-specific inactivation of the Nrf1 gene in adult mouse leads to nonalcoholic steatohepatitis and hepatic neoplasia. Proc Natl Acad Sci USA. 2005;102:4120–4125. doi: 10.1073/pnas.0500660102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Gen Y, et al. SOX2 identified as a target gene for the amplification at 3q26 that is frequently detected in esophageal squamous cell carcinoma. Cancer Genet Cytogenet. 2010;202:82–93. doi: 10.1016/j.cancergencyto.2010.01.023. [DOI] [PubMed] [Google Scholar]
- 22.Ionov Y, Matsui S, Cowell JK. A role for p300/CREB binding protein genes in promoting cancer progression in colon cancer cell lines with microsatellite instability. Proc Natl Acad Sci USA. 2004;101:1273–1278. doi: 10.1073/pnas.0307276101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Lawrence MS, et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature. 2013;499:214–218. doi: 10.1038/nature12213. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.McLean CY, et al. GREAT improves functional interpretation of cis-regulatory regions. Nat Biotechnol. 2010;28:495–501. doi: 10.1038/nbt.1630. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Eom GH, et al. Histone methyltransferase SETD3 regulates muscle differentiation. J Biol Chem. 2011;286:34733–34742. doi: 10.1074/jbc.M110.203307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Lee MG, et al. Demethylation of H3K27 regulates polycomb recruitment and H2A ubiquitination. Science. 2007;318:447–450. doi: 10.1126/science.1149042. [DOI] [PubMed] [Google Scholar]
- 27.Zhang Y, et al. 2008. Model-based analysis of ChIP-Seq (MACS). Genome Biol 9:R137. doi:10.1186/gb-2008-9-9-r137.
- 28.Broad Institute TCGA Genome Data Analysis Center 2016 Data from “Analysis-ready standardized TCGA data from Broad GDAC Firehose 2016_01_28 run,” 10.7908/C11G0KM9. Available at http://gdac.broadinstitute.org/runs/stddata__2016_01_28/. Accessed February 11, 2017.
- 29.Chang CC, et al. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. doi: 10.1186/s13742-015-0047-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Auton A, et al. 1000 Genomes Project Consortium A global reference for human genetic variation. Nature. 2015;526:68–74. doi: 10.1038/nature15393. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.