Skip to main content
Howard Hughes Medical Institute Author Manuscripts logoLink to Howard Hughes Medical Institute Author Manuscripts
. Author manuscript; available in PMC: 2020 Aug 17.
Published in final edited form as: Nature. 2020 Jul 29;583(7818):752–759. doi: 10.1038/s41586-020-2119-x

Spatiotemporal DNA Methylome Dynamics of the Developing Mouse Fetus

Yupeng He 1,2, Manoj Hariharan 1, David U Gorkin 3, Diane E Dickel 4, Chongyuan Luo 1,9, Rosa G Castanon 1, Joseph R Nery 1, Ah Young Lee 3, Yuan Zhao 2,3, Hui Huang 3,5, Brian A Williams 6, Diane Trout 6, Henry Amrhein 6, Rongxin Fang 2,3, Huaming Chen 1, Bin Li 3, Axel Visel 4,7,8, Len A Pennacchio 4,7, Bing Ren 3,9, Joseph R Ecker 1,10
PMCID: PMC7398276  NIHMSID: NIHMS1044798  PMID: 32728242

Summary

Cytosine DNA methylation is essential for mammalian development, but understanding of its spatiotemporal distribution in the developing embryo remains limited. As part of the mouse ENCODE project, we profiled 168 methylomes from 12 mouse tissues/organs at 9 developmental stages spanning from embryogenesis to the adult. We identified 1,808,810 genomic regions showing CG methylation variations in fetal tissues. These DNA elements predominantly lose CG methylation during fetal development, whereas the trend is reversed after birth. During late fetal development stages, non-CG methylation accumulated within gene bodies of key developmental transcription factors, coinciding with their transcriptional repression. Integrating genome-wide DNA methylation, histone modification and chromatin accessibility data allowed prediction of 461,141 putative developmental tissue-specific enhancers, whose human counterparts are enriched for disease-associated genetic variants. These spatiotemporal epigenome maps provide a resource for studies of gene regulation during tissue/organ progression and a starting point for investigating regulatory elements involved in human developmental disorders.


Mammalian embryonic development involves exquisite spatiotemporal gene regulation13. These processes are mediated by the sophisticated orchestration of transcription factors (TF) binding to regulatory DNA elements (primarily enhancers and promoters) and epigenetic modifications that influence these events. Specifically, the accessibility of TFs to regulatory DNA is closely related to the covalent modification of histones and DNA4,5.

Cytosine DNA methylation (mC) is an epigenetic modification that plays a critical role in gene regulation6. This base modification occurs predominantly at cytosines followed by guanine (mCG) in mammalian genomes and is dynamic at regulatory elements in different tissues and cell types711. mCG can directly affect the DNA binding affinity of a variety of TFs5,12 and targeted mCG removal/addition in promoters correlates with increases/decreases in gene transcription13. Non-CG methylation (mCH; where H = A, C or T) is also present at appreciable levels in embryonic stem cells, oocytes, heart and skeletal muscle, and is abundant in the mammalian brain79,11,1417. In fact, the level of mCH in human neurons exceeds that of mCG9. Although its precise function(s) are unknown, mCH directly affects DNA binding of MeCP2, the methyl-binding protein responsible for Rett Syndrome18.

mC is actively regulated during mammalian development19. However, compared to preimplantation embryogenesis1921, epigenomic data are lacking for later stages, where anatomical features of the major organ systems emerge and human birth defects become manifest22. To fill this knowledge gap, as part of the mouse ENCODE project, we used the experimentally tractable mouse embryo to generate epigenomic and transcriptomic maps for 12 tissue types at 9 developmental stages starting from embryonic day 10.5 (E10.5) to birth (postnatal day 0, P0) and for some tissues to adult. In this study, we performed whole-genome bisulfite sequencing (WGBS) to generate base-resolution methylome maps. In companion papers, the same tissue samples were profiled using chromatin immunoprecipitation sequencing (ChIP-seq), assay for transposase-accessible chromatin data using sequencing (ATAC-seq)23 (described in Gorkin et al.), and RNA-seq (described in He and Williams et al.) to identify histone modification, chromatin accessibility and gene expression landscapes, respectively.

These datasets allow studies of the dynamics of gene regulation in developing fetal tissues, expanding the scope of the previous phase of mouse ENCODE24 which focused on gene regulation in adult tissues.

  • Identification of 1,808,810 Highlights genomic of the findings of this study include: regions showing developmental and tissue-specific CG methylation variations in fetal tissues, covering 22.5% of the mouse genome.

  • 91.5% of the mCG variant regions have no overlap with promoters, CpG islands or CpG island shores.

  • The dominant methylation patterns observed were a continuous loss of CG demethylation prenatally during fetal progression, and CG remethylation postnatally, primarily at distal regulatory elements in the tissues examined.

  • Non-CG methylation accumulation was found at the bodies of genes encoding developmental transcription factors during fetal progression, which was associated with the future repression of these genes.

  • With integrative analyses of DNA methylation, histone modifications and chromatin accessibility data from mouse ENCODE, we predicted 461,141 putative enhancers across all fetal tissues.

  • The putative fetal enhancers accurately recapitulate experimentally validated enhancers in matched tissue types from matched developmental stages.

  • Predicted regulatory elements displayed spatiotemporal enhancer-like active chromatin, which correlates with the dynamic expression patterns of genes essential for tissue development.

  • The human homologs of the fetal putative enhancers are enriched for genetic risk factors variants for a variety of human diseases.

  • These comprehensive datasets are publicly accessible at http://encodeproject.org and http://neomorph.salk.edu/ENCODE_mouse_fetal_development.html.

Results

Developing fetal tissue methylomes

To assess the mC landscape in the developing mouse embryo, we generated 168 methylomes to cover most of the major organ systems and tissue types derived from the three primordial germ layers (Fig. 1a). All methylomes exceed Encyclopedia of DNA Elements (ENCODE) standards, with deep sequencing depth (median 31.8×) with biological replication, high conversion rate (>99.5%) and high reproducibility; the Pearson correlation coefficient of mCG quantification between biological replicates is > 0.8 (Supplementary Table 1; Methods). The reproducibility of liver methylomes is slightly lower because of the higher sampling variation due to genome-wide hypomethylation (Pearson correlation coefficient > 0.73). To better understand the epigenomic landscape during fetal development, we also incorporated into our analyses histone modification (ChIP-seq), chromatin accessibility (ATAC-seq) and gene expression (RNA-seq) data from the same tissue/organ samples (from the Gorkin et al., and He and Williams et al. companion papers; Supplementary Table 2).

Figure 1. Annotation of methylation variable regulatory elements in developing mouse tissues.

Figure 1.

a, Tissue samples (green) profiled in this study. Blue cells indicate published data and grey cells mark the tissues and stages that were not sampled because either the organ is not yet formed or it was not possible to obtain sufficient material for the experiment, or it was too heterogeneous to obtain informative data.

b, Global mCG level of each tissue across their developmental trajectories. The adult forebrain is approximated using postnatal 6-week frontal cortex9.

c, Fetal CG-DMRs identified in this study encompass the majority of the adult CG-DMRs from a previous study6. The numbers with and without parenthesis are related to fetal CG-DMRs and adult CG-DMRs, respectively.

d, Categorization of CG-DMRs. Proximal CG-DMRs are ones overlapping with promoters, CG islands (CGIs) or CGI shores. The rest are distal CG-DMRs. Fetal enhancer-linked CG-DMRs (feDMRs) are ones predicted to be putative enhancers, while ones within 1kb to distal feDMRs are flanking distal feDMRs. The remaining distal CG-DMRs showing hypomethylation are primed distal feDMRs. The remaining are unexplained distal CG-DMRs, whose functions are unknown, and they are further stratified based on their overlap with transposons. See Methods for details.

e, mCG, H3K27ac and expression dynamics of Fabp7 genes. Gold ticks represent CG sites, whose height represents mCG level, ranging from 0 to 1. Bottom three tracks displayed input normalized H3K27ac enrichment (RPKM, Reads Per Kilobase per Million mapped reads), ranging from 0 to 20. Fabp7 expression (TPM, Transcript Per Million mapped reads) is shown on the right.

f, mCG and H3K27ac profiles near an experimentally validated enhancer from the VISTA enhancer dataset26. The experimental results on the right show the number of total mouse E11.5 embryos tested and for a given tissue, the embryos where mm50 is active.

The genomes of all fetal tissues were heavily CG methylated, with global mCG level ranging from 70%−82% with the notable exception of liver (60%−74%; Fig. 1b). Mouse fetal liver showed a signature of partially methylated domains (PMDs)7. Strikingly, PMD formation and dissolution precisely coincided with fetal liver hematopoiesis (Supplementary Note 1; Extended Data Fig. 1).

Despite similar global mCG level in fetal tissues, we identified 1,808,810 CG differentially methylated regions (CG-DMRs), which are, on average, 339bp long and cover 22.5% (614 Mb) of the mouse genome (Extended Data Fig. 2a; Methods). This comprehensive fetal tissue CG-DMR annotation captured ~96% (n = 272,858) of all previously reported adult mouse tissue CG-DMRs11, while identifying over 1.5 million new regions (Fig. 1c).

Strikingly, 76% of the CG-DMRs are more than 10kb away from neighboring transcription start sites (Extended Data Fig. 2b). Only 8.5% (n = 153,019) of CG-DMRs overlapped with promoters, CpG islands (CGIs) or CGI shores (Fig. 1d; Extended Data Fig. 2ce). ~91.5% (1,655,791) of CG-DMRs were distally located and showed a high degree of evolutionary conservation, suggesting that they are functional (Fig. 1d; Extended Data Fig. 2fg). By integrating these epigenomic datasets, we computationally delineated 468,141 CG-DMRs that are likely fetal enhancers (fetal enhancer-linked CG-DMRs or feDMRs; see later section “Enhancer prediction with multi-omic data”; Supplementary Dataset). We further categorized the remaining CG-DMRs into four other types according to the degree of mCG difference and their relationship with transposons (Supplementary Note 2; Extended Data Fig. 23). These results provided a comprehensive annotation of mCG variation throughout the mouse genome.

The CG-DMRs show various degrees of mCG difference (effect size). The effect size of 71% of CG-DMRs is larger than 0.2, indicating that these CG-DMRs are present in at least 20% of cells in at least one tissue, while CG-DMRs in different categories showed distinct effect sizes (Extended Data Fig. 4ab). On average, one CG-DMR contains 9 differentially methylated CG sites (DMSs), and in 62% of CG-DMRs, more than 80% of CG sites are DMSs (Extended Data Fig. 4cd). Interestingly, CG-DMRs with more DMSs showed stronger predicted regulatory activity (Extended Data Fig. 4e). Similarly, as CG-DMRs with larger effect size more likely reflect bona fide mCG variation, they indeed showed stronger anti-correlation with active histone modifications and the transcription of nearby genes (Extended Data Fig. 4f; Supplementary Note 3).

We found some extensive changes in methylation near genes essential for fetal tissue development in these fetal tissue methylomes. For example, Fabp7 is a gene essential for establishing radial glial fiber in the developing brain25. In forebrain, Fabp7 underwent drastic and continuous demethylation as the forebrain matures, associating with increased forebrain-specific H3K27ac and Fabp7 gene expression (Fig. 1e). Also, an experimentally validated enhancer (from VISTA enhancer browser26) of E11.5 heart, limb, nose and several other tissues, exhibits hypomethylation in matched E11.5 tissue (Fig. 1f).

Distinct pre- and postnatal mCG dynamics

The dominant methylation pattern emerging during fetal progression was a continuous loss of mCG at tissue-specific CG-DMRs, which strongly overlap with predicted enhancers (Fig. 2a; Extended Fig. 5a). The widespread demethylation is consistent with results in whole mouse embryos from a previous study27. In striking contrast, these CG-DMRs mainly gained mCG after birth (Fig. 2a). To quantify these changes for each developmental period, we counted loss-ofmCG and gain-of-mCG events (mCG decreasing or increasing by at least 0.1 in one CG-DMR; Fig. 2bd; Methods). From E10.5 to P0, 77% to 95% of the mCG changes were loss-of-mCG, more than 70% of which occurred between E10.5 and E13.5 in all tissues except heart (46%; Extended Data Fig. 5b). The mCG level of 44–84% tissue-specific CG-DMRs dropped to below 0.5 at E14.5, compared to 16–31% at E10.5. Since allele-specific methylation is relatively rare8, the observed methylation dynamics suggest that at E14.5, most of the tissue-specific CG-DMRs are unmethylated in more than half of the cells in a tissue.

Figure 2. Tissue-specific CG-DMRs undergo continuous demethylation during embryogenesis and remethylation after birth.

Figure 2.

a, mCG level of tissue-specific CG-DMRs. The adult forebrain is approximated using postnatal 6-week frontal cortex9.

b, The numbers of loss-of-mCG (blue) and gain-of-mCG (red) events in tissue-specific CG-DMRs for each developmental period.

c-d, Fraction of tissue-specific CG-DMRs that undergo lost-of-mCG (blue) and gain-of-mCG (red) at each developmental period. Grey lines show the data for each non-liver tissue and the blue/red line shows the average.

e, mCG and H3K27ac dynamics of forebrain-specific CG-DMRs.

f, Relationship between mCG and H3K27ac in tissue-specific CG-DMRs. For each tissue type, tissue-specific CG-DMRs were grouped by their mCG level into L, M and H. Then, we quantified the fraction of tissue-specific CG-DMRs in each category that showed different levels of H3K27ac enrichment.

Compared to the loss of mCG, 57%−86% of the gain-of-mCG events happened after birth (Extended Data Fig. 5c). As a result, 27%−56% of the tissue-specific fetal CG-DMRs become highly methylated (mCG level > 0.6) in adult tissues (at least 4 weeks old), compared to 0.3%−15% at P0, likely reflecting the silencing of fetal regulatory elements (Extended Data Fig. 5d). In forebrain, 70% of forebrain-specific CG-DMRs underwent both prenatal loss-of-mCG and postnatal gain-of-mCG, coinciding with the dramatic methylomic reconfiguration during postnatal forebrain development9 (Extended Data Fig. 5e). However, only 33% of heart-specific CG-DMRs showed a similar trajectory, which is potentially associated with its relatively earlier maturation (Extended Data Fig. 5e). The percentage (8%−15%) is even lower for CG-DMRs specific to kidney, lung, stomach and intestine, suggesting that major demethylation events likely occur during earlier developmental stages.

This widespread demethylation cannot be explained by the expression dynamics of cytosine methytransferases Dnmt1 and Dnmt3a or the co-factor Uhrf128, nor that of Tet methylcytosine dioxygenases, though a previous study27 reported the involvement of active DNA demethylation (Extended Data Fig. 5f). The absence of gain-of-mCG events until the postnatal period may involve translational and/or posttranslational regulation of these enzymes. Importantly, WGBS does not distinguish between 5-methylcytosine and 5-hydroxymethylcytosine29. Based on earlier studies9,30, 5-hydroxymethylcytosines are relatively rare. However, further studies that directly measure the full complement of cytosine modifications are needed to understand their dynamics during fetal tissue development.

Linking dynamic mCG and chromatin states

To further pinpoint the timing of CG-DMR remethylation and its relationship with enhancer activity, we clustered forebrain-specific CG-DMRs based on their mCG and H3K27ac dynamics across both fetal and adult stages (Fig. 2e; Extended Data Fig. 5g; Methods). In all clusters, mCG increased dramatically between the first and second postnatal weeks and increased even further during tissue maturation at the adult stage (Extended Data Fig. 5h).

We then interrogated the association between mCG dynamics and predicted enhancer activity (approximated by H3K27ac abundance). Although depletion of mCG was not necessarily related to H3K27ac enrichment (e.g. cluster 3, 5 and 6), high mCG was indicative of low H3K27ac (Fig. 2ef). Only 2–9% of the highly methylated CG-DMRs (mCG level > 0.6) showed high H3K27ac enrichment (>6), whereas 25–28% of lowly methylated ones (mCG level < 0.2) were enriched for H3K27ac (Fig. 2f). These observations suggest that decreases in cytosine methylation occurring during fetal progression may precede and promote enhancer activity by increased TF binding and/or altered histone modifications.

Large-scale mCG features

In mouse neurons and a variety of human tissues, some CG-DMRs were found clustered, forming kilobase-scale hypomethylated domains, termed “large hypo CG-DMRs”8,31. We identified 273–1,302 such CG-DMRs in fetal tissues by merging adjacent CG-DMRs (Supplementary Table 3; Methods). For example, we found two limb-specific large hypo CGDMRs in the upstream of Lmx1b, a gene critical to limb development32 (Extended Data Fig. 6a). The mCG level of CG-DMRs within the same large hypo CG-DMR is well correlated (average Pearson correlation coefficient 0.76–0.86; Extended Data Fig. 6b). Compared with typical CGDMRs, large hypo CG-DMRs showed higher levels of H3K4me1 and H3K27ac, while 25–57% of them overlapped with the putative super-enhancers33,34 defined by extremely high H3K27ac (Extended Data Fig. 6cd; Methods). Similar to super-enhancers, the majority (58%−79%) of large hypo CG-DMRs are intragenic (fold-enrichment = 1.36–1.84, p-value < 0.001, Monte Carlo testing; Methods) and are associated with genes related to tissue functions (Supplementary Table 4).

We also found a different multi-kilobase DNA methylation feature called DNA methylation valley or DMV35,36 (Supplementary Table 5; Methods). DMVs are ubiquitously unmethylated in all tissues across their developmental trajectory, whereas large hypo CG-DMRs display spatiotemporal hypomethylation patterns (Extended Data Fig. 7ab). In fact, less than 4% of large hypo CG-DMRs overlapped with DMVs. Also, 53–58% of the DMV genes encode TFs compared to 8–17% for large hypo CG-DMRs (Extended Data Fig. 7c). The absence of repressive DNA methylation in DMVs implies that the expression of TF genes may be regulated by alternative mechanisms. Indeed, 510 out of 706 (72.2%) DMV genes are targets of Polycomb repression complex reported in the companion paper Gorkin et al. (fold-enrichment = 2.3, p-value < 0.001, hypergeometric test).

mCH domains predict gene silencing

A less well-understood form of cytosine DNA methylation found in mammalian genomes is called non-CG methylation (mCH)15. mCH accumulates at detectable levels in nearly all tissues/organs during fetal progression (Fig. 3a). Interestingly, in brain tissues, the timing of mCH accumulation correlates with the developmental maturation (down-regulation of neural progenitor markers37,38 and up-regulation of neuronal markers39) in sequential order of hindbrain, midbrain, and forebrain (Fig. 3a; Extended Data Fig. 8ab). Previous studies reveal that mCH is preferentially deposited at 5’-CAG-3’ context in embryonic stem cells and at 5’-CAC-3’ in adult tissues15. In all fetal tissues, mCH is enriched at CAC sites and such specificity further increased as the tissues mature, implying a similar DNMT3A-dependent mCH pathway in both fetal and adult tissues15 (Extended Data Fig. 8c).

Figure 3. mCH accumulation predicts reduced gene expression.

Figure 3.

a, Global mCH level dynamics for each tissue. The adult forebrain is approximated using postnatal 6-week frontal cortex9.

b, Pax3 overlapping mCH domain.

c, mCH domain clustering based on mCH dynamics.

d, mCH domain genes. Dark blue bars highlight transcription factor encoding genes.

e, The most enriched biological process terms from EnrichR47 for mCH domain genes. P-values were calculated using one-tailed Fisher’s exact test with sample size being 69, 28, 41, 234 and 213 for C1, C2, C3, C4 and C5, respectively. P-values are adjusted for multiple testing correction using the Benjamini-Hochberg method.

Intriguingly, mCH preferentially accumulated at large genomic regions that we termed as “mCH domains”, which show higher mCH level than their flanking sequences (Fig. 3b). We identified 384 mCH domains, which averaged 255kb in length (Methods). Strikingly, 92% of them and 61% of their bases are intergenic (fold-enrichment = 1.20 and 1.43, p-value < 0.001, Monte Carlo testing). 22% (128 out of 582) of the mCH domain genes (e.g. Pax3) encode TFs, many of which are related to tissue development/organogenesis (fold-enrichment = 3.23, p-value < 0.001, Monte Carlo testing).

To further explore the dynamics of mCH accumulation, we grouped mCH domains into 5 clusters, C1–5 (Fig. 3bc; Extended Data Fig. 8d; Methods). mCH domains in C1, C4 and C5 acquire mCH in all tissues (Fig. 3c). Interestingly, C1 is enriched for genes related to neuron differentiation, whereas C4 and C5 overlap genes associated with embryo development (Fig. 3de; Supplementary Table 6). In contrast to these ubiquitous mCH domains, C2 gains mCH mostly in heart, while C3 is brain-specific and overlaps genes related to axon guidance (Fig. 3de).

As mCH accumulates in mCH domains during fetal progression, the mCH domain genes tend to be repressed compared to outside genes, especially by P0 (Extended Data Fig. 8ef). Since mCH domain genes are related to tissue/organ/embryo development, our data suggests that mCH is associated with silencing of the pathways of early fetal development. Interestingly, 382 of the 582 mCH domain genes are targeted by the Polycomb repressive complex pathway found in the companion paper by Gorkin et al. (fold-enrichment = 2.0, p-value < 0.001, hypergeometric test). Consistent with our findings across fetal tissues, one study40 on postnatal brain reported that mCH acquired in gene bodies during postnatal brain development also repressed transcription. Further experiments, especially in the developing embryo, are necessary to delineate the mechanism of mCH regulation and its potential role in transcriptional regulation.

Enhancer annotation based on multi-omic data

To further investigate dynamic transcriptional regulation in developing fetal tissues, we predicted fetal CG-DMRs that are likely associated with enhancer activity using the REPTILE41 algorithm through the integration of mCG, histone modifications and chromatin accessibility profiles. We identified in total 468,141 candidate fetal enhancer-linked CG-DMRs or feDMRs (Methods; Supplementary dataset). feDMRs show enhancer-like chromatin signatures including open chromatin, mCG and H3K27me3 depletion, and H3K4me1 and H3K27ac enrichment7,23,42 (Fig. 4a). 99,582 (21.3%) of feDMRs were not previously reported in adult mouse tissues24 and 58,307 (12.4%) were not captured by the chromatin state model (compared to the putative enhancers from Gorkin et al. companion paper; Fig. 4b).

Figure 4. Enhancer annotation of developing mouse tissues.

Figure 4.

a, Chromatin signatures of feDMRs in E11.5 heart. The aggregate plots show the average histone modifications (left), and chromatin accessibility and mCG (right) profiles of +/− 5kb regions flanking the feDMR centers.

b, The overlap between feDMRs, the adult enhancers from Yue et al.24, and putative enhancers from Gorkin et al. (companion paper). The letter in parenthesis indicates the enhancer set from which the number is calculated. “g” and “y” refer to the putative enhancers from Gorkin et al. and Yue et al., respectively. Numbers related to feDMRs are underlined.

c, True positive rate of putative enhancers on 100 down-sampled VISTA datasets in each E11.5 tissue for (from left to right): top 1–2,500 and 2,501–5,000 feDMRs, top 2,500 and 2,501–5,000 feDMRs without overlap with the putative enhancers from Gorkin et al., top 2,500 putative enhancers from Gorkin et al (blue), and random region (grey). The sample size is 1,000 for boxes of random region and 100 for the rest. Random region indicates 10 sets of randomly selected genomic regions with GC density and evolutionary conservation matching the top 5,000 feDMRs. Blue and black lines indicate the fraction of elements that are experimentally validated enhancers (positives), and random positive rate, respectively. The top and bottom of a box represents upper quartile (Q3 or 75% percentile) and lower quartile (Q1 or 25% percentile), while the middle line indicates the median. The upper and lower whiskers are 1.5 times of (Q3–Q1) above Q3 and below Q1. Single points are outlier data points that are beyond the whiskers.

To evaluate the likelihood of functionality of these putative fetal enhancers, we intersected feDMRs with VISTA enhancer browser DNA elements26, which were tested for enhancer activity by in vivo transgenic reporter assay in E11.5 mouse embryos. Even after carefully controlling for biases in the dataset, 37%−55% of the 2500 (top 3%−7%) most confident feDMRs-overlapping VISTA elements showed in vivo enhancer activity in matched tissues (Fig. 4c; Extended Data Fig. 9; Supplementary Note 4). Also, in any given tissue, feDMRs cover 73%−88% of the chromatin-state-based putative enhancers, and capture experimentally validated enhancers missing from the chromatin-state-based putative enhancers without compromising accuracy (Fig. 4c; Extended Data Fig. 9d). These results are consistent with our previous findings that incorporating DNA methylation data improves enhancer prediction41. The validity of feDMRs is further supported by their evolutionary conservation, enrichment of TF binding motifs related to specific tissue function(s) and the enrichment of neighboring genes in specific tissue-related pathways (Extended Data Fig. 2eg; Supplementary Table 78; Methods).

Associating mCG, enhancers and gene expression

Lastly, we interrogated the association of mCG dynamics with the expression of genes in different biological processes/pathways. Using weighted correlation network analysis (WGCNA)43 method, we identified 33 co-expressed gene clusters (co-expression modules, CEMs) and calculated “eigengenes” to summarize the expression profile of genes within modules (Fig. 5ab; Extended Data Fig. 10a; Methods). Genes sharing similar expression profiles are more likely to be regulated by a common mechanism and/or involved in the same pathway (Extended Data Fig. 10b; Supplementary Table 9). For example, genes in CEM12, which are related to cell cycle, are highly expressed in early developmental stages but are down-regulated as the tissues mature, matching our knowledge that cells become post-mitotic in mature tissues (Fig. 5c; Extended Data Fig. 10c).

Figure 5. Association between mCG, gene expression and disease-associated SNPs.

Figure 5.

a, Expression profiles for 2,500 of the most variable genes.

b, 33 CEMs identified in WGCNA and their eigengene expression. Bolded CEMs are related to (c).

c, The most enriched biological process terms of genes in four representative CEMs using EnrichR47. P-values were based on one-tailed Fisher’s exact test with sample size 6,766, 602, 126 and 2,968 for CEM3, CEM12, CEM29 and CEM32, respectively. P-values were adjusted for multiple testing correction using the Benjamini-Hochberg method.

d, Correlation of the tissue-specific eigengene expression (orange) for each developmental stage with the mCG level / enhancer score (blue/red) z-scores of feDMRs linked to the genes in CEM32. Pearson correlation coefficients were calculated (n = 7, 11 and 8 for E11.5, E14.5 and P0 respectively).

e-f, Pearson correlation coefficients of mCG level/enhancer score (blue/red) of feDMRs linked to the genes in each CEM with (e) tissue-specific eigengene expression across all 33 CEMs on all stages, and (f) temporal epigengene expression across all CEMs in all tissue types, excluding liver. P-values were based on the two-tailed Mann-Whitney test (n = 231 (e) and n = 363 for (f)). See the legend of Figure 4c for description of boxplots.

g, feDMRs are enriched for human GWAS SNPs associated with tissue/organ specific functions and tissue-related disease states. P-values were calculated using linkage disequilibrium (LD) score regression45, which were adjusted for multiple testing correction using the Benjamini-Hochberg approach.

To understand how mCG and enhancer activity of feDMRs are associated with the expression of genes in CEMs, we linked feDMRs to their neighboring genes. Then, we correlated eigengene expression of each CEM with the average mCG (or enhancer score) of feDMRs linked to the genes in that CEM (Methods). To tease out tissue-specific and temporal associations, we calculated the correlation across tissues and across developmental stages separately. Across all tissue samples from a given developmental stage, mCG of feDMRs was negatively correlated with eigengene expression, whereas enhancer score was positively correlated (Fig. 5de). We then calculated the correlation across samples of a given tissue type from different developmental stages. While mCG levels generally decreased at feDMRs over development (Fig. 2a), the enhancer score remained positively correlated with temporal expression (Fig. 5f; Extended Data Fig. 10d). These results imply that feDMRs likely drive both tissue-specific and temporal gene expression.

Human orthologs of feDMRs are enriched for genetic risk factors

The vast majority of genetic variants associated with human diseases identified in genome-wide association studies (GWAS) are located in non-coding regions. These non-coding variants as well as the heritability of human diseases are enriched in the distal regulatory elements of related tissues and cell types44,45. The spatiotemporal mouse enhancer activity annotation (feDMRs) and the degree of evolutionary conservation in regulatory elements between human and mouse24 provide the possibility to analyze diseases/trait associated loci, and pinpoint the related tissue(s) and developmental timepoint(s) in the mouse ENCODE data. To do this, we applied stratified linkage disequilibrium (LD) score regression45 to partition the heritability of 27 traits in the human orthologous regions of the mouse feDMRs (Methods). We found that the heritability of human diseases/traits associated SNPs is significantly enriched in the orthologous regions of mouse feDMRs for each corresponding tissue (Fig. 5g; Supplementary Table 10). For example, the heritability of schizophrenia and “years of education” is enriched in forebrain- and midbrain-specific feDMRs, while craniofacial- and limb-specific feDMRs are enriched for the heritability of height (Fig. 5g). This analysis also found that some associations between traits/diseases and tissue-specific feDMRs were found only at certain developmental stages (Fig. 5g). For example, schizophrenia loci are associated with forebrain feDMRs only at E12.5-P0. Similar results are also found for spatiotemporal differential open chromatin regions for human orthologs (Gorkin et al. companion paper). Given current challenges in obtaining human fetal tissues, our results suggest the possibility of integrating human genetic data with fetal spatiotemporal epigenomic data from model organisms to predict the relevant tissue/organ type for a variety of human developmental diseases.

Discussion

In this study, we describe the generation and analysis of a comprehensive collection of base-resolution, genome-wide maps of cytosine DNA methylation for 12 fetal tissues/organs from 8 distinct developmental stages of mouse embryogenesis. By integrating DNA methylation with histone modification, chromatin accessibility and RNA sequencing data from the same tissue samples from companion papers, we annotated 1,808,810 methylation variable genomic elements, encompassing nearly a quarter (613 Mb) of the mouse genome and generating predictions for 468,141 fetal enhancer elements (histone modification and chromatin accessibility data (from Gorkin et al.) and RNA sequencing data (from He and Williams et al). The counterparts of these fetal enhancers in the human genome are tissue-specifically enriched for genetic risk loci associated with a variety of developmental disorders/diseases. Such enrichments suggest the possibility of generating new mouse models of human disease by introducing the candidate disease-associated alleles into feDMRs using genomic editing techniques46.

The temporal nature of these datasets enabled us to uncover surprisingly simple mCG dynamics at predicted DNA regulatory regions. During early stages of fetal development, methylation decreases at predicted fetal regulatory elements in all tissues until birth, after which time it dramatically rises. Since the tissues that we have investigated are composed by a variety of cell types, a fraction of the observed dynamics potentially result from DNA methylation changes during the differentiation of individual cell types and/or the changing cell type composition during development. In spite of the tissue heterogeneity, such dynamics suggest a plausible regulatory principal whereby metastable repressive mCG is removed to enable more rapid, flexible modes of gene regulation (e.g. histone modification/chromatin accessibility).

In addition, our findings extend current knowledge of non-CG methylation, an understudied context of cytosine modification. We observed that during fetal development there is preferential accumulation of mCH tissue-specifically at genomic locations, each hundreds of kilobases in size. We termed this novel genomic feature we as “mCH domain”. Genes that lie in mCH domains are down-regulated in their expression as mCH further accumulates during the later stages of fetal development. Though its function remains debatable, in vivo and in vitro studies indicated that mCH directly increases the binding affinity of MeCP218, which is highly expressing in the brain and mutation of which leads to Rett Syndrome. Gene-rich mCH domains in non-brain tissues are likely enriched for yet to be discovered mCH binding proteins, which, like MeCP2, may be involved recruiting transcriptional repressor complexes, promoting the observed gene repression.

Despite the broad scope of this study, it is important to note its limitations. First, several tissues, such as skeleton, gonads and pancreas, were not included in the dataset. Also, sex-related differences were not studied. In addition, the tissues examined in this study are heterogeneous, and thus future efforts to examine the epigenomes of individual cells will be critical to achieving a deeper understanding of the gene regulatory programs.

Overall, we present the most comprehensive set of temporal fetal tissue epigenome mapping data available in terms of the number of developmental stages and tissue types investigated, expanding upon the previous phase of mouse ENCODE project24, which focused exclusively on adult mouse tissues. Our results highlight the power of this dataset for analyzing regulatory element dynamics in fetal tissues during in utero development. These spatiotemporal epigenomic datasets provide a valuable resource for answering fundamental questions about gene regulation during mammalian tissue/organ development as well as the possible origins of human developmental diseases.

Methods

Data Availability

All whole-genome bisulfite sequencing (WGBS) data from mouse embryonic tissues are available at the ENCODE portal (https://www.encodeproject.org/) and/or deposited in the NCBI Gene Expression Omnibus (GEO) (Supplementary Table 1). The additional RNA-seq dataset of forebrain, midbrain, hindbrain and liver is available at the NCBI Gene Expression Omnibus (GEO) under accession GSE100685. All other data used in this study, including chromatin immunoprecipitation sequencing (ChIP-seq), assay for transposase-accessible chromatin using sequencing (ATAC-seq), RNA-seq and additional WGBS data, are available at the ENCODE portal and/or GEO (Supplementary Table 2).

Code availability

methylpy (1.0.2) and REPTILE (1.0) are available at https://github.com/yupenghe/methylpy and https://github.com/yupenghe/REPTILE. Custom code used for this study is available at https://github.com/yupenghe/encode_dna_dynamics. This work used computation resource from the Extreme Science and Engineering Discovery Environment (XSEDE)48.

Abbreviations

AD: adult

CEM: co-expression module

mC: cytosine DNA methylation

mCG: CG methylation

mCH: non-CG methylation

TF: transcription factor

H3K4me1: Histone 3 lysine 4 monomethylation

H3K4me3: Histone 3 lysine 4 trimethylation)

H3K27me3: Histone 3 lysine 27 trimethylation

H3K27ac: Histone 3 lysine 27 acetylation

WGBS: whole-genome bisulfite sequencing

REPTILE: Regulatory Element Prediction based on TIssue-specific Local Epigenetic marks

GWAS: genome-wide association study

SNP: single nucleotide polymorphism

TPM: transcripts per million

WGCNA: weighted gene co-expression network analysis

RPKM: Reads Per Kilobase per Million mapped reads

CG-DMR: differentially CG methylation region

feDMR: fetal enhancer linked CG-DMR

fd-feDMR: flanking distal feDMR

pd-feDMR: primed distal feDMR

unxDMR: unexplained CG-DMR

te-unxDMR: transposon overlapping unxDMR

nte-unxDMR: non transposon overlapping unxDMR

TSS: transcription start sites

CGI: CpG island

DMV: DNA methylation valley

PMD: partially methylated domain

AD-A enhancer: adult active enhancer

AD-V enhancer: adult vestigial enhancer

FB: forebrain

MB: midbrain

HB: hindbrain

NT: neural tube

HT: heart

CF: craniofacial

LM: limb

KD: kidney

LG: lung

ST: stomach

IT: intestine

LV: liver

Tissue Collection

All animal work was reviewed and approved by the Lawrence Berkeley National Laboratory Animal Welfare and Research Committee or the University of California, Davis Institutional Animal Care and Use Committee.

Mouse fetal tissues were dissected from embryos of different developmental stages from female C57Bl/6N Mus musculus animals. Animals, used for obtaining tissue materials from E14.5 and P0 stages, were purchased from both Charles River Laboratories (C57BL/6NCrl strain) and Taconic Biosciences (C57BL/6NTac strain). For tissues of remaining developmental stages, animals (C57BL/6NCrl strain) were purchased from Charles River Laboratories. The number of embryos or P0 pups collected was determined by whether the materials were sufficient for genomic assay, and was not based on statistical considerations. 15–120 embryos/pups were collected for each replicate of each tissue of each stage.

Tissue Excision And Fixation

See Supplementary File 1–2 for details.

MethylC-seq Library Construction and Sequencing

MethyC-seq libraries were constructed as previously described8 and a detailed protocol is available49. An Illumina HiSeq 2500 system was used for all whole genome bisulfite sequencing (WGBS) using either 100 or 130 base single-ended reads.

Mouse Reference Genome Construction

For all analyses in this study, we used mm10 as the reference genome, which includes 19 autosomes and two sex chromosomes (corresponding to the “mm10-minimal” reference in ENCODE portal, https://www.encodeproject.org/). The fasta files of mm10 were downloaded from UCSC genome browser (Jun 9 2013)50.

WGBS Data Processing

All WGBS data were mapped to the mm10 mouse reference genome as previously described51. WGBS processing includes mapping of the bisulfite-treated phage lambda genome spike-in as control to estimate sodium bisulfite non-conversion rate. This pipeline called methylpy is available on github (https://github.com/yupenghe/methylpy). Briefly, cytosines within WGBS reads were first computationally converted to thymines. The converted reads were then aligned by bowtie (1.0.0) onto the forward strand of C-T converted reference genome and the reversed strand of G-A converted reference genome, separately. We filtered out reads that were not uniquely mapped or were mapped to both computationally converted genomes. Next, PCR duplicate reads were removed. Lastly, methylpy counts the methylated basecalls (cytosines) and unmethylated basecalls (thymines) for each cytosine position in the corresponding reference genome sequence (mm10 or lambda).

Calculation of Methylation Level

Methylation level was computed to measure the intensity and degree of DNA methylation of single cytosines or larger genomic regions. The methylation level is defined as the ratio of the sum of methylated basecall counts over the sum of both methylated and unmethylated basecall counts at one cytosine or across sites in a given region52 subtracting the sodium bisulfite non-conversion rate. The sodium bisulfite non-conversion rate is defined as the methylation level of the bisulfite-treated lambda genome.

We calculated this metric for cytosines in both CG context and CH contexts (H=A, C or T). The former is called the CG methylation (mCG) level or mCG level while the latter is called the CH methylation (mCH) level or mCH level.

Quality Control of WGBS data

We calculated several quality control metrics for all the WGBS data and the results are presented in Supplementary Table 1. For each tissue sample, we calculated cytosine coverage, sodium bisulfite conversion rate, and reproducibility between biological replicates. 1) Cytosine coverage is the average number of reads that cover cytosine. In the calculation, we combined the data of both strands. 2) Sodium bisulfite conversion rate measures the sodium bisulfite conversion efficiency and is calculated as one minus the methylation level of unmethylated lambda genome. 3) The reproducibility of biological replicates is defined as the Pearson correlation coefficient of mCG quantification between biological replicates for sites covered by at least 10 reads.

All of the WGBS data pass ENCODE standard (https://www.encodeproject.org/wgbs/#standards) and are accepted by ENCODE consortium. Almost all of the biological replicates of tissue samples have at least 30× cytosine coverage. All biological replicates have at least 99.5% sodium bisulfite conversion rate. All non-liver tissue samples have reproducibility greater than 0.8. The reproducibility of liver samples is slightly lower but is still greater than 0.7. The reduced reproducibility is due to the increase of sampling variation, which is a result of genome-wide hypomethylation in the liver genome.

ChIP-seq Data Processing

ChIP-seq data were processed using the ENCODE uniform processing pipeline for ChIP-seq. In brief, Illumina reads were first mapped to the mm10 reference using bwa53 (version 0.7.10) with parameters “-q 5 -l 32 -k 2”. Next, the Picard tool (http://broadinstitute.github.io/picard/, version 1.92) was used to remove PCR duplicates using the following parameters: “REMOVE_DUPLICATES=true”.

For each histone modification mark, we represented it as continuous enrichment values of 100bp bins across the genome. The enrichment was defined as the RPKM (Reads Per Kilobase per Million mapped reads) after subtracting ChIP input. The enrichment across the genome was calculated using bamCompare in Deeptools254 (2.3.1) using options “--binSize 100 --normalizeUsingRPKM --extendReads 300 --ratio subtract“. For the ChIP-seq data of EP300, we used MACS55 (1.4.2) to call peaks using default parameters.

RNA-seq Data

Processed RNA-seq data for all fetal tissues, from all stages was downloaded from the ENCODE portal (https://www.encodeproject.org/; Supplementary Table S2).

To further validate our findings regarding transcriptomes generated across laboratories (Wold and Ecker), we generated an additional two replicates of RNA-seq data for fetal forebrain, midbrain, hindbrain and liver tissues. We first extracted total RNA using RNeasy Lipid tissue mini kit from Qiagen (cat no.#74804). Then, we used Truseq Stranded mRNA LT kit (Illumina, RS-122–2101 and RS-122–2102) to constructed stranded RNA-seq libraries on 4ug of the extracted total RNA. An Illumina HiSeq 2500 was used to sequence the libraries and generate 130 bases single-ended reads.

RNA-seq Data Processing and Gene Expression Quantification

RNA-seq data was processed using the ENCODE RNA-seq uniform processing pipeline. Briefly, RNA-seq reads were mapped to the mm10 mouse reference using STAR56 aligner (version 2.4.0k) with GENCODE M4 annotation57. We quantified gene expression levels using RSEM (version 1.2.23)58, expressed as transcripts per million (TPM). For all downstream analyses, we filtered out non-expressed genes and only retained genes that showed non-zero TPM in at least 10% of samples.

ATAC-seq Data

ATAC-seq data for all fetal tissues, from all stages was downloaded from the ENCODE portal (https://www.encodeproject.org/; Supplementary Table S2). ATAC-seq reads were mapped to mm10 genome using bowtie (1.1.2) with flag “-X 2000 --no-mixed --no-discordant”. Then, we removed PCR duplicates using samtools53 and mitochondrial reads. Next, we converted read ends to account for Tn5 insertion position by moving the read end position by 4bp towards the center of the fragment. We converted paired-end read ends to single-ended read ends. Last, we used MACS2 (2.1.1.20160309) with flags “—nomodel —shift 37 —ext 73 —pval 1e-2 -B —SPMR —call-sumits” to generate signal track file in bigwig format. MACS calculated ATAC-seq read fold enrichment over the background MACS moving window model. Such fold enrichment is used as the intensity/signal of chromatin accessibility.

Genomic Features of Mouse Reference Genome

We used GENCODE M457 gene annotation in this study. CG island (CGI) annotation was downloaded from UCSC genome browser (Sep 5, 2016)50. CGI shores are defined as the upstream 2kb and downstream 2kb regions along CGIs. Promoters are defined as regions from - 2.5kb to +2.5kb around transcription start sites (TSSs). CGI promoters are defined as those overlapping with CGIs while the remaining promoters are called non-CGI promoters.

We also obtained a list of mappable transposons using the below procedure. RepeatMasker annotation of the mm10 mouse genome was downloaded from UCSC genome browser (Sep 12, 2016)50. The annotation includes 5,138,231 repeats. We acquired the transposon annotation by selecting only the repeats belonging to one of the below repeat classes (repClass): “DNA”, “SINE”, “LTR” or “LINE”. Then, we excluded any repeat elements with a question mark in their name (repName), class (repClass) or family (repFamily). For the remaining 3,643,962 transposons, we further filtered out elements that contained less than 2 CG sites or cases where less than 60% of CG sites within were covered by at least 10 reads across all samples when the data of two replicates were combined. Finally, we utilized the remaining set of 1,688,189 mappable transposons for analyses in this study.

CG Differentially Methylated Region (CG-DMRs)

We identified CG-DMRs using methylpy (https://github.com/yupenghe/methylpy) as previously described51. Briefly, we first called differentially methylated sites (DMSs) and then merged them into blocks if they both showed similar sample-specific methylation patterns and were within 250bp. Last, we filtered out blocks containing less than three DMSs. In this procedure, we combined the data from the two biological replicates for all tissues, excluding liver samples due to global hypomethylation of the genome.

We overlapped the resulting fetal tissue CG-DMRs with CG-DMRs previously identified in Hon et al11 using “intersectBed” from bedtools59 (v2.27.1). The mm9 coordinates of the CGDMRs from Hon et al. were first, mapped to mm10 using liftOver50 with default parameters. Overlap of CG-DMRs is defined as a CG-DMR with at least one base overlap with another CGDMR when comparing genomic coordinates between lists.

Identification of Tissue-Specific CG-DMRs

For each fetal tissue type, we defined tissue-specific CG-DMRs as those that showed hypomethylation in a tissue sample from any fetal stage (E10.5 to P0). Hypomethylation is only meaningful with a baseline, thus we used an outlier detection algorithm60 to defined the baseline mCG level of each CG-DMR across tissue samples using the mean of the “bulk”, which is defined as the value for the narrowest mCG level range that includes half of all samples.

Specifically, xsi is the mCG level of CG-DMR i (i = 1,e mCin tissue sample s(s = 1,ssN). Assuming the samples are ordered such that x1ix2ixsixNi, the baseline is defined as the bi=s=aa+[N/2]xsi, where a is the sample index such that xa+[N/2]ixai is minimized, i.e. a=argmin(xt+[N/2]ixti). [N/2] is defined as the smallest integer that is greater than N/2. Lastly, we defined hypomethylated samples as samples in which the mCG level at CG-DMR i is at least 0.3 smaller than baseline bi, i.e. {S|(xsibi)0.3}. Then, CG-DMR i is specific to these tissues. Liver data was not included in this analysis and we excluded CG-DMRs that had zero coverage in any of the non-liver samples. In total, only 402 (~0.02%) CG-DMRs were filtered out

Linking CG-DMRs with Genes

We linked CG-DMRs to their putative target genes based on genomic distance. First, we only considered expressed genes, which showed non-zero TPM in at least 10% of all fetal tissue samples. Next, we obtained coordinates for transcriptional start sites (TSSs) of the expressed genes and paired each CG-DMR with the closest TSS using “closestBed” from bedtools59. In this way, we inferred a target gene for each CG-DMR; these gene/TSS associations were used in all subsequent analyses in this study.

Predicting Fetal Enhancer-linked CG-DMRs (feDMRs)

The REPTILE41 algorithm was used to identify the CG-DMRs that showed enhancer-like chromatin signatures. We called these fetal enhancer-like CG-DMRs or feDMRs. REPTILE uses a random forest classifier to learn and then distinguish the epigenomic signatures of enhancers and genomic background. One unique feature of REPTILE is that by incorporating the data of additional samples (as outgroup/reference), it is able to employ epigenomic variation information to improve enhancer prediction. In this study, REPTILE was run using input data from CG methylation (mCG), chromatin accessibility (ATAC-seq) and six histone marks (H3K4me1, H3K4me2, H3K4me3, H3K27ac, H3K27me3 and H3K9ac).

A REPTILE enhancer model was trained in similar way as in previously study41. Briefly, CG-DMRs were called across the methylomes of mouse embryonic stem cells (mESCs) and all eight E11.5 mouse tissues. CG-DMRs were required to contain at least 2 DMSs and they were extended 150bp in each direction (5’ and 3’). The REPTILE model was trained on the mESC data using E11.5 mouse tissues as outgroup. Data from mCG and six histone modifications are available for these samples. The training dataset consists of 5000 positive instances (putative known enhancers) and 35,000 negative instances. Positives were 2kb regions centered at the summits of top 5,000 EP300 peaks in mESCs. Negatives include randomly chosen 5,000 promoters and 30,000 2kb genomic bins. The bins have no overlap with any positives or promoters. REPTILE learned the chromatin signatures that distinguish positive instances from negative instances.

Next, using this enhancer model, we applied REPTILE to delineate feDMRs from the 1,808,810 CG-DMRs identified across all non-liver tissues. feDMRs were predicted for each sample based on data from mCG and six core histone marks, while the remaining non-liver samples were used as an outgroup. In REPTILE, the random forest classifier for CG-DMR assigns a confidence score ranging from 0.0 to 1.0 to each CG-DMR in each sample. This score corresponds to the fraction of decision trees in the random forest model that vote in favor of the CG-DMR to be an enhancer. Previous benchmarks showed that the higher the score, the more likely that a CG-DMR shows enhancer activity41. We named this confidence score as the enhancer score. For each tissue sample, feDMRs are defined as CG-DMRs with enhancer score greater than 0.3. feDMRs were also defined for each tissue type as the CG-DMRs that were identified as an feDMR in at least one tissue sample of that tissue type. For example, if a CGDMR was predicted as an feDMR only in E14.5 forebrain, it was classified as a forebrain-specific feDMR.

We overlapped the feDMRs with putative adult enhancers from Yue et al24. We utilized a set of coordinates identifying the center base position of putative enhancers for each of the tissues and cell types from http://yuelab.org/mouseENCODE/predicted_enhancer_mouse.tar.gz. Next, we defined putative enhancers as +/− 1kb regions around the centers. Putative enhancers from different tissues and cells types were combined and merged if they overlapped. The merged putative enhancers (mm9) were then mapped to the mm10 reference using liftOver50. Finally, “intersectBed” from bedtools59 were used to overlap feDMRs with these putative enhancers.

Evaluating feDMRs with Experimentally Validated Enhancers

We used enhancer data set from the VISTA enhancer browser26 to estimate the fraction of feDMR that actually displays enhancer activity in vivo. Specifically, we calculated the fraction of feDMR-overlapping VISTA elements that are experimentally validated as enhancer, which we termed “true positive rate”. We evaluated the true positive rate of feDMRs for 6 E11.5 tissues (forebrain, midbrain, hindbrain, heart, limb and neural tube), where there are at least 30 VISTA elements being experimentally validated enhancers (positives).

However, the VISTA elements were biasedly selected. Compared to randomly selected sequences, they are more enriched for enhancers, which will lead to an overestimation of the true positive rate. To reduce the impact of selection bias, we need to first estimate the fraction of VISTA elements that are positives (positive rate) in a given tissue if there is minimal selection bias. We termed this fraction as genuine positive rate. Then, we can sample the current VISTA dataset to construct datasets with positive rate matching the genuine positive rate. Since the positive rate is not inflated in the constructed datasets, it will allow a fair evaluation of our enhancer prediction approach. See Supplementary Note 4 for details.

Using the bias-controlled datasets, we calculated true positive rate of feDMRs for each E11.5 tissue. First, we ranked feDMRs by their enhancer scores (from highest to lowest). We then overlapped the top 2,500 (or top 2501–5000) feDMRs of given E11.5 tissue with VISTA elements, requiring that at least one feDMR is fully contained for a VISTA element to be counted as overlapped. Last, we calculated the fraction of feDMR-overlapping VISTA that are experimentally validated enhancers in the given tissue, i.e. true positive rate.

To better interpret the true positive rate of feDMRs, we also evaluated randomly selected 5,000 genomic bins with GC content and degree of evolution conservation (PhyloP score) matching the top 5,000 feDMRs. We used this method as a baseline. For each E11.5 tissue, we repeated this random selection process for 10 times and generated 10 sets of random regions. Next, we calculated the true positive rate of each set of the random regions in the bias-controlled datasets. As an additional baseline method, we also calculated the positive rate of VISTA elements that did not overlap with any feDMRs or H3K27ac peaks.

Comparing feDMRs with Putative Enhancers Based on Chromatin State

Chromatin state based putative enhancers are genomic regions labeled as enhancer states (state 5, 6 and 7) by ChromHMM61 in non-liver tissue samples (Gorkin et al companion paper, this issue). To fairly compare its validation rate with that of feDMRs, we need to select the top 2,500 putative enhancers. ChromHMM does not assign score and therefore we instead ranked these elements using the H3K27ac signal. Then, we calculated the fraction of the top 2,500 putative enhancers that are overlapped with feDMRs.

To test whether feDMRs are able to capture more enhancers than chromatin states, we computed the validation rate of the non-overlapping feDMRs. Also, we calculated the validation rate of ChromHMM enhancers by overlapping them with VISTA elements. This is used as additional baseline for evaluating feDMRs.

Enriched Transcription Factor (TF) Binding Motifs in Tissue-Specific feDMRs

To identify TF motifs enriched in feDMRs, we scanned the genome to delineate TF motif occurrences as previously described62. Briefly, we utilized TF binding position weight matrices (PWMs) from the MEME motif database (v11, 2014 Jan 23. motif sets chen2008, hallikas2006, homeodomain, JASPAR_CORE_2014_vertebrates, jolma2010, jolma2013, macisaac_theme.v1, uniprobe_mouse, wei2010_mouse_mw, wei2010_mouse_pbm, zhao2011). Then, FIMO63 was used to scanned the genome to identify TF motif occurrences using options “--output-pthresh 1E-5 --max-stored-scores 500000”.

Next, we performed a hypergeometric test to identify significant motif enrichments. For each tissue type, we calculated the motif enrichment for feDMRs in that tissue (foreground) against a list of feDMRs identified for other tissues not overlapping with the foreground tissue list. For this analysis, we extended the average size of both foreground and background feDMRs to 400bp to avoid bias due to size differences. For a given tissue t, the total number of foreground and background feDMRs is Nf,t and Nb,t, respectively, and Nt = Nf,t + Nb,t is the total number of feDMRs. For a given TF binding motif m, TF motif occurrences are overlapped with nf,t,m foreground and nb,t,m background feDMRs, while nt,m = nf,t,m + nb,t,m is the total number of overlapping feDMRs. The probability of observing nf,t,m or more overlapping foreground feDMRs (p-value) is defined as:

P(Xnf,t,m|Nf,t,nf,t,m,Nb,t,nb,t,m)=x=nf,t,mnt,m(Nf,tx)(Nb,tnt,mx)(Ntnt,m)

For each tissue type, we performed this test for all motifs (n=532). Then, the p-values of each tissue were adjusted using Benjamini-Hochberg method and the motifs were called as significant if they passed 1% FDR cutoff. Lastly, we excluded any TF-binding motifs whose TF expression level was less than 10 TPM. The results are listed in Supplementary Table 11.

Enriched Pathways and Biological Processes of feDMR Neighboring Genes

For each tissue stage, we used GREAT64 tool to find enriched pathways and biological processes of genes near feDMRs identified in that tissue. For each tissue stage, GREAT was run under the “Single nearest gene” association strategy on 10,000 feDMRs with the highest enhancer score. The GREAT analysis results are listed in Supplementary Table 12.

Enrichment of Heritability in feDMRs for Human Diseases and Traits

We applied stratified linkage disequilibrium (LD) score regression45 to test for the heritability enrichment of different traits in feDMRs. The code for LD score regression was from https://github.com/bulik/ldsc (March 2nd, 2018). LD score regression was performed on HapMap365 single nucleotide polymorphisms (SNPs) downloaded from https://data.broadinstitute.org/alkesgroup/LDSCORE/weights_hm3_no_hla.tgz. Then, the SNP list was further filtered to the SNP used in a pretrained baseline model (https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_baselineLD_v1.1_ldscores.tgz). LD score was calculated based on the data of European population in 1000 genome project66 (https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_plinkfiles.tgz) and the minor allele frequency of SNPs in this population was downloaded from https://data.broadinstitute.org/alkesgroup/LDSCORE/1000G_Phase3_frq.tgz. The summary statistics of 27 traits were downloaded from https://data.broadinstitute.org/alkesgroup/sumstats_formatted/. “PASS_Years_of_Education1.sumstats” was ignored because the summary statistics of a more recent study on years of education was available.

To obtain the human orthologous regions of the CG-DMRs, we used liftOver to map mouse CG-DMRs (mm10) to hg19, requiring that at least 50% of the bases in CG-DMR could be assigned to hg19 (using option -minMatch=0.5). In total, 1,034,801 out of 1880810 (55%) of mouse DMR regions could be aligned to the human genome.

Then, for each tissue sample, we overlapped the human orthologous regions of its feDMRs with SNPs in 1000 genome SNPs and calculated LD score using 1000 genome data. However, only the LD score of SNPs in the pretrained baseline model were reported and used for later analysis. LD score was calculated using option “--ld-wind-cm 1”.

Last, we performed LD score regression for each trait and the feDMRs of each tissue sample with option “--overlap-annot”. The regression model used in the test included feDMRs and the annotations in the pretrained baseline model as before45. The latter was used to control for non-tissue-specific enrichment in generic regulatory elements such as all promoters45. In total, we performed 1,953 tests (27 traits × 59 tissue samples). P-value was calculated based on reported coefficient z-score (Coefficient_z-score) using R function pnorm with parameter “lower.tail=F”. The “Coefficient_z-score” was based on 200 times of block jackknife resampling and thus the sample size of this statistical test is 200. To correct p-value inflation due to multiple comparisons, we applied Benjamini-Hochberg approach separately on the p-values from tests on the feDMRs of each tissue sample. P-value cutoff given 5% false discovery rate (FDR) was used to call significant enrichment.

Categorizing CG-DMRs

To better understand the potential functions of CG-DMRs, we grouped them into various categories based on their genomic location and chromatin signatures. First, we overlapped CGDMRs with promoters, CGIs and CGI shores and defined the CG-DMRs overlapping these locations as proximal CG-DMRs. Out of the 153,019 proximal CG-DMRs, 46,692, 90,831, 1,710 and 13,786 overlapped with CGI promoters, non-CGI promoters, CGIs and CGI shores, respectively. We avoided assigning proximal CG-DMRs into multiple categories by prioritizing the four genomic features as CGI promoter, non-CGI promoter, CGI and CGI shores (ordered in decreasing priority). Each CG-DMRs was assigned to the category with the highest priority.

We further classified the remaining 1,655,791 distal CG-DMRs:

  1. 397,320 of them were predicted as distal feDMRs (CG-DMRs that show enhancer-like chromatin signatures42,67) as described in the previous method section.

  2. Next, we defined flanking distal feDMRs as the CG-that were within 1kb to distal feDMRs but were not predicted as enhancers (feDMRs). In total we found 212,620 such CG-DMRs.

  3. Then, among the remaining, unclassified CG-DMRs, 159,347 CG-DMRs were identified tissue-specific CG-DMRs in at least one of the tissues because they displayed strong tissue-specific hypomethylation patterns (mCG difference ≥ 0.3). By checking the enrichment of histone marks in their hypomethylated tissues, we found they were enriched for H3K4me1 but not other histone marks, and such chromatin signatures resembled that of primed enhancers68. Therefore, we defined these CG-DMRs as primed distal feDMRs.

  4. Lastly, we defined the remaining CG-DMRs as unexplained CG-DMRs (unxDMRs) because their functional roles could not be assigned yet. We found unxDMRs have strong overlap with transposons and we further divided them into two classes: te-unxDMRs (n = 449,623) and nte-unxDMRs (n = 436,881). te-unxDMRs are unxDMRs that are overlapped with transposons, while the remaining were nte-unxDMRs.

Evolutionary Conservation of CG-DMRs

The evolutionary conservation of CG-DMRs were measured using PhyloP score69 from the UCSC genome browser50 (http://hgdownload.cse.ucsc.edu/goldenpath/mm10/phyloP60way/mm10.60way.phyloP60way.bw). Next, Deeptools254 was used to generate the profile of evolutionary conservation of the CGDMR centers and +/− 5kb flanking regions using options “reference-point --referencePoint=center -a 5000 -b 5000”.

To get the fraction of CG-DMRs that are evolutionarily conserved, we overlapped CGDMRs from different categories with conserved DNA elements in mouse genome. The list of conserved elements was downloaded from UCSC genome browser50 (phastConsElements60Way in mm10 mouse reference).

CG-DMR Effect Size

We defined the effect size of a CG-DMR as the absolute difference in mCG level between the most hypomethylated tissue sample and the average of samples in the “bulk”. Average mCG level of some CG-DMR in “bulk” samples estimates the baseline mCG level of that genomic region. The “bulk” samples are selected 50% of all samples such that the range of their mCG level is narrowest (See “Identification of Tissue-Specific CG-DMRsMethods subsection for details). In this definition, the effect size indicates the degree of hypomethylation for CG-DMR. The effect size of DMS is defined in the same way.

Finding TF-binding Motifs Enriched in Flanking Distal feDMRs

To identify the TF-binding motifs enriched in flanking distal feDMRs relative to feDMRs, we performed motif analysis using the former as foreground and the latter as background. Specifically, for each tissue, the tissue-specific feDMRs were used as background, while flanking distal feDMRs that were within 1kb of these tissue-specific feDMRs were used as foreground. To avoid potential bias residing in different size distribution, both foreground and background regions were extended from both sides (5’ and 3’) such that both had mean size of 400bp. Next a hypergeometric test was performed to find TF-binding motifs that were significantly enriched in foreground. This test was the same as that used for the identification of TF-binding motifs in feDMRs.

TF-binding Motif Enrichment Analysis for Primed Distal feDMRs

We also performed motif analysis to identify TF-binding motifs enriched in primed distal feDMRs. The procedure was similar to the motif enrichment analysis on feDMRs. For each tissue, the primed distal feDMRs hypomethylated in that tissue were considered as foreground while the remaining primed distal feDMRs were considered as background. Then, a hypergeometric test was performed to identify significant motif enrichment.

Next, for each tissue type, we compared the TF-binding motifs enriched in primed distal feDMRs and the tissue-specific feDMRs. The hypergeometric test was used to test the significance of overlap – the chance of obtaining the observed overlap if the two lists were based on random sampling (without replacement) from the TF-binding motifs with TF expression level greater than 10 TPM.

Monte Carlo test of the Overlap between unxDMR and Transposons

To estimate the significance of overlap between unxDMRs and Transposons, we shuffled the location of unxDMRs using “shuffleBed” tool from bedtools59 with default setting and recalculated the overlaps. After repeating this step for 1,000 times, we obtained an empirical estimate of the overlap if unxDMRs were randomly distributed in the genome. Let the observed number of TE overlapping unxDMRs be xobs and the number of TE overlapping shuffled unxDMRs in permutation i be xipermut. We then calculated p-values as

p=[i=11000I(xobsxipermut)]+11000+1

where I(x)={1x=true0x=false.

Identification of Large Hypo CG-DMRs

Large hypo CG-DMRs were called using the same procedure as previously described62. For each tissue type, tissue-specific CG-DMRs were merged if they are within 1kb of each other. Then, we filtered out merged CG-DMRs less than 2kb in length.

We overlapped genes with large hypo CG-DMRs and then filtered out any genes with names starting with “Rik” or “Gm[0–9]”, where [0–9] represents a single digit, because the ontology of these genes were ill-defined.

Super-enhancer Calling

Super-enhancers were identified using ROSE34,70 pipeline. First, H3K27ac peaks were called using macs255 callpeak module with options “--extsize 300 -q 0.05 --nomodel -g mm”. Control data was used in the peak-calling step. Next, ROSE was run with options “-s 12500 -t 2500”, and H3K27ac peaks, mapped H3K27ac ChIP-seq reads and mapped control reads as input. The super-enhancers calls were generated for each tissue sample. Then, we obtained the super-enhancers of one tissue type by merging the super-enhancers called at each stage of fetal development (E10.5 to P0). Last, we generated a list of merged super-enhancers by merging super-enhancer calls of all tissue types except liver.

Quantification of the mCG Dynamics in Tissue-Specific CG-DMRs

To quantify mCG dynamics, we defined and counted loss-of-mCG and gain-of-mCG events. A loss-of-mCG (Gain-of-mCG) event is a decrease (increase) of mCG level by at least 0.1 in one CG-DMR in one stage interval. For example, if the mCG level of one CG-DMR at E11.5 and E12.5 is 0.8 and 0.7 in heart respectively, it is considered a loss-of-mCG event occurred at the stage interval E11.5-E12.5. Stage interval is defined as the transition between two sampled adjacent stages(e.g. E15.5 and E16.5).

Clustering Forebrain-specific CG-DMRs based on mCG and H3K27ac Dynamics

We used k-means clustering to identify subgroups of forebrain-specific CG-DMR based on mCG and H3K27ac dynamics. First, for each forebrain-specific CG-DMR, we calculated the mCG level and H3K27ac enrichment in forebrain samples from E10.5 to adult stages. Here, we used methylome data for postnatal 1, 2 and 6 week frontal cortex from Lister et al9 to approximate the DNA methylation landscape of adult forebrain. We also incorporated H3K27ac data for postnatal 1, 3 and 7 week forebrain samples. Next, to make the range H3K27ac enrichment values comparable to that of mCH levels, for each forebrain-specific CG-DMR, the negative H3K27ac enrichment values were thresholded as zero and then each value was divided by the maximum. If the maximum was zero for some forebrain-specific CG-DMRs, we set all values to be zero. k-means clustering was of subgroups was carried out but no new patterns were observed. Lastly, we used GREAT64 employing the “Single nearest gene” association strategy to identify the enriched gene ontology terms of genes near CG-DMRs for each subgroup.

Association Between mCG level and H3K27ac Enrichment

To investigate the association between mCG and H3K27ac, for each tissue and each developmental stage, we first divided the tissue-specific CG-DMRs into three categories based on mCG methylation levels: H (highly CG methylated; mCG level > 0.6), M (moderately CG methylated; 0.2 < mCG level ≤ 0.6) and L (lowly CG methylated; mCG level ≤ 0.2). Then, we examined the distribution of H3K27ac enrichment in different groups of CG-DMRs by counting the number of CG-DMRs for each of four levels of H3K27ac: [0,2], (2, 4], (4, 6] and (6, ∞).

DNA Methylation Valley (DMV) Identification

We identified DMVs as previously described35. First, genome was dividing into 1kb non-overlapping bins. Then, for each tissue sample (replicate), consecutive bins with mCG level less than 0.15 were merged into blocks; bins with no data (no CG sites or no reads) were skipped. Next, any blocks merged from no less than 5 with-data bins were called as DMVs. For each tissue sample, we filtered for DMVs reproducible in two replicates by first selecting the DMVs identified in one replicate that are overlapped any DMVs called in the other replicate, and then merging overlapping DMVs. Using this strategy, we obtained DMV calls for each tissue from each developmental stage. Lastly, we got a list of merged DMVs for all tissue samples by merging all DMVs identified in any tissues from any developmental stages.

We overlapped genes with DMVs and then filtered out any genes with names starting with “Rik” or “Gm[0–9]”, where [0–9] represents a single digit, because the ontology of these genes were ill-defined.

Partially Methylated Domain (PMD) Identification

PMDs were identified as previously described8 using random forest classifier. To train the classifier, we first visually selected regions on chromosome 19 as strong candidates for PMDs or non-PMDs in E14.5 liver sample. Specifically, we manually annotated 5 PMDs with which showed obvious lower mCG level compared to adjacent genomic regions (chr19:46110000–46240000, chr19:45820000–45960000, chr19:47140000–47340000 and chr19:48060000–52910000) and 7 Non-PMD regions (chr19:4713800–4928700, chr19:7420700–7541100, chr19:8738100–8967000, chr19:18633300–18713800, chr19:53315500–53390000, chr19:55256600–55633900 and chr19:59281600–59329200).

Next, these regions were divided into 10kb non-overlapping bins and we calculated the percentiles of the methylation levels at the CG sites within each bin. CG sites that were within CGIs, DMVs35 or any of four Hox loci (see below) were excluded as these regions are typically hypomethylated which may result in incorrect PMD calling. Additionally, sites with less than 5 reads covered were also excluded. We trained the random forest classifier using data from E14.5 liver (combining the two replicates) and we then predicted whether a 10kb bin was a PMD or non-PMD in all liver samples (considering replicates separately). We chose a large bin size (10 kb) to reduce the effect of smaller scale methylation variation (such as DMRs) as PMDs were first discovered as large (mean length = 153kb, PMID: 19829295) regions with intermediate methylation level (< 70%, PMID: 19829295). Furthermore, the features (the distribution of methylation level of CG sites, which measured the fraction of CG sites that showed methylation level at various methylation level ranges) used in the classifier required enough CG sites within each bin to robustly estimate the distribution, which necessitated a relatively large bin. Also, we excluded any 10kb bins containing less than 10 CG sites for the same reason. These percentiles were used as features for the random forest. The random forest implement was from scikit-learn (version 0.17.1)71 python module and the following arguments were supplied to the Python function RandomForestClassifier from scikit-learn: n_estimators = 10000, max_features=None, oob_score=True, compute_importances=True.

Lastly, we merged consecutive 10kb bins that were predicted as PMD into blocks and filtered out blocks smaller than 100kb. We further excluded blocks that overlapped with gaps in mm10 genome (downloaded from UCSC genome browser, Sep 21, 2013). To obtain a set of PMDs that were reproducible in both replicates, we only considered genomic regions that were larger than 100kb and were covered by PMD calls in both replicates. These regions were the final set of PMDs used for later analyses. Because there was only one replicated for adult liver, we called the PMDs at this stage using the single replicate.

PMDs were originally called using the above procedure without excluding CG sites in Hox gene clusters. However, because these Hox loci are more likely to be considered as large DMVs35, we removed any PMD that overlapped with the four Hox clusters (chr11:96257739–96358516, chr15:102896908–103038064, chr2:74648392–74748841 and chr6:52146273–52277140).

Overlap Between PMDs and LADs

To examine the relationship between PMDs and lamina associated domains (LADs) in normal mouse liver cells (AML12 hepatocyte) we utilized LAD data in the Table S2 from Fu et al72. The mm9 coordinates of LADs were converted to mm10 using liftOver with default settings. We then used Monte Carlo testing to examine the significance of the overlap between PMDs and LADs. Similar to the procedure for checking the overlap between TEs and unxDMRs, we permutated (1000 times) the genomic locations of PMDs and recorded the number of overlapping bases (xishuf for permutation i) between shuffled PMDs and LADs. Then, we compared xishuf with the observed numbers of overlapping bases (xobs) between PMDs and LADs and computed p-values as:

p=[i=11000I(xobsxishuf)]+11000+1

where I(x)={1x=true0x=false.

Replication Timing Data

Replication timing data (build mm10) of three mouse cell types was utilized from ReplicationDomain73. The cell types used for these analyses were mESC (id: 1967902&4177902_TT2ESMockCGHRT), neural progenitor cells (id: 4180202&4181802_TT2NSMockCGHRT) and mouse embryonic fibroblasts (id: 304067–1 Tc1A).

Gene Expression in PMDs

We obtained PMD overlapping protein-coding genes information using “intersectBed”. A similar approach was used to obtain the protein-coding genes overlapped with PMD flanking regions (upstream 100kb and downstream 100kb of PMDs); genes overlapping with PMDs were removed from this list. Lastly, we compared the expression of PMD-overlapping genes (n=5,748) and the genes (n=2,555) overlapping with flanking regions.

Sequence Context Preference of mCH

To interrogate the sequence preference of mCH, as previous described8, we first identified CH sites that showed a significantly higher methylation level than the low level noise (which was around 0.005 in term of methylation level) caused by incomplete sodium bisulfite non-conversion. For each CH site, we counted the number of reads that supported methylation and the number of reads that did not. Next, we performed a binomial test with the success probability equal to the sodium bisulfite non-conversion rate. FDR (1%) was controlled using the Benjamini-Hochberg approach74. This analysis was independently performed for each three-nucleotide context (e.g., a pvalue cutoff was calculated for CAG cytosines). Lastly, we counted sequence motif occurrence of +/−5bp around the tri-nucleotide context of methylated mCH sites and visualized the sequence preferences using seqLogo75

Calling mCH Domains

We used an iterative process to call mCH domains, which are genomic regions that are enriched for mCH compared to flanking regions. First, we selected a set of samples that showed no evidence of mCH. Data from these samples were used in the following steps to filter out genomic regions that are prone to misalignment and showed suspicious mCH abundance. Analysis of the global mCH level and mCH motifs revealed that E10.5 and E11.5 tissues (excluding heart samples) have extremely low mCH and the significantly methylated non-CG sites showed little CA preference. Therefore, we assumed these sites contain no mCH domain and any mCH domain called in control samples by the algorithm were likely artifacts. By filtering out the domains called in the control samples, we were able to exclude the genomic regions that were prone to mapping error or avoid other potential drawbacks in the processing pipeline.

In order to identify genomic regions where sharp changes in mCH levels occurred, we applied a change point detection algorithm with the mCH levels of all 5kb non-overlapping bins across the genome as input. We only included bins that contained at a minimum 500 CH sites and at least 50% of CH sites were covered by 10 or more reads. The identified regions defined the boundaries that separate mCH domains from genomic regions showing background level mCH. We implemented this step using the function cpt.mean in R package “changepoint”, with options “method=“PELT”, pen.value=0.05, penalty=“Asymptotic” and minseglen=2”.To match the range of chosen penalty, we scaled up mCH levels by a factor of 1,000.

The iterative procedure was carried out as follows: 1) An empty list of excluded regions was created. 2) For each control sample, the change point detection algorithm was applied to the scaled mCH levels of 5kb non-overlapping bins. Bins overlapping with excluded regions were ignored. 3) The genome was segmented into chunks based on identified change points. 4) The mCH level of each chunk was calculated as the mean mCH level of the overlapping 5kb bins that were not overlapped with excluded regions. 5) mCH domains were identified as chunks whose mCH level was at least 50% greater than the mCH level of both upstream and downstream chunks. The pseudo mCH level of 0.001 was used to avoid dividing zero. 6) mCH domains were added to the list of excluded regions. 7) Step 2 to 6 were repeated until the list of excluded regions stop expanding. 8) Steps 2–5 were then applied to all samples. 9) For each tissue/organ, only regions were retained that were identified as (part of) an mCH domain in both replicates and regions less than 15kb in length were filtered out; mCH domains must span at least three bins. The above criterion were used to define mCH domains for each tissue/organ. 10) Individual mCH domains from each tissue and organ were merged to obtain a single combined list of 384 mCH domains.

Clustering of mCH Domains

We applied k-means clustering to group the 384 identified mCH domains into 5 clusters based on the normalized mCH accumulation profile of each mCH domain and corresponding flanking regions (100kb upstream and 100kb downstream). Specifically, 1) in each tissue sample, the mCH accumulation profile of one mCH domain was represented as a vector of length 50: the mCH level of 20 5kb bins upstream mCH domain, 10 bins that equally divided the mCH domain and 20 5kb bins downstream. 2) Then, we normalized all values by the average mCH level of bins of flanking regions (the 20 5kb bins upstream and 20 5kb bins downstream of mCH domain). 3) We next computed the profile in samples of 6 tissue types (midbrain, hindbrain, heart, intestine, stomach and kidney) that showed the most evident mCH accumulation in fetal development. 4) Using the profile of these tissue samples, k-means (R v3.3.1) was used to clustered mCH domains with k = 5. We also tried higher cluster numbers (e.g. 6) but did not identify any new patterns. Even using the current k setting (k=5), the mCH domains in clusters 1 (C1) and 3 (C3) shared similar mCH accumulation pattern.

Genes in mCH Domains

We obtained the overlapping gene information for each of the mCH domains by overlapping gene bodies with mCH domains using “intersectBed” in bedtools59. Only protein coding genes were considered. We further filtered out any genes with names starting with “Rik” or “Gm[0–9]”, where [0–9] represents a single digit, because the ontology of these genes were ill-defined. For the overlapping genes of each mCH domain cluster, we used EnrichR47,76 to find the enriched gene ontology terms (“GO_Biological_Process_2015”).

Next we asked whether the identified overlapping genes were enriched for TF encoding genes. For this purpose, a list of mouse TFs from AnimalTFDB77 (Feb 27, 2017) was utilized. We then performed a Monte Carlo test to estimate the significance of the findings. Specifically, xobs is the number of TF encoding genes in all overlapping genes. We randomly selected (1000 times) the same number of genes and in the ith time, xipermut of the randomly selected genes encoding TFs. Lastly, the p-values was calculated as

p=[i=11000I(xobsxipermut)]+11000+1

where I(x)={1x=true0x=false.

mCH Accumulation Indicates Gene Repression

To evaluate the association between mCH abundance and gene expression, we traced the expression dynamics of genes inside mCH domains. For mCH domains in each cluster, we first calculated the TPM z-score for each of the overlapping genes. Specifically, for each tissue type and each overlapping gene, we normalized TPM values in the samples of that tissue type to z-scores. The z-scores showed the trajectory of dynamic expression, in which the aptitude information of expression was removed. If the gene was not expressed, we did not perform the normalization. Next, we calculated the z-scores for all genes that had no overlapped with any mCH domain. Lastly, we subtracted the z-scores of overlapping genes by the z-scores of all genes outside mCH domains. The resulting values indicated the level of expression of genes in mCH domains relative to genes not in mCH domains.

Weighted Correlation Network Analysis (WGCNA)

We used weighted correlation network analysis (WGCNA)78, an unsupervised method, to detect sets of genes with similar expression profiles across samples (R package, “WGCNA” version 1.51). Briefly, TPM values were First log2 transformed (with pseudo count 1e-5). Then, the TPM value of every gene across all samples was compared against the expression profile of all other genes and a correlation matrix was obtained. To obtain connection strengths between any two genes, we transformed this matrix to an adjacency matrix using a power adjacency function. To choose the parameter (soft threshold) of the power adjacency function, we used the scale-free topology (SFT) criterion, where the constructed network is required to at least approximate scale-free topology. The SFT criterion recommends use of the first threshold parameter value where model-fit saturation is reached as long as it is above 0.8. In this study, the threshold was reached for a power of 5.

Next, the adjacency matrix is further transformed to a topological overlap matrix (TOM) that finds “neighborhoods” of every gene iteratively, based on the connection strengths. The TOM was calculated based on the adjacency matrix derived using the signed hybrid network type, biweight mid correlation and signed TOMtype parameters of the TOMsimilarityFromExpr module in WGCNA. Hierarchical clustering of the TOM was done using the flashClust module using the average method. Next, we used the cutreeDynamic module with the hybrid method, deepSplit = 3 and minClusterSize = 30 parameters to identify modules that have at least 30 genes. A summarized module-specific expression profile was created using the expression of genes within the given module, represented by the eigengene. The eigengene is defined as the first principal component of the log2 transformed TPM values of all genes in a module. In other words, this is a virtual gene that represents the expression profile of all genes in a given module. Next, very similar modules were merged after a hierarchical clustering of the eigengenes of all modules applying a distance threshold of 0.15. Finally, the eigengenes were recalculated for all modules after merging.

Gene Ontology Analysis of Genes in Co-expression Modules (CEM)

To better understand the biological processes of genes in each CEM, we used Enrichr47,76 (http://amp.pharm.mssm.edu/Enrichr/) to identify the enriched gene ontology terms in the “GO_Biological_Process_2015” category.

Correlating Eigengene Expression with mCG and Enhancer Scores of feDMRs

We investigated the association between gene expression and epigenomic signatures of regulatory elements in CEMs. First, for each CEM, we used the eigengene expression to summarize the transcription patterns of all genes in the module. Then, we calculated the normalized average enhancer score and normalized average mCG level of all feDMRs that were linked to the genes in the CEM. Specifically, to reduce the potential batch effect, for each tissue and each stage, we normalized the enhancer score of each feDMR by the mean enhancer score of all feDMRs. mCG levels of feDMRs were normalized in similar way except that the data of all DMRs was used to calculate the mean mCG level for each tissue and each stage. Next, for each CEM, the TPM of its eigengene, the normalized average enhancer score and mCG level of linked feDMRs were converted to z-scores across all fetal stages for each tissue type (for analysis for tissue-specific expression) or across tissue types for each development stage (for analysis for temporal expression). Lastly, for each CEM, we calculated the Pearson correlation coefficient (R 3.3.1) between the z-score of eigengene expression and the z-score of normalized enhancer score (or mCG level) for each module. The correlation coefficients were calculated for two different settings: 1) for each tissue type, the correlation was computed using z-score of normalized eigengene expression values and enhancer scores (or mCG levels) across different development stages or 2) for each developmental stage, the correlation was computed across different tissue types. The coefficients from the former analysis indicate how well temporal gene expression is correlated with enhancer score or mCG level of regulatory elements, while the latter measures the association with tissue-specific gene expression.

We then test whether the correlation that we observed was significant by comparing it with the correlation based on shuffle data. In the analysis for tissue-specific expression in a given tissue type, we mapped the eigengene expression of one CEM to the enhancer score (or mCG level) of feDMRs linked to genes in a randomly chosen CEM. For example, in the shuffle setting, when given tissue type was heart, we calculated the correlation between the eigengene expression of CEM14 and the enhancer score of the feDMRs linked to genes in CEM6. In the analysis for temporal expression, given a specific developmental stage, we performed similar permutation. Next, we calculated the Pearson correlation coefficients for this permutation setting. Lastly, using a two-tailed Mann-Whitney test, we compared the median of observed correlation coefficients and the median of those based on shuffled data.

Extended Data

Extended Data Figure 1. Global hypomethylation in fetal liver.

Extended Data Figure 1.

a, Average mCG level of PMDs and flanking regions (+/−100kb) in liver samples from different developmental stages.

b, Normalized average mCG level of PMDs and flanking regions in liver samples. The mCG level was normalized (scaled) such that the average mCG level of +/−20kb regions around each PMD is 1.0..

c, The total bases that PMDs encompass in liver at different developmental stages.

d, Percentage of bases in the PMDs identified in each of the liver samples (E12.5 liver, E13.5 liver etc.) that are also within the PMDs identified in E15.5 liver sample.

e, Histone modification profiles for H3K9me3 (top), H3K27me3 (middle) and H3K27ac (bottom) within PMDs and flanking regions (+/−100kb) in liver samples from different developmental stages.

f, Replication timing profiling of PMDs and flanking regions (+/−100kb). The values indicate the tendency to be replicated at an earlier stage in the cell cycle.

g, Expression of genes overlapping PMDs and flanking regions (+/−100kb) (left) compared with those with no PMD overlap (right). Two plots on the bottom show the data from a validation dataset, containing RNA-seq data generated using a different protocol on matched tissues. The top and bottom of a box represents upper quartile (Q3 or 75% percentile) and lower quartile (Q1 or 25% percentile), while the middle line indicates the median. The upper and lower whiskers are 1.5 times of (Q3–Q1) above Q3 and below Q1. Single points are outlier data points that are beyond the whiskers.

Extended Data Figure 2. Categorization of CG-DMRs.

Extended Data Figure 2.

a, CG-DMR size distribution.

b, Distance of CG-DMRs to the nearest transcription start sites.

c, Genomic distribution of proximal CG-DMRs.

d, Evolutionary conservation of proximal CG-DMRs overlapping with: CG islands (CGI), CGI shores, CGI promoters and non-CGI promoters. PhyloP score was used to measure the degree of conservation.

e, Cumulative distribution of conservation score of CG-DMRs in different categories.

f, Fraction of CG-DMRs in different categories that are overlapped with PhastCons conserved elements (Methods).

g, Conservation (PhyloP) score of promoters and different categories of distal CG-DMRs and flanking regions (+/− 5kb).

Extended Data Figure 3. Characterization of primed distal feDMRs and unexplained CG-DMRs.

Extended Data Figure 3.

a, CG methylation (mCG) level of all primed distal fetal enhancer-linked CG-DMRs (feDMRs) in all non-liver tissues. Each row in the heatmap is one tissue sample and each column corresponds to one primed distal feDMR. Both rows and columns were clustered using hierarchical clustering. Colored bars indicate the tissue types and developmental stages of samples, respectively.

b, mCG (left) and histone modification (right) signatures of primed distal feDMRs (blue; n = 618,786) and feDMRs (red; n = 3,715,052). The boxplots were generated using “boxplot” function in R (3.3.1) and show the median and quantiles of the values in all non-liver tissues. The top and bottom of a box represents the upper quartile (Q3 or 75% percentile) and the lower quartile (Q1 or 25% percentile) of data points, while the middle line indicates the median. The upper and lower whiskers are 1.5 times of interquartile range (Q3–Q1) above upper quartile (Q3) and below lower quartile (Q1) respectively. Outliers are data points beyond the whiskers and are plotted as single points.

c, Number of enriched transcription factor binding motifs only in feDMRs (red), only in primed distal feDMRs (orange), both (dark red) and none (grey). Only the motifs linked to expressed transcription factors (transcripts per million, TPM >= 10) were included. Hypergeometric test was used to estimate the significance of overlap between motifs enriched in feDMRs and ones enriched in primed distal feDMRs.

d-e, Similar to (a), heatmaps showing the mCG levels of unexplained CG-DMRs, including transposon overlapping unexplained CG-DMRs (d) and non-transposon overlapping unexplained CG-DMRs (e).

Extended Data Figure 4. CG-DMR effect size analysis.

Extended Data Figure 4.

a, Distribution of CG-DMR effect sizes.

b, The cumulative distribution of CG-DMR effect sizes for CG-DMRs in different categories.

c, Distribution of the number of differentially methylated sites (DMSs) in CG-DMRs.

d, Fraction of CG sites in CG-DMRs that are DMSs given different DMS effect size cutoffs.

e, Number of DMSs in CG-DMRs with different H3K4me1 enrichment, H3K27ac enrichment, enhancer score (from REPTILE), and RNA abundance (log10(TPM+1), transcript per million mapped reads) of the nearest genes whose transcription start site are within 5kb to CG-DMRs. The value was calculated in the most hypomethylated sample of each CG-DMR. The sample size for each violin/box from left to right is 732,389, 626,599, 254,925, 108,012 and 74,277 for H3K4me1, 935,017, 560,213, 136,778, 58,396 and 105,798 for H3K27ac, 1,593,822, 89,797, 70,254, 36,776 and 5,553 for enhancer score, and 1,045,863, 645,080, 98,020, 7,052 and 187 for gene expression.

f, Distribution of Pearson correlation coefficient between mCG level and various metrics for CGDMRs with different effect size. The metrics include H3K4me1 enrichment, H3K27ac enrichment, enhancer score (from REPTILE), and transcription (TPM, transcript per million mapped reads) of nearest genes whose transcription state site are within 5kb to CG-DMRs. The number of CG-DMRs with effect size <0.2, 0.2–0.3, 0.3–0.4, 0.4–0.5 and 0.5–1 are 523,106, 615,414, 347,019, 184,116 and 138,512 respectively. The boxplots and violin plots were generated using ggplot2 (2.2.1) R (3.3.1) package. In the violin plot, the width of the violin represents the density of different data values. In the boxplots, the top and bottom of a box represents the upper quartile (Q3 or 75% percentile) and the lower quartile (Q1 or 25% percentile) of data points, while the middle line indicates the median. The upper and lower whiskers are 1.5 times of interquartile range (Q3–Q1) above the upper quartile (Q3) and the below lower quartile (Q1) respectively.

Extended Data Figure 5. Link between methylation dynamics and histone modifications at tissue-specific CG-DMRs.

Extended Data Figure 5.

a, Composition of tissue-specific CG-DMRs.

b-c, Percentage of loss-of-mCG (b) and gain-of-mCG (c) events for different fetal stage intervals.

d, Fraction of tissue-specific CG-DMRs that are heavily CG methylated (mCG level > 0.6).

e, Number and fraction of tissue-specific CG-DMRs that only gained mCG (mCG level increases by at least 0.1; red) after P0, only lost mCG (mCG level decreases by at least 0.1; blue), and both (purple) in six tissues where adult methylome data are available.

f, RNA abundance of genes involved in DNA methylation pathways, measured by transcripts per million (TPM).

g, Normalized H3K27ac signals in different clusters.

h, Dynamic mCG level of forebrain-specific CG-DMRs. Grey lines show the mean methylation levels of CG-DMRs in different clusters. The blue line shows the average.

Extended Data Figure 6. Large-scale CG hypomethylation strongly overlap with super-enhancers.

Extended Data Figure 6.

a, Epigenomic profiles of two limb-specific large hypo CG-DMRs near the Lmx1 gene. The bottom diagram shows the relative location of the two large hypo CG-DMRs to Lmx1 gene.

b, Correlation between the mCG level of CG-DMRs within the same large hypo CG-DMR across development stages in the given tissue type. If multiple CG-DMRs are within one large hypo CG-DMR, the mean Pearson correlation coefficient of all pairwise comparisons is reported. Number of CG-DMRs are shown in parenthesis.

c, H3K27ac and H3K4me1 enrichment in large hypo CG-DMRs (red; n = 39,729) and remaining CG-DMRs (green; n = 4,045,384) at all developmental stages across all tissue types except liver.

d, Number of large hypo CG-DMRs identified in each tissue type and the percentage that overlap with super-enhancers (red).

The boxplots (b, c) were generated using “boxplot” function in R (3.3.1). The top and bottom of a box represents the upper quartile (Q3 or 75% percentile) and the lower quartile (Q1 or 25% percentile), while the middle line indicates the median. The upper and lower whiskers are 1.5 times of (Q3–Q1) above Q3 and below Q1. Single points are outlier data points that are beyond the whiskers.

Extended Data Figure 7. Comparing large hypo CG-DMRs and DNA methylation valleys (DMVs).

Extended Data Figure 7.

a, mCG level of large hypo CG-DMRs (top) and DMVs (bottom) in all non-liver tissues. Both rows and columns were clustered using hierarchical clustering. Colored bars indicate the tissue types and developmental stages of samples, respectively. The heatmap shows the data of merged large hypo CG-DMRs and DMVs for predictions from all tissue samples.

b, Fraction of large hypo CG-DMRs (left) and DMVs (right) that undergo lost-of-mCG (top, blue) and gain-of-mCG (bottom, red) during development. The blue (loss-of-mCG) or red (gain-of-mCG) line shows the aggregated values over all non-liver tissues, whereas the grey lines show the data for each tissue type.

c, Number of genes overlapping with large hypo CG-DMRs (left) or DMVs (right). The dark blue bar indicates the number of genes that encode transcription factors (TFs).

Extended Data Figure 8. Non-CG methylation accumulation in fetal tissues.

Extended Data Figure 8.

a, Expression of neural progenitor markers, Nes37 and Sox238. TPM – transcripts per million mapped reads.

b, Expression of several neuronal markers from Svendsen et al39.

c, Sequence context preference for non-CG methylation (mCH).

d, Grouping mCH domains into 5 clusters based on the dynamics of methylation accumulation. The heatmap shows normalized methylation levels of mCH domains and flanking genomic regions (up to 100kb upstream and 100kb downstream). mCH in the adult (AD) forebrain was approximated using data from the frontal cortex from 6-week-old mice.

e, Expression dynamics of genes within mCH domains relative to the other genes. Z-scores were calculated for each gene across development and each line shows the mean value of mCH overlapping genes for each cluster.

f, The expression of genes in mCH domains at P0 relative to the expression dynamics of genes outside mCH domains. Each circle corresponds to the value given one mCH domain cluster and one tissue. The red horizontal line indicates the median, which was tested against 0 using a one-sided Wilcoxon signed-rank test (n = 50).

Extended Data Figure 9. Fetal-enhancer-linked CG-DMRs (feDMRs).

Extended Data Figure 9.

a, Number of feDMRs predicted in each tissue from each stage.

b, Positive rate (fraction of elements that are experimentally validated as enhancer in a given tissue) of DNA elements in VISTA enhancer browser (left), ones that do not overlap with feDMRs or H3K27ac peaks (middle), and those that do not overlap with feDMRs or H3K27ac peaks and do not show a high evolutionary PhyloP score (right). Numbers in parenthesis indicate the number of VISTA elements.

c, Positive rate in E11.5 tissues of the elements that are experimentally validated as an enhancer (+) or not (−) in a given E11.5 tissue. Numbers indicate the number of VISTA elements.

d, (left) Percentage of putative enhancers from Gorkin et al. (companion paper) that overlap with feDMRs in each tissue sample. (right) Percentage of feDMRs that overlap with putative enhancers from Gorkin et al.

Extended Data Figure 10. WGCNA identification of co-expression modules.

Extended Data Figure 10.

a, The scale free topology model fit (R2) (top) and the mean connectivity of the coexpression network (bottom) given different soft-thresholding powers. These two plots show how thresholds were chosen for weighted gene co-expression network analysis (WGCNA). The blue horizontal line indicates the model fit cutoff (R2 = 0.8). A soft threshold = 5 was chosen to construct the co-expression network because it is the first threshold value where the model fit is greater than 0.8.

b, Top enriched ontology terms of genes in co-expression modules. EnrichR47 was used for this analysis, which uses a one-tailed Fisher’s exact to calculate p-values (number of overlapping genes are shown in parentheses). Then, the Benjamini-Hochberg method was used to adjust p-values for multiple testing correction.

c, Expression of genes in CEM12. Each row is a gene in certain module and the transcripts per million (TPM) z-scores were calculated along each row.

d, Similar to Fig. 5d, correlation of temporal eigengene expression for CEM29 and CEM12 with the z-scores of average mCG level and the z-scores of average enhancer score of neighboring feDMRs (Pearson correlation coefficient; n = 7).

Supplementary Material

Supplemental Table 1.xlsx
Supplemental Table 7.xlsx
Supplemental Table 8.xls
Supplemental Table 9.xls
Supplementary_Information_v2.pdf
Supplemental Table 10.xls
Supplemental Table 11.xlsx
Supplemental Table 12.xlsx
Supplemental Table 2.xlsx
Supplemental Table 3.xls
Supplemental Table 4.xls
Supplemental Table 5.xls
Supplemental Table 6.xls

Acknowledgements

We thank Drs. Junhao Li, Shao-shan Carol Huang, Eran A. Mukamel and Liang Song for critical comments. D.U.G is supported by the A.P. Giannini Foundation and NIH IRACDA K12 GM068524. A.V. and L.A.P. were supported by National Institutes of Health grant U54HG006997, and the research conducted at the E.O. Lawrence Berkeley National Laboratory was performed under Department of Energy Contract DE-AC02–05CH11231, University of California. J.R.E is an Investigator of the Howard Hughes Medical Institute. Use of the Extreme Science and Engineering Discovery Environment (XSEDE) was supported by National Science Foundation grant number ACI-1548562. This work was supported by the National Institutes of Health ENCODE Project (U54 HG006997). The data that support these findings are publicly accessible at https://www.encodeproject.org/ and http://neomorph.salk.edu/ENCODE_mouse_fetal_development.html. Additional RNA-seq datasets are available at the NCBI Gene Expression Omnibus (accession GSE100685). Further details describing the data used in this study can be found in Supplementary Tables 1 and 2.

Footnotes

Conflict of Interest

B.R. is a co-founder and shareholder of Arima Genomics, Inc.

Code Availability

Methylpy (1.0.2) and REPTILE (1.0) are available at https://github.com/yupenghe/methylpy and https://github.com/yupenghe/REPTILE. Custom code used for this study is available at https://github.com/yupenghe/encode_dna_dynamics.

References

  • 1.Reik W Stability and flexibility of epigenetic gene regulation in mammalian development. Nature 447, 425 (2007). [DOI] [PubMed] [Google Scholar]
  • 2.Davidson EH Emerging properties of animal gene regulatory networks. Nature 468, 911 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Spitz F & Furlong EEM Transcription factors: from enhancer binding to developmental control. Nat. Rev. Genet 13, 613 (2012). [DOI] [PubMed] [Google Scholar]
  • 4.Patel DJ & Wang Z Readout of Epigenetic Modifications. Annu. Rev. Biochem 82, 81–118 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Zhu H, Wang G & Qian J Transcription factors as readers and effectors of DNA methylation. Nat Rev Genet 17, 551–565 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bird A DNA methylation patterns and epigenetic memory. Genes Dev. 16, 6–21 (2002). [DOI] [PubMed] [Google Scholar]
  • 7.Lister R et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 462, 315–322 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Schultz MD et al. Human body epigenome maps reveal noncanonical DNA methylation variation. Nature 523, 212–6 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lister R et al. Global epigenomic reconfiguration during mammalian brain development. Science 341, 1237905 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Ziller MJ et al. Charting a dynamic DNA methylation landscape of the human genome. Nature 500, 477–81 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Hon GC et al. Epigenetic memory at embryonic enhancers identified in DNA methylation maps from adult mouse tissues. Nat. Genet 45, 1198–206 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Domcke S et al. Competition between DNA methylation and transcription factors determines binding of NRF1. Nature 528, 575–579 (2015). [DOI] [PubMed] [Google Scholar]
  • 13.Stricker SH, Koferle A & Beck S From profiles to function in epigenomics. Nat Rev Genet, 18 51–66 (2017). [DOI] [PubMed] [Google Scholar]
  • 14.Xie W et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 148, 816–831 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.He Y & Ecker JR Non-CG Methylation in the Human Genome. Annu. Rev. Genomics Hum. Genet 16, 150615185749007 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Guo JU et al. Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nat. Neurosci 17, 215–22 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Luo C et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. 604, 600–604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Chen L et al. MeCP2 binds to non-CG methylated DNA as neurons mature, influencing transcription and the timing of onset for Rett syndrome. Proc. Natl. Acad. Sci 201505909 (2015). doi: 10.1073/pnas.1505909112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang L et al. Programming and Inheritance of Parental DNA Methylomes in Mammals. Cell 157, 979–991 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Guo H et al. The DNA methylation landscape of human early embryos. Nature 511, 606–10 (2014). [DOI] [PubMed] [Google Scholar]
  • 21.Smith ZD et al. DNA methylation dynamics of the human preimplantation embryo. Nature 511, 611–615 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Jones KL Recognizable Patterns of Human Malformation. (Saunders, 2005). [Google Scholar]
  • 23.Buenrostro JD, Giresi PG, Zaba LC, Chang HY & Greenleaf WJ Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–8 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Yue F et al. A comparative encyclopedia of DNA elements in the mouse genome. Nature 515, 355–364 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Feng L, Hatten ME & Heintz N Brain lipid-binding protein (BLBP): a novel signaling system in the developing mammalian CNS. Neuron 12, 895–908 (1994). [DOI] [PubMed] [Google Scholar]
  • 26.Visel A, Minovitsky S, Dubchak I & Pennacchio L. a. VISTA Enhancer Browser - A database of tissue-specific human enhancers. Nucleic Acids Res. 35, 88–92 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Bogdanović O et al. Active DNA demethylation at enhancers during the vertebrate phylotypic period. Nat. Genet 48, 417–426 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Law J. a & Jacobsen SE. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat. Rev. Genet 11, 204–220 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Huang Y et al. The Behaviour of 5-Hydroxymethylcytosine in Bisulfite Sequencing. PLoS One 5, e8888 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Yu M et al. Base-Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome. Cell 149, 1368–1380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Mo A et al. Epigenomic Signatures of Neuronal Diversity in the Mammalian Brain. Neuron 86, 1369–1384 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Johnson RL & Tabin CJ Molecular models for vertebrate limb development. Cell 90, 979–990 (1997). [DOI] [PubMed] [Google Scholar]
  • 33.Hnisz D et al. Super-enhancers in the control of cell identity and disease. Cell 155, (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Whyte WA et al. Master Transcription Factors and Mediator Establish Super-Enhancers at Key Cell Identity Genes. Cell 153, 307–319 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Xie W et al. Epigenomic analysis of multilineage differentiation of human embryonic stem cells. Cell 153, 1134–1148 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Jeong M et al. Large conserved domains of low DNA methylation maintained by Dnmt3a. Nat. Genet 46, 17 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lendahl U, Zimmerman LB & McKay RD CNS stem cells express a new class of intermediate filament protein. Cell 60, 585–595 (1990). [DOI] [PubMed] [Google Scholar]
  • 38.Ellis P et al. SOX2, a persistent marker for multipotential neural stem cells derived from embryonic stem cells, the embryo or the adult. Dev. Neurosci 26, 148–165 (2004). [DOI] [PubMed] [Google Scholar]
  • 39.Svendsen CN, Bhattacharyya A & Tai Y-T Neurons from stem cells: preventing an identity crisis. Nat Rev Neurosci 2, 831–834 (2001). [DOI] [PubMed] [Google Scholar]
  • 40.Stroud H et al. Early-Life Gene Expression in Neurons Modulates Lasting Epigenetic States. Cell 171, 1151–1164.e16 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.He Y et al. Improved regulatory element prediction based on tissue-specific local epigenomic signatures. Proc. Natl. Acad. Sci (2017). doi: 10.1073/pnas.1618353114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Heintzman ND et al. Histone modification at human enhancers reflect global cell-type specific gene expression. Nature 459, 108–112 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Langfelder P & Horvath S WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Maurano MT et al. Systematic Localization of Common Disease-Associate Variation in Regulatorty DNA. Science (80-.). 337, 1190 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Finucane HK et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Publ. Gr 47, 1228–1235 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Komor AC, Kim YB, Packer MS, Zuris JA & Liu DR Programmable editing of a target base in genomic DNA without double-stranded DNA cleavage. Nature 533, 420–424 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Chen EY et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Towns John et al. XSEDE: Accelerating Scientific Discovery. Comput. Sci. Eng 16, 62–74 (2014). [Google Scholar]
  • 49.Urich MA, Nery JR, Lister R, Schmitz RJ & Ecker JR MethylC-seq library preparation for base-resolution whole-genome bisulfite sequencing. Nat. Protoc 10, 475–483 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kent WJ et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Ma H et al. Abnormalities in human pluripotent cells due to reprogramming mechanisms. Nature 511, 177–83 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Schultz MD, Schmitz RJ & Ecker JR ‘Leveling’ the playing field for analyses of single-base resolution DNA methylomes. Trends Genet. 28, 583–585 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Li H & Durbin R Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Ramírez F et al. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Res. 44, W160–W165 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Zhang Y et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, 1–9 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Dobin A et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Harrow J et al. GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Li B & Dewey CN RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Quinlan AR & Hall IM BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Rousseeuw P Least Median of Squares Regression. J. Am. Stat. Assoc 79, 871–880 (1984). [Google Scholar]
  • 61.Ernst J & Kellis M ChromHMM: automating chromatin-state discovery and characterization. Nat Meth 9, 215–216 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Mo A et al. Epigenomic Signatures of Neuronal Diversity in the Mammalian Brain. Neuron 86, 1369–1384 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Grant CE, Bailey TL & Noble WS FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.McLean CY et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol 28, 495–501 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.The International HapMap Project. Nature 426, 789–796 (2003). [DOI] [PubMed] [Google Scholar]
  • 66.Consortium T 1000 G. P. A global reference for human genetic variation. Nature 526, 68 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Heintzman ND et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat. Genet 39, 311–318 (2007). [DOI] [PubMed] [Google Scholar]
  • 68.Calo E & Wysocka J Modification of enhancer chromatin: what, how and why? Mol. Cell 49, 10.1016/j.molcel.2013.01.038 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Pollard KS, Hubisz MJ, Rosenbloom KR & Siepel A Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Lovén J et al. Selective Inhibition of Tumor Oncogenes by Disruption of Super-Enhancers. Cell 153, 320–334 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Pedregosa F et al. Scikit-learn: Machine Learning in {P}ython. J. Mach. Learn. Res 12, 2825–2830 (2011). [Google Scholar]
  • 72.Fu Y et al. MacroH2A1 associates with nuclear lamina and maintains chromatin architecture in mouse liver cells. Sci. Rep 5, 17186 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Weddington N et al. ReplicationDomain: a visualization tool and comparative database for genome-wide replication timing data. BMC Bioinformatics 9, 530 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Benjamini Y & Hochberg Y Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995). [Google Scholar]
  • 75.Bembom O seqLogo: Sequence logos for DNA sequence alignments. R package version 1.40.0. (2016). [Google Scholar]
  • 76.Kuleshov MV et al. Enrichr: a comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–7 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Zhang H-M et al. AnimalTFDB: a comprehensive animal transcription factor database. Nucleic Acids Res. 40, D144 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Zhang B & Horvath S A General Framework for Weighted Gene Co-Expression Network Analysis A General Framework for Weighted Gene. Stat. Appl. Genet. Mol. Biol 4, (2005). [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Table 1.xlsx
Supplemental Table 7.xlsx
Supplemental Table 8.xls
Supplemental Table 9.xls
Supplementary_Information_v2.pdf
Supplemental Table 10.xls
Supplemental Table 11.xlsx
Supplemental Table 12.xlsx
Supplemental Table 2.xlsx
Supplemental Table 3.xls
Supplemental Table 4.xls
Supplemental Table 5.xls
Supplemental Table 6.xls

Data Availability Statement

All whole-genome bisulfite sequencing (WGBS) data from mouse embryonic tissues are available at the ENCODE portal (https://www.encodeproject.org/) and/or deposited in the NCBI Gene Expression Omnibus (GEO) (Supplementary Table 1). The additional RNA-seq dataset of forebrain, midbrain, hindbrain and liver is available at the NCBI Gene Expression Omnibus (GEO) under accession GSE100685. All other data used in this study, including chromatin immunoprecipitation sequencing (ChIP-seq), assay for transposase-accessible chromatin using sequencing (ATAC-seq), RNA-seq and additional WGBS data, are available at the ENCODE portal and/or GEO (Supplementary Table 2).

RESOURCES