Skip to main content
Nucleic Acids Research logoLink to Nucleic Acids Research
. 2023 Feb 10;51(5):2046–2065. doi: 10.1093/nar/gkad050

DNA methylation entropy is associated with DNA sequence features and developmental epigenetic divergence

Yuqi Fang 1,2, Zhicheng Ji 3,4,3, Weiqiang Zhou 5,3, Jordi Abante 6,3, Michael A Koldobskiy 7,8, Hongkai Ji 9,, Andrew P Feinberg 10,11,12,13,
PMCID: PMC10018346  PMID: 36762477

Abstract

Epigenetic information defines tissue identity and is largely inherited in development through DNA methylation. While studied mostly for mean differences, methylation also encodes stochastic change, defined as entropy in information theory. Analyzing allele-specific methylation in 49 human tissue sample datasets, we find that methylation entropy is associated with specific DNA binding motifs, regulatory DNA, and CpG density. Then applying information theory to 42 mouse embryo methylation datasets, we find that the contribution of methylation entropy to time- and tissue-specific patterns of development is comparable to the contribution of methylation mean, and methylation entropy is associated with sequence and chromatin features conserved with human. Moreover, methylation entropy is directly related to gene expression variability in development, suggesting a role for epigenetic entropy in developmental plasticity.

INTRODUCTION

DNA methylation, a covalent modification of the nucleotide cytosine, heritable during cell division at CpG dinucleotides, is a key component of the epigenetic information, i.e. independent of the DNA sequence itself, defining cell type identity and developmental state. Differences in DNA methylation levels between individuals can be driven by nearby DNA sequence differences termed methylation quantitative trait loci (mQTLs) (1), and DNA methylation level is generally inversely related to mean gene expression levels, particularly at gene promoters (2). In addition to DNA methylation levels, DNA methylation stochasticity, more formally defined as entropy in information theory, is related to processes involving plasticity, such as the epithelial-mesenchymal transition in cancer (3) and differentiation potency, and could also help to regulate developmental plasticity (4). However, it remains poorly understood how epigenetic entropy is influenced by DNA sequence, how entropy might function, its relationship to developmental state or transcription factor binding sites, or what might be its effect on gene expression.

Information theory is the quantitative field that measures and analyzes information and entropy in its transmission, and it is thus a natural fit to address stochasticity in methylation. Figure 1A illustrates this idea applied to multiple consecutive CpG sites on the same DNA molecule. The two examples have similar mean methylation levels but markedly different entropy. The example on the right shows substantial variation from molecule to molecule in terms of the combinatorial configurations of the binary methylation states of multiple CpG sites in the same molecule, and hence high entropy; whereas the example on the left has little such variation and low entropy. Our overall goal was to relate differences in entropy (compared to mean) to the methylation potential energy landscape, genetic sequence, regulatory DNA, transcription factor binding, embryonic development and gene expression (Figure 1A).

Figure 1.

Figure 1.

Study overview. (A) Conceptual illustration of applying information theory to a set of DNA reads from WGBS. Each row is a single DNA molecule representing one read, and the dots on the line represents the CpG location. The blue dots correspond to unmethylated CpG and the red dots correspond to methylated CpG. The mean methylation level (MML) on the left panel is the same as the MML on the right panel. However, when considering the combinatorial configuration of methylation states of multiple CpGs in the same molecule, the left panel shows small variation among reads (rows), whereas the right panel shows large variation. As a result, the right panel has a much larger normalized entropy (NME) than the left panel (see Methods and Materials). Created with BioRender.com. (B) A summary of the analyses performed in this study, which examines methylation entropy and its relationship to functional genomic features including transcription factor binding sites, CpG density and regulatory DNA; and relationship to functional consequences including the embryonic development, contribution of methylation entropy and mean methylation to methylation landscape, orthogonal relationships to transcription factor binding sites, CpG density and regulatory DNA, conservation of those features between human and mouse, and the association between gene expression variability and methylation entropy.

Entropy analysis has been applied to whole-genome bisulfite sequencing methylation data (5–7), but entropy analysis has not incorporated the role of the underlying DNA sequence itself in its control or been applied to embryonic development. We first applied an information theory method to measure DNA methylation mean and entropy that can distinguish the two alleles of a gene, providing the perfect control since the two alleles are present in exactly the same cells (8). We analyzed data from 49 polymorphic human samples obtained from the Roadmap Epigenomics Project, which contain not only DNA methylation sequencing data, but also complete DNA sequence (SNPs) (9). We asked three questions comparing entropy and mean methylation (outlined in blue in Figure 1B): (i) How are they related to transcription factor binding site? (ii) How are they related to CpG density? and (iii) How are they related to regulatory DNA?

In order to better understand the functional role of methylation entropy, we then performed an information-theoretic analysis of mouse embryonic development, examining comprehensive whole-genome DNA methylation data from the ENCODE3 project (10) on seven individual tissue types during mouse development at six developmental time points, addressing five further questions (outlined in orange in Figure 1B): (i) Does this information theory-based approach identify tissue- and time-dependent changes in the methylation landscape? (ii) What is the relative contribution of mean methylation and methylation entropy to these developmental landscape changes? (iii) Can we integrate these results with orthogonal measures of transcription factor binding sites, CpG density, and regulatory DNA in order to understand the role of DNA methylation entropy? (iv) Are these features conserved between mouse and human? and (v) What is the functional relationship between methylation entropy and gene expression variability, using single-cell RNA-seq data from both mouse and human? These analyses taken as whole revealed striking associations of methylation entropy with genomic and developmental processes, the sequences that may drive them, and their relationship to developmental gene expression.

MATERIALS AND METHODS

SNP data processing

The revealing single nucleotide polymorphisms (SNPs) information for subjects H9, HUES64, STL001, STL002, STL003, skin03, HuFGM02, 149, 150 and 112 were extracted from the Roadmap Epigenomics database (11). Whole-genome sequencing data (WGS) for the H1 cell line was downloaded from PRJNA285681 (SRR2048232). The WGS reads were trimmed using Trim Galore (v0.5.0) (12). The trimmed reads were aligned to the hg19 reference genome using the Arioc WGS alignment package(v1.40) (13). Duplicated reads from polymerase chain reaction (PCR) products were removed using Picard tools (MarkDuplicates v2.18.13) (14). SNPs were called using GATK HaplotypeCaller (v4.0.0) (14) and dbSNP build 15 (15).

Allele-specific methylation detection.

The Whole-genome bisulfite sequencing (WGBS) data was aligned, and reads were assigned to each allele in the same way as the correlated potential energy landscape (CPEL) pipeline (8). The first and last 5 base pair (bp) of each read were not used in the analysis. To perform haplotype-dependent allele-specific methylation analysis, we used the Julia package CpelAsm.jl, a recently developed method (8). For a given haplotype, CpelAsm estimates an allele-specific epigenetic landscape of the random methylation state Inline graphic, where N is the number of CpG sites in the region, by performing maximum likelihood estimation on a set of M independent WGBS reads obtained for each allele as described in the CpelAsm (8), resulting in a probability mass function of the methylation state p(x),Inline graphic, for each allele (allele subscript not shown for notational simplicity hence forth). Once the parameters of the allele-specific models are estimated, CPEL computes the allele-specific mean methylation level (MML) Inline graphic as well as the normalized methylation entropy (NME) h, a measure of epigenetic stochasticity, for each allele. In particular, the MML of a given allele is given by

graphic file with name M0003.gif

where Inline graphic is the random methylation state of the n-th CpG site, and Inline graphic denotes the expected methylation level of the n-th CpG site. The corresponding NME is given by

graphic file with name M0006.gif

Both quantities are normalized to produce values in the range [0,1]. Then, CPEL computes the absolute MML difference and NME difference between two alleles. The first statistic measures the absolute difference in MML (dMML) between haplotype alleles, and it is simply given by Inline graphic. The second statistic measures the absolute difference in NME (dNME) between haplotype alleles, and it is given by Inline graphic. Thus, a large NME difference suggests that the methylation state in one allele is highly stochastic while that of the other allele behaves almost deterministically. In order to determine whether the allelic difference in MML or NME is statistically significant, we generate null statistics from homozygous regions of the genome to obtain an empirical null distribution for each test statistic (dMML, dNME). However, the distribution of the null statistics will be a function of the number of CpG sites N. Therefore, we generate a set of null statistics for each N considered. To generate a null statistic, we randomly choose a genomic window with no SNPs and select a contiguous set of N CpG sites. Next, we randomly split the reads mapping to that region into two groups, which simulate two alleles. A CPEL model is estimated for each simulated allele, and the corresponding statistics are computed to produce a null statistic. This process is repeated until we obtain 1000 null statistics for each N. These sets of null statistics are used then to compute a p-value for each test statistic in each heterozygous region. The complete details of the procedure used to compute p-values can be found in Supplementary Material section 10 hypothesis testing from (8). In the case of the SNP being located at C or G in the CpG that changes CpG number, the SNP-containing CpG is not included in the CPEL model in both alleles. This ensures a fair comparison between the two alleles. Subsequently, CpelAsm used the Benjamini–Hochberg procedure to compute adjusted P-values to control the false discovery rate (FDR) in the statistical output (16). The FDR ≤0.1 was used to determine the significance of the MML difference and NME difference.

Regions with large MML difference enrichment analysis

The list of imprinted genes was downloaded from the Geneimprint website (17). The contingency table was constructed by counting the occurrence that the regions showing large MML differences in the promoter of the imprinted gene versus non-imprinted gene. The odds ratio was calculated and the two-sided Fisher's exact test was performed using the contingency table where rows are regions with or without significant dMML and columns are regions at the promoter region of imprinted or non-imprinted genes. The same enrichment analysis was done for the mono-allelic expressed gene (MAE) where rows are regions with or without significant dMML and columns are regions at the promoter region of MAE genes or non-MAE genes (18).

Human motif binding analysis

We used the motifBreakR package in the default setting to first map the motif obtained from the JASPAR database (19) at each SNP to predict the transcription factor binding sites (TFBSs). For each motif site, we then evaluated the binding probability of the corresponding motif in each allele by considering the polymorphism. For each motif site, we counted the occurrence of the region with higher binding probability in the allele with higher NME or MML than the other allele. This observation should be 50% by random chance. On the contrary, if the occurrence of such regions is significantly higher than 50%, using the binomial test, it suggests that a higher transcription factor (TF) binding probability is associated with higher NME or MML. Similarly, we performed the analysis to identify motifs that are associated with lower NME or MML.

Analyzing the association between NME and SNP

We examined the trinucleotide context near the SNP, i.e. 1bp before and after the SNP. For each type of SNP, there are 16 possible trinucleotide changes. We merged trinucleotide changes that are reverse complements, e.g. GCG → GTG and CGC → CAC. The trinucleotide was connected by → in the direction of SNP. The direction of SNP was determined by decreasing CG if there are CG changes in the trinucleotide, e.g. GCG → GTG. Allele1 is the allele on the left, e.g. GCG, and the allele2 is the allele on the right, e.g. GTG. The NME difference was calculated for each region using NME in the allele1 minus the NME in allele2. Note that for both alleles the NME was calculated without using the SNP-containing CpG to ensure a fair comparison between alleles. For each trinucleotide context, the contingency table was constructed by counting the occurrence that the NME in allele1 is smaller than the NME in allele2 in that trinucleotide context versus not in that trinucleotide context (Supplementary Figure S6). The log (Odds ratio) and p-value were calculated using the two-sided Fisher's exact test. The Benjamini-Hochberg correction was used to calculate the FDR for significance (16).

Some regions contain >1 SNPs that change the CpG number. To calculate the effect of the CpG number in the allele on NME, we calculated the CpG number changes and NME changes between two alleles. The analyzed regions with different CpG numbers between two alleles due to SNP and significant NME difference (FDR ≤ 0.1) were selected. Among those regions, the distribution of NME in the allele with more CpGs was plotted in Figure 3B in red and the distribution of NME in the allele with fewer CpG was plotted in Figure 3B in green. The one-sided t-test was used to test the significance of NME difference between two alleles because we found the allele with more CpG tends to have smaller NME when analyzing SNPs. The NME in both alleles in all the regions with significant NME differences between the two alleles, regardless of CpG number, was plotted in Figure 3B in blue as a reference.

Figure 3.

Figure 3.

Allelic sequence relationship to NME. (A) The analysis for each SNP genotype in all 52 possible trinucleotide contexts. Log(Odds ratio) (x-axis) greater than 0 means the right trinucleotide is more likely to have higher entropy than the left trinucleotide than random expectation. If the SNP changes the number of CpG sites, the allele with fewer CpG sites is shown on the right and the bar is colored in blue. For each SNP genotype, the trinucleotide contexts associated with the largest allelic entropy differences (i.e. highest odds ratio) were often the ones that change the CpG number (colored in blue). Error bar is used to show the upper and lower 95% of CI of the log(OR). (B) Distribution of NME comparing the alleles with higher or lower numbers of CpG. For the regions that have CpG number differences, the allele is labeled as ‘more CpG’ (red) for the allele that contains more CG than the allele labeled ‘fewer CpG’ (green). The allele with more CG has significantly lower NME than the allele with fewer CpG (paired one-sided t-test, P < 2.2 × 10−16). The distribution of NME for all alleles is displayed as a reference (blue). (C) NME and MML, regardless of SNPs, of genomic regions surrounding transcription start sites (20kb around each TSS) stratified based on CpG density. The CpG density was defined as the ratio of the observed CpG number to the expected CpG number in each region (see Methods and Materials). The centerline represents the median and the upper and lower lines are the first and third quantile. As CpG density increases, the NME decreases. MML showed an abrupt decrease at high CpG density while NME had a gradual decrease as CpG increased.

The number of CpG was counted in each allele with an extended region (500 bp). The expected number of CpG was calculated using equations from (20)

graphic file with name M0009.gif

The density was calculated for each allele by the number of CpG/expected CpG. The CpG density difference was calculated by the difference of CpG density between two alleles. The regions showing significant NME difference (FDR ≤ 0.1) were used to calculate the correlation between CpG density difference and NME difference.

Human NME and MML calculation without separating alleles

Allele-specific analysis can only be performed for regions with heterozygote SNPs. To analyze the whole-genome regardless of SNPs, the WGBS data for human samples were aligned without assigning the allele. The first and last 5 bp of each read were not used. The target region was divided into 250 bp segments. For each segment, regardless of the allele of origin, the MML and the NME were calculated as described in the Allele-specific methylation detection section. The CpG density-dependent nature of differential NME (dNME) suggested that simply lumping data from both alleles might give a false impression of information-theoretic entropy. If one simply pooled the reads from the two alleles with a 50–50 mixture of a highly methylated and lowly methylated allele of a gene, both with near-zero entropy, combined reads would sometimes falsely appear to be relatively high entropy. That is exactly what would happen when examining imprinted regions, where one allele is in fact methylated depending on the parent of origin. Indeed, if one computes NME without regard to an SNP, there is a false skewing to high NME in regions with significant MML difference such as genes with imprinting (Supplementary Figure S1A, B); while that is not the case for regions without significant MML difference (Supplementary Figure S1C). Since regions with allelic mean methylation imbalances were rare across the genome (0.422% of all regions analyzed), and since the main difference between the mean allelic methylation entropy and methylation entropy regardless of SNP were in such regions, we excluded regions showing mean methylation imbalances from this analysis. As a result, in this whole-genome analysis, NME is expected to be similar to the mean allelic NME of the two alleles at each locus.

NME and CpG density analysis

NME and MML were calculated within 20 kilo base pairs (kb) of transcription start site (TSS) as described in the Human NME and MML calculation without separating alleles section. We computed the CpG density for each region using (total number of CpG)/(expected number of CpG). The expected number of CpG was calculated using the same way we did in Analyzing the association between NME and SNP. The correlation between CpG density and NME or MML regardless of the allele was calculated using the two-sided Pearson correlation test.

Human motif analysis

We downloaded the non-redundant CORE motifs from JASPAR (19). The 630 human motifs representing 586 TFs or TF complexes were then mapped to the human genome using CisGenome (21). For each motif, the mapped motif sites were grouped into two classes (regulatory and non-regulatory) based on 167 ENCODE (22) DNase-seq samples. Regulatory DNA is defined as the union set of chromatin accessible sites across the 167 DNase-seq samples which were downloaded from https://github.com/WeiqiangZhou/BIRD-data (23). Briefly, to obtain the regulatory regions in human, the aligned DNase-seq data (alignment based on hg19) from 167 ENCODE samples (representing 74 cell types) were downloaded from https://www.encodeproject.org/ (22). Genomic regions from chromosome Y were excluded. Then, the genome was divided into 250 bp non-overlapping bins. The number of reads mapped to each bin was counted for each DNase-seq sample. To adjust for different sequencing depths, bin read counts for each sample were first divided by the sample's total read count and then scaled by multiplying a constant (minimum total read count from all the samples). Since most genomic loci are noise rather than regulatory elements, we filtered genomic loci to exclude those without strong DNase I hypersensitivity (DH) signal in any training DNase-seq sample. The filtering was done in three steps. First, genomic bins with normalized read count ≤ 20 in all samples were excluded. Second, bins with normalized read count larger than 10 000 in ≥1 sample were considered abnormal and therefore also excluded. Third, a signal-to-noise ratio (SNR) was computed for each bin in each sample, and bins with SNR ≤3 in all samples were considered as noise and filtered out. To compute SNR of a genomic bin in a sample, we first collected 40 bins in the neighborhood of the bin in question. The average DH level of these bins was computed to serve as the background. The log2(SNR) was defined as log2([DH level of a bin]/[background]). To obtain the regulatory regions in mouse, we downloaded the DNase-seq peak files (mm10) from 72 samples which are from the similar tissues and developmental stages as the DNA methylation data. Similarly, genomic regions from chromosome Y were excluded. The peak regions were merged from all the samples and divided into 250 bp bins. To get a set of non-regulatory DNA, we used CisGenome (21) to obtain genomic regions that are not located in the regulatory DNA but have a similar distribution of distance to TSS with the regulatory DNA. For each motif, motif sites that overlap with regulatory DNA were labeled regulatory motif sites, and motif sites that overlap with non-regulatory DNA were labeled as non-regulatory motif sites. For this, we used methylation entropy to characterize the NME of each motif site in regulatory DNA and non-regulatory DNA. The NME regardless of alleles was calculated using the method described in the Human NME and MML calculation without separating alleles section. For each TF, the median value of NME across all motif sites in each of the two motif site classes (i.e. regulatory and non-regulatory) was computed. Based on the median NME for each TF across different samples, we then used a two-sided Wilcoxon signed-rank test to test whether NME is different between the two classes of sites. To adjust for multiple testing, we calculated the false discovery rate (FDR) using Benjamini–Hochberg procedure (16). Supplementary Figure S2A shows the standardized NME values where the median NME for each TF was standardized to have zero mean and unit standard deviation across all samples and the two classes of sites. Similarly, we compared the MML between motif sites in regulatory DNA and non-regulatory DNA for each TF. Supplementary Figure S2D shows the standardized MML values based on the median MML for each TF. We also applied a two-sided Wilcoxon signed-rank test to test whether MML is different between the two classes of sites.

Mouse embryonic development data analysis

The aligned WGBS bam files were manually downloaded from the ENCODE3 database. The bam files were sorted using samtools v1.9 and deduplicated using Picard tools MarkDuplicates (v2.23.3-4) function. Uncertainty coefficient (UC) is the Jensen–Shannon distance (JSD) normalized by entropy between the DNA methylation landscapes of two time-points which captures both mean methylation changes and methylation entropy differences. We quantified the difference in methylation landscapes between the two samples Inline graphic and Inline graphic by computing the uncertainty coefficient (UC)

graphic file with name M00012.gif

where I(X;S) is the mutual information between the methylation state X and the sample S, and it is given by

graphic file with name M00013.gif
graphic file with name M00014.gif

and h(X) is the NME under a model where the sample S has been integrated out. It has been shown that Inline graphic where Inline graphic is the Jensen–Shannon divergence (JSD) that measures the similarity between two probability distributions Inline graphic and Inline graphic (8). Large JSD means a large difference in methylation probability distribution between two samples. Given the linear dependency of UC(X;S) on the mutual information I(X;S), large values of UC(X;S) imply a large difference between samples. The NME, MML, differential NME, differential MML and UC were then calculated using the CPEL pipeline for each genomic region (8). Regions that have lower than 10× coverage were excluded.

For each tissue type, regions with UC >0.1 were first analyzed using unsupervised clustering. Because we found >28% of them only have UC >0.1 in one tissue, we selected the region with UC >0.1 only within that tissue (not in other tissues) for at least one pair of time points. UC for all consecutive pairs of time points (e.g. E10.5 and E11.5, E11.5 and E12.5) were standardized to have mean 0 and variance 1 for each region across consecutive pairs of time points. The standardized values were then used to computationally group regions into 10 UC clusters with k-means clustering. The k-means clustering was performed 10 times with different random seeds. The regions that were always clustered in the same clusters were defined as core-cluster regions. The rest of the regions were assigned to each core cluster based on their correlation to the mean UC of the core cluster. The 10 UC clusters were ordered according to which consecutive time points have the largest UC.

We selected UC >0.1 as the cutoff because the top 5% of all UC values is around 0.1. We further tested the other UC cutoffs, 0.025 (∼top 30%), 0.05 (∼top 15%), 0.15 (∼top 3%), 0.2 (∼top 1%), using the same method.

Mouse gene ontology (GO) analysis

Mouse GO analysis was done for each cluster in each tissue. For regions overlapping with enhancers, the target genes were annotated according to the published dataset (24). The genes for which we got complete methylation data in all samples at enhancer regions were used as the background gene list for GO analysis. For each cluster, the regions overlapping with enhancers were annotated to the corresponding gene and were used as target genes for GO analysis. The R package topGO was used to perform GO analysis using the classicFisher method (25). The GO terms with more than 10 annotated genes were retained. For a given GO term, let N denote the number of genes with an enhancer that has methylation landscape change, and let M denote the number of other genes in this GO term. Let Q denote the number of genes not in this GO term but with an enhancer that has methylation landscape change, and let R denote the number of other genes not in this GO term. The association between the GO term and the methylation landscape change is tested by examining N. The GO terms with N = 0 were excluded. The P-values were calculated conditional on the knowledge that the GO terms with N = 0 are excluded. In other words, the probability of observing N = n genes in a GO term whose enhancers have methylation landscape changes conditional on N ≥ 1 is:

graphic file with name M00019.gif

where

graphic file with name M00020.gif

Therefore, the P-value for observed N, i.e. (Inline graphic is:

graphic file with name M00022.gif

The Benjamini–Hochberg procedure was used to correct p-values for multiple hypothesis testing (16). The GO terms with fold change (FC) >1.5 were reported. The final GO terms for each cluster were ranked by FDR and FC was used to break ties. The top 5 GO terms were selected from each cluster to generate the heatmap. The GO terms for all tissues and clusters were plotted together. The GO terms that do not have FDR ≤0.2 in all tissue-cluster were excluded. The rows in the heatmap were ordered based on which columns show the highest FC.

For regions overlapping promoters, defined as within 2 kb of TSS, the genes were annotated to that TSS and the GO analysis was performed using the same procedure. The genes for which we have complete data in all samples at promoter regions were used as the background list.

We also applied the same analysis to regions defined based on other UC cutoffs (i.e., 0.025, 0.05, 0.15, 0.2).

Categorize regions based on UC-dNME and UC-dMML correlation

For each tissue, we calculated the correlation between NME change and UC (dNME-UC correlation) and the correlation between MML change and UC (dMML-UC correlation). We randomly permuted the labels of the NME/MML differences 20 times, recalculated the correlations between permuted NME/MML and UC as the null distribution, and obtained empirical P-values as the tail areas under the null distribution. FDRs were then calculated based on p-values using the Benjamini–Hochberg procedure (16). Some regions have a high dMML-UC correlation and dNME-UC correlation (Supplementary Figure S3A). To distinguish those regions from the regions that only have high dNME-UC correlation or dMML-UC correlation, we identified different clusters of regions based on the density of dMML-UC correlation and dNME-UC correlation. To that end, we calculated the difference between dNME-UC correlation and dMML-UC correlation (dNME-dMML difference) as well as the mean of two correlations for each region (Supplementary Figure S3B). We binned the mean correlation with a 0.05 interval. For each bin, we search for the first local minimum for the density of dNME-UC and dMML-UC difference below 0 and above 0 (negative local minimum and positive local minimum). The regions with values below the negative local minimum suggested the dMML-UC correlation is larger than the dNME-UC correlation. Thus, among the regions with significant dMML-UC correlation (FDR ≤ 0.2), the regions with dNME-dMML difference below the negative local minimum or have dMML-UC correlation greater than 0 while dNME-UC correlation smaller than 0, were categorized as regions whose UC is predominantly correlated with MML change (predominantly MML-correlated regions), meaning that their UC changes across time can be mainly explained by the MML changes (Supplementary Figure S3C–G). On the other hand, the regions with values above positive local minimum suggested the dNME-UC correlation is larger than the dMML-UC correlation. Thus, among the regions with significant dNME-UC correlation (FDR ≤ 0.2), the regions with dNME–dMML difference above the positive local minimum or their dNME–UC correlation is greater than 0 while dMML-UC correlation is smaller than 0, were categorized as regions whose UC is predominantly correlated with NME change (predominantly NME-correlated regions), meaning that their UC changes across time can be mainly explained by the NME changes (Supplementary Figure S3E, F,G). The regions that have values between two local minimums and have significant dNME–UC correlation and dMML-UC correlation (FDR ≤ 0.2) were categorized as regions whose UC is correlated with both NME and MML, meaning that both MML and NME changes contribute to the temporal changes of UC (Both). The regions that have insignificant dMML–UC correlation and dNME–UC correlation were categorized as regions whose UC is independent of both NME and MML (Neither) (Supplementary Figure S3E–G). The one-sided t-test was used to determine if there are more regions whose UC is predominantly correlated with NME change than regions whose UC is predominantly correlated with MML change. We chose a one-sided t-test because visually there are more regions whose UC is predominantly correlated with NME than regions whose UC is predominantly correlated with MML. We did the same analysis using the other UC cutoffs, 0.025, 0.05, 0.15, 0.2.

Mouse motif analysis

We downloaded 736 human and mouse motifs from JASPAR and mapped the motifs to the mouse genome using CisGenome (21). Enrichment of each motif was calculated as the ratio between the odds of motif sites in the target regions (i.e. [number of target regions that contain motif sites]/[number of target regions that do not contain motif sites]) and the odds of motif sites in control regions (i.e. [number of control regions that contain motif sites]/[number of control regions that do not contain motif sites]). The control regions, which have a similar distribution of distance to TSS with the target regions, were obtained using CisGenome (21). One-sided Fisher's exact test was applied to test whether the motif is significantly enriched in the target regions. Multiple testing was adjusted by converting P-values to FDRs using Benjamini–Hochberg procedure (16). To compare the enrichment of each motif between regions where the methylation landscape change (i.e. UC) is predominantly correlated with NME with regions where UC is predominantly correlated with MML, we fitted a linear regression model to the normalized log odds ratio of motif enrichment from the two types of regions. The normalized log odds ratio was calculated using the log odds ratio divided by its standard error. We identified TF motifs that were outside the 75% prediction interval and were significantly enriched in either region type (FDR ≤ 0.1). TF motifs that were more enriched in regions where UC is predominantly correlated with NME were marked with blue color and defined as ‘entropy-associated motifs’, and TF motifs that are more enriched in regions where UC is predominantly correlated with MML were marked with red color and defined as ‘mean-associated motifs’.

We also compared the entropy-associated TF motifs combined from all tissue types with TF motifs preferring high NME in the allele-specific analysis in human. By performing a one-sided Fisher's exact test, we found that compared to mean-associated TF motifs, entropy-associated TF motifs in mouse have significantly higher overlap with the human high NME-allele preferring motifs. The one-sided Fisher's exact test was used because the assumption is the entropy-associated TF is more enriched in mouse than mean-associated TF. Similar to the analysis in humans, we studied the relationship between NME and regulatory DNA in the mouse. We compared the NME between motif sites that are located in regulatory DNA as defined by a union set of mouse DNase I hypersensitive sites and those located in non-regulatory DNA (obtained similar to the human analysis) in each mouse sample for 736 transcription factor binding motifs.

Single-cell RNA-seq data processing

Human single-cell RNA-seq raw count data were downloaded from Human Cell Landscape (26) (Microwell-seq platform). Cells with at least 500 expressed genes with non-zero read counts were retained. The read counts were normalized by library size and gene expression values were imputed using SAVER (27) to address the high sparsity.

Mouse single-cell RNA-seq processed log2-transformed FPKM data were downloaded from ENCODE (28) (Fluidigm C1 SMART-seq platform). The log2-transformed expression matrix was transformed back to the original scale and imputed using SAVER (27).

For both human and mouse data, SAVER imputed values were then log2-transformed. Genes with non-zero expression in at least 10% of all cells were retained, and all ribosomal genes were removed. The processed data were used for the subsequent gene expression variability analysis.

Calculation of gene expression mean-adjusted variability (MAV)

Let yij be the library-size normalized, imputed, and log2-transformed expression level for gene i (i = 1, …, I) and cell j (j = 1, …, J). Let mi and si be the mean and standard deviation of the expression level for gene i across all cells respectively. A B-spline regression model was fitted across all genes where si is the response variable and mi is the independent variable, and let Inline graphicbe the fitted values of the standard deviations. The gene expression mean-adjusted variability (MAV) of gene i, hi, is defined as the residual of the regression model, or equivalently the difference between observed and fitted standard deviation: hi = siInline graphic.

In addition, we ran the BASiCS package (29) that uses a Bayesian hierarchical model to calculate residual overdispersion, which is mean-corrected gene expression variability on the same datasets. We found that BASiCS and MAV provide similar results in both human (the Pearson correlation between NME and results from BASiCS near TSS is 0.21, the Pearson correlation between NME and MAV near TSS is 0.23, Supplementary Figure S4) and mouse (the Pearson correlation between NME and results from BASiCS near TSS is 0.11, the Pearson correlation between NME and MAV near TSS is 0.11). However, BASiCS requires spike-in or replicated samples (29). Most of our samples do not have spike-in or replicates. Thus, we choose to use our MAV calculation over BASiCS.

RESULTS

Methylation entropy can depend on DNA sequences

To identify DNA sequences specifically associated with methylation entropy, we applied our recently developed information-theoretic method for allele-specific methylation analysis (8) to 49 human samples from the Roadmap Epigenomics Project (9). This approach allows us to rigorously analyze genetic sequence-driven differences in methylation in the exact same cellular and tissue context. We used a methylation potential energy landscape model that considers all potential methylation states, cooperative interactions between adjacent sites, and adheres to the rigorous definition of Shannon entropy (8) (see Materials and Methods). In Figure 2, we illustrate four sets of sequencing reads, where the two alleles are distinguished by a single nucleotide polymorphism (SNP), from which we calculated mean methylation level (MML) and normalized methylation entropy (NME), i.e. normalized for the number of methylatable CpG sites. On the top are two genes, PLAG1 and KCNQ1showing large mean differences, where one allele is much more methylated overall than the other. Explaining the large methylation difference between alleles, both PLAG1 and KCNQ1 are imprinted genes, i.e. with the parent of origin-specific expression (30,31), consistent with the known role of methylation in silencing imprinted loci on one allele (32,33). Indeed, we found that regions with large mean allelic methylation difference were enriched in imprinted genes (Supplementary Figure S5, Fisher's exact test, OR = 60.5, 95% CI = [47.3 – 76.9], P < 2.2 × 10−16, Materials and Methods). Note that allelic differences of methylation in imprinted genes are not related to the underlying DNA sequence, the focus of the present study, because what is a maternal allele in one generation can be a paternal allele in the next. In contrast, at the bottom of Figure 2 are examples of large sequence-driven differential methylation entropy between individual alleles at ASPG and TMC4, with much smaller differences in mean methylation levels than the examples on top. Note that previously sequence-independent entropy was studied in a way that does not separate two alleles, and such analysis identified imprinted genes as in our examples in Figure 2 as if they had high methylation entropy even though they may not at allelic level (11). For example, for regions with the allelic difference in mean methylation like those in the top panel of Figure 2, if one mixes the two alleles, then the entropy may appear to be higher in the allelic mixture, e.g. 0.28 for PLAG1, than in the individual alleles, but at the sequence level, this increase in entropy is actually due to parent of origin-specific imprinting. For each allele, the entropy is low. In examples such as the bottom of Figure 2, the differences in entropy are specific to the DNA sequence (i.e. the entropy is high within an allele), which is our focus as we wish to understand the underlying sequence drivers of entropy. Among the 3 332 744 regions containing heterozygous SNPs we analyzed, 29 681 exhibited significant allele-specific differential NME (FDR ≤ 0.1), but only 6807 regions exhibited significant differential MML (FDR ≤ 0.1). Among those significant regions, 28 863 regions showed significant dNME without significant dMML and 5989 regions showed significant dMML without significant dNME. Only 818 regions showed both significant dMML and dNME. This suggests that the methylation landscape change for a large proportion of regions can only be detected by dNME but not dMML. Indeed, we found that dNME and dMML are not highly correlated, with a median R2 of 0.07.

Figure 2.

Figure 2.

Examples of allele-specific methylation mean and entropy analysis. Four example regions with large methylation mean difference (top panel) or large methylation entropy difference (bottom panel) between two alleles. For each region, the continuous grey horizontal lines represent individual sequencing reads. The brown vertical lines are the location of the SNPs used to distinguish two alleles. The red dots represent methylated CpG sites, and the blue dots represent unmethylated CpG sites. The top panel shows two example regions located at two known imprinted genes, PLAG1 and KCNQ1. Consistent with the known association between imprinting and promoter methylation, in one allele almost all CpG sites are methylated and in the other allele almost all CpG sites are unmethylated. The lower panel shows two genes, ASPG and TMC4, with high methylation entropy differences inside the gene. For ASPG, the allele with genotype A has low entropy (0.13) as indicated by most of the reads having the same methylation status. The allele with the genotype of G has high entropy (0.79) as indicated by the highly stochastic color pattern. For TMC4, the allele with the genotype of C has low entropy (0.15), while the allele with the genotype of T has high methylation entropy (0.86) as indicated by the highly variable methylation patterns across reads.

Entropy-associated DNA sequences are associated with predicted transcription factor binding sites

We next scanned the nucleotide sequence around each SNP for transcription factor binding sites (TFBS) based on motifs from the JASPAR database (19) and computed the binding probability in each allele as described (34). We identified 129 motifs with higher binding probability in the allele with a lower mean methylation level than the other allele, and only seven motifs with higher binding probability in the allele with a higher mean methylation level than the other allele (Supplementary Dataset S1). Supporting the validity of our approach, CTCF was ranked fifth among the 129 motifs associated with low methylation levels (Supplementary Dataset S1, Method), consistent with the known observation that the allele with higher CTCF binding probability shows lower methylation (35). Similarly, 89–93% of motif sites of NFI family members, including NFIC, NFIB, NFIX, showed lower mean methylation level in the allele with higher binding probability (Supplementary Dataset S1) than the other allele, consistent with the observation that the NFI family proteins are enriched at demethylated sites during neural development (36).

Using a similar approach, we then asked what motifs are associated with allelic differences in methylation entropy. We found 135 motifs with higher binding probability in the allele with significantly higher methylation entropy than the other allele (binomial test, FDR 0.1; Supplementary Dataset S2), compared to 14 motifs showing decreased binding probability in the allele with significantly higher methylation entropy at the same FDR level (binomial test, FDR 0.1; Supplementary Dataset S2). Although there is no linear relationship between dMML and dNME, there is a slight negative correlation between dNME and dMML (median Pearson correlation = −0.18). Thus, we examine the overlaps between high entropy-associated motifs and low MML-associated motifs. Among these 135 higher entropy-associated motifs, 55 were not associated with low MML meaning they cannot be detected using MML (Table 1, Supplementary Dataset S3). Gene ontology (GO) enrichment analysis (37) of the transcription factors associated with these 55 motifs shows enrichment for positive regulation of cell development (enrichment = 5.43, P = 4.72 × 10−5) and positive regulation of cell differentiation (enrichment = 3.32, P = 7.71 × 10−4). Furthermore, several of the transcription factors associated with these motifs are pioneer transcription factors, i.e. that open up condensed chromatin, including ASCL1, PBX1, MEIS1, ATF4, ESRRB and KLF4 (38–43). In contrast, low MML-associated motifs not associated with high NME showed no GO enrichment of the associated transcription factors.

Table 1.

The FDR is calculated from the P-values of the binomial test (P-values, 95% CI and effect size are in data S3). The GO enrichment analysis for the ranked list of transcription factors whose motifs only have high NME is positive regulation of cell development (enrichment = 5.43, P = 4.72 × 10−5) and positive regulation of cell differentiation (enrichment = 3.32, P = 7.71 × 10−4). Using the same GO enrichment analysis for the ranked list of transcription factors whose motifs only have low MML, there is no significantly enriched term using a P-value cutoff of 0.001

High NME-associated motifs not associated with low MML Low MML-associated motifs not associated with high NME
TF FDR TF FDR TF FDR TF FDR
MYOG 4.14E-05 TAL1::TCF3 4.75E-02 LEF1 2.27E-05 ELK1 1.65E-02
SOX8 4.83E-05 ATOH1(var.2) 4.92E-02 ERF 2.00E-04 SOX9 1.65E-02
ZNF148 2.33E-04 SOX15 4.92E-02 SP1 3.20E-04 TCF21(var.2) 1.65E-02
ASCL1 3.56E-04 PBX2 4.92E-02 JUN::JUNB 4.06E-04 ZBTB7A 1.77E-02
SOX10 5.73E-04 PRDM4 4.92E-02 JDP2 4.51E-04 LHX9 2.11E-02
MYF5 6.09E-04 FOXK1 5.12E-02 ELF5 8.87E-04 MSGN1 2.19E-02
FOXG1 6.52E-04 TGIF2 5.25E-02 ETV5 1.05E-03 CEBPB 3.05E-02
CEBPA 1.25E-03 RARA::RXRA 5.70E-02 HNF4A(var.2) 1.11E-03 CEBPE 3.05E-02
ZNF682 3.36E-03 MEIS1(var.2) 5.96E-02 SOX2 1.91E-03 HMBOX1 3.73E-02
MITF 4.43E-03 MSC 6.96E-02 FOXA2 2.37E-03 RXRB 4.22E-02
BHLHA15(var.2) 4.82E-03 ESRRB 6.96E-02 KLF11 3.40E-03 TCF7L1 4.68E-02
MAFF 5.25E-03 PKNOX1 6.96E-02 KLF2 3.45E-03 SP2 5.27E-02
ZNF263 5.26E-03 ZNF317 6.99E-02 KLF3 3.45E-03 TFAP4(var.2) 6.02E-02
KLF4 5.46E-03 JUND(var.2) 7.17E-02 KLF15 4.89E-03 RFX2 6.02E-02
TGIF1 8.00E-03 ATF2 7.20E-02 TCF7 6.62E-03 OTX2 7.38E-02
RORA 8.19E-03 TEAD4 7.49E-02 TCF7L2 7.94E-03 GRHL2 7.38E-02
PBX1 1.19E-02 EWSR1-FLI1 8.33E-02 SP9 8.41E-03 FOXF2 8.40E-02
FOXD2 1.23E-02 JUN 8.60E-02 HNF4A 1.00E-02 ZNF410 8.67E-02
KLF10 1.24E-02 NHLH1 8.61E-02 HNF4G 1.00E-02 TWIST1 8.99E-02
RORB 1.39E-02 SOX14 8.61E-02 IRF1 1.27E-02 KLF14 9.73E-02
FERD3L 1.49E-02 PKNOX2 8.61E-02 GBX1 1.33E-02 CEBPG 9.76E-02
RORC 1.68E-02 RARA::RXRG 8.66E-02 SP4 1.47E-02
TBX4 1.70E-02 POU2F1 8.66E-02 ZBTB7B 1.54E-02
ZKSCAN5 2.70E-02 TEAD2 9.04E-02 FOXD1 1.54E-02
FOXB1 3.29E-02 POU2F2 9.23E-02 FOXI1 1.54E-02
POU3F4 3.81E-02 PBX3 9.28E-02 FOXO3 1.54E-02
ATF4 4.18E-02 PRRX2 9.42E-02 FOXO6 1.54E-02
POU1F1 4.57E-02 NFATC4 1.65E-02

Given that transcription factors generally bind to regulatory DNA to regulate their target genes (44), we then compared NME at regulatory DNA vs. non-regulatory DNA, defined using DNase I hypersensitive site sequencing data (see Methods and Materials). We identified 583 out of 630 motifs whose motif sites in regulatory DNA showed significantly higher NME than those in non-regulatory DNA (Wilcoxon's signed-rank test, FDR <0.05; Supplementary Figure S2A, B, and Supplementary Dataset S4). In contrast, the motifs sites in regulatory DNA showed lower MML than those in non-regulatory DNA for all motifs (Wilcoxon's signed-rank test, FDR <0.05; Supplementary Figure S2D, E). These data also show that NME changes and MML changes between regulatory and non-regulatory regions do not always co-occur. GO analysis of transcription factors for motifs with higher NME in regulatory DNA was also highly enriched for developmental categories (Supplementary Dataset S5). Examples include the LHX5 motif (Supplementary Figure S2C, left panel), which regulates neuronal differentiation and dendritogenesis of Purkinje cells (45,46), and the SRF motif (Supplementary Figure S2C right panel), which is required for vascular smooth muscle cell differentiation (47). Low MML was also associated with regulatory DNA, as expected (Wilcoxon's signed-rank test FDR <0.05; and Supplementary Dataset S6 and Supplementary Figure S2D, E), e.g. CTCF (48) and NRF1 (49) (Supplementary Figure S2F).

Entropy is inversely related to CpG density

We then performed allele-specific analysis to explore how normalized methylation entropy (NME) is related to DNA sequence other than transcription factor binding sites per se, by comparing allele-specific entropy to the trinucleotide DNA sequence contexts containing the given CpG dinucleotide (Figure 3A, Supplementary Figure S6; see Materials and Methods). This analysis revealed several such contexts, but the strongest effects were seen where the SNP results in a lost CpG in one allele, and the allele with fewer CpG was 2.1-fold more likely to show higher NME than the alternate allele (Figure 3A, Fisher's exact test, 95% CI [2.0–2.2], P < 2.2 × 10−16). Comparing the number of CpG sites between the two alleles across all regions with significant allele-specific methylation entropy differences, the allele with more CpGs showed an average decrease in NME of 0.24 (Figure 3B, one-sided paired t-test, P < 2.2 × 10−16), supporting the observation that lower CpG number was associated with higher NME. We further examined NME across the genome within 20 kb of transcriptional start sites, regardless of allele, to include regions lacking SNPs. Here as well, there was an inverse correlation between NME and CpG density (Figure 3C, top panel, Pearson correlation test, correlation = −0.21, 95% CI = [−0.21, −0.21], P < 2.2 × 10 −16; see Materials and Methods). We also observed a negative correlation between the mean methylation level and CpG density (Figure 3C, bottom panel, Pearson correlation test, correlation = −0.35, 95% CI = [−0.35, −0.35], P < 2.2 × 10−16). MML showed a more abrupt decrease at high CpG density than did NME (Figure 3C), consistent with the known compartmentalization of CpG islands.

Information-theory based analysis reveals a developmental role of entropy

To explore the developmental significance of methylation information content, we analyzed mouse prenatal whole-genome bisulfite sequencing (WGBS) ENCODE3 data from seven tissues at sequential embryonic developmental time points (E10.5–E16.5) (10), examining 46 samples with data for 1035 pairwise sample comparisons at each of 5 017 785 genomic regions. We did not include postnatal time points since the environment of the animal is drastically different after birth, which could directly affect the epigenetic landscape, supported as well by the lack of a consistent developmental trajectory when a postnatal timepoint was included (Supplementary Figure S7). For each pairwise comparison, we calculated the information-theoretic distance between two samples’ methylation potential energy landscape, quantified by the uncertainty coefficient (UC). UC measures the inherent information in a sample distinct from another sample, corrected for entropy (mathematically defined in the Methods).

Multidimensional scaling (MDS) of all 1035 comparisons across the entire genome showed that samples from the same tissue were grouped together, while samples from different tissues were separated (Figure 4A, top). Dissimilar tissue types, such as limb and liver, were separated from each other to a greater degree than similar tissue types, such as hindbrain, midbrain, and forebrain (Figure 4A top). Furthermore, MDS showed that samples from the same tissue type were ordered based on developmental stage, with samples from earlier stages being closer to a common center across tissues, and samples from later stages being away from the center, progressing in different directions for different tissue types (Figure 4A, bottom). Thus, the tissue-specific developmental trajectories can be derived from the information-theoretic content of DNA methylation.

Figure 4.

Figure 4.

Information-theoretic analysis of methylation landscape in mouse embryonic development. (A) Multi-dimensional scaling (MDS) plot of mouse embryonic samples from different tissues and developmental time points based on samples’ pairwise distance using uncertainty coefficient (UC). Samples (dots) are color-coded either by tissues (top) or by developmental time points (bottom). Samples from the same tissue or similar tissues tended to be clustered together. The MDS also captured samples’ temporal ordering, with samples from earlier developmental time points being closer to a common center (labeled by the red dot in the bottom plot) and samples from later time points gradually moving away from the center toward different directions representing different tissues. (B) Heatmap of UC values in mouse embryogenesis revealed tissue- and developmental stage-specific DNA methylation landscape changes. Regions with high UC in only one tissue were grouped into 10 clusters, which were ordered according to the consecutive time points with the highest UC. The clustering reflected the temporal cascade of methylation landscape changes in that tissue. High UC (red) means large methylation change (including mean and entropy changes) and low UC (blue) means small change. The temporal patterns were tissue-specific, as patterns observed in one tissue cannot be observed in other tissues. (C) Gene ontology (GO) analysis demonstrates the biological significance of UC-based developmental clustering. For each tissue and region cluster identified in (B), enriched biological functions associated with enhancer regions in the cluster are identified and the top 5 GO terms with largest fold-enrichment are shown (rows: GO terms; columns: tissues and region clusters; ‘*’FDR ≤ 0.1). Large fold-enrichments were labeled in blue while small fold-enrichments were labeled in green. The enriched GO terms revealed tissue-specific functions. (Note: although GO terms in cluster 6, hindbrain have high fold-enrichments in other tissues, they are not significant.) The GO terms in the heart followed the development order, i.e. we observed atrial septum and cardiac atrium development in early stages and heart valve development in late stages. Similarly, in the forebrain, we saw the GO terms related to neuron development in early stages and brain subdivision formation in later stages. (D) Bar plot showing the relative contribution of MML and NME to the methylation landscape changes as characterized by UC. In each tissue, regions were classified into four categories based on whether a region's UC change across time points was predominantly correlated with the change in MML (red), NME (blue), both (green), or uncertain (purple). The proportion of each category among all regions that have high UC in at least one pair of time points is shown. In all tissues, there were more regions where UC is predominantly correlated with NME change (blue) than the regions where UC is predominantly correlated with MML change (red), suggesting a larger role of methylation entropy than mean methylation in shaping the dynamic changes of methylation landscape. (E) Gene ontology (GO) analysis demonstrates the biological significance of regions whose UC is predominantly correlated with NME using the same method as in C). All tissues show tissue-specific GO terms from those regions whose UC is predominantly correlated with NME. (F) Gene ontology (GO) analysis demonstrates the biological significance of regions whose UC is predominantly correlated with MML. Some tissues show tissue-specific GO terms such as EFP, forebrain, limb and midbrain.

We then performed clustering analysis within each tissue in order to determine whether there was a specific set of genomic regions undergoing methylation landscape changes at each developmental stage, and if so, whether those time-specific clusters were also specific to the given tissue in which they were observed. To do this, for each tissue, we performed clustering analysis on all pairs of adjacent time points from earliest to latest, analyzing 2 085 884 regions showing large methylation landscape change (defined by UC > 0.1 in at least one tissue between any two time points, see Methods and Materials, Supplementary Figure S8) for each comparison. The ordering of the regions in the clustering heatmap recaptured the developmental process, confirming that regions with high UC capture developmental change (Supplementary Figure S8). The temporal patterns were tissue-specific, as clustering patterns observed in one tissue were not observed in other tissues. To explore the background signal in Supplementary Figure S8, we also measured the proportion of regions with high UC differences between any time point pair, by the number of tissues in which it was observed, and this was bimodal with the highest peak for unique tissues (Supplementary Figure S9). These tissue-specific regions with high UC also captured time-dependent developmental changes (Figure 4B). Furthermore, these developmental clusters in a given tissue were almost entirely specific to that tissue (Figure 4B). UC >0.1 corresponded to the top 5% of all UC values in the entire dataset. Using other UC cutoffs from 0.025 to 0.2, we still observed the tissue-specific clustering patterns (Supplementary Figure S10, see Materials and Methods).

Next, we asked whether the genes linked to these developmental- and tissue-specific regions are functionally related to those tissues by GO analysis. We mapped the regions to regulatory elements of specific genes, including promoters (within 2 kb of the transcriptional start site), and enhancers as described (24). Enhancers showed substantially larger changes in both NME and MML than promoters (Supplementary Figure S11), so we focused initially on enhancer regions (Figure 4C, Supplementary Figure S12 and Supplementary Dataset S7). The GO terms identified in this manner were strikingly tissue-specific as well as specific to each temporal cluster, and highly related to the development function of each tissue as well. For example, GO terms related to embryonic facial prominence (EFP), such as positive regulation of chondrocyte differentiation, showed large and significant enrichment only in EFP and not in other tissues; and GO terms related to liver, such as monosaccharide biosynthetic process, showed large and significant enrichment only in liver and not in other tissues (Figure 4C). Moreover, in some tissues, the temporal order of GO terms also reflected the temporal developmental program (forebrain and heart, see Figure 4C). Using the same method with other UC cutoffs from 0.025 to 0.2, we still found significant tissue- and time-specific GO enrichment (Supplementary Figure S13).

Although our focus was on enhancer regions, as explained above, we also analyzed the genes whose promoters showed significant methylation landscape change. We found that those genes were enriched in fewer tissue-specific terms than genes with methylation landscape changes in enhancer regions (Supplementary Figure S14). This was consistent with the finding that enhancers showed larger NME and MML changes than promoters (Supplementary Figure S11), confirming that epigenetic landscape changes at enhancers were more strongly related to tissue-specific development.

UC shows comparable or stronger correlation with entropy compared to mean methylation

UC measures methylation landscape changes between samples and captures both mean methylation changes and methylation entropy changes. To determine the relative contribution of mean methylation and methylation entropy change to the developmental landscape distance as measured by UC as described above, we next calculated the correlation between MML change and NME change to UC, classifying methylation changes at each region as predominantly correlated with MML, NME, both, or neither (see Methods and Materials). In the regions with UC >0.1 between developmental timepoints, UC was predominantly correlated with MML change in 2%-14%, depending on the tissue, with NME change in 22–43%, with both MML and NME change in 18–36% and with neither in 19–34% regions (Figure 4D). These data suggested that in all seven tissues, the contribution of NME change to UC is at least comparable to, if not larger than, the contribution of MML change to UC. Even using other UC cutoffs from 0.025 to 0.2, we still observed similar trend in development (Supplementary Figure S15, supplementary Text).

To understand the biological function of those regions with different categories as shown in Figure 4D, we performed a GO annotation analysis on the genes with enhancers in the four UC-correlation categories described above. Predominantly NME-correlated regions have a slightly larger number of enriched GO terms for tissue-specific functions compared to predominantly MML-correlated regions (Figure 4E, F, Supplementary Figure S16A, B and Supplementary Dataset S8 and S9). Given that these two sets of regions are non-overlapping, it shows that NME and MML each contains unique information about tissue-specific functions not captured by the other. There was also substantial enrichment for tissue-specific functions in UC regions correlated with both mean and entropy change (Supplementary Figure S16C, and Supplementary Dataset S10), and UC also captured some information independent of both mean and entropy change (Supplementary Figure S16D, and Supplementary Dataset S11). We also annotated regions based on whether they are linked to tissue-specific genes, and we found that the predominantly NME-correlated regions and the predominantly MML-correlated regions covered similar numbers of tissue-specific genes (Supplementary Text). These results demonstrated that methylation entropy contains at least as much unique information about the developmental epigenetic landscape as does mean methylation. Thus, analyzing only mean methylation prevents one from identifying a substantial fraction of DNA methylation landscape changes characterized by the information-theoretic measure UC.

Developmental entropy and mean are associated with different transcription factor binding sites

Given the results above showing that mean and entropy were associated with different transcription factor motifs in human, and also associated with different genomic regions based on embryonic development, we asked whether different transcription factor motifs are associated with mean- or entropy-related developmental methylation changes in the mouse. We compared regions where UC was predominantly correlated with NME change, to regions where UC was predominantly correlated with MML change (Figure 5, see Materials and Methods). Different sets of motifs were found to be associated with entropy versus mean. In total, we identified 344 entropy-associated motifs and 324 mean-associated motifs (Supplementary Dataset S12). Figure 5 shows the enrichment of each motif in the regions where UC was predominantly correlated with NME change versus the regions where UC was predominantly correlated with MML change in all 7 tissues we analyzed. For example, KLF family motifs such as KLF4 and KLF5 are among the entropy-associated motifs which appear in multiple tissue types such as heart, forebrain, limb, EFP, midbrain, hindbrain, and liver (Figure 5AG, Supplementary Dataset S12). KLF4 is known to be a key transcription factor in embryonic development and is also one of the four transcription factors to generate iPSC (50). Among mean-associated motifs, NF-I family motifs such as NFIC, NFIX, and NFIB were identified in multiple tissues such as heart, limb, EFP, hindbrain, and liver (Figure 5A, C, D, E, F, and Supplementary Dataset S12). Mutations in Nfi family members are associated with brain development defects (51). Using entropy-associated motifs, we annotated enhancers with entropy changes as well as their target genes (Table 2, Supplementary Dataset S13, and Supplementary Figure S17), and similarly for enhancers with mean methylation changes (Supplementary Dataset S14). For instance, Celsr2, which has been shown to regulate forebrain wiring (52), showed a large entropy change between E12.5 and E13.5 at its enhancer, which is also the KLF4 transcription factor binding site (Supplementary Figure S17) (50). Similarly, Gata5 plays a critical role in heart development and Gata5 knockout mice develop bicuspid aortic valve (53). Gata5 showed a large entropy change between E11.5 and E12.5 at its enhancer, which is also the GLI3 binding site (Supplementary Figure S17) (50). These data suggest that entropy-associated transcription factor binding may play an important role in modulating enhancer function in development.

Figure 5.

Figure 5.

Association of methylation entropy changes with transcription factor binding motifs. Scatterplots comparing the enrichment level of each TF motif in regions where DNA methylation changes were predominantly correlated with NME change versus the enrichment level of the same motif in regions where methylation changes were predominantly correlated with MML change. The analysis was run for (A) embryonic facial prominence (EFP), (B) forebrain, (C) heart, (D) hindbrain, (E) limb, (F) liver, (G) midbrain, in the mouse embryonic development dataset. For each tissue, we identified motifs that were more enriched in the regions where UC was predominantly correlated with NME change compared to regions where UC was predominantly correlated with MML change (entropy-associated motifs, marked with blue color) as well as motifs that were more enriched in regions where UC was predominantly correlated with MML change than the regions where UC was predominantly correlated with NME change (mean-associated motifs, marked with red color). Motifs that were also identified in the human analysis which prefer high NME are marked with text labels. Entropy-associated motifs that prefer high NME in the human analysis are marked with blue labels. Mean-associated motifs that preferred high NME in the human analysis are marked with red labels. Among the entropy-associated motifs, a substantial number (32, 27, 16, 24, 26, 27 and 19 motifs from heart, forebrain, limb, EFP, midbrain, hindbrain and liver, respectively) also prefer high NME in the human analysis such as KLF4 and KLF5.

Table 2.

Genes whose enhancers contain entropy-associated motifs, are listed in the ‘Entropy-associated TF motif’ column. At the enhancer of those genes, the correlations between UC and NME difference (dNME) are much greater than the correlation between UC and MML difference (dMML) and the dNME is much larger than dMML. Those genes are known to be functional in the corresponding tissue from the literature listed in the PMID column. This suggests that those entropy-associated motifs are at the enhancers of functional genes

Tissue Predicted motif Cluster Gene dNME dNME-UC cor dMML dMML-UC cor Stage PMID
EFP NR5A1; NR2F2 4 Edn1 0.53 0.98 0.05 −0.26 E12.5-E13.5 24268655; 25772936
NR2F2 10 Hat1 0.50 0.79 0.0 −0.80 E14.5-E15.5 23754951
ZNF263; CTCFL; GATA2 2 Trps1 0.45 0.97 0.22 0.74 E11.5-E12.5 22581230
NR5A1; ETV4; NR2F2 2 Tgfbr2 0.43 0.95 0.01 −0.60 E11.5-E12.5 15731757
GATA2 1 Alx4 0.35 0.77 0.01 0.26 E13.5-E14.5 11137991
Forebrain KLF5 5 Prdm16 0.29 0.66 0.01 −0.21 E15.5-E16.5 28698301
IKZF1; ELF3; ETV1 8 Otx2 0.41 0.95 0.04 −0.47 E14.5-E15.5 1353865; 14625556
KLF4; KLF5 10 Celsr2 0.37 0.81 0.06 −0.06 E15.5-E16.5 25002511
IKZF1; ETV1; SOX10; NR5A1 10 Nfia 0.43 0.86 0.13 0.17 E15.5-E16.5 12514217
ZNF740; KLF4; KLF5; KLF6 2 Nfib 0.40 0.97 0.01 −0.02 E11.5-E12.5 27965439
limb PBX1 9 Kat6b 0.60 0.99 0.08 −0.06 E13.5-E14.5 22265014; 22077973
SPIB; ETV4 2 Ror2 0.25 0.76 0.07 −0.17 E11.5-E12.5 20660756; 21316585; 10700182
ZNF740 10 Ski 0.49 0.97 0.15 0.66 E14.5-E15.5 12435627
ZNF740; SPIB 2 Irx3 0.49 0.79 0.05 −0.39 E10.5-E11.5 24726282
ETV4 2 Wnt5a 0.45 0.97 0.04 −0.13 E10.5-E11.5 21316585
heart KLF5; KLF4; ETV4; KLF9; SP8; ZNF148; 8 Gata4 0.51 0.80 0.15 0.56 E14.5-E15.5 27984724; 22522929; 26875865; 31080136
PRDM4; SOX4 10 Ppif 0.44 0.96 0.02 −0.32 E15.5-E16.5 15800627; 20890047
PBX3; PKNOX1; SOX10 4 Ror1 0.46 0.87 0.15 0.62 E11.5-E12.5 30342492; 11713269
KLF6; TEAD2; RORC 8 Ndrg2 0.62 1.00 0.20 0.75 E14.5-E15.5 22523601; 16520977
ETV4; ATF2 8 Stat6 0.59 0.99 0.07 −0.15 E13.5-E14.5 33192584; 25722436

Entropy-associated sequence and its association with regulatory DNA are conserved between human and mouse

Next, we asked whether the entropy-associated sequence and chromatin features are conserved between human and mouse. The percentage of overlap between human motifs associated with high NME and mouse entropy-associated motifs was significantly higher than the percentage of overlap between the human motifs associated with high NME and the mouse MML-related motifs (Fisher's exact test, odds ratio = 1.43, P = 0.04; Supplementary Dataset S12). The mouse entropy-associated motifs that overlap with the human motifs preferring high NME are highlighted in blue boxes in Figure 5. Those entropy-associated motifs can play a critical role in the corresponding tissue, such as PBX3, whose entropy-associated binding motifs are conserved in human and mouse, and are associated with heart development but not forebrain and limb, is necessary for normal heart development (54), PKNOX1, which is associated with congenital heart defects (55) and FOXC1, which is required for cardiovascular development (56). In the forebrain, we found SOX10, SP8 and RORA (Figure 5B), which regulate myelination-related genes (57), promote olfactory bulb interneurons during the development (58), and regulate genes associated with autism spectrum disorder (59). In limb, we found SOX4, which is related to massive cartilage fusion (60), ETV4, associated with sonic hedgehog pathways in limb outgrowth (61), and PBX1, required for chondrocyte proliferation (62).

We found similar conservation between mouse and human for the relationship of entropy to CpG density and regulatory DNA. For CpG density, similar to human (Figure 3C), we observed a strong correlation between local CpG density and NME in mouse samples (Pearson correlation test, correlation = −0.21, 95% CI = [−0.21,−0.21], P < 2.2 × 10−16; Supplementary Figure S18). To explore the relationship between NME and transcription factor motif binding sites at regulatory DNA in mouse, we compared the NME between motif sites that are in regulatory DNA as defined by a union set of mouse DNase I hypersensitive sites and those located in non-regulatory DNA in each mouse sample for 736 TF motifs. Similar to human motif analysis (Supplementary Figure S2), we found that most of the motifs in mouse have significantly higher NME in regulatory DNA than non-regulatory DNA (661 motifs with FDR ≤ 0.05) (Supplementary Figure S19, and Supplementary Dataset S15). We further examined whether those motifs with high NME in the regulatory region are conserved between humans and mice. In the tissues that exist in both human and mouse, we found that 79% (583 motifs) of the motifs show higher NME in regulatory regions than non-regulatory regions in both human and mouse. In the tissues that only exist in one species, we observed that 81.5% (600 motifs) of the motifs show higher NME in regulatory regions than in non-regulatory regions. This suggests that the observation of high NME in regulatory regions is largely tissue-independent.

Together, these results showed that the relationships between NME and transcription factor binding, CpG density and regulatory DNA were highly conserved between mouse and human, thus supporting their functional significance.

Methylation entropy is associated with gene expression variability in both human and mouse

Because NME is associated with expression variability in human cancer cells (63), we asked whether methylation entropy is related to gene expression variability in normal human somatic tissues and mouse embryonic development. First, we re-analyzed single-cell RNA-seq data (26) from 14 different human somatic tissues and human ESC that we had already done methylation analysis from WGBS data for the corresponding tissues (see Materials and Methods for details). We found a correlation between NME near the transcription start site (TSS) of the genes and mean-adjusted variability (MAV) of gene expression (see Methods and Materials). Genes with higher MAV tended to have larger NME near TSS while genes with lower MAV tended to have smaller NME near TSS (Figure 6A) (26). The correlation between the NME close to the TSS and the gene expression variability was highly significant (Pearson correlation test, average correlation = 0.23, average 95% CI = [0.21, 0.25], average P = 7.6 × 10−5). We observed a similar pattern in other human samples except for undifferentiated stem cell lines (Figure 6B). Thus, NME was generally positively correlated with gene expression variability but not mean expression (Supplementary Figure S20).

Figure 6.

Figure 6.

Association of methylation entropy changes with gene expression variability. (A) Association between NME surrounding transcription start sites (TSS) and gene expression variability. An example showing NME near TSS was lower for the gene with low mean-adjusted expression variability (MAV) than the gene with higher MAV. Genes were stratified based on the quartiles of their MAV. For each stratum, the average NME surrounding TSS across all genes in the stratum is shown. The genes in the lowest MAV quantile (red line) had the lowest NME near their TSS. The genes in the highest MAV quantile (purple line) had much higher NME near their TSS. The Pearson correlation between genes’ MAV and NME was 0.269 (P < 2.2 × 10−16). (B) Heatmap showing NME values within +−500 bp of genes’ TSS in different samples. Each row is a sample. In each sample, genes were stratified based on their expression variability using quantiles of MAV (X-axis), and the heatmap shows the average NME of all genes in each stratum. We observed increasing NME (from pink to dark red) as the MAV quantile increases except for embryonic cell lines, indicating that gene expression variability (MAV) is correlated with NME in regions surrounding genes’ TSS.

Given the enrichment of regions with large MML difference in imprinted genes, we also explored the relationship between MML and monoallelic gene expression (MAE) (18). We observed significant enrichment of the regions with large allelic MML difference in the promoters of MAE genes (Fisher's exact test, odds ratio = 1.49, 95% CI = [1.04–2.15], P = 0.03, Supplementary Dataset S16). In contrast, as expected, we did not observe an enrichment of the regions with large allelic NME difference (Fisher's exact test, odds ratio = 1.03, 95% CI = [0.79–1.35], P = 0.84, Supplementary Dataset S16) suggesting no association between methylation entropy and mean gene expression. Thus, while mean methylation is associated with mean expression, entropic methylation is associated with expression variability (MAV).

We next analyzed the relationship between NME and MAV in mouse, analyzing mouse fetal limb tissue scRNA-seq data from the ENCODE3 project (28) and calculating MAV in the same way as in human. Consistent with the human analysis, genes with higher MAV showed larger NME near TSS while genes with lower MAV showed smaller NME near TSS (Supplementary Figure S21A), and there was a smaller but still statistically significant correlation between NME and MAV (Pearson correlation test, average correlation = 0.11, average 95% CI = [0.10, 0.11], average P < 2.2 × 10−16; Supplementary Figure S20B) (28). Those observations showed that the association of NME and gene expression variability in normal tissue is also conserved between human and mouse.

DISCUSSION

In summary, we have performed a systematic characterization of DNA methylation entropy, relating it to DNA sequence, predicted transcription factor binding sites, regulatory DNA, its relative contribution to epigenetic landscape changes during mouse development, and its relationship to variable gene expression. Applying an information-theoretic approach to mouse embryo DNA methylation data, we identified tissue- and time-specific transitions that are much more discriminative than conventional mean-based DNA methylation analyses, identifying temporal transitions remarkably well and essentially uniquely for each tissue type. By using the information metric of UC, we could also directly compare the relative contribution of differences in entropy versus mean methylation to these informational changes in the epigenetic landscape of mouse development. In all tissues, we observed more regions where UC is predominantly correlated with NME change (Figure 4D, 22–43%) than the regions where UC is predominantly correlated with MML change (Figure 4D, 2–14%). This, along with the gene ontology enrichment analysis (Figure 4C, E, F) and additional analysis of tissue-specific genes (Supplementary Text) suggests that the role of methylation entropy in shaping the dynamic changes of methylation landscape is comparable to, if not larger than, mean methylation. Gene ontology enrichment analysis supported a functional role of these entropy changes, particularly at enhancers, identifying many known critical genes for development in those tissues. These specific functional categories were often congruent to the functional development of those tissues, as well, as evidenced particularly in the heart.

We also identified certain DNA sequence relationships with methylation entropy, using allele-specific analysis comparing directly the entropy or mean of sequences differing by a single nucleotide in the same cells. There were ∼3-fold more entropy- than mean-distinguishing SNPs. Moreover, these sequence differences were directly related to changes in many transcription factor binding sites. Note that imprinted genes were associated with differential mean methylation between alleles, although if the alleles are pooled they will appear entropic as in (11). There was also a strong inverse relationship of entropy with small differences in CpG density, in contrast to the comparatively weak inverse association of CpG density with mean methylation except at very high-density islands.

Moreover, entropic methylation was strongly associated with gene expression variability in human normal somatic cells and normal mouse development, in contrast to the monoallelic expression which was associated with differential mean methylation as seen here and elsewhere (11,64). The fact that so many entropy-associated features were found to be conserved between human and mouse suggests that they serve an important functional role. This relationship supports a potential role of epigenetic entropy in mediating either buffering or phenotypic divergence, two sides of the same developmental coin (4). The link between epigenetic entropy and the sequence of specific transcription factor binding sites suggests that binding of some transcription factors may regulate a variable response to environmental signaling, since many of these transcription factors are signaling-dependent, as classified by Brivanlou and Darnell (65). These include nuclear receptor family members such as NR5A1, RORA, RORB, and accessory factors like KLF’s, and receptor-ligand targets including ETS-family members (ETV’s, ELF’s), receptor tyrosine kinases like EGFRs, and HIPPO-signaling targets like TEADs, all of which were conserved entropy-associated binding regions between human and mouse.

A limitation of the mouse data analysis is inherent in the datasets from which they were derived (and as those authors also acknowledged as a limitation) (10), namely that a given tissue sample would include multiple cell types which could affect inferences about developmental changes (Supplementary Text). This admixture would likely affect conclusions regarding entropy changes over time in development. However, entropy itself likely does contribute meaningfully to developmental transitions based on the current analysis for two reasons. First, the regions associated with high entropy change are located at the enhancers of the genes that are specific to the development of corresponding tissue as shown in the gene ontology (GO) analysis. Second, this limitation of admixture does not apply to the human data sets since in that case entropy differences were at the level of individual alleles. Moreover, the entropy-associated transcription factor motifs were conserved between the human data set and the mouse embryonic data set.

It will be interesting to explore the relationship between polymorphisms and epigenetic entropy in large cohorts, which could not be done on the relatively small number of human samples that we were able to analyze here, since rare variants in transcription factor binding sites have been associated with altered methylation at specific CpG sites in human lymphocytes (66). Previously, Mendelian transmission of 383 regions with variable mean methylation in the human genome were identified (67), and two recent studies using GTEx data show SNPs associated with variation in mean methylation in a tissue-specific manner (68,69). An alternative approach would be to analyze genetically complex mouse strains like the collaborative cross (70,71), which was not possible in the inbred mouse strain studied here; and this mouse approach would allow for hypothesis testing of specific transcription factor engineered mutations in the development.

Finally, the results shown here emphasize the importance of including entropy analysis in genome-wide studies of DNA methylation. While this has been widely adopted in some human diseases such as cancer (72), the role of methylation entropy in embryonic development may be under-appreciated. A recent study supports that idea, showing that knockout of DNA methyltransferases increases methylation entropy and gene expression variability in cultured embryonic stem cells (7). The results presented here identify genetic sequences and their associated motifs underlying epigenetic entropy and their critical role in development.

DATA AVAILABILITY

All the code for analysis is available in GitHub and Zenodo: https://github.com/AndyFeinberg/methyl_entropy, https://doi.org/10.5281/zenodo.7570687. The data used in the paper are publicly available. Mouse WGBS data are downloaded from the ENCODE3 project: https://www.encodeproject.org/. Human WGBS data are downloaded from the ENCODE project and Roadmap Epigenetic project: http://www.roadmapepigenomics.org/. GEO accession number and ENCODE accession number is in Supplementary Dataset S17. Processed data is available from the authors upon request.

Supplementary Material

gkad050_Supplemental_Files

ACKNOWLEDGEMENTS

We thank Mauro Maggioni, Bill Nelson, Sarah Wheelan, Patrick Cahan and Nilanjan Chatterjee for their careful review and thoughtful suggestions.

Author contributions: Conceptualization, A.P.F.; methodology, investigation and writing, Y.F., Z.J., W.Z., J.A., M.A.K., H.J., A.P.F.; visualization, Y.F., Z.J., W.Z.; funding acquisition and supervision, A.P.F., H.J.

Contributor Information

Yuqi Fang, Center for Epigenetics, Johns Hopkins University, 855 N. Wolfe St., Baltimore, MD 21205, USA; Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.

Zhicheng Ji, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, MD 21205, USA; Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC 27708, USA.

Weiqiang Zhou, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, MD 21205, USA.

Jordi Abante, Department of Electrical & Computer Engineering, Johns Hopkins University, Baltimore, MD 21218, USA.

Michael A Koldobskiy, Center for Epigenetics, Johns Hopkins University, 855 N. Wolfe St., Baltimore, MD 21205, USA; Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, 855 N. Wolfe St., Baltimore, MD 21205, USA.

Hongkai Ji, Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, MD 21205, USA.

Andrew P Feinberg, Center for Epigenetics, Johns Hopkins University, 855 N. Wolfe St., Baltimore, MD 21205, USA; Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA; Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St., Baltimore, MD 21205, USA; Department of Medicine, Johns Hopkins University School of Medicine, 600 N Wolfe St, Baltimore, MD 21205, USA.

SUPPLEMENTARY DATA

Supplementary Data are available at NAR Online.

FUNDING

NIH [DP1 DK119129 to A.P.F., R01HG010889, R01HG009518 to H.J.]; NSF [1933303 to A.P.F.]. Funding for open access charge: NIH [DP1 DK119129].

Conflict of interest statement. None declared.

REFERENCES

  • 1. Hannon E., Gorrie-Stone T.J., Smart M.C., Burrage J., Hughes A., Bao Y., Kumari M., Schalkwyk L.C., Mill J.. Leveraging DNA-methylation quantitative-trait loci to characterize the relationship between methylomic variation, gene expression, and complex traits. Am. J. Hum. Genet. 2018; 103:654–665. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Anastasiadi D., Esteve-Codina A., Piferrer F.. Consistent inverse correlation between DNA methylation of the first intron and gene expression across tissues and species. Epigenetics Chromatin. 2018; 11:37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Feinberg A.P., Koldobskiy M.A., Gondor A.. Epigenetic modulators, modifiers and mediators in cancer aetiology and progression. Nat. Rev. Genet. 2016; 17:284–299. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Pujadas E., Feinberg A.P.. Regulated noise in the epigenetic landscape of development and disease. Cell. 2012; 148:1123–1131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Landan G., Cohen N.M., Mukamel Z., Bar A., Molchadsky A., Brosh R., Horn-Saban S., Zalcenstein D.A., Goldfinger N., Zundelevich A.et al.. Epigenetic polymorphism and the stochastic formation of differentially methylated regions in normal and cancerous tissues. Nat. Genet. 2012; 44:1207–1214. [DOI] [PubMed] [Google Scholar]
  • 6. Jenkinson G., Pujadas E., Goutsias J., Feinberg A.P.. Potential energy landscapes identify the information-theoretic nature of the epigenome. Nat. Genet. 2017; 49:719–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Tsankov A.M., Wadsworth M.H., Akopian V., Charlton J., Allon S.J., Arczewska A., Mead B.E., Drake R.S., Smith Z.D., Mikkelsen T.S.et al.. Loss of DNA methyltransferase activity in primed human ES cells triggers increased cell-cell variability and transcriptional repression. Development. 2019; 146:dev174722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Abante J., Fang Y., Feinberg A.P., Goutsias J.. Detection of haplotype-dependent allele-specific DNA methylation in WGBS data. Nat. Commun. 2020; 11:5238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M.J.et al.. Integrative analysis of 111 reference human epigenomes. Nature. 2015; 518:317–330. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. He Y., Hariharan M., Gorkin D.U., Dickel D.E., Luo C., Castanon R.G., Nery J.R., Lee A.Y., Zhao Y., Huang H.et al.. Spatiotemporal DNA methylome dynamics of the developing mouse fetus. Nature. 2020; 583:752–759. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Onuchic V., Lurie E., Carrero I., Pawliczek P., Patel R.Y., Rozowsky J., Galeev T., Huang Z., Altshuler R.C., Zhang Z.et al.. Allele-specific epigenome maps reveal sequence-dependent stochastic switching at regulatory loci. Science. 2018; 361:eaar3146. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.Journal. 2011; 17:10–12. [Google Scholar]
  • 13. Wilton R., Li X., Feinberg A.P., Szalay A.S.. Arioc: gPU-accelerated alignment of short bisulfite-treated reads. Bioinformatics. 2018; 34:2673–2675. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Brouard J.S., Schenkel F., Marete A., Bissonnette N.. The GATK joint genotyping workflow is appropriate for calling variants in RNA-seq experiments. J. Anim. Sci. Biotechnol. 2019; 10:44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Sherry S.T., Ward M.H., Kholodov M., Baker J., Phan L., Smigielski E.M., Sirotkin K.. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 2001; 29:308–311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Benjamini Y., Hochberg Y.. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodological). 1995; 57:289–300. [Google Scholar]
  • 17. Jirtle R. 2012; Imprinted Genes: by Species, Geneimprint.
  • 18. Savova V., Chun S., Sohail M., McCole R.B., Witwicki R., Gai L., Lenz T.L., Wu C.T., Sunyaev S.R., Gimelbrant A.A.. Genes with monoallelic expression contribute disproportionately to genetic diversity in humans. Nat. Genet. 2016; 48:231–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Fornes O., Castro-Mondragon J.A., Khan A., van der Lee R., Zhang X., Richmond P.A., Modi B.P., Correard S., Gheorghe M., Baranašić D.et al.. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 2020; 48:D87–D92. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Saxonov S., Berg P., Brutlag D.L.. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl. Acad. Sci. U.S.A. 2006; 103:1412–1417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Ji H., Jiang H., Ma W., Johnson D.S., Myers R.M., Wong W.H.. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat. Biotechnol. 2008; 26:1293–1300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. The ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489:57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Zhou W., Ji Z., Fang W., Ji H.. Global prediction of chromatin accessibility using small-cell-number and single-cell RNA-seq. Nucleic Acids Res. 2019; 47:e121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Gorkin D.U., Barozzi I., Zhao Y., Zhang Y., Huang H., Lee A.Y., Li B., Chiou J., Wildberg A., Ding B.et al.. An atlas of dynamic chromatin landscapes in mouse fetal development. Nature. 2020; 583:744–751. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. Alexa A., Rahnenführer J.. Gene set enrichment analysis with topGO. Bioconductor Improv. 2009; 27:1–26. [Google Scholar]
  • 26. Han X., Zhou Z., Fei L., Sun H., Wang R., Chen Y., Chen H., Wang J., Tang H., Ge W.et al.. Construction of a human cell landscape at single-cell level. Nature. 2020; 581:303–309. [DOI] [PubMed] [Google Scholar]
  • 27. Huang M., Wang J., Torre E., Dueck H., Shaffer S., Bonasio R., Murray J.I., Raj A., Li M., Zhang N.R.. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods. 2018; 15:539–542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. He P., Williams B.A., Trout D., Marinov G.K., Amrhein H., Berghella L., Goh S.T., Plajzer-Frick I., Afzal V., Pennacchio L.A.et al.. The changing mouse embryo transcriptome at whole tissue and single-cell resolution. Nature. 2020; 583:760–767. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Eling N., Richard A.C., Richardson S., Marioni J.C., Vallejos C.A.. Correcting the mean-variance dependency for differential variability testing using single-cell RNA sequencing data. Cell Syst. 2018; 7:284–294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Iglesias-Platas I., Court F., Camprubi C., Sparago A., Guillaumet-Adkins A., Martin-Trujillo A., Riccio A., Moore G.E., Monk D. Imprinting at the PLAGL1 domain is contained within a 70-kb CTCF/cohesin-mediated non-allelic chromatin loop. Nucleic Acids Res. 2013; 41:2171–2179. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Umlauf D., Goto Y., Cao R., Cerqueira F., Wagschal A., Zhang Y., Feil R.. Imprinting along the Kcnq1 domain on mouse chromosome 7 involves repressive histone methylation and recruitment of Polycomb group complexes. Nat. Genet. 2004; 36:1296–1300. [DOI] [PubMed] [Google Scholar]
  • 32. Li E., Beard C., Jaenisch R.. Role for DNA methylation in genomic imprinting. Nature. 1993; 366:362–365. [DOI] [PubMed] [Google Scholar]
  • 33. Monk D., Mackay D.J.G., Eggermann T., Maher E.R., Riccio A.. Genomic imprinting disorders: lessons on how genome, epigenome and environment interact. Nat. Rev. Genet. 2019; 20:235–248. [DOI] [PubMed] [Google Scholar]
  • 34. Coetzee S.G., Coetzee G.A., Hazelett D.J.. motifbreakR: an R/bioconductor package for predicting variant effects at transcription factor binding sites. Bioinformatics. 2015; 31:3847–3849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Bell A.C., Felsenfeld G.. Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene. Nature. 2000; 405:482–485. [DOI] [PubMed] [Google Scholar]
  • 36. Sanosaka T., Imamura T., Hamazaki N., Chai M., Igarashi K., Ideta-Otsuka M., Miura F., Ito T., Fujii N., Ikeo K.et al.. DNA methylome analysis identifies transcription factor-based epigenomic signatures of multilineage competence in neural stem/progenitor cells. Cell Rep. 2017; 20:2992–3003. [DOI] [PubMed] [Google Scholar]
  • 37. Eden E., Navon R., Steinfeld I., Lipson D., Yakhini Z.. GOrilla: a tool for discovery and visualization of enriched GO terms in ranked gene lists. BMC Bioinformatics. 2009; 10:48. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Park N.I., Guilhamon P., Desai K., McAdam R.F., Langille E., O’Connor M., Lan X., Whetstone H., Coutinho F.J., Vanner R.J.et al.. ASCL1 Reorganizes chromatin to direct neuronal fate and suppress tumorigenicity of glioblastoma stem cells. Cell Stem Cell. 2017; 21:209–224. [DOI] [PubMed] [Google Scholar]
  • 39. Sun Y., Zhou B., Mao F., Xu J., Miao H., Zou Z., Phuc Khoa L.T., Jang Y., Cai S., Witkin M.et al.. HOXA9 Reprograms the enhancer landscape to promote leukemogenesis. Cancer Cell. 2018; 34:643–658. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Adachi K., Kopp W., Wu G., Heising S., Greber B., Stehling M., Araúzo-Bravo M.J., Boerno S.T., Timmermann B., Vingron M.et al.. Esrrb unlocks silenced enhancers for reprogramming to naive pluripotency. Cell Stem Cell. 2018; 23:266–275. [DOI] [PubMed] [Google Scholar]
  • 41. Soufi A., Garcia M.F., Jaroszewicz A., Osman N., Pellegrini M., Zaret K.S.. Pioneer transcription factors target partial DNA motifs on nucleosomes to initiate reprogramming. Cell. 2015; 161:555–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42. Shan J., Fu L., Balasubramanian M.N., Anthony T., Kilberg M.S.. ATF4-dependent regulation of the JMJD3 gene during amino acid deprivation can be rescued in Atf4-deficient cells by inhibition of deacetylation. J. Biol. Chem. 2012; 287:36393–36403. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43. Magnani L., Ballantyne E.B., Zhang X., Lupien M.. PBX1 genomic pioneer function drives erα signaling underlying progression in breast cancer. PLos Genet. 2011; 7:e1002368. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Wittkopp P.J., Kalay G.. Cis-regulatory elements: molecular mechanisms and evolutionary processes underlying divergence. Nat. Rev. Genet. 2011; 13:59–69. [DOI] [PubMed] [Google Scholar]
  • 45. Zhao Y., Sheng H.Z., Amini R., Grinberg A., Lee E., Huang S., Taira M., Westphal H.. Control of hippocampal morphogenesis and neuronal differentiation by the LIM homeobox gene Lhx5. Science. 1999; 284:1155–1158. [DOI] [PubMed] [Google Scholar]
  • 46. Lui N.C., Tam W.Y., Gao C., Huang J.D., Wang C.C., Jiang L., Yung W.H., Kwan K.M.. Lhx1/5 control dendritogenesis and spine morphogenesis of Purkinje cells via regulation of Espin. Nat. Commun. 2017; 8:15079. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Owens G.K., Kumar M.S., Wamhoff B.R.. Molecular regulation of vascular smooth muscle cell differentiation in development and disease. Physiol. Rev. 2004; 84:767–801. [DOI] [PubMed] [Google Scholar]
  • 48. Stadler M.B., Murr R., Burger L., Ivanek R., Lienert F., Schöler A., van Nimwegen E., Wirbelauer C., Oakeley E.J., Gaidatzis D.et al.. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011; 480:490–495. [DOI] [PubMed] [Google Scholar]
  • 49. Domcke S., Bardet A.F., Adrian Ginno P., Hartl D., Burger L., Schubeler D. Competition between DNA methylation and transcription factors determines binding of NRF1. Nature. 2015; 528:575–579. [DOI] [PubMed] [Google Scholar]
  • 50. Karagiannis P., Takahashi K., Saito M., Yoshida Y., Okita K., Watanabe A., Inoue H., Yamashita J.K., Todani M., Nakagawa M.et al.. Induced pluripotent stem cells and their use in Human models of disease and development. Physiol. Rev. 2019; 99:79–114. [DOI] [PubMed] [Google Scholar]
  • 51. Zenker M., Bunt J., Schanze I., Schanze D., Piper M., Priolo M., Gerkes E.H., Gronostajski R.M., Richards L.J., Vogt J.et al.. Variants in nuclear factor I genes influence growth and development. Am. J. Med. Genet. C Semin. Med. Genet. 2019; 181:611–626. [DOI] [PubMed] [Google Scholar]
  • 52. Qu Y., Huang Y., Feng J., Alvarez-Bolado G., Grove E.A., Yang Y., Tissir F., Zhou L., Goffinet A.M.. Genetic evidence that Celsr3 and Celsr2, together with Fzd3, regulate forebrain wiring in a Vangl-independent manner. Proc. Natl. Acad. Sci. U.S.A. 2014; 111:E2996–E3004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Laforest B., Andelfinger G., Nemer M.. Loss of Gata5 in mice leads to bicuspid aortic valve. J. Clin. Invest. 2011; 121:2876–2887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Stankunas K., Shang C., Twu K.Y., Kao S.C., Jenkins N.A., Copeland N.G., Sanyal M., Selleri L., Cleary M.L., Chang C.P.. Pbx/Meis deficiencies demonstrate multigenetic origins of congenital heart disease. Circ. Res. 2008; 103:702–709. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Arrington C.B., Dowse B.R., Bleyl S.B., Bowles N.E.. Non-synonymous variants in pre-B cell leukemia homeobox (PBX) genes are associated with congenital heart defects. Eur. J. Med. Genet. 2012; 55:235–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Lambers E., Arnone B., Fatima A., Qin G., Wasserstrom J.A., Kume T.. Foxc1 Regulates early cardiomyogenesis and functional properties of embryonic stem cell derived cardiomyocytes. Stem Cells. 2016; 34:1487–1500. [DOI] [PubMed] [Google Scholar]
  • 57. Aston C., Jiang L., Sokolov B.P.. Transcriptional profiling reveals evidence for signaling and oligodendroglial abnormalities in the temporal cortex from patients with major depressive disorder. Mol. Psychiatry. 2005; 10:309–322. [DOI] [PubMed] [Google Scholar]
  • 58. Waclaw R.R., Allen Z.J., Bell S.M., Erdelyi F., Szabo G., Potter S.S., Campbell K.. The zinc finger transcription factor Sp8 regulates the generation and diversity of olfactory bulb interneurons. Neuron. 2006; 49:503–516. [DOI] [PubMed] [Google Scholar]
  • 59. Sarachana T., Hu V.W.. Genome-wide identification of transcriptional targets of RORA reveals direct regulation of multiple genes associated with autism spectrum disorder. Mol Autism. 2013; 4:14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60. Bhattaram P., Penzo-Mendez A., Kato K., Bandyopadhyay K., Gadi A., Taketo M.M., Lefebvre V.. SOXC proteins amplify canonical WNT signaling to secure nonchondrocytic fates in skeletogenesis. J. Cell Biol. 2014; 207:657–671. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61. Mao J., McGlinn E., Huang P., Tabin C.J., McMahon A.P.. Fgf-dependent Etv4/5 activity is required for posterior restriction of Sonic Hedgehog and promoting outgrowth of the vertebrate limb. Dev. Cell. 2009; 16:600–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62. Selleri L., Depew M.J., Jacobs Y., Chanda S.K., Tsang K.Y., Cheah K.S., Rubenstein J.L., O’Gorman S., Cleary M.L.. Requirement for Pbx1 in skeletal patterning and programming chondrocyte proliferation and differentiation. Development. 2001; 128:3543–3557. [DOI] [PubMed] [Google Scholar]
  • 63. Koldobskiy M.A., Jenkinson G., Abante J., Rodriguez DiBlasi V.A., Zhou W., Pujadas E., Idrizi A., Tryggvadottir R., Callahan C., Bonifant C.L.et al.. Converging genetic and epigenetic drivers of paediatric acute lymphoblastic leukaemia identified by an information-theoretic analysis. Nat. Biomed. Eng. 2021; 5:360–376. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64. Gupta S., Lafontaine D.L., Vigneau S., Vinogradova S., Mendelevich A., Igarashi K.J., Bortvin A., Alves-Pereira C.F., Clement K., Pinello L.. DNA methylation is a key mechanism for maintaining monoallelic expression on autosomes. 2020; bioRxiv doi:23 July 2020, preprint: not peer reviewed 10.1101/2020.02.20.954834. [DOI]
  • 65. Brivanlou A.H., Darnell J.E. Jr. Signal transduction and the control of gene expression. Science. 2002; 295:813–818. [DOI] [PubMed] [Google Scholar]
  • 66. Martin-Trujillo A., Patel N., Richter F., Jadhav B., Garg P., Morton S.U., McKean D.M., DePalma S.R., Goldmuntz E., Gruber D.et al.. Rare genetic variation at transcription factor binding sites modulates local DNA methylation profiles. PLoS Genet. 2020; 16:e1009189. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67. Plongthongkum N., van Eijk K.R., de Jong S., Wang T., Sul J.H., Boks M.P., Kahn R.S., Fung H.L., Ophoff R.A., Zhang K.. Characterization of genome-methylome interactions in 22 nuclear pedigrees. PLoS One. 2014; 9:e99313. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68. Gunasekara C.J., Scott C.A., Laritsky E., Baker M.S., MacKay H., Duryea J.D., Kessler N.J., Hellenthal G., Wood A.C., Hodges K.R.et al.. A genomic atlas of systemic interindividual epigenetic variation in humans. Genome Biol. 2019; 20:105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69. Rizzardi L.F., Hickey P.F., Idrizi A., Tryggvadóttir R., Callahan C.M., Stephens K.E., Taverna S.D., Zhang H., Ramazanoglu S., Hansen K.D.et al.. Human brain region-specific variably methylated regions are enriched for heritability of distinct neuropsychiatric traits. Genome Biol. 2021; 22:116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70. Churchill G.A., Airey D.C., Allayee H., Angel J.M., Attie A.D., Beatty J., Beavis W.D., Belknap J.K., Bennett B., Berrettini W.et al.. The Collaborative Cross, a community resource for the genetic analysis of complex traits. Nat. Genet. 2004; 36:1133–1137. [DOI] [PubMed] [Google Scholar]
  • 71. Aylor D.L., Valdar W., Foulds-Mathes W., Buus R.J., Verdugo R.A., Baric R.S., Ferris M.T., Frelinger J.A., Heise M., Frieman M.B.et al.. Genetic analysis of complex traits in the emerging collaborative cross. Genome Res. 2011; 21:1213–1222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72. Teschendorff A.E., Feinberg A.P.. Statistical mechanics meets single-cell biology. Nat. Rev. Genet. 2021; 22:459–476. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

gkad050_Supplemental_Files

Data Availability Statement

All the code for analysis is available in GitHub and Zenodo: https://github.com/AndyFeinberg/methyl_entropy, https://doi.org/10.5281/zenodo.7570687. The data used in the paper are publicly available. Mouse WGBS data are downloaded from the ENCODE3 project: https://www.encodeproject.org/. Human WGBS data are downloaded from the ENCODE project and Roadmap Epigenetic project: http://www.roadmapepigenomics.org/. GEO accession number and ENCODE accession number is in Supplementary Dataset S17. Processed data is available from the authors upon request.


Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press

RESOURCES