Skip to main content
Science Advances logoLink to Science Advances
. 2025 Oct 24;11(43):eadw3027. doi: 10.1126/sciadv.adw3027

KnowYourCG: Facilitating base-level sparse methylome interpretation

David C Goldberg 1,, Hongxiang Fu 1,, Daniel Atkins 1, Ethan Moyer 1, Chin Nien Lee 2, Yanxiang Deng 2, Wanding Zhou 1,2,*
PMCID: PMC12551721  PMID: 41134907

Abstract

Decoding DNA methylomes for biological insights is critical in epigenetics research. We present KnowYourCG (KYCG), a data interpretation framework designed for functional DNA methylation analysis. Unlike existing tools that target genes or genomic intervals, KYCG features direct base-level screenings of diverse biological and technical influences, including sequence motifs, transcription factor binding, histone modifications, replication timing, cell-type–specific methylation, and trait associations. Through implementing efficient infrastructure that rapidly screens and investigates thousands of knowledgebases, KYCG addresses the challenges of data sparsity in various methylation datasets, including low-pass or single-cell DNA methylomes, 5-hydroxymethylation (5hmC) profiles, spatial DNA methylation maps, and array-based datasets for epigenome-wide association studies. Applying KYCG to these datasets provides valuable insights into cell differentiation, cancer origins, epigenome-trait associations, and technical issues such as array artifacts, single-cell batch effects, and Nanopore 5hmC detection accuracy. Our tool simplifies large-scale methylation analysis and integrates seamlessly with standard assay technologies.


KnowYourCG is a scalable, CpG-based framework for DNA methylome interpretation for broad biological and technical links.

INTRODUCTION

Modified cytosine 5′-carbon at the CpG dinucleotide context is one of the most studied epigenetic marks in higher eukaryotes. In mammals, DNA methylation extensively implicates gene regulation, genome evolution, organismal development, and disease (1). Despite the prevalent interest in characterizing the DNA methylome, understanding the functional implications of methylation changes can be difficult. This is partly because DNA methylation is encoded on specific sequence units, e.g., CpG dinucleotides, but is also highly plastic and jointly governed by multiple intrinsic and external factors, such as cell identity (2), genetics (3), pathology (4), sex (5), age (6), and other environmental conditions (7). Functional DNA methylation analysis often demands awareness of the sequence structures and all explicit and hidden biological covariates (8) and technical confounders (9).

Effective computational methods for mining biological links from DNA methylation data have been lacking compared to their gene expression counterparts (1012). Most functional enrichment analysis methods for DNA methylation data piggyback on tools initially designed to investigate gene sets [e.g., DAVID (12)] and genomic intervals [e.g., HOMER (13) and GREAT (14)]. Methods specifically designed for DNA methylation data follow a similar gene-centric (15, 16) or genomic interval–based approach (14, 17, 18). In other words, investigators must first link CpGs to genes or form a differentially methylated region (DMR) based on genomic proximity (14, 17, 19, 20).

There are fundamental drawbacks to these strategies. First, DNA methylation data are inherently sparse due to CpG depletion outside CpG islands and additional sparsity introduced by practical constraints of profiling methods (Fig. 1A). The Infinium arrays, widely used in epigenome-wide association studies (EWAS), cover only 1 to 3% of the genomic CpGs (9). Reduced representation bisulfite sequencing (RRBS) covers ~10% but is limited to CpG-dense regions. Whole-genome bisulfite sequencing (WGBS) covers the entire genome but frequently lacks per-base depth and quantification granularity. Epitomizing both forms of sparsities, single-cell methylomes typically cover 1 to 10% of the entire CpG set in the genome (Fig. 1A) (21). These data sparsities make accurate definitions of DMRs difficult and often subjective, even when true differences exist.

Fig. 1. Overview of the KnowYourCG analysis framework.

Fig. 1.

(A) Visualization of DNA methylation data sparsity in terms of the genome coverage and sequencing depth across common profiling methods. M, million; NGS, next-generation sequencing. (B) Schematic comparison between traditional gene-centric and KnowYourCG (KYCG) CpG-centric analytical workflows. BS-seq, bisulfite sequencing. (C) Overview of curated CpG knowledgebases used in KYCG for enrichment analysis. LINE, long interspersed nuclear element; kbp, kilo–base pair; ERV, endogenous retroviruses. (D) Memory and speed performance benchmarking of KYCG’s vectorized approach versus traditional set-based CpG representations. Gb, gigabytes. (E) Speed benchmarking of KYCG compared to a standard pipeline for computing enrichment statistics over increasing knowledgebase numbers. (F) Evaluation of enrichment testing in sparse datasets. ChromHMM state rankings were tested at varying levels of CpG sparsity from N (~28 million CpGs) to N/214 (~1700 CpGs). P values are based on Fisher’s exact tests.

Second, gene-centric approaches face challenges in establishing meaningful CpG-gene associations and unbiased gene weighting (2123). Methylation at different gene regions plays distinct regulatory roles (24), and gene-centric analysis often misses biology at intergenic, geneless regions. Intergenic methylation is known to implicate cell replication (25, 26), genome instability (2528), cell differentiation (2, 29), and aberrant writer/eraser enzyme activity (30, 31). Because of the discrete nature of CpG dinucleotides and their depletion from deamination, proximity-based CpG-gene associations or DMRs may fail to reveal clear enrichment patterns. Instead, focal and dispersed methylation changes are more common and implicate transcription factor (TF) binding (29).

The alternative strategy to study functional links in DNA methylation data is to use CpGs as the units of analysis based on a fixed CpG index, as implemented in methods such as eFORGE (32, 33), which were designed for array-based datasets with 20,000 to 900,000 probes (34). However, as newer datasets scale to whole-genome coverage (20 million to 30 million CpGs), overlap counting across hundreds to thousands of knowledgebase sets becomes computationally inefficient.

To address the above needs, we developed a comprehensive computational framework for DNA methylation data interpretation (Fig. 1B). KnowYourCG (KYCG) analyzes CpG sets for biological links and technical confounders. Capitalizing on a key technical innovation that rapidly enumerates CpG set differences across the whole genome, we achieve fast enrichment testing of methylomes against up to thousands of curated biological and technical covariates. Next, we first describe the implementation, after which we apply the tool to five broad application scenarios: (i) low-input DNA methylation profiles, including single-cell and spatial DNA methylation; (ii) 5-hydroxymethylation (5hmC) profiles and Nanopore-based direct detection; (iii) cell-type composition dynamics; (iv) interpretation of predictive machine learning tools such as epigenetic clocks and cancer classifiers; and last, (v) the detection of technical confounders. Collectively, we show that KYCG unveils interesting unreported links between CpG groups and demonstrated a variety of practical functionalities for analyzing large-scale DNA methylome data. Our tool is compatible with sequencing-based data and array platforms and has a user-friendly web-based application.

RESULTS

CpG-centric interpretation of sparse DNA methylomes

KYCG is a framework consisting of a web application, an R/Bioconductor application programming interface, a C command-line tool, and a database designed for DNA methylation data exploratory enrichment analysis, analogous to gene set enrichment analysis but focused on CpGs (Fig. 1B and fig. S1A). A CpG set linked to known biological functions, such as the specific binding sites of TFs, is called a knowledgebase set to distinguish it from the query. The significance of overlap between query CpGs and knowledgebase sets is evaluated using the hypergeometric distribution (Materials and Methods). To automate discovery, we uniformly processed 12,114,567 CpG-indexed knowledgebases for download and online query (Data and materials availability). These sets are constructed from human and mouse genome sequences, annotations, and public sequencing and array-based profiling (11,806 bulk and 480,012 single cells) and 1067 EWAS studies (Fig. 1C, table S1A, fig. S1B, and Materials and Methods).

To manage statistical complexity and improve interpretability, we grouped the CpG sets into biologically distinct testing knowledgebase domains representing separate hypothesis spaces with varying term counts, biological relevance, and structural organization. These domains are further classified into the following four major categories: (i) sequence features (e.g., k-mer, tetranucleotide, and transcription binding motifs), (ii) genomic features (e.g., chromatin states, histone modifications, gene links, transposable elements, TF bindings, and evolutionary conservation), (iii) trait associates (e.g., cell-type–specific methylations, human EWAS associates, and epigenetic clocks), and (iv) technical associates (e.g., sequence maskers, array hybridization, and extension masks). We extensively validated these knowledgebases, which form biologically relevant communities (fig. S1, C to E, and Materials and Methods). These testing domains define independent hypothesis spaces. Testing within domains preserves statistical power and biological focus.

To optimize performance, we used adaptive encoding to compress CpG sets, achieving compact disk storage and efficient in-memory manipulation (Materials and Methods). The comparison algorithm, implemented in C with bitwise vectorization, substantially accelerates the set overlap analysis. Our results demonstrate that for queries with 1 million CpGs, this method is ~10× faster and uses ~60× less memory than traditional set-based representations of CpGs. Unlike set representations, comparison time remains constant and scalable to large query sizes (Fig. 1D). Compared to a BEDTools-based pipeline of counting query overlaps (35), KYCG achieves a 25-fold speedup (Fig. 1E), supporting large-scale enrichment testing across thousands of knowledgebases. Similar performance gains extend to other functionalities, such as rapid methylation aggregation over knowledgebases (fig. S1F).

We first tested KYCG’s performance under query sparsity, as seen in RRBS, capture methylation sequencing (methyl-seq), and Infinium arrays, which target only a small subset of CpGs. To assess the enrichment testing feasibility, we simulated sparsity by downsampling CCCTC-binding factor (CTCF) binding–associated CpG sets from the full-genome set (N ~ 28 million) to N/214. We then evaluated the stability of ChromHMM state rankings by comparing sparse and full-genome enrichment (Fig. 1F). Active promoters consistently ranked highest, but sparsity introduced variations. The top-ranking ChromHMM terms remained stable at sparsity levels down to N/210 (~27,000 CpGs), with HM450, EPIC, and RRBS-based results resembling nonsparse predictions. However, the top enrichment term changed in 26% of runs at the extreme sparsity level (N/214; ~1700 CpGs). These findings illustrated KYCG’s stability for enrichment testing with sparse CpG inputs.

KYCG reveals biology from low-input, single-cell, and spatial DNA methylomes

Next, we evaluated KYCG’s performance in real sparse sequencing data by first analyzing methylomes (~2 million to 8 million CpGs) from various stages of primordial germ cell (PGC) development, where limited DNA precludes deep profiling (36). Enrichment analysis of methylated CpGs against TF binding sites (TFBS) and histone mark knowledgebases (Fig. 2A) showed that regions escaping global hypomethylation were enriched for heterochromatic (Het) marks, including histone H3 lysine 9 trimethylation (H3K9me3) and zinc finger protein 57 (ZFP57) binding. This enrichment was absent in male embryonic day 16.5 (E16.5) PGCs, consistent with known methylation rebound at this stage (37). These findings demonstrate KYCG’s ability to reveal biology at intergenic regions.

Fig. 2. Application of KYCG to sparse low-input, single-cell, Nanopore, and spatial methylomes.

Fig. 2.

(A) Enrichment analysis of sparse DNA methylomes (~2 million to 8 million CpGs) during PGC development. (B) Evaluation of 50 pairs of single-cell colon cancer versus adjacent normal methylomes. Spearman correlation of cancer hypermethylation enrichment results was tested relative to the least sparse pair (~6 million CpGs), indicated by the red dot. (C) t-SNE visualization of the selected 50 pairs of cells for comparison between the KYCG motif database and HOMER using single-cell colon cancer hypermethylation data. (D) Enrichment analysis of cell-type–specific H3K27me3 histone modifications of hypermethylated CpGs in bladder cancer (BLCA) and breast cancer (BRCA) TCGA datasets. (E) Neural tube and heart enrichment testing of TFBS from spatial mouse E11.5 embryo data. (F) Heatmap showing cell-specific TFBS methylation identified by aggregating methylation over KYCG knowledgebases. Forty-eight cells per major cell-type class are shown as rows, and TFBS knowledgebases are columns. Meth., methylation; t-SNE, t-distributed stochastic neighbor embedding; TCGA, The Cancer Genome Atlas.

We next evaluated whether KYCG captures biology from highly sparse single-cell methylomes (200,000 to 1 million CpGs), a common scenario when pseudobulk aggregation is limited by biological availability or cost. In a pairwise comparison between a randomly selected single colon tumor cell and an adjacent normal cell, KYCG reveals the signature enrichment of hypermethylation at bivalent chromatin, marked by H3K27me3 and bound by Polycomb repressive complex members [e.g., polyhomeotic homolog 1 (PHC1), polycomb group ring finger 2 (PCGF2), jumonji and AT-rich interaction domain containing 2 (JARID2), ring finger protein 1 (RING1), enhancer of zeste homolog 2 (EZH2), etc.] (fig. S2A) (38). Hypomethylated CpGs were enriched in quiescent (Quies) and Het regions, Hi-C B compartments, and WCGWs (fig. S2B), as previously characterized (25). The result is robust to cell pairs of different sparsity levels. Notably, the cancer-specific hypermethylation pattern was robustly detected in extremely sparse methylome profiles covering as few as ~12,000 CpGs (~0.05% genomic CpGs), showing strong correlation with the most deeply sequenced cells (Fig. 2B). Similarly, KYCG also captured cell-type–specific differences, with differential methylation between single forkhead box protein p2–positive (Foxp2+) neurons and oligodendrocytes enriched at enhancer binding sites (fig. S2C) (39).

To assess KYCG’s advantage in sparse methylome analysis, we compared it to HOMER, a widely used genomic interval–based enrichment tool (13). We used the above colon cancer hypermethylation as our query and tested the enrichment of TF binding motifs. KYCG identified biologically relevant motifs, such as caudal type homeobox 2 (CDX2), a key player in intestinal differentiation and often acting as a tumor suppressor and a prognostic marker (40, 41), as well as the FOX family and the androgen receptor ANDR, both implicated in colon cancer (42, 43). Testing the larger DMRs against similar TF binding databases (Materials and Methods), HOMER missed the colon relevance and picked up general TFs, affecting cellular differentiation and proliferation instead, such as sine oculis homeobox 4 (SIX4) and zinc finger protein 41 (ZNF41) (Fig. 2C) (44, 45). Notably, when using aggregated pseudobulks, HOMER did detect CDX2 and FOX motif enrichment. However, this signal diminished with smaller cell numbers (fig. S2D), suggesting that DMR calling may dilute the signal in sparse settings.

Furthermore, we observed that cancer-associated hypermethylation patterns align with the cancer cell’s tissue of origin. For example, while hypermethylated CpGs in TCGA bladder cancers were broadly enriched for H3K27me3 across many cell types, the strongest enrichment was observed when comparing H3K27me3 marks in immortalized urothelium cells. Likewise, breast cancer hypermethylation is most enriched in the same mark profiled from MCF7 breast epithelium cells (Fig. 2D).

To demonstrate KYCG’s broad applicability, we applied KYCG to a spatial DNA methylation dataset from a mouse E11.5 embryo (Fig. 2E) (46). Methylation differences between cells from two spatial regions (B and H) located near the brain and heart areas on the light-field image were analyzed. Differential methylation was primarily linked to embryogenesis-specific TFs, including zinc finger proteins, which is consistent with the developmental stage (Fig. 2E). Region B hypomethylation was enriched for brain-specific TFs [(e.g., peroxisome proliferator-activated receptor delta (PPARD), LIM homeobox 1 (LHX1), eomesodermin (EOMES), NK6 homeobox 1 (NKX6-1), single-minded homolog 2 (SIM2)], while region H hypomethylation was enriched for heart-specific factors such as heart and neural crest derivatives expressed 2 (Hand2) (4749). Notably, the brain-specific TF distal-less homeobox 6 (DLX6) was hypomethylated in region H, suggesting a preference for methylated DNA binding. These results highlight KYCG’s capability to resolve region-specific methylation differences and connect them to biological processes.

Aggregating methylation signals can mitigate missingness in single-cell datasets. However, large bin– or continuous genomic interval–based aggregation may obscure biologically relevant trans-acting features that span multiple genomic sites. Using KYCG’s fast aggregation capability (fig. S1F), we analyzed 1188 TFBS knowledgebases across 4000 single cells from 20 brain cell types to uncover transcriptional networks underlying cell identity (50). Differential methylation analysis revealed distinct patterns (Fig. 2F), such as hypomethylation at oligodendrocyte transcription factor 2 (OLIG2), SRY-box transcription factor 2 (SOX2), and SRY-box transcription factor 8 (SOX8) binding in oligodendrocytes, key regulators of their development (51, 52), and at nuclear factor, interleukin 3 regulated (NFIL3) and lymphoblastic leukemia derived sequence 1 (LYL1) binding in microglia, linked to immune function (53, 54). In addition, TFBS methylation distinguished superficial cortical neurons (L1-3/L2-4) from deeper layers (L4-5/L5-6), highlighting epigenetic regulation of cortical layer development. These findings demonstrate KYCG’s utility for dimensionality reduction and feature aggregation in sparse single-cell data.

KYCG facilitates 5hmC analysis and assesses Oxford Nanopore Technologies direct detection

5hmC, an intermediate in 5-methylcytosine (5mC) oxidation and demethylation, plays a critical role in epigenetic cell identity. Despite its importance, 5hmC exhibits dynamic and sparse distribution (5559). Even in brain tissues, where 5hmC is most abundant, it reaches only 10 to 20% of 5mC levels (60), posing substantial challenges for data analysis (21, 61).

To address the challenges of analyzing sparse 5hmC data, we tested KYCG on 5hmC profiles from recent single-cell studies. Using snhmC-seq2 data (57), we evaluated brain cell types where 5hmC was measured at only 0.2 to 1% CpGs in astrocytes and oligodendrocytes. Pairwise comparisons revealed that 5hmC differences between cell types were enriched in TF binding and genes linked to brain cell differentiation programs (Fig. 3A and fig. S3, A and B). T-box brain transcription factor 1 (TBR1) and Eomes emerged as the most significant TFs discriminating between excitatory and inhibitory neurons. The two TFs are essential for the development of glutamatergic excitatory neurons in the cerebral cortex and are typically absent in GABA-releasing inhibitory neurons (62, 63). Besides, Myocyte Enhancer Factor 2A (Mef2a), an important transcription factor for excitatory neurons (64), emerged as a TF with binding significantly enriched at 5hmC differences between excitatory neurons from oligodendrocytes (Fig. 3A).

Fig. 3. Application of KYCG to 5hmC analysis and ONT direct detection.

Fig. 3.

(A) Pairwise comparison of 5hmC profiles in major brain cell types (astrocytes, oligodendrocytes, excitatory neurons, and inhibitory neurons) derived from snmC-seq2 data. AS, astrocytes; OL, oligodendrocytes; EX, excitatory; IN, inhibitory. (B) Marker gene enrichment for hyper-5hmC from human bisulfite APOBEC-coupled epigenetic sequencing (bACE)-array data. (C) ChromHMM state enrichment of ONT-derived 5mC and 5hmC signals across four mouse tissues (lung, blood, uterus, and cortex) and deep ACE sequencing (ACE-seq). (D) Tissue-specific chromatin state enrichment of tissue-specific 5mC from ONT. (E) Comparison of ONT-derived 5hmC profiles and single-cell 5hmC datasets (SIMPLE-seq and snmC-seq2). ASC, astrocytes; ODC, oligodendrocytes; OPC, oligodendrocyte precursor cell; Exc, excitatory; Inh, inhibitory.

In nonbrain high-turnover tissues, 5hmC is even scarcer (60), as 5hmC is a poor substrate for DNA methyltransferase 1 (DNMT1) and unmaintained in rapidly dividing cells (65, 66). This ultrasparsity leaves the interval and per-locus analysis of genome-wide 5hmC patterns largely impractical (61). To assess KYCG’s utility in this context, we analyzed 104 human 5hmC profiles across 25 tissue types generated using the bACE-array technology (67), applying KYCG to evaluate tissue-specific 5hmC signals (Fig. 3B). 5hmC sites in proliferative tissues, such as lymphocytes and placenta, were enriched near marker genes of corresponding cell types (Fig. 3B). For example, the placenta-specific gain of 5hmC is localized to ADAM12 and EPAS1, genes expressed in trophoblasts that regulate placental vascularization, nutrient availability, and immune tolerance (6870). In lymph nodes, 5hmC was enriched near IGHM, IGKC, and other genes involved in B cell signaling and antibody production (71, 72). These observations demonstrate KYCG’s versatility in uncovering tissue-specific epigenetic regulation from ultrasparse 5hmC datasets.

Oxford Nanopore Technology (ONT) is an emerging approach to directly discriminate 5mC, 5hmC, and unmodified C from ion current signals (73, 74), bypassing cytosine deamination methods that cannot separate 5mC and 5hmC (75). However, ONT’s 5hmC detection remains undercalibrated (74, 76), and per-site accuracy is difficult to assess due to the sparse and heterogeneous nature of 5hmC. To address this, we used KYCG to evaluate the biological relevance of ONT-based 5mC and 5hmC signals across four mouse tissues (lung, blood, uterus, and cortex) profiled with low-pass Flongle flow cells (~1 million CpGs per sample).

Our results showed that ONT-derived 5mC and 5hmC maps are consistent with established biology. 5mC was enriched at gene bodies (Tx) and Het (Fig. 3C and fig. S3C) (67). From the sparse methylomes, we identified specific methylation patterns contrasting one sample against the others. These patterns exhibited tissue-specific chromatin state enrichment, such as PromF7 (77) in brain cells and EnhA13 (77) in blood and immune cells (Fig. 3D). 5hmC shares 5mCs’ enrichment in gene bodies but is depleted in Het. Furthermore, 5hmC was enriched at enhancers, where 5mC is depleted, highlighting the unique role of 5hmCs in ten-eleven translocation (TET)-mediated active demethylation and cis-regulation. The ONT 5hmC enrichment patterns closely mirrored deep ACE sequencing (ACE-seq) data, supporting its biological accuracy (Fig. 3C). Further validation using single-cell 5hmC datasets (SIMPLE-seq and snhmC-seq2) showed strong cross-dataset concordance. Comparing these with ONT 5hmC signals, all brain cell types showed higher enrichment in brain ONT profiles compared to blood, lung, and uterus (Fig. 3E). While limited by the bulk nature of the ONT data, these findings support the broad biological relevance of ONT in resolving 5hmC landscapes.

KYCG detects cell composition dynamics through enrichment testing

DNA methylation has long been established as a robust biomarker to discriminate cell types and analyze their composition in heterogeneous tissues (78). We reason that enriching methylation changes in cell-type–specific methylations would inform cell composition dynamics. To test this, we compiled KYCG knowledgebases, each holding CpG sites whose methylations discriminate two cell-type groups (a cell type contrast), including commonly used “one-versus-rest” comparisons (Fig. 4A and Materials and Methods). We used a nonparametric linear discriminant analysis approach to construct these knowledgebases while prioritizing CpGs showing large methylation differences between the contrasting groups (Fig. 4A).

Fig. 4. Detection of cell composition dynamics through KYCG cell-type–specific DNA methylation signature enrichment.

Fig. 4.

(A) Construction of cell-type–specific methylation knowledgebases with contrasts defined as pairwise comparisons of cell-type groups. Dendr., dendritic; Pla., plasma. (B) Chromatin state enrichment of hyper– and hypo–cell-type–specific methylation knowledgebases for immune, brain, and pan-tissue datasets. (C) Methylation signatures of cell types are enriched in marker genes for the corresponding cell type. (D) Validation of cell-type–specific methylation knowledgebases across datasets using normalized pointwise mutual information (NPMI). (E) Shared methylation signatures between unrelated cell types. CGE VIP, caudal ganglionic eminence vasoactive intestinal peptide–expressing interneurons. (F) Comparison of local methylation environment analysis at the NKX2-1 locus for inhibitory neurons and lung bronchus cells with other cell types. (G) Expression analysis of NKX2-1 across 79 cell types. nTPM, normalized transcripts per million. (H) Heatmap showing enrichment of EWAS hit CpG sets in cell-specific methylation CpG sets. The –log10 FDR values from the enrichment tests are z-score normalized within each trait (columns). Trait-related methylation enriches the cell types where the trait manifests. MSC, mesenchymal stem cell; GI, gastrointestinal.

To verify the quality of cell-type–specific methylation sets, we investigated their genomic distribution and validated sets across studies. First, consistent with prior reports (79, 80), cell-type–identifying methylation signals were more often based on the absence than the presence of methylation in the target cell types (Fig. 4B) and represent cell-type–specific enhancer chromatin (fig. S4A) (81). Second, cell-type–specific methylations likely regulate marker genes of the target cell type, suggesting an immediate transcriptional consequence (Fig. 4C). Genomic proximity analysis found that hypermethylation knowledgebases are more spatially clustered than the hypomethylation sets, suggesting their localization to CpG islands and involvement with the target gene expression (fig. S4B). Third, using normalized pointwise mutual information (NPMI) to measure set overlaps, we found that related cell types from different sequencing projects were associated with similar methylation signatures with concordant directionalities (Fig. 4D). Last, the cell-type–specific methylations are linked to cell lineage specification. For instance, brain cell methylation signatures are enriched in genes implicated in neurodevelopment and the differentiation of the specific neuron or glial cell types (fig. S4C).

Some unrelated cell types share methylation changes at overlapping CpG sites, suggesting regulatory network reuse (Fig. 4E). For example, inhibitory medial ganglionic eminence (MGE) neurons and lung bronchus cells, despite functioning in disparate organ systems and arising from different developmental origins, shared methylation signatures (Fig. 4, E and F). Although unexpected, we confirmed that these regions are indeed similarly methylated at the NKX2-1 locus and share similar NKX2-1 expression patterns relative to all other cell types they were compared to (Fig. 4G).

Cell composition dynamics may be mechanisms of methylation associations in EWAS studies of bulk tissues. Using our cell-specific knowledgebases, we tested whether KYCG could detect cell composition changes across disease states. We observed a concordant enrichment of trait-associated CpGs in the corresponding cell-type signatures (Fig. 4H and table S1B). For example, inflammatory bowel disease and Crohn’s disease–associated CpGs were enriched in lower gastrointestinal cell markers, while CpGs with type 2 diabetes–linked methylation showed enrichment in pancreatic cells. Similarly, methylation variations interrogated in liver aging and hepatocellular carcinoma studies were enriched in CpGs carrying hepatocyte-specific methylations. These observations likely reflect disease-associated shifts in cell-type proportions or aberrant methylation at cell identity–linked sites.

KYCG facilitates machine learning model interpretation

DNA methylation–based predictive models have been widely used in translational applications. However, interpreting these “black-box” models remains challenging. We hypothesize that KYCG could reveal the workings of predictive models by analyzing model features. Below, we focus on epigenetic clocks and cancer classifiers as two examples.

We queried eight epigenetic clocks that predict chronological aging and biological causes that alter organismal aging. First, we observed that different clock models’ features are associated with different enrichment terms, potentially reflecting the clocks’ prediction targets (Fig. 5A). The DunedinPACE clock, designed to predict the pace of aging from 19 different physiological measures (82), was highly enriched in sites with methylations linked to body weight and metabolic traits. The EpiTOC clock measures mitotic activity (83) and was enriched in cancer studies, partially methylated domains (PMDs), and Polycomb group targets. The Horvath, Levine, and Hannum clocks that predict chronological or phenotypic age were enriched in aging EWAS studies from independent cohorts not seen during training by the respective clock. Bohlin and Knight gestational age clocks (84, 85) were enriched in independent gestational age EWAS studies (86), while the Lee clock (87), trained on placental tissues, was also enriched in one gestational age study. Similar to EpiTOC, it was also enriched in cancer-associated methylations, bivalent chromatin, Polycomb group targets, and PMDs.

Fig. 5. Machine learning model interpretation with KYCG.

Fig. 5.

(A) Enrichment testing of eight epigenetic clocks reveals feature-specific associations with chromatin states and 17 EWAS studies. P values are based on Fisher’s exact tests before FDR correction. (B) Bohlin gestational age clock features enrichment in HOXB gene clusters and H3K36me2 histone modifications. OR, odds ratio. (C) Enrichment of DunedinPACE clock probes in genes and chromatin states. (D) Enrichment analysis of high- and low-importance CpG features from cancer classifiers. (E) Differential methylation enrichment analysis between correctly and incorrectly classified tumor samples. (F) Misclassified meningiomas compared to correctly classified tumors reveal CpG enrichments in neuronal, endothelial, and microglial signatures. P values are based on Fisher’s exact tests before FDR correction. (G) Heatmap of TNXB-associated methylation differences between correctly and incorrectly classified meningiomas. TNXB, tenascin XB.

Besides linking the clock features to related traits, KYCG also generated hypotheses regarding the models’ workings. The Lee clock enrichment likely reflects placental tissue’s high proliferation and cancer-like properties and may explain the poor performance of other cord blood–trained clocks on placental samples (87). For the Horvath and the Hannum clocks that predict chronological ages, we observed enrichments in cell-specific methylations from immune cell types such as monocytes, natural killer (NK) cells, and dendritic cells (fig. S5A). These enrichments reflect altered blood composition during the aging process (88) and are leveraged by epigenetic clocks to predict age (89). Compared to other aging clocks, the Bohlin gestational clock was enriched in HOXB genes and histone H3K36me2 marks (Fig. 5B), suggesting that the clock might have used the methylation of homeobox (HOX) genes, which are important for gestational development and body patterning (90), and the methylation gain might be mediated by H3K36me2, which recruits DNMT3s via the PWWP domains (91). The same HOXB3 site (cg15908709) can also be associated with gestational age in an independent dataset (fig. S5B) (86), validating this link. Last, KYCG found an enrichment of DunedinPACE clock features in overweight phenotypes, e.g., body mass index, obesity, and hepatic fat, as well as inflammatory disease signaling, e.g., Crohn’s disease, irritable bowel syndrome, and C-reactive protein (fig. S5C). Notably, DunedinPACE features are spatially linked to the gene LGALS3BP (Fig. 5C), which regulates immune responses in colon epithelial cells (92), cancer (93), HIV infection (94), and organ decline (95), suggesting a potential mechanism of the clock tracking diseases via the epigenetic regulation of a key circulating glycoprotein.

We next asked whether KYCG could help interpret cancer classifiers (96). We trained a random forest classifier on 2801 public brain tumor methylomes from more than 80 tumor classes (Materials and Methods). KYCG found that features with the highest importance scores were enriched in enhancers and actively transcribed genes, whereas less important CpGs were more enriched only in gene bodies (Fig. 5D). This highlights that the tumor cells of origin and the regulatory network underlying the cell identity difference are the main signal sources in cancer classification.

Furthermore, KYCG can help explain misclassifications. For example, we compared five correctly classified meningiomas to five misclassified tumors (Materials and Methods), separated by the leading principal component (Fig. 5E and fig. S5D). The 200 CpGs with the greatest positive loading scores along the leading principal component were enriched in neuronal, endothelial, and microglia signatures, suggesting that these samples may have different cells of origin (Fig. 5F). Linear modeling between the classification groups identified 30,686 differentially methylated CpGs that distinguished correct classification and misclassifications. These CpGs were enriched in TNXB (Fig. 5, F and G), which was previously shown to be differentially methylated across the dura and leptomeningeal layers of the meninges (97). This suggests that the misclassification likely reflects meningiomas originating from different leptomeningeal layers.

KYCG detects technical confounders in single-cell and EWAS datasets.

Hidden technical confounders mislead methylation biology interpretation (8, 98) and can be hard to detect even for experienced researchers. The KYCG knowledgebases include CpG sets linked to sequencing- and array-specific artifacts, e.g., methylation measurements influenced by genetic variations or poor coverage uniformity, to enable automatic sanity checks (Fig. 1C). To demonstrate this utility, we first applied KYCG to analyze 12 single-cell methylation studies on mouse tissues using eight assay technologies (Fig. 6A). Clustering these single-cell methylomes by their genomic feature enrichments revealed the impact of profiling technology on coverage uniformity (Fig. 6A). Most single-cell methylome datasets are biased in coverage toward CpG-dense regions, e.g., the transcription start sites (Tss/TssBiv), and depleted in Het and Quies regions, although most library preparation protocols do not intentionally enrich specific genomic regions. As a positive control, this bias is most prominent in single-cell reduced-representation bisulfite sequencing (scRRBS) and single-cell extended representation bisulfite sequencing (scXRBS), as they explicitly target CpG-dense regulatory regions (99). iscCOOL (100), scCOOL (101), and sciMETv2 (102) showed a reverse depletion pattern in CpG-rich regions and slight enrichment in Het (Fig. 6A). This reversed nonuniformity was potentially linked to adopting a tailing and ligation method as opposed to the usual postbisulfite adaptor tagging (100). Technologies based on the isolated nuclei (e.g., snmC-seq) are depleted in mitochondrial CpGs, while those that profile total cellular DNA are enriched in the mitochondrial genome, reflecting their high copy number (fig. S6A). We integrated two single-cell brain datasets profiled using two different assay technologies. We found that cells of the same cell type form different clusters. KYCG revealed that the difference is primarily linked to the bias in capturing different chromatin features, with Luo et al. (50) better capturing the Quies regions (Fig. 6B) and being slightly more depleted in TssA/TssFlnk chromatin states, particularly in neurons and oligodendrocytes, compared to the sites covered in Lee et al. (121) (fig. S6, B and C).

Fig. 6. Technical confounder discovery with KYCG in single-cell and EWAS datasets.

Fig. 6.

(A) Coverage biases in single-cell methylome technologies of 12 single-cell methylation studies using eight assay technologies. (B) Technical variations in three single-cell brain methylation datasets. t-SNE plots illustrate clustering by assay technology and differential capture of chromatin features. (C and D) Identification and enrichment of probes with poor correlation to methylation titration in the mouse array (C) and the human methylation screening array (D). exp., experiment; Pop., population. (E) Detection of ancestry-associated artifacts in EWAS datasets. P values are based on Fisher’s exact tests before FDR correction. FC, fold change.

Genetic polymorphism and sequence mappability can substantially affect methylation array measurement but are often overlooked. To demonstrate KYCG’s utility in detecting such artifacts, we used a methylation titration dataset to identify probes whose methylation readings are uncorrelated with the known titrated methylation fractions. KYCG found an enrichment of these probes in probes with known sequence mismatches, single-nucleotide polymorphism (SNP) probes, non-CpG methylation probes, negative control probes, and probes with suboptimal or nonunique mappings (e.g., targeting repetitive elements; Fig. 6, C and D). These enrichments suggested that probe sequence artifacts contributed to the probes’ poor performance, as revealed by the titration experiments. Furthermore, we checked array probes with variable signal intensity in a dataset of nonmalignant human tissues. KYCG identified the enrichment of such probes with mapping and color channel artifacts, suggesting an immediate consequence of probe hybridization and base extension (fig. S6D). Last, we applied KYCG to check CpG sets supposedly associated with ancestry, as in a previous study (104). We observed that these CpGs are significantly enriched in methylation readings influenced by human genetic polymorphisms (Fig. 6E), highlighting the critical need to distinguish true methylation quantitative trait loci (meQTLs) or ancestry-linked DNA methylation from measurement artifacts. Our experiment demonstrated the utility of KYCG in conveniently detecting technical confounders in sanity-checking EWAS discoveries.

DISCUSSION

Efficient enrichment testing tools are critical to the effective learning of omics datasets. CpG sites are the base units for DNA methylation data with a fixed length of 2 base pairs (bp) and a globally depleted prevalence, presenting an intrinsic sparsity. Gene-centric and DMR-based methods, originally designed for other omics data types (13, 14, 105), may be insufficient at fully capturing methylation biology. Gene-centric methods suffer from a CpG-gene linkage challenge and do not cover intergenic changes, which are now also known to have a regulatory role. On the other hand, DMR-based approaches assume that the methylations of nearby CpGs vary at a certain genomic scale, are coregulated by common chromatin features, and should be analyzed as units. However, this assumption can break down when methylation biology functions at finer or broader genomic scales. For example, TFBS often span just 5 to 30 nucleotides and may involve only single CpGs. In such scenarios, a base-level approach, as in KYCG, can be more sensitive at capturing fine-scale patterns. KYCG benefits not only sparse but also nonsparse datasets in providing multiscale interpretations of discrete methylation datasets.

Furthermore, many population-scale epigenetic studies operate within a “CpG subspace,” such as that set by Infinium microarray design. CpG-indexed enrichment analysis is well suited for these contexts, as implemented by existing tools (31, 32). However, a unified framework that generalizes across data types—including sequencing-based assays that may (e.g., WGBS) or may not (e.g., RRBS) target a fixed CpG set—has been lacking. Toward this goal, we conducted in silico experiments to evaluate the stability of enrichment testing across different CpG subspaces. Our analysis suggests that, when the proper testing universe is used, enrichment results from array-defined CpG subspaces faithfully track results from whole-genome datasets, except in extremely sparse scenarios (Fig. 1E). The results likely depend on the query and knowledgebase sets. Using CTCF binding sites as the query, we observed a slight reduction in the number of significant terms relative to uniformly downsampled data of similar genome coverage (Fig. 1E). This is likely due to array-based spaces being biased toward genic and enhancer regions, which may miss intergenic signals. Nonetheless, the top enriched terms remain stable. As methylation microarrays have much smaller CG subspaces, this resilience of enrichment to sparsity would justify the adoption of an array technology for lower experimental and computational costs.

A key strength of KYCG is its unified design that integrates data with curated resources, agnostic to assay platforms. For common array platforms (67, 106110), KYCG precomputed knowledgebases indexed by CpG probe IDs (“cg” numbers). For sequencing data–based knowledgebases, KYCG dynamically sets the appropriate background universe based on the query scope. This flexibility enables consistent enrichment analysis across both array and sequencing platforms, facilitating integration of data with knowledge derived from diverse assay types.

Beyond defined CpG subspaces, KYCG scales base-level interpretation to highly sparse DNA methylome datasets, including single-cell (e.g., snmC-seq or sci-MET) and spatial methylomes (e.g., Spatial-DMT) (46). These assays offer high-resolution insights but suffer from signal dropout and low per-site coverage, limiting traditional DMR-based approaches. Aggregation to pseudobulks can also be challenging when not enough cells of the cell type are captured. KYCG offers a solution for studying “dirty” differential methylations where the difference per locus is not statistically examined and DMR boundaries are murky. This strategy may also benefit biological scenarios of global but subtle methylation changes, e.g., methylation reader defects (111).

Key to the feasibility of comprehensive testing is the efficiency of KYCG in scanning the whole genome. Compared to gene enrichments focusing on ~30,000 human genes, enumerating ~28 million CpGs imposes a major computational hurdle to CG-based enrichment testing. When the knowledgebases are small and CGs can be indexed in the CG subspaces, one could adopt the traditional approach of set comparisons. However, a more efficient approach is needed when the queries and knowledgebases become larger. Here, we explored both pathways for addressing this hurdle and provided flexible computational solutions. We index the CGs based on the genomic coordinates for large queries and knowledgebases and use a vectorized counting approach to calculate the set overlaps quickly. This substantially enhanced the performance of set comparisons and enabled the efficient testing of thousands of knowledgebases. The same idea can apply to 5hmCs and non-CpG methylations, which are greater in number and more memory demanding. More powerful compression methods may be used to further enhance computational efficiency.

In implementing KYCG’s strategy, we noted that CpG-indexed enrichment testing requires both query and knowledgebase sets, and potentially the universe sets, to share the same CpG index. This is likely the reason some tools such as HOMER natively do not support 2-bp queries. While tools such as LOLA (17) can accept 2-bp queries, bias arises if the knowledgebases remain interval based. Converting these intervals to 2-bp resolution eliminates the bias but greatly increases storage and computation time without efficient indexing, limiting scalability to large numbers of databases. For example, comparing the end-to-end run time of KYCG and LOLA in performing the analysis described in Fig. 2B, KYCG was substantially more efficient (fig. S6E), although the two tools produced similar results.

Automated sanity checks against colinear biology and technical confounders, which may contribute to observed trait associations, are a pressing need even for seasoned scientists. For example, copy number polymorphism has masqueraded as epigenetic silencing events (98). Leukocyte contamination may confound the discovery of cancer-associated epigenetic silencing (112). Global methylation variation linked to proliferation and impaired DNMT recruitment can be misinterpreted as altered epigenetic aging (25, 113). CpG-rich genomic features, e.g., CpG islands, canyons, and nadirs, overlap extensively and share similar methylation biology, such as mitotic hypermethylation. KYCG represents a step toward addressing this challenge and allows one to check these collinear associations by automatically testing genomic colocalization and comparing enrichment levels. For example, our analysis demonstrated that one can dissect the cell-type context by comparing enrichment levels of the same histone modification features but in different cell types. We also cautioned array-based meQTL discovery due to SNP-originated reading artifacts (22). Further expanding and improving the comprehensive collection of knowledgebases is critical to keeping awareness of all hidden biological and technical links.

MATERIALS AND METHODS

Whole-genome encoding and compression via YAME

The KYCG framework is designed to streamline whole-genome–wide CpG encoding, data storage, and statistical analyses, leveraging efficient compression and parsing capabilities provided by its core component, YAME (Yet Another MEthylation analysis tool). To minimize storage requirements, genomic coordinates are not explicitly stored. Instead, all knowledgebases and query datasets are preprocessed and indexed according to a default ordering of CpGs based on the reference genome (e.g., GRCh38), with or without contig information. The genomic coordinates are compactly stored separately and can be flexibly combined or built using generic tools such as AWK and BEDTools (35). This coordinate-free design reduces redundancy while ensuring consistency across datasets.

YAME, the command-line tool within KYCG, handles the encoding, parsing, and compression of CpG-related data. A combination of bit-packing, Run-length encoding, and the DEFLATE algorithm is used for sparse methylomes dominated by zeros, substantially reducing file sizes for optimal storage, inflation, and access. Categorical data, such as sequence context or chromosome annotations, are compressed using a specialized state encoding scheme. This separates textual state definitions from indices, optimizing repetitive patterns for space savings. For methyl-seq data, YAME uses a unique MU specification, where methylated (M) and unmethylated (U) read counts are stored in a 64-bit integer. The upper 32 bits represent the methylated allele (M), and the lower 32 bits represent the unmethylated allele (U). This encoding is both space efficient and computationally optimized. These integers are further compressed, ensuring compact storage for large datasets.

YAME also enables flexible data manipulation. It supports combining multiple knowledgebases or datasets into a single indexed file, enabling random-access queries with constant time complexity. For extending datasets to higher dimensions (e.g., non-CpG methylation signals or larger genomes), YAME supports data inflation to different levels of precision. This feature allows it to function efficiently in memory-constrained environments. Furthermore, YAME provides extensive functionality for data manipulation, including efficient subsetting of sites and samples, aggregation, masking, downsampling, chunking, and performing rowwise operations. These features make YAME a versatile tool for preparing and analyzing complex datasets.

Comprehensive CpG annotation

Multilayer CpG annotations are organized as knowledgebases, encompassing 12,114,567 CpG-indexed datasets systematically curated for automated discovery and analysis (see Data and materials availability). This annotation integrates data from human and mouse genome sequences, annotations, and extensive public resources, including 11,806 bulk and 480,012 single-cell sequencing and array-based profiling studies, as well as EWAS projects (table S1A). The annotations are organized into four broad categories of testing domains: (i) sequence features: includes k-mers, tetranucleotides, and TF binding motifs; (ii) genomic features: includes chromatin states, histone modifications, gene associations, local modules of CpGs correlated in methylation levels across tissues, transposable elements, TFBS, and evolutionary conservation; (iii) trait associations: includes cell-type–specific methylation, human EWAS associations, and epigenetic clocks; and (iv) technical associates: includes sequence maskers, array hybridization artifacts, and extension masks. Each testing domain includes a varying number of CpG sets linked to biological and technical ontologies.

Sequence features: This category includes key sequence composition metrics such as CpG density, GC fraction, sequence motifs, and k-mer contexts. Transcription factor binding models were obtained from HOCOMOCO (114), and motif locations in the human and mouse genomes were identified using the FIMO tool from the MEME suite (115). These motif locations were extended by ±10 bp to define corresponding CpG sets. Tetranucleotide sequence contexts were integrated with three-dimensional (3D) chromatin compartment data to capture CpG sets associated with biologically relevant features, such as PMD solo-WCGW sequences, which are indicative of replicative methylation loss (25), and other sequence contexts known to be more subject to biased DNMT (116) or TET-mediated modifications (117). For most sequence feature knowledgebases, including tetranucleotide contexts, CpG references were standardized by merging the C and its complementary palindromic G. In addition, stranded CpG sets were constructed to assess strand-specific preferences for hemimethylation and non-CpG methylation, providing deeper insights into sequence-context–specific methylation patterns.

Genomic features: CpG sets were characterized across genomic scales, from large-scale features such as Hi-C AB compartments and topologically associating domain (TAD) domains to smaller-scale events such as histone modifications and TFBS. ChromHMM annotations, TFBS, and histone modifications were used to construct both consensus and cell-type–specific knowledgebases. Data were sourced from Cistrome (118) and ReMap 2022 (103), which integrate ENCODE data. The peaks were intersected with human and mouse reference genome CpG coordinates. The top 50,000 to 100,000 CpGs with the highest overlap frequencies (including variations due to ties) were selected to construct consensus TFBS and histone modification knowledgebases. Different consensus ChromHMM annotations were taken from the human and mouse data generated in the Roadmap Epigenomics Mapping Consortium (119) and ENCODE (24, 120), targeting primary tissue and cell lines, respectively. To address the underrepresentation of cell-type– or tissue-specific chromatin states (e.g., enhancers or promoters) in consensus annotations, full-stack ChromHMM segmentation (77, 81) was incorporated to construct CpG-indexed knowledgebases for specific cell or tissue types. These were refined into MU-style knowledgebase sets by calculating the frequency of CpG overlaps across samples to capture consensus and specific features. Additional features include the integration of the PhastCons evolutionary conservation score to capture conservation metrics and indexing metagene data relative to gene coordinates for positional annotations of CpGs within genes. Gene links were derived for CpGs within a region from 10-kb upstream TSS to transcription termination sites. Enhancer-overlapping subsets were constructed on the basis of CpGs in regions marked by H3K4me1 and H3K27ac and the absence of H3K4me3, defining active enhancers. These annotations enable quick data summaries, such as using metagene knowledgebases for generating metagene plots and flanking sequence sets for sequence logo visualizations, ensuring comprehensive and flexible genomic analyses.

Trait associations: This category includes cell-type–specific methylation as identified by single-cell and sorted cell methylome profiles and those linked to human traits, as primarily identified from previous array experiments. To construct cell-specific CpG knowledgebase sets, BED/bigWig files for single-cell brain (50, 78, 121), sorted pan tissue (79), and sorted immune cell WGBS (122) data were downloaded and used for marker identification. To reduce the sparsity of single-cell brain data, pseudobulk methylomes were generated by averaging methylation over the cell-type labels obtained by previously reported unsupervised clustering analysis. To define cell signatures, we first developed 1038 contrast groups (table S2) by manually curating the hierarchy of cell types, each defining a sample set. The curation was guided by global methylome similarity and biological knowledge (Fig. 4A). We then investigated every pair of sample sets across major cell-type groups and hierarchically within major groups. Targeting these contrast groups, we performed a nonparametric discriminant analysis as follows: Pairwise Wilcoxon rank sum testing was performed between the target and the background groups at each CpG site to identify cell-specific markers. CpG sites with an area under the curve (AUC) > 0.95 and a difference in β value of >0.5 between the target and the background groups were selected. Cell signature knowledgebases were tested for enrichment against consensus and full-stack ChromHMM knowledgebases in KYCG using the testEnrichment function. For human trait associations, 1067 EWAS studies were curated from the literature and EWAS databases [EWAS catalog (123) and EWAS atlas (124)] and converted to knowledgebases by intersecting the trait-associated CpG probes with each array platform.

Technical associates: This category includes CpG groups useful for controlling data quality in sequencing and array experiments. Besides checking for sex and mitochondrial chromosome enrichment, sequence-based knowledgebases include the ENCODE exclusion list (125), centromeres, telomeres, and micro- and macrosatellite sequences. Probe array masks were obtained from previous studies (9). Briefly, they cover probe hybridization and extension artifacts due to sequence polymorphism and nonuniqueness.

Knowledgebase cross-validation

The curated CpG knowledgebases are diverse in biological category and size (fig. S1B). To understand the redundancies and relationships between the knowledgebase sets, we computed the NPMI, a statistical measure of co-occurrence (−1 = never, 0 = independence, and 1 = always co-occurs) for each pair. Figure S1C shows a graph of a small subset of intergroup knowledgebase sets sharing the highest NPMIs (>0.5) across all computed pairs. The remaining edge list was graphed in Cytoscape (126) version 3.9.1 with the Prefuse Force Directed layout. NPMIs between histone modifications were graphed in ComplexHeatmap (127) version 2.19.0. Although it was not uncommon for knowledgebases from different groups to share some CpG sites after thresholding for NPMI, five general communities emerged: (i) CpG islands and TSS, (ii) gene bodies, (iii) Het regions, (iv) bivalent and polycomb repressive complex 2 (PRC2) targets, (v) CTCF binding sites, and (vi) enhancer-like elements. NPMI was also computed for every cell-signature knowledgebases. Sets with an NPMI >0.4 were selected for visualization using the Circlize package (version 0.4.15) (128).

We explored the overlap of the 83 histone modifications and 1188 TFBS knowledgebases with ChromHMM genomic features. Related histone modification–overlapping CpG sets are clustered together based on NPMI, forming distinct groups (fig. S1D). Notably, the promoter group is overrepresented by various activating histone acetylation and H3K4me3 marks. Other histone modification–overlapping CpG knowledgebases are organized into broad categories representing bivalent chromatin, gene transcription, and Het (fig. S1D). Transcription factor binding sites rarely co-occurred with Het and Quies regions, with mean NPMIs of −0.244 and −0.236, respectively. A total of 161 TFs of the 1188 (13.6%) did not have an NPMI > 0.25 with any ChromHMM feature. This group of TFs was enriched in the ZNF family of proteins (P = 9.952 × 10–10; Fisher’s exact test), and gene ontology analysis revealed enrichment relating to DNA replication. Of the remaining TFs, 944 (79%) showed the highest NPMI with TssA, consistent with TFs generally binding adjacent to promoters. A total of 31 (3%) TFs displayed the highest preference for EnhA1 regions, 21 (2%) for TssBiv regions, 13 (1%) for genic enhancers (EnhG2), 10 (0.8%) for TssFlnkU, 4 for Tx, and 3 for ZNF_OR_Rpts. Overall, TFs are generally localized with Tss elements and enhancers (fig. S1E). TFBS-overlapping CpGs were analyzed across multiple experiments, aggregating overlaps to compute NPMI with ChromHMM features. TFBS with NPMI > 0.25 were grouped by their highest NPMI ChromHMM feature. Gene Ontology analysis for TFs in each group was performed using Enrichr (129). This validates the construction and confirms the expected biological relationships among the knowledgebases.

To validate cell-type–specific signatures, each knowledgebase was first tested for enrichment in gene knowledgebases within 10 kb of the query CpGs, identified with the buildGeneDBs function. Enriched genes [false discovery rate (FDR) < 0.05; Fisher’s exact test] for each signature branch were overlapped with the marker genes for each nontumor human cell type from the CellMarker2.0 database (130), and cell types from pairs that had four or more overlapping genes were selected for visualization in ComplexHeatmap (version 2.19.0) (127). For brain cell enrichment testing, one versus all signatures for excitatory neurons, inhibitory neurons, and glia were tested for enrichment against gene (identified with buildGeneDBs), consensus ChromHMM, and TFBS knowledgebases.

CpG set enrichment testing

Building on YAME’s ability to rapidly compute CpG counts and overlaps (with an optional universe set constraint), the KYCG R/Bioconductor package provides statistical analysis functionalities and visualization for enrichment results. For pairwise methylome analysis, the YAME’s pairwise function efficiently identifies differential methylation CpG sets (DMCs) that represent various contrasts (e.g., hypermethylation, hypomethylation, or both combined) with customizable filters, using the set of CpGs involved in the comparison (covered in both profiles and comparable) as the universe. KYCG uses the hypergeometric distribution as the null hypothesis for enrichment testing. The package supports fast calculation of Fisher’s exact test statistics (via R’s phyper function) and FDR correction, offering one- and two-sided testing options. While efficient, this test assumes statistical independence among CpGs. Multiple test corrections, by default via Benjamini-Hochberg, are done within each testing domain to avoid domain size imbalance. This is justified by the distinct hypothesis space with different term counts, biological relevance, and structural organization.

In addition, KYCG uses a gene set enrichment analysis–like strategy to compare set-based query or knowledgebases and continuous vector variables on a defined universe. Significance is assessed using a Kolmogorov-Smirnov test on the permuted null distribution, with a Gaussian approximation of the null offered as an efficient alternative for large query or knowledgebases. In addition, the framework integrates gene-CpG associations, enabling pathway-level analyses of genes linked to enriched CpGs. A suite of visualization tools, including dot plots, waterfall plots, volcano plots, and track plots, is available to ensure a clear and interpretable presentation of results. Enrichment testing considers a universe set built for each experiment. YAME binarize and YAME pairwise function conveniently produce a paired target and universe set from data for subsequent enrichment testing.

KYCG performance and stability

For each platform (whole genome, EPIC, and HM450), random queries of 1 million, 0.5 million, and 0.1 million were generated by sampling (with replacement when necessary) the respective platforms’ universe space. The queries were tested for enrichment in consensus ChromHMM features, using the respective platform as the background universe space. Testing for each query size–platform pair was repeated 100 times. Compute times for set-based testing in R were measured using the Sys.time() function. For vectorized testing, the command-line time function was used. Compute times were measured only for the Fisher’s exact testing process and not the time elapsed for I/O of the knowledgebase and universe files or query generation. Memory usage was tested using the same queries and ChromHMM features. Maximum resident set size was recorded with time -f “%M” parameters for the maximum memory usage from the time of loading files to testing enrichment. To compare whole-genome computing of enrichment statistics, BEDTools intersect (v2.30.0) was used to intersect query and knowledgebase sets using the -sorted option, followed by counting in AWK (v5.1.0). For methylation aggregation over knowledgebase sets, BEDTools intersect and groupby functions were used. Enrichment statistics and methylation aggregation in KYCG were both computed using the yame summary -m function.

CTCF binding sites were identified from ENCODE chromatin immunoprecipitation sequencing data (131) and intersected with the reference genome (GRCh38) CpG coordinates to use as a query for enrichment testing in ChromHMM features. The GRCh38 reference genome CpG space was uniformly downsampled by factors of 2, 22, 24, 26, 28, 210, 212, and 214 to create universe subsets for enrichment testing. RRBS data from 17 tissues were downloaded from ENCODE (24). Fifty iterations of downsampling and enrichment testing were performed for each universe size and type. RRBS and array data were not downsampled.

Genomic proximity testing

Proximity testing of hyper- and hypomethylated CpG markers was modeled with a Poisson distribution with a λ parameter representing the number of CpGs occurring in fixed 1500-bp intervals. For a given query set of CpGs, a null distribution was generated by performing 1000 simulations of random samples of equal size to the query and calculating the mean number of events (CpGs co-occurring in a 1500-bp interval) as the λ parameter. This λ was used as the Poisson point estimate to compute the probability for the number of co-occurrences in the query set.

Benchmarking datasets

Nucleosome occupancy and methylome sequencing (NOMe-seq) data from PGCs were downloaded from a prior study (132). Methylated CpGs (methylation fraction ≥ 0.3, minimum coverage = 1) were used as a query for enrichment testing against full-stack ChromHMM, histone modification, and TFBS knowledgebase sets using all CpGs with non–not available (NA) values for each sample as the universe. Enrichment testing was performed using the YAME summary function.

Single-cell DNA methylome data from Bian et al. (38) and Liu et al. (39) were downloaded and stored using YAME. Fifty pairs of cells were randomly selected, and methylation differences were calculated to define hyper- and hypomethylated sites (methylation differences of 1 or −1). The universe set is defined as sites covered by both cells. Spearman correlation was used to compare the enrichment ordering of the sampled pairs with the most deeply sequenced pairs (i.e., the pair with the greatest number of CpGs covered in both cells). Differential methylation regions were merged from differentially methylated sites within 10-kb windows and used as inputs for HOMER (13) motif analysis via findMotifsGenome.pl. For TFBS analysis, single-cell data were downloaded from Luo et al. (50), and methylation was aggregated over the 1188 TFBS knowledgebase sets using the YAME summary function. Cells were grouped according to the major class label reported by the authors. Wilcoxon rank sum testing was performed between the target cell type and the background groups at each TFBS feature, and each TFBS that discriminated the target cell type with an AUC of 0.8 or higher was selected for further analysis.

Cancer WGBS data were obtained from TCGA. Two cancer types (bladder and breast cancer) were selected. Compared to adjacent normal tissues, hypermethylated sites were tested against cell-type–specific histone modification features. Pseudobulks from spatial embryo E11.5 methylation data were merged for the heart and neural tube regions (3 × 3 pixels), and methylation differences were tested to demonstrate cell-type–specific TFs.

For 5hmC and Oxford Nanopore sequencing analysis, SIMPLE-seq (133) and snhmC-seq (57) datasets were downloaded and processed into YAME-compatible formats. Pseudobulks were merged for each major brain cell type. Pairwise comparisons of the snhmC-seq data among the four major brain cell types were conducted using the YAME pairwise function, focusing on 40% or more methylation differences. ONT 5mC and 5hmC data (Supplementary Materials) were analyzed against chromatin states across four distinct cell types. ACE-seq (134) data from embryonic stem cells were used as a benchmark to validate ONT 5hmC enrichment.

For 5hmC array-based analyses, bACE-array data were obtained from a previous study (67). One-versus-all comparisons were performed for the displayed tissue type groups using Wilcoxon rank sum testing between the target and the background group at each CpG site. CpG sites with 5% or more methylation differences and an AUC > 0.8 for discriminating the target tissue type were considered for further analysis. Marker CpGs were linked to genes (GENCODEv19) ±1500 bp from the CpG site. Linked genes for each tissue type were tested for enrichment against the CellMarker 2024 and Human Gene Atlas gene ontology databases using Enrichr (129, 135).

For RNA expression comparisons, cell-type–specific RNA sequencing count data were downloaded from the Human Protein Atlas “RNA single-cell type data” database (136), and expression levels of NKX2-1 were log transformed and plotted across 79 cell types. To evaluate KYCG’s capacity for screening array probe artifacts, we used methylation titrations from prior studies (34, 67, 137). β values from 10 mouse DNA samples with varying methylation titration levels (0, 5, 10, 25, 50, 75, and 100%) generated by EpigenDx (137) were used to test the correlation between beta values and methylation titrations. For each CpG on the MM285 array, Pearson’s correlation was computed between the methylation reading and the expected methylation level of the titration. CpG probes with a correlation coefficient < 0.9 were used as a query to test enrichment in MM285 technical database sets.

EWAS and predictive model feature interpretation

EWAS trait associations were downloaded from databases (123, 124), and associated CpGs were converted into knowledgebases by intersecting each trait CpG set with array manifests. Each one-versus-rest cell-type–specific methylation knowledgebase was tested for enrichment against the HM450 EWAS trait knowledgebase using the testEnrichment() function, and the top six most significantly enriched traits were plotted for each cell type. Epigenetic clock CpG query sets were downloaded and tested against EWAS trait, gene, cell-type–specific methylation signature, chromHMM, histone modification, and PMD knowledgebases under each clock’s respective assay platform. Gestational aging methylation data were downloaded from Koeck et al. (86). The Pearson correlation coefficient was computed between the methylation of the clock CpGs in the HOXB3 gene on the EPIC array and the corresponding sample’s gestational age.

To analyze the central nervous system tumor classifier features, data from Capper et al. (96) were downloaded and preprocessed using the SeSAMe package (98). The 32,000 most variable CpGs were used as features to train a random forest classifier using the randomForest package in R with default parameters. The reference cohort of 2801 samples was used for training, and testing was performed on 1100 samples from the prospective cohort. Importance scores for classifier features were ranked according to the decrease in the Gini index for each CpG. The top and bottom 16,000 CpGs based on the Gini index are considered high- and low-importance features. Differential methylation analysis between correctly and incorrectly classified meningioma samples was performed using the SeSAMe DML() function, and differentially methylated CpGs were tested for enrichment against all knowledgebases. Visualization of the TNXB gene was performed using the SeSAMe visualizeGene function.

Acknowledgments

We thank L. D. Cowen for assistance with the KYCG web application user interface design, H. Zhu for help with Flongle data production, and T. Triche Jr. and W. Ding for discussion. The TCGA data presented in this study are based upon data generated by The Cancer Genome Atlas Research Network: www.cancer.gov/tcga. We acknowledge the efforts of the TCGA research teams and consortium for providing access to comprehensive datasets.

Funding: The work is supported by the National Institutes of Health under award numbers R35-GM146978 (to W.Z.) and DP2AI177913 (to Y.D.) and the Packard Fellowship for Science and Engineering (to Y.D.). C.N.L. was supported, in part, by the Institute for the RNA Innovation of the Perelman School of Medicine at the University of Pennsylvania.

Author contributions: Conceptualization: H.F., E.M., and W.Z. Methodology: D.C.G., H.F., E.M., and W.Z. Investigation: Y.D., H.F., D.C.G., C.N.L., and W.Z. Data curation: D.C.G. and W.Z. Validation: D.C.G., H.F., and W.Z. Formal analysis: D.C.G., H.F., and W.Z. Software: D.A., H.F., D.C.G., E.M., and W.Z. Resources: Y.D., C.N.L., and W.Z. Visualization: D.C.G., H.F., and W.Z. Supervision: Y.D. and W.Z. Funding acquisition: W.Z. Project administration: W.Z. Writing—original draft: H.F., D.C.G., and W.Z. Writing—review and editing: Y.D., H.F., D.C.G., C.N.L., and W.Z.

Competing interests: The authors declare that they have no competing interests.

Data and materials availability: KYCG and user documentation are available as an R/Bioconductor package at www.bioconductor.org/packages/release/bioc/html/knowYourCG.html and the developmental version at www.bioconductor.org/packages/devel/bioc/html/knowYourCG.html. The interactive web application for online queries is hosted at https://zhouserver.research.chop.edu/knowyourcg/. In addition, the source code is available on Zenodo at https://zenodo.org/records/17373673 and on GitHub at https://github.com/zhou-lab/knowYourCG. The sequence-level enrichment analysis is available as a command-line C program available on Zenodo at https://zenodo.org/records/17373677 and on GitHub at https://github.com/zhou-lab/YAME. The YAME documentation is available at https://zhou-lab.github.io/YAME/. All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials. ONT DNA 5mC and 5hmC profiles of the four mouse tissues are available at https://doi.org/10.5061/dryad.zgmsbccq9.

Supplementary Materials

The PDF file includes:

Figs. S1 to S6

Legends for tables S1 and S2

sciadv.adw3027_sm.pdf (4.2MB, pdf)

Other Supplementary Material for this manuscript includes the following:

Tables S1 and S2

REFERENCES AND NOTES

  • 1.Greenberg M. V. C., Bourc’his D., The diverse roles of DNA methylation in mammalian development and disease. Nat. Rev. Mol. Cell Biol. 20, 590–607 (2019). [DOI] [PubMed] [Google Scholar]
  • 2.Reizel Y., Sabag O., Skversky Y., Spiro A., Steinberg B., Bernstein D., Wang A., Kieckhaefer J., Li C., Pikarsky E., Levin-Klein R., Goren A., Rajewsky K., Kaestner K. H., Cedar H., Postnatal DNA demethylation and its role in tissue maturation. Nat. Commun. 9, 2040 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Villicaña S., Bell J. T., Genetic impacts on DNA methylation: Research findings and future perspectives. Genome Biol. 22, 127 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Robertson K. D., DNA methylation and human disease. Nat. Rev. Genet. 6, 597–610 (2005). [DOI] [PubMed] [Google Scholar]
  • 5.Xia Y., Dai R., Wang K., Jiao C., Zhang C., Xu Y., Li H., Jing X., Chen Y., Jiang Y., Kopp R. F., Giase G., Chen C., Liu C., Sex-differential DNA methylation and associated regulation networks in human brain implicated in the sex-biased risks of psychiatric disorders. Mol. Psychiatry 26, 835–848 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Horvath S., Raj K., DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat. Rev. Genet. 19, 371–384 (2018). [DOI] [PubMed] [Google Scholar]
  • 7.Zipple M. N., Zhao I., Kuo D. C., Lee S. M., Sheehan M. J., Zhou W., Ecological realism accelerates epigenetic aging in mice. Aging Cell 24, e70098 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Teschendorff A. E., Relton C. L., Statistical and integrative system-level analysis of DNA methylation data. Nat. Rev. Genet. 19, 129–147 (2018). [DOI] [PubMed] [Google Scholar]
  • 9.Zhou W., Laird P. W., Shen H., Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res. 45, e22 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Reimand J., Arak T., Adler P., Kolberg L., Reisberg S., Peterson H., Vilo J., g:Profiler—A web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res. 44, W83–W89 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yu G., Wang L.-G., Han Y., He Q.-Y., clusterProfiler: An R package for comparing biological themes among gene clusters. OMICS 16, 284–287 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Huang D. W., Sherman B. T., Lempicki R. A., Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57 (2009). [DOI] [PubMed] [Google Scholar]
  • 13.Heinz S., Benner C., Spann N., Bertolino E., Lin Y. C., Laslo P., Cheng J. X., Murre C., Singh H., Glass C. K., Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.McLean C. Y., Bristor D., Hiller M., Clarke S. L., Schaar B. T., Lowe C. B., Wenger A. M., Bejerano G., GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Wang Y., Franks J. M., Whitfield M. L., Cheng C., BioMethyl: An R package for biological interpretation of DNA methylation data. Bioinformatics 35, 3635–3641 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ren X., Kuan P. F., methylGSA: A Bioconductor package and Shiny app for DNA methylation data length bias adjustment in gene set testing. Bioinformatics 35, 1958–1959 (2019). [DOI] [PubMed] [Google Scholar]
  • 17.Sheffield N. C., Bock C., LOLA: Enrichment analysis for genomic region sets and regulatory elements in R and Bioconductor. Bioinformatics 32, 587–589 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Halachev K., Bast H., Albrecht F., Lengauer T., Bock C., EpiExplorer: Live exploration and global analysis of large epigenomic datasets. Genome Biol. 13, R96 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Silva T. C., Coetzee S. G., Gull N., Yao L., Hazelett D. J., Noushmehr H., Lin D.-C., Berman B. P., ELMER v.2: An R/Bioconductor package to reconstruct gene regulatory networks from DNA methylation and transcriptome profiles. Bioinformatics 35, 1974–1977 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Yao L., Shen H., Laird P. W., Farnham P. J., Berman B. P., Inferring regulatory element landscapes and transcription factor networks from cancer methylomes. Genome Biol. 16, 105 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Iqbal W., Zhou W., Computational methods for single-cell DNA methylome analysis. Genomics Proteomics Bioinformatics 21, 48–66 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Robertson A. G., Yau C., Carrot-Zhang J., Damrauer J. S., Knijnenburg T. A., Chambwe N., Hoadley K. A., Kemal A., Zenklusen J. C., Cherniack A. D., Beroukhim R., Zhou W., Integrative modeling identifies genetic ancestry-associated molecular correlates in human cancer. STAR Protocols 2, 100483 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Saghafinia S., Mina M., Riggi N., Hanahan D., Ciriello G., Pan-cancer landscape of aberrant DNA methylation across human tumors. Cell Rep. 25, 1066–1080.e8 (2018). [DOI] [PubMed] [Google Scholar]
  • 24.ENCODE Project Consortium, Moore J. E., Purcaro M. J., Pratt H. E., Epstein C. B., Shoresh N., Adrian J., Kawli T., Davis C. A., Dobin A., Kaul R., Halow J., Van Nostrand E. L., Freese P., Gorkin D. U., Shen Y., He Y., Mackiewicz M., Pauli-Behn F., Williams B. A., Mortazavi A., Keller C. A., Zhang X.-O., Elhajjajy S. I., Huey J., Dickel D. E., Snetkova V., Wei X., Wang X., Rivera-Mulia J. C., Rozowsky J., Zhang J., Chhetri S. B., Zhang J., Victorsen A., White K. P., Visel A., Yeo G. W., Burge C. B., Lécuyer E., Gilbert D. M., Dekker J., Rinn J., Mendenhall E. M., Ecker J. R., Kellis M., Klein R. J., Noble W. S., Kundaje A., Guigó R., Farnham P. J., Cherry J. M., Myers R. M., Ren B., Graveley B. R., Gerstein M. B., Pennacchio L. A., Snyder M. P., Bernstein B. E., Wold B., Hardison R. C., Gingeras T. R., Stamatoyannopoulos J. A., Weng Z., Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhou W., Dinh H. Q., Ramjan Z., Weisenberger D. J., Nicolet C. M., Shen H., Laird P. W., Berman B. P., DNA methylation loss in late-replicating domains is linked to mitotic cell division. Nat. Genet. 50, 591–602 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Zhou W., Reizel Y., On correlative and causal links of replicative epimutations. Trends Genet. 41, 60–75 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Howard G., Eiges R., Gaudet F., Jaenisch R., Eden A., Activation and transposition of endogenous retroviral elements in hypomethylation induced tumors in mice. Oncogene 27, 404–408 (2008). [DOI] [PubMed] [Google Scholar]
  • 28.Eden A., Gaudet F., Waghmare A., Jaenisch R., Chromosomal instability and tumors promoted by DNA hypomethylation. Science 300, 455 (2003). [DOI] [PubMed] [Google Scholar]
  • 29.Yin Y., Morgunova E., Jolma A., Kaasinen E., Sahu B., Khund-Sayeed S., Das P. K., Kivioja T., Dave K., Zhong F., Nitta K. R., Taipale M., Popov A., Ginno P. A., Domcke S., Yan J., Schübeler D., Vinson C., Taipale J., Impact of cytosine methylation on DNA binding specificities of human transcription factors. Science 356, eaaj2239 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Liu M., Ohtani H., Zhou W., Ørskov A. D., Charlet J., Zhang Y. W., Shen H., Baylin S. B., Liang G., Grønbæk K., Jones P. A., Vitamin C increases viral mimicry induced by 5-aza-2′-deoxycytidine. Proc. Natl. Acad. Sci. U.S.A. 113, 10238–10244 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Roulois D., Loo Yau H., Singhania R., Wang Y., Danesh A., Shen S. Y., Han H., Liang G., Jones P. A., Pugh T. J., O’Brien C., De Carvalho D. D., DNA-demethylating agents target colorectal cancer cells by inducing viral mimicry by endogenous transcripts. Cell 162, 961–973 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Breeze C. E., Reynolds A. P., van Dongen J., Dunham I., Lazar J., Neph S., Vierstra J., Bourque G., Teschendorff A. E., Stamatoyannopoulos J. A., Beck S., eFORGE v2.0: Updated analysis of cell type-specific signal in epigenomic data. Bioinformatics 35, 4767–4769 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Breeze C. E., Paul D. S., van Dongen J., Butcher L. M., Ambrose J. C., Barrett J. E., Lowe R., Rakyan V. K., Iotchkova V., Frontini M., Downes K., Ouwehand W. H., Laperle J., Jacques P.-É., Bourque G., Bergmann A. K., Siebert R., Vellenga E., Saeed S., Matarese F., Martens J. H. A., Stunnenberg H. G., Teschendorff A. E., Herrero J., Birney E., Dunham I., Beck S., eFORGE: A tool for identifying cell type-specific signal in epigenomic data. Cell Rep. 17, 2137–2150 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kaur D., Lee S. M., Goldberg D., Spix N. J., Hinoue T., Li H.-T., Dwaraka V. B., Smith R., Shen H., Liang G., Renke N., Laird P. W., Zhou W., Comprehensive evaluation of the infinium human MethylationEPIC v2 BeadChip. Epigenetics Commun. 3, 6 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Quinlan A. R., Hall I. M., BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Lee S. M., Loo C. E., Prasasya R. D., Bartolomei M. S., Kohli R. M., Zhou W., Low-input and single-cell methods for Infinium DNA methylation BeadChips. Nucleic Acids Res. 52, e38 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Seisenberger S., Andrews S., Krueger F., Arand J., Walter J., Santos F., Popp C., Thienpont B., Dean W., Reik W., The dynamics of genome-wide DNA methylation reprogramming in mouse primordial germ cells. Mol. Cell 48, 849–862 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Bian S., Hou Y., Zhou X., Li X., Yong J., Wang Y., Wang W., Yan J., Hu B., Guo H., Wang J., Gao S., Mao Y., Dong J., Zhu P., Xiu D., Yan L., Wen L., Qiao J., Tang F., Fu W., Single-cell multiomics sequencing and analyses of human colorectal cancer. Science 362, 1060–1063 (2018). [DOI] [PubMed] [Google Scholar]
  • 39.Liu H., Zhou J., Tian W., Luo C., Bartlett A., Aldridge A., Lucero J., Osteen J. K., Nery J. R., Chen H., Rivkin A., Castanon R. G., Clock B., Li Y. E., Hou X., Poirion O. B., Preissl S., Pinto-Duarte A., O’Connor C., Boggeman L., Fitzpatrick C., Nunn M., Mukamel E. A., Zhang Z., Callaway E. M., Ren B., Dixon J. R., Behrens M. M., Ecker J. R., DNA methylation atlas of the mouse brain at single-cell resolution. Nature 598, 120–128 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Dalerba P., Sahoo D., Paik S., Guo X., Yothers G., Song N., Wilcox-Fogel N., Forgó E., Rajendran P. S., Miranda S. P., Hisamori S., Hutchison J., Kalisky T., Qian D., Wolmark N., Fisher G. A., van de Rijn M., Clarke M. F., CDX2 as a prognostic biomarker in stage II and stage III colon cancer. N. Engl. J. Med. 374, 211–222 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Graule J., Uth K., Fischer E., Centeno I., Galván J. A., Eichmann M., Rau T. T., Langer R., Dawson H., Nitsche U., Traeger P., Berger M. D., Schnüriger B., Hädrich M., Studer P., Inderbitzin D., Lugli A., Tschan M. P., Zlobec I., CDX2 in colorectal cancer is an independent prognostic factor and regulated by promoter methylation and histone deacetylation in tumors of the serrated pathway. Clin. Epigenetics 10, 120 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Albasri A. M., Elkablawy M. A., Clinicopathological and prognostic significance of androgen receptor overexpression in colorectal cancer. Experience from Al-Madinah Al-Munawarah, Saudi Arabia. Saudi Med. J. 40, 893–900 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Laissue P., The forkhead-box family of transcription factors: Key molecular players in colorectal cancer pathogenesis. Mol. Cancer 18, 5 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Shoichet S. A., Hoffmann K., Menzel C., Trautmann U., Moser B., Hoeltzenbein M., Echenne B., Partington M., Van Bokhoven H., Moraine C., Fryns J.-P., Chelly J., Rott H.-D., Ropers H.-H., Kalscheuer V. M., Mutations in the ZNF41 gene are associated with cognitive deficits: Identification of a new candidate for X-linked mental retardation. Am. J. Hum. Genet. 73, 1341–1354 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Ozaki H., Watanabe Y., Takahashi K., Kitamura K., Tanaka A., Urase K., Momoi T., Sudo K., Sakagami J., Asano M., Iwakura Y., Kawakami K., Six4, a putative myogenin gene regulator, is not essential for mouse embryonal development. Mol. Cell. Biol. 21, 3343–3350 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Lee C. N., Fu H., Cardilla A., Zhou W., Deng Y., Spatial joint profiling of DNA methylome and transcriptome in mammalian tissues. Nature (2025). 10.1038/s41586-025-09478-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.George R. M., Firulli A. B., Hand factors in cardiac development. Anat Rec (Hoboken) 302, 101–107 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Schumacher J. A., Bloomekatz J., Garavito-Aguilar Z. V., Yelon D., tal1 Regulates the formation of intercellular junctions and the maintenance of identity in the endocardium. Dev. Biol. 383, 214–226 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Greulich F., Rudat C., Kispert A., Mechanisms of T-box gene function in the developing heart. Cardiovasc. Res. 91, 212–222 (2011). [DOI] [PubMed] [Google Scholar]
  • 50.Luo C., Liu H., Xie F., Armand E. J., Siletti K., Bakken T. E., Fang R., Doyle W. I., Stuart T., Hodge R. D., Hu L., Wang B.-A., Zhang Z., Preissl S., Lee D.-S., Zhou J., Niu S.-Y., Castanon R., Bartlett A., Rivkin A., Wang X., Lucero J., Nery J. R., Davis D. A., Mash D. C., Satija R., Dixon J. R., Linnarsson S., Lein E., Behrens M. M., Ren B., Mukamel E. A., Ecker J. R., Single nucleus multi-omics identifies human cortical cell regulatory genome diversity. Cell Genomics 2, 100107 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Freudenstein D., Lippert M., Popp J. S., Aprato J., Wegner M., Sock E., Haase S., Linker R. A., González Alvarado M. N., Endogenous Sox8 is a critical factor for timely remyelination and oligodendroglial cell repletion in the cuprizone model. Sci. Rep. 13, 22272 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Zhang S., Zhu X., Gui X., Croteau C., Song L., Xu J., Wang A., Bannerman P., Guo F., Sox2 is essential for oligodendroglial proliferation and differentiation during postnatal brain myelination and CNS remyelination. J. Neurosci. 38, 1802–1820 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Kim H. S., Sohn H., Jang S. W., Lee G. R., The transcription factor NFIL3 controls regulatory T-cell function and stability. Exp. Mol. Med. 51, 1–15 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Zohren F., Souroullas G. P., Luo M., Gerdemann U., Imperato M. R., Wilson N. K., Göttgens B., Lukov G. L., Goodell M. A., The transcription factor Lyl-1 regulates lymphoid specification and the maintenance of early T lineage progenitors. Nat. Immunol. 13, 761–769 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Bai D., Zhu C., SIMPLE-seq to decode DNA methylation dynamics in single cells. Nat. Rev. Genet. 25, 377 (2024). [DOI] [PubMed] [Google Scholar]
  • 56.Cao Y., Bai Y., Yuan T., Song L., Fan Y., Ren L., Song W., Peng J., An R., Gu Q., Zheng Y., Xie X. S., Single-cell bisulfite-free 5mC and 5hmC sequencing with high sensitivity and scalability. Proc. Natl. Acad. Sci. U.S.A. 120, e2310367120 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Fabyanic E. B., Hu P., Qiu Q., Berríos K. N., Connolly D. R., Wang T., Flournoy J., Zhou Z., Kohli R. M., Wu H., Joint single-cell profiling resolves 5mC and 5hmC and reveals their distinct gene regulatory effects. Nat. Biotechnol. 42, 960–974 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Parry A., Rulands S., Reik W., Active turnover of DNA methylation during cell fate decisions. Nat. Rev. Genet. 22, 59–66 (2021). [DOI] [PubMed] [Google Scholar]
  • 59.Shi D.-Q., Ali I., Tang J., Yang W.-C., New insights into 5hmC DNA modification: Generation, distribution and function. Front. Genet. 8, 100 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Wen L., Tang F., Genomic distribution and possible functions of DNA hydroxymethylation in the brain. Genomics 104, 341–346 (2014). [DOI] [PubMed] [Google Scholar]
  • 61.He B., Yao H., Yi C., Advances in the joint profiling technologies of 5mC and 5hmC. RSC Chem. Biol. 5, 500–507 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Fazel Darbandi S., Robinson Schwartz S. E., Pai E. L.-L., Everitt A., Turner M. L., Cheyette B. N. R., Willsey A. J., State M. W., Sohal V. S., Rubenstein J. L. R., Enhancing WNT signaling restores cortical neuronal spine maturation and synaptogenesis in tbr1 mutants. Cell Rep. 31, 107495 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Lv X., Ren S.-Q., Zhang X.-J., Shen Z., Ghosh T., Xianyu A., Gao P., Li Z., Lin S., Yu Y., Zhang Q., Groszer M., Shi S.-H., TBR2 coordinates neurogenesis expansion and precise microcircuit organization via Protocadherin 19 in the mammalian cortex. Nat. Commun. 10, 3946 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Flavell S. W., Cowan C. W., Kim T.-K., Greer P. L., Lin Y., Paradis S., Griffith E. C., Hu L. S., Chen C., Greenberg M. E., Activity-dependent regulation of MEF2 transcription factors suppresses excitatory synapse number. Science 311, 1008–1012 (2006). [DOI] [PubMed] [Google Scholar]
  • 65.Hashimoto H., Liu Y., Upadhyay A. K., Chang Y., Howerton S. B., Vertino P. M., Zhang X., Cheng X., Recognition and potential mechanisms for replication and erasure of cytosine hydroxymethylation. Nucleic Acids Res. 40, 4841–4849 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Valinluck V., Sowers L. C., Endogenous cytosine damage products alter the site selectivity of human DNA maintenance methyltransferase DNMT1. Cancer Res. 67, 946–950 (2007). [DOI] [PubMed] [Google Scholar]
  • 67.Goldberg D. C., Cloud C., Lee S. M., Barnes B., Gruber S., Kim E., Pottekat A., Westphal M. S., McAuliffe L., Majounie E., KalayilManian M., Zhu Q., Tran C., Hansen M., Stojakovic J., Parker J. B., Kohli R. M., Porecha R., Renke N., Zhou W., Scalable screening of ternary-code DNA methylation dynamics associated with human traits. Cell Genomics 5, 100929 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 68.Varberg K. M., Dominguez E. M., Koseva B., Varberg J. M., McNally R. P., Moreno-Irusta A., Wesley E. R., Iqbal K., Cheung W. A., Schwendinger-Schreck C., Smail C., Okae H., Arima T., Lydic M., Holoch K., Marsh C., Soares M. J., Grundberg E., Extravillous trophoblast cell lineage development is associated with active remodeling of the chromatin landscape. Nat. Commun. 14, 4826 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Sammar M., Drobnjak T., Mandala M., Gizurarson S., Huppertz B., Meiri H., Galectin 13 (PP13) facilitates remodeling and structural stabilization of maternal vessels during pregnancy. Int. J. Mol. Sci. 20, 3192 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Aghababaei M., Perdu S., Irvine K., Beristain A. G., A disintegrin and metalloproteinase 12 (ADAM12) localizes to invasive trophoblast, promotes cell invasion and directs column outgrowth in early placental development. Mol. Hum. Reprod. 20, 235–249 (2014). [DOI] [PubMed] [Google Scholar]
  • 71.Hikida M., Casola S., Takahashi N., Kaji T., Takemori T., Rajewsky K., Kurosaki T., PLC-γ2 is essential for formation and maintenance of memory B cells. J. Exp. Med. 206, 681–689 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Fu C., Turck C. W., Kurosaki T., Chan A. C., BLNK: A central linker protein in B cell activation. Immunity 9, 93–103 (1998). [DOI] [PubMed] [Google Scholar]
  • 73.Ahsan M. U., Gouru A., Chan J., Zhou W., Wang K., A signal processing and deep learning framework for methylation detection using Oxford Nanopore sequencing. Nat. Commun. 15, 1448 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Liu Y., Rosikiewicz W., Pan Z., Jillette N., Wang P., Taghbalout A., Foox J., Mason C., Carroll M., Cheng A., Li S., DNA methylation-calling tools for Oxford Nanopore sequencing: A survey and human epigenome-wide evaluation. Genome Biol. 22, 295 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Huang Y., Pastor W. A., Shen Y., Tahiliani M., Liu D. R., Rao A., The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PLOS ONE 5, e8888 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Y. Kong, Y. Zhang, E. A. Mead, H. Chen, C. E. Loo, Y. Fan, M. Ni, X.-S. Zhang, R. M. Kohli, G. Fang, Critical assessment of nanopore sequencing for the detection of multiple forms of DNA modifications. bioRxiv 2024.11.19.624260 [Preprint] (2024). 10.1101/2024.11.19.624260. [DOI]
  • 77.Vu H., Ernst J., Universal chromatin state annotation of the mouse genome. Genome Biol. 24, 153 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Luo C., Keown C. L., Kurihara L., Zhou J., He Y., Li J., Castanon R., Lucero J., Nery J. R., Sandoval J. P., Bui B., Sejnowski T. J., Harkins T. T., Mukamel E. A., Behrens M. M., Ecker J. R., Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Loyfer N., Magenheim J., Peretz A., Cann G., Bredno J., Klochendler A., Fox-Fisher I., Shabi-Porat S., Hecht M., Pelet T., Moss J., Drawshy Z., Amini H., Moradi P., Nagaraju S., Bauman D., Shveiky D., Porat S., Dior U., Rivkin G., Or O., Hirshoren N., Carmon E., Pikarsky A., Khalaileh A., Zamir G., Grinbaum R., Abu Gazala M., Mizrahi I., Shussman N., Korach A., Wald O., Izhar U., Erez E., Yutkin V., Samet Y., Rotnemer Golinkin D., Spalding K. L., Druid H., Arner P., Shapiro A. M. J., Grompe M., Aravanis A., Venn O., Jamshidi A., Shemer R., Dor Y., Glaser B., Kaplan T., A DNA methylation atlas of normal human cell types. Nature 613, 355–364 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Moss J., Magenheim J., Neiman D., Zemmour H., Loyfer N., Korach A., Samet Y., Maoz M., Druid H., Arner P., Fu K.-Y., Kiss E., Spalding K. L., Landesberg G., Zick A., Grinshpun A., Shapiro A. M. J., Grompe M., Wittenberg A. D., Glaser B., Shemer R., Kaplan T., Dor Y., Comprehensive human cell-type methylation atlas reveals origins of circulating cell-free DNA in health and disease. Nat. Commun. 9, 5068 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Vu H., Ernst J., Universal annotation of the human genome through integration of over a thousand epigenomic datasets. Genome Biol. 23, 9 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Belsky D. W., Caspi A., Corcoran D. L., Sugden K., Poulton R., Arseneault L., Baccarelli A., Chamarti K., Gao X., Hannon E., Harrington H. L., Houts R., Kothari M., Kwon D., Mill J., Schwartz J., Vokonas P., Wang C., Williams B. S., Moffitt T. E., DunedinPACE, a DNA methylation biomarker of the pace of aging. eLife 11, e73420 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Yang Z., Wong A., Kuh D., Paul D. S., Rakyan V. K., Leslie R. D., Zheng S. C., Widschwendter M., Beck S., Teschendorff A. E., Correlation of an epigenetic mitotic clock with cancer risk. Genome Biol. 17, 205 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Knight A. K., Craig J. M., Theda C., Bækvad-Hansen M., Bybjerg-Grauholm J., Hansen C. S., Hollegaard M. V., Hougaard D. M., Mortensen P. B., Weinsheimer S. M., Werge T. M., Brennan P. A., Cubells J. F., Newport D. J., Stowe Z. N., Cheong J. L. Y., Dalach P., Doyle L. W., Loke Y. J., Baccarelli A. A., Just A. C., Wright R. O., Téllez-Rojo M. M., Svensson K., Trevisi L., Kennedy E. M., Binder E. B., Iurato S., Czamara D., Räikkönen K., Lahti J. M. T., Pesonen A.-K., Kajantie E., Villa P. M., Laivuori H., Hämäläinen E., Park H. J., Bailey L. B., Parets S. E., Kilaru V., Menon R., Horvath S., Bush N. R., LeWinn K. Z., Tylavsky F. A., Conneely K. N., Smith A. K., An epigenetic clock for gestational age at birth based on blood methylation data. Genome Biol. 17, 206 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Bohlin J., Håberg S. E., Magnus P., Reese S. E., Gjessing H. K., Magnus M. C., Parr C. L., Page C. M., London S. J., Nystad W., Prediction of gestational age based on genome-wide differentially methylated regions. Genome Biol. 17, 207 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Koeck R. M., Busato F., Tost J., Consten D., van Echten-Arends J., Mastenbroek S., Wurth Y., Remy S., Langie S., Nawrot T. S., Plusquin M., Alfano R., Bijnens E. M., Gielen M., van Golde R., Dumoulin J. C. M., Brunner H., van Montfoort A. P. A., Zamani Esteki M., Methylome-wide analysis of IVF neonates that underwent embryo culture in different media revealed no significant differences. NPJ Genom. Med. 7, 39 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 87.Lee Y., Choufani S., Weksberg R., Wilson S. L., Yuan V., Burt A., Marsit C., Lu A. T., Ritz B., Bohlin J., Gjessing H. K., Harris J. R., Magnus P., Binder A. M., Robinson W. P., Jugessur A., Horvath S., Placental epigenetic clocks: Estimating gestational age using placental DNA methylation levels. Aging (Albany NY) 11, 4238–4253 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.Rossi D. J., Bryder D., Zahn J. M., Ahlenius H., Sonu R., Wagers A. J., Weissman I. L., Cell intrinsic alterations underlie hematopoietic stem cell aging. Proc. Natl. Acad. Sci. U.S.A. 102, 9194–9199 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Le Garff-Tavernier M., Béziat V., Decocq J., Siguret V., Gandjbakhch F., Pautas E., Debré P., Merle-Beral H., Vieillard V., Human NK cells display major phenotypic and functional changes over the life span. Aging Cell 9, 527–535 (2010). [DOI] [PubMed] [Google Scholar]
  • 90.Slieker R. C., Roost M. S., van Iperen L., Suchiman H. E. D., Tobi E. W., Carlotti F., de Koning E. J. P., Slagboom P. E., Heijmans B. T., Chuva de Sousa Lopes S. M., DNA methylation landscapes of human fetal development. PLOS Genet. 11, e1005583 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 91.Weinberg D. N., Papillon-Cavanagh S., Chen H., Yue Y., Chen X., Rajagopalan K. N., Horth C., McGuire J. T., Xu X., Nikbakht H., Lemiesz A. E., Marchione D. M., Marunde M. R., Meiners M. J., Cheek M. A., Keogh M.-C., Bareke E., Djedid A., Harutyunyan A. S., Jabado N., Garcia B. A., Li H., Allis C. D., Majewski J., Lu C., The histone mark H3K36me2 recruits DNMT3A and shapes the intergenic DNA methylation landscape. Nature 573, 281–286 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Cho S.-H., Shim H.-J., Park M.-R., Choi J.-N., Akanda M. R., Hwang J.-E., Bae W.-K., Lee K.-H., Sun E.-G., Chung I.-J., Lgals3bp suppresses colon inflammation and tumorigenesis through the downregulation of TAK1-NF-κB signaling. Cell Death Discov. 7, 65 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Capone E., Iacobelli S., Sala G., Role of galectin 3 binding protein in cancer progression: A potential novel therapeutic target. J. Transl. Med. 19, 405 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Lodermeyer V., Ssebyatika G., Passos V., Ponnurangam A., Malassa A., Ewald E., Stürzel C. M., Kirchhoff F., Rotger M., Falk C. S., Telenti A., Krey T., Goffinet C., The antiviral activity of the cellular glycoprotein LGALS3BP/90K is species specific. J. Virol. 92, e00226-18 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Costa J., Pronto-Laborinho A., Pinto S., Gromicho M., Bonucci S., Tranfield E., Correia C., Alexandre B. M., de Carvalho M., Investigating LGALS3BP/90 K glycoprotein in the cerebrospinal fluid of patients with neurological diseases. Sci. Rep. 10, 5649 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Capper D., Jones D. T. W., Sill M., Hovestadt V., Schrimpf D., Sturm D., Koelsche C., Sahm F., Chavez L., Reuss D. E., Kratz A., Wefers A. K., Huang K., Pajtler K. W., Schweizer L., Stichel D., Olar A., Engel N. W., Lindenberg K., Harter P. N., Braczynski A. K., Plate K. H., Dohmen H., Garvalov B. K., Coras R., Hölsken A., Hewer E., Bewerunge-Hudler M., Schick M., Fischer R., Beschorner R., Schittenhelm J., Staszewski O., Wani K., Varlet P., Pages M., Temming P., Lohmann D., Selt F., Witt H., Milde T., Witt O., Aronica E., Giangaspero F., Rushing E., Scheurlen W., Geisenberger C., Rodriguez F. J., Becker A., Preusser M., Haberler C., Bjerkvig R., Cryan J., Farrell M., Deckert M., Hench J., Frank S., Serrano J., Kannan K., Tsirigos A., Brück W., Hofer S., Brehmer S., Seiz-Rosenhagen M., Hänggi D., Hans V., Rozsnoki S., Hansford J. R., Kohlhof P., Kristensen B. W., Lechner M., Lopes B., Mawrin C., Ketter R., Kulozik A., Khatib Z., Heppner F., Koch A., Jouvet A., Keohane C., Mühleisen H., Mueller W., Pohl U., Prinz M., Benner A., Zapatka M., Gottardo N. G., Driever P. H., Kramm C. M., Müller H. L., Rutkowski S., von Hoff K., Frühwald M. C., Gnekow A., Fleischhack G., Tippelt S., Calaminus G., Monoranu C.-M., Pfister S. M., DNA methylation-based classification of central nervous system tumours. Nature 555, 469–474 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Maier A. D., Christiansen S. N., Haslund-Vinding J., Krogager M. E., Melchior L. C., Scheie D., Mathiesen T., DNA methylation profile of human dura and leptomeninges. J. Neuropathol. Exp. Neurol. 82, 641–649 (2023). [DOI] [PubMed] [Google Scholar]
  • 98.Zhou W., Triche T. J., Laird P. W., Shen H., SeSAMe: Reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res. 46, e123 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Meissner A., Gnirke A., Bell G. W., Ramsahoye B., Lander E. S., Jaenisch R., Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 33, 5868–5877 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 100.Gu C., Liu S., Wu Q., Zhang L., Guo F., Integrative single-cell analysis of transcriptome, DNA methylome and chromatin accessibility in mouse oocytes. Cell Res. 29, 110–123 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Guo F., Li L., Li J., Wu X., Hu B., Zhu P., Wen L., Tang F., Single-cell multi-omics sequencing of mouse early embryos and embryonic stem cells. Cell Res. 27, 967–988 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Nichols R. V., O’Connell B. L., Mulqueen R. M., Thomas J., Woodfin A. R., Acharya S., Mandel G., Pokholok D., Steemers F. J., Adey A. C., High-throughput robust single-cell DNA methylation profiling with sciMETv2. Nat. Commun. 13, 7627 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 103.Hammal F., de Langen P., Bergon A., Lopez F., Ballester B., ReMap 2022: A database of Human, Mouse, Drosophila and Arabidopsis regulatory regions from an integrative analysis of DNA-binding sequencing experiments. Nucleic Acids Res. 50, D316–D325 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Carrot-Zhang J., Chambwe N., Damrauer J. S., Knijnenburg T. A., Robertson A. G., Yau C., Zhou W., Berger A. C., Huang K.-L., Newberg J. Y., Mashl R. J., Romanel A., Sayaman R. W., Demichelis F., Felau I., Frampton G. M., Han S., Hoadley K. A., Kemal A., Laird P. W., Lazar A. J., Le X., Oak N., Shen H., Wong C. K., Zenklusen J. C., Ziv E., Cancer Genome Atlas Analysis Network, Cherniack A. D., Beroukhim R., Comprehensive analysis of genetic ancestry and its molecular correlates in cancer. Cancer Cell 37, 639–654.e6 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 105.Yu G., Wang L.-G., He Q.-Y., ChIPseeker: An R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31, 2382–2383 (2015). [DOI] [PubMed] [Google Scholar]
  • 106.Maden S. K., Thompson R. F., Hansen K. D., Nellore A., Human methylome variation across Infinium 450K data on the Gene Expression Omnibus. NAR Genom. Bioinform. 3, lqab025 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium , Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Wahl S., Drong A., Lehne B., Loh M., Scott W. R., Kunze S., Tsai P.-C., Ried J. S., Zhang W., Yang Y., Tan S., Fiorito G., Franke L., Guarrera S., Kasela S., Kriebel J., Richmond R. C., Adamo M., Afzal U., Ala-Korpela M., Albetti B., Ammerpohl O., Apperley J. F., Beekman M., Bertazzi P. A., Black S. L., Blancher C., Bonder M.-J., Brosch M., Carstensen-Kirberg M., de Craen A. J. M., de Lusignan S., Dehghan A., Elkalaawy M., Fischer K., Franco O. H., Gaunt T. R., Hampe J., Hashemi M., Isaacs A., Jenkinson A., Jha S., Kato N., Krogh V., Laffan M., Meisinger C., Meitinger T., Mok Z. Y., Motta V., Ng H. K., Nikolakopoulou Z., Nteliopoulos G., Panico S., Pervjakova N., Prokisch H., Rathmann W., Roden M., Rota F., Rozario M. A., Sandling J. K., Schafmayer C., Schramm K., Siebert R., Slagboom P. E., Soininen P., Stolk L., Strauch K., Tai E.-S., Tarantini L., Thorand B., Tigchelaar E. F., Tumino R., Uitterlinden A. G., van Duijn C., van Meurs J. B. J., Vineis P., Wickremasinghe A. R., Wijmenga C., Yang T.-P., Yuan W., Zhernakova A., Batterham R. L., Smith G. D., Deloukas P., Heijmans B. T., Herder C., Hofman A., Lindgren C. M., Milani L., van der Harst P., Peters A., Illig T., Relton C. L., Waldenberger M., Järvelin M.-R., Bollati V., Soong R., Spector T. D., Chambers J. C., Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 109.Cancer Genome Atlas Research Network, Weinstein J. N., Collisson E. A., Mills G. B., Shaw K. R. M., Ozenberger B. A., Ellrott K., Shmulevich I., Sander C., Stuart J. M., The Cancer Genome Atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Bibikova M., Barnes B., Tsan C., Ho V., Klotzle B., Le J. M., Delano D., Zhang L., Schroth G. P., Gunderson K. L., Fan J.-B., Shen R., High density DNA methylation array with single CpG site resolution. Genomics 98, 288–295 (2011). [DOI] [PubMed] [Google Scholar]
  • 111.Johnson B. S., Zhao Y.-T., Fasolino M., Lamonica J. M., Kim Y. J., Georgakilas G., Wood K. H., Bu D., Cui Y., Goffin D., Vahedi G., Kim T. H., Zhou Z., Biotin tagging of MeCP2 in mice reveals contextual insights into the Rett syndrome transcriptome. Nat. Med. 23, 1203–1214 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 112.Thorsson V., Gibbs D. L., Brown S. D., Wolf D., Bortone D. S., Yang T.-H. O., Porta-Pardo E., Gao G. F., Plaisier C. L., Eddy J. A., Ziv E., Culhane A. C., Paull E. O., Sivakumar I. K. A., Gentles A. J., Malhotra R., Farshidfar F., Colaprico A., Parker J. S., Mose L. E., Vo N. S., Liu J., Liu Y., Rader J., Dhankani V., Reynolds S. M., Bowlby R., Califano A., Cherniack A. D., Anastassiou D., Bedognetti D., Mokrab Y., Newman A. M., Rao A., Chen K., Krasnitz A., Hu H., Malta T. M., Noushmehr H., Pedamallu C. S., Bullman S., Ojesina A. I., Lamb A., Zhou W., Shen H., Choueiri T. K., Weinstein J. N., Guinney J., Saltz J., Holt R. A., Rabkin C. S., Cancer Genome Atlas Research Network, Lazar A. J., Serody J. S., Demicco E. G., Disis M. L., Vincent B. G., Shmulevich I., The immune landscape of cancer. Immunity 48, 812–830.e14 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Ferilli M., Ciolfi A., Pedace L., Niceta M., Radio F. C., Pizzi S., Miele E., Cappelletti C., Mancini C., Galluccio T., Andreani M., Iascone M., Chiriatti L., Novelli A., Micalizzi A., Matraxia M., Menale L., Faletra F., Prontera P., Pilotta A., Bedeschi M. F., Capolino R., Baban A., Seri M., Mammì C., Zampino G., Digilio M. C., Dallapiccola B., Priolo M., Tartaglia M., Genome-wide DNA methylation profiling solves uncertainty in classifying NSD1 variants. Genes 13, 2163 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Vorontsov I. E., Eliseeva I. A., Zinkevich A., Nikonov M., Abramov S., Boytsov A., Kamenets V., Kasianova A., Kolmykov S., Yevshin I. S., Favorov A., Medvedeva Y. A., Jolma A., Kolpakov F., Makeev V. J., Kulakovskiy I. V., HOCOMOCO in 2024: A rebuild of the curated collection of binding models for human and mouse transcription factors. Nucleic Acids Res. 52, D154–D163 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Cuellar-Partida G., Buske F. A., McLeay R. C., Whitington T., Noble W. S., Bailey T. L., Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics 28, 56–62 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Ming X., Zhang Z., Zou Z., Lv C., Dong Q., He Q., Yi Y., Li Y., Wang H., Zhu B., Kinetics and mechanisms of mitotic inheritance of DNA methylation and their roles in aging-associated methylome deterioration. Cell Res. 30, 980–996 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Ravichandran M., Rafalski D., Davies C. I., Ortega-Recalde O., Nan X., Glanfield C. R., Kotter A., Misztal K., Wang A. H., Wojciechowski M., Rażew M., Mayyas I. M., Kardailsky O., Schwartz U., Zembrzycki K., Morison I. M., Helm M., Weichenhan D., Jurkowska R. Z., Krueger F., Plass C., Zacharias M., Bochtler M., Hore T. A., Jurkowski T. P., Pronounced sequence specificity of the TET enzyme catalytic domain guides its cellular function. Sci. Adv. 8, eabm2427 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 118.Zheng R., Wan C., Mei S., Qin Q., Wu Q., Sun H., Chen C.-H., Brown M., Zhang X., Meyer C. A., Liu X. S., Cistrome data browser: Expanded datasets and new tools for gene regulatory analysis. Nucleic Acids Res. 47, D729–D735 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 119.Roadmap Epigenomics Consortium, Kundaje A., Meuleman W., Ernst J., Bilenky M., Yen A., Heravi-Moussavi A., Kheradpour P., Zhang Z., Wang J., Ziller M. J., Amin V., Whitaker J. W., Schultz M. D., Ward L. D., Sarkar A., Quon G., Sandstrom R. S., Eaton M. L., Wu Y.-C., Pfenning A. R., Wang X., Claussnitzer M., Liu Y., Coarfa C., Harris R. A., Shoresh N., Epstein C. B., Gjoneska E., Leung D., Xie W., Hawkins R. D., Lister R., Hong C., Gascard P., Mungall A. J., Moore R., Chuah E., Tam A., Canfield T. K., Hansen R. S., Kaul R., Sabo P. J., Bansal M. S., Carles A., Dixon J. R., Farh K.-H., Feizi S., Karlic R., Kim A.-R., Kulkarni A., Li D., Lowdon R., Elliott G., Mercer T. R., Neph S. J., Onuchic V., Polak P., Rajagopal N., Ray P., Sallari R. C., Siebenthall K. T., Sinnott-Armstrong N. A., Stevens M., Thurman R. E., Wu J., Zhang B., Zhou X., Beaudet A. E., Boyer L. A., De Jager P. L., Farnham P. J., Fisher S. J., Haussler D., Jones S. J. M., Li W., Marra M. A., McManus M. T., Sunyaev S., Thomson J. A., Tlsty T. D., Tsai L.-H., Wang W., Waterland R. A., Zhang M. Q., Chadwick L. H., Bernstein B. E., Costello J. F., Ecker J. R., Hirst M., Meissner A., Milosavljevic A., Ren B., Stamatoyannopoulos J. A., Wang T., Kellis M., Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.van der Velde A., Fan K., Tsuji J., Moore J. E., Purcaro M. J., Pratt H. E., Weng Z., Annotation of chromatin states in 66 complete mouse epigenomes during development. Commun. Biol. 4, 239 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Lee D.-S., Luo C., Zhou J., Chandran S., Rivkin A., Bartlett A., Nery J. R., Fitzpatrick C., O’Connor C., Dixon J. R., Ecker J. R., Simultaneous profiling of 3D genome structure and DNA methylation in single human cells. Nat. Methods 16, 999–1006 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 122.Stunnenberg H. G., International Human Epigenome Consortium, Hirst M., The international human epigenome consortium: A blueprint for scientific collaboration and discovery. Cell 167, 1145–1149 (2016). [DOI] [PubMed] [Google Scholar]
  • 123.Battram T., Yousefi P., Crawford G., Prince C., Sheikhali Babaei M., Sharp G., Hatcher C., Vega-Salas M. J., Khodabakhsh S., Whitehurst O., Langdon R., Mahoney L., Elliott H. R., Mancano G., Lee M. A., Watkins S. H., Lay A. C., Hemani G., Gaunt T. R., Relton C. L., Staley J. R., Suderman M., The EWAS catalog: A database of epigenome-wide association studies. Wellcome Open Res. 7, 41 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 124.Li M., Zou D., Li Z., Gao R., Sang J., Zhang Y., Li R., Xia L., Zhang T., Niu G., Bao Y., Zhang Z., EWAS Atlas: A curated knowledgebase of epigenome-wide association studies. Nucleic Acids Res. 47, D983–D988 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 125.Amemiya H. M., Kundaje A., Boyle A. P., The ENCODE blacklist: Identification of problematic regions of the genome. Sci. Rep. 9, 9354 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 126.Shannon P., Markiel A., Ozier O., Baliga N. S., Wang J. T., Ramage D., Amin N., Schwikowski B., Ideker T., Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 127.Gu Z., Eils R., Schlesner M., Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847–2849 (2016). [DOI] [PubMed] [Google Scholar]
  • 128.Gu Z., Gu L., Eils R., Schlesner M., Brors B., circlize implements and enhances circular visualization in R. Bioinformatics 30, 2811–2812 (2014). [DOI] [PubMed] [Google Scholar]
  • 129.Chen E. Y., Tan C. M., Kou Y., Duan Q., Wang Z., Meirelles G. V., Clark N. R., Ma’ayan A., Enrichr: Interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinformatics 14, 128 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 130.Hu C., Li T., Xu Y., Zhang X., Li F., Bai J., Chen J., Jiang W., Yang K., Ou Q., Li X., Wang P., Zhang Y., CellMarker 2.0: An updated database of manually curated cell markers in human/mouse and web tools based on scRNA-seq data. Nucleic Acids Res. 51, D870–D876 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 131.Wang J., Zhuang J., Iyer S., Lin X., Whitfield T. W., Greven M. C., Pierce B. G., Dong X., Kundaje A., Cheng Y., Rando O. J., Birney E., Myers R. M., Noble W. S., Snyder M., Weng Z., Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 132.Guo H., Hu B., Yan L., Yong J., Wu Y., Gao Y., Guo F., Hou Y., Fan X., Dong J., Wang X., Zhu X., Yan J., Wei Y., Jin H., Zhang W., Wen L., Tang F., Qiao J., DNA methylation and chromatin accessibility profiling of mouse and human fetal germ cells. Cell Res. 27, 165–183 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 133.Bai D., Zhang X., Xiang H., Guo Z., Zhu C., Yi C., Simultaneous single-cell analysis of 5mC and 5hmC with SIMPLE-seq. Nat. Biotechnol. 43, 85–96 (2025). [DOI] [PubMed] [Google Scholar]
  • 134.Schutsky E. K., DeNizio J. E., Hu P., Liu M. Y., Nabel C. S., Fabyanic E. B., Hwang Y., Bushman F. D., Wu H., Kohli R. M., Nondestructive, base-resolution sequencing of 5-hydroxymethylcytosine using a DNA deaminase. Nat. Biotechnol. 36, 1083–1090 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 135.Kuleshov M. V., Jones M. R., Rouillard A. D., Fernandez N. F., Duan Q., Wang Z., Koplev S., Jenkins S. L., Jagodnik K. M., Lachmann A., McDermott M. G., Monteiro C. D., Gundersen G. W., A. Ma’ayan, Enrichr: A comprehensive gene set enrichment analysis web server 2016 update. Nucleic Acids Res. 44, W90–W97 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 136.Karlsson M., Zhang C., Méar L., Zhong W., Digre A., Katona B., Sjöstedt E., Butler L., Odeberg J., Dusart P., Edfors F., Oksvold P., von Feilitzen K., Zwahlen M., Arif M., Altay O., Li X., Ozcan M., Mardinoglu A., Fagerberg L., Mulder J., Luo Y., Ponten F., Uhlén M., Lindskog C., A single-cell type transcriptomics map of human tissues. Sci. Adv. 7, eabh2169 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 137.Zhou W., Hinoue T., Barnes B., Mitchell O., Iqbal W., Lee S. M., Foy K. K., Lee K.-H., Moyer E. J., VanderArk A., Koeman J. M., Ding W., Kalkat M., Spix N. J., Eagleson B., Pospisilik J. A., Szabó P. E., Bartolomei M. S., Vander Schaaf N. A., Kang L., Wiseman A. K., Jones P. A., Krawczyk C. M., Adams M., Porecha R., Chen B. H., Shen H., Laird P. W., DNA methylation dynamics and dysregulation delineated by high-throughput profiling in the mouse. Cell Genomics 2, 100144 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Figs. S1 to S6

Legends for tables S1 and S2

sciadv.adw3027_sm.pdf (4.2MB, pdf)

Tables S1 and S2


Articles from Science Advances are provided here courtesy of American Association for the Advancement of Science

RESOURCES