Skip to main content
Communications Biology logoLink to Communications Biology
. 2025 Oct 14;8:1468. doi: 10.1038/s42003-025-08859-2

Single-cell DNA methylation analysis tool Amethyst resolves distinct non-CG methylation patterns in human astrocytes and oligodendrocytes

Lauren E Rylaarsdam 1,, Benjamin W Skubi 2,3, Ruth V Nichols 1, Brendan L O’Connell 1,2, Jack Henry Kotnik 4, Stephen D Coleman 2,5,6, Galip Gürkan Yardımcı 2,5,6, Andrew C Adey 1,2,6,7,
PMCID: PMC12521561  PMID: 41087658

Abstract

Single-cell sequencing technologies have revolutionized biomedical research by enabling deconvolution of cell type-specific properties from heterogeneous tissue. While robust tools have been developed to handle bioinformatic challenges posed by single-cell RNA and ATAC data, options for emergent modalities such as methylation are limited, impeding the utility of results. Here we present Amethyst, a comprehensive R package for atlas-scale single-cell methylation sequencing data analysis. Amethyst begins with base-level methylation calls and enables clustering of distinct biological populations, cell type annotation, differentially methylated region calling, and interpretation of results - facilitating rapid data interaction in a local environment. We introduce the workflow using published single-cell methylation human peripheral blood mononuclear cell and cortex data. We further apply Amethyst to an atlas-scale brain dataset and deconvolute non-CG methylation patterns in human astrocytes and oligodendrocytes, challenging the notion that this form of methylation is principally relevant to neurons in the brain. Tools such as Amethyst will make single-cell methylation data analysis more accessible, catalyzing research progress across diverse contexts.

Subject terms: Epigenetics in the nervous system, Genome informatics, Epigenomics


A study uses Amethyst, a comprehensive single-cell methylation analysis tool, to resolve distinct non-CG methylation patterns in human astrocytes and oligodendrocytes. Findings challenge historically neuron-centric perspectives on noncanonical methylation.

Introduction

Within an organism, thousands of distinct cell types are established and maintained using the same underlying genomic sequence. This diversity is largely mediated by the orchestration of gene expression through epigenetic modifications, including the covalent attachment of a methyl group to the 5’ position of cytosines. Methylation can then facilitate the recruitment or inhibition of various transcription factors to bind DNA, thereby precisely tuning gene expression to meet the specific needs of each cell1. Aberrant methylation patterns are associated with nearly every disease state—including many types of cancers, metabolic disorders, and neurological disorders28—underscoring its critical relevance to a multitude of biomedical research fields.

Methylation canonically occurs at a cytosine followed by a guanine (mCG). In mammalian tissues, while most available CG sites are methylated genome-wide, regulatory regions are CG-enriched and display highly variable methylation levels. It was recently established that select mammalian cell types can additionally utilize methylation at cytosines followed by non-guanine nucleotides (mCH)914. This noncanonical pattern is most abundantly found in brain tissue913. At birth, the brain has high mCG levels but is relatively devoid of mCH10. Millions of CH sites are then rapidly methylated in an early postnatal window coinciding with peak synaptogenesis, until total mCH sites may exceed the number of mCG sites10,15. The resulting patterns are more subtype-specific than mCG1517 and are thought to fine-tune the expression of genes distinguishing closely related neuronal subtypes18. While mCH is also present in glia at about fivefold lower levels9,10,15,16, this population has historically been overlooked given the relative abundance in neurons, except for one report of NeuN bulk tissue exhibiting hyper-mCH in a gene subset with key roles in neuronal development10.

Despite the high heterogeneity of brain tissue, studies investigating subtype-specific methylation principles are limited. This is largely due to the lag in single-cell methylation technologies compared to those for other epigenetic modalities like RNA or ATAC. While methods such as scBS-seq19, scRRBS20, and snmC-seq21,22 have made it possible to analyze methylation on a single-cell level for over a decade, these approaches are constrained by limitations such as coverage, throughput, and cost. These hurdles were recently ameliorated with the advent of combinatorial indexing-based strategies for methylation analysis (sciMET)2326. By iteratively applying indexes, sciMET circumvents the need to process each cell in isolation, thereby exponentially increasing throughput while simultaneously reducing processing costs per cell. The next challenge lies in data analysis: although comprehensive packages exist for exploration of single-cell RNA and ATAC data, available tools for single-cell methylation data analysis27 generally (1) are not equipped to handle large datasets; (2) focus solely on mCG; or (3) address a few steps in the scope of the entire workflow. One exception is ALLCools15,28, a package written to analyze the output of the snmC-seq17,21,22 workflow. However, ALLCools is Python-based, while a rich single-cell data analysis ecosystem exists within the R framework due to the ubiquity of tools like Seurat2931, Signac32, Monocle33,34, and ArchR35.

Here we introduce Amethyst, a comprehensive R package specifically designed for single-cell methylation analysis, lowering the bioinformatic expertise required to work with this modality. Amethyst is capable of efficiently processing data from hundreds of thousands of high-coverage cells by performing initial computationally intensive steps on a cluster, followed by rapid local interaction of the output in RStudio. Versatile functions are provided to facilitate integration, doublet detection, clustering, annotation, differentially methylated region (DMR) identification, and interpretation of results. We demonstrate the utility of Amethyst by exploring mCH patterns specific to mature glia of ectodermal lineage, a context typically emphasized in neuronal populations. We find that non-CG methylation in astrocytes and oligodendrocytes follows similar principles to what has been shown in neurons: it accumulates across important neuronal genes in a manner anticorrelated with expression, the composite trinucleotide contexts are methylated at similar frequencies, and both populations display hyper-mCH across genes escaping X-inactivation. Tools such as Amethyst will increase accessibility to single-cell methylation data analysis, catalyzing research progress.

Results

Amethyst facilitates comprehensive analysis of high-throughput single-cell methylation datasets

Following initial processing of reads, the first objective in single-cell methylation data analysis is to resolve distinct biological populations from gigabytes of base-level methylation calls (Fig. 1a). This is done in Amethyst by first calculating methylation levels over a feature set of genomic regions for each cell. While transcriptomic analysis typically starts with a gene-count matrix, methylation analysis provides a much greater computational burden, as each cytosine position is a feature itself. Amethyst navigates this challenge by storing base-level methylation calls in an hdf5 format. Aggregate methylation levels over distinct features can then be read in and calculated as needed, either by using provided functions in Amethyst or by applying the highly efficient Python-based helper package Facet36, which writes results to the parent hdf5 file.

Fig. 1. Amethyst facilitates comprehensive analysis of high-throughput single-cell methylation datasets.

Fig. 1

a Overview of the Amethyst workflow. Sequencing output is pre-processed using pipelines such as Premethyst to generate base-level methylation calls per cell. Methylation levels over a feature set are used to resolve distinct biological populations. Following cell type annotation, differentially methylated regions (DMRs) are identified and visualized. b Qualitative comparison of Amethyst to other single-cell methylation sequencing analysis tools generated since 2017. Tools are grouped by principal focus. Yr = year the manuscript describing the package was published. Max test = maximum dataset cell size, in either the manuscript or literature, for which methylation-specific analysis was demonstrated. c Quantitative comparison of collective processing time and maximum resident set size using a test brain mCG dataset of n = 1346 cells24. Not all packages address each step in this process. 10 threads were utilized if multithreading was supported. See “Methods” and Supplementary Data 1.

Methylation levels across feature sets are then condensed to a lower-dimensional space. Any dimensionality reduction can then be performed, but the default method for Amethyst is calculating fast truncated singular values with the Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA)37. Amethyst provides a helper function for estimating how many dimensions are needed to achieve the desired amount of variance explained. Batch correction with Harmony38, mitigation of potential coverage biases, and doublet removal can be applied after this step if appropriate (Supplementary Fig. 1). This output is then used to calculate cluster membership with either a Louvain39,40- or Leiden41-based method and two-dimensional coordinates with either Uniform Manifold Approximation and Projection (UMAP)42,43 or t-Distributed Stochastic Neighbor Embedding (t-SNE)44. Biological identity of each cluster is then determined by assessing methylation levels over canonical marker genes or by correlating to a pre-existing atlas. Finally, users can run DMR analysis between any combination of groups. Amethyst provides a suite of visualization tools to aid in the interpretation of the output.

We compared the breadth of Amethyst to existing packages for single-cell methylation analysis (Fig. 1). Of the 16 packages assessed27,28,4559, most focused on a specific component of the process (Fig. 1b): for example, MELISSA49, BPRMeth50, and Epiclomal51 all emphasize cluster resolution; MethSCAn47 identifies variably methylated regions (VMRs) and DMRs; and scMET48 does both. Three additional pipeline packages sought to encompass most major steps in addition to Amethyst: ALLCools28, EpiScanpy45, and MethylStar46. ALLCools28 was the most robust alternative of the three, as EpiScanpy45 is best documented for handling RNA + ATAC data, and MethylStar46 relies on other tools for visualization and interpretation of results.

We next benchmarked Amethyst’s performance against ALLCools and other applicable packages (Fig. 1c and Supplementary Data 1) using a test dataset of 1346 human brain cells24. We broadly categorized steps in each workflow into three main components: preparation, clustering, and DMR identification. Preparation typically involved calculation of the feature set, for which we used mCG levels over 100 kb genomic windows. This modest test dataset proved too large for packages MELISSA and BPRMeth, which required in-memory storage of billions of base-level calls. At the clustering stage, dataset size also posed challenges for scMET and Epiclomal, which were specifically designed to handle sparse single-cell methylomes. Clustering proceeded quickest for Amethyst due to the utilization of the fast truncated singular value decomposition package IRLBA for dimensionality reduction. However, if clustering methods from ALLCools, MethSCAn, MOFA+, or any other alternative strategies are more appropriate for a dataset, Amethyst can readily incorporate externally calculated results into the object framework. This versatility will enable the rapid adoption of novel computational or molecular workflows as this fast-moving field evolves.

Next, we benchmarked DMR testing between excitatory and inhibitory neurons in our brain dataset with Amethyst, ALLCools, and MethSCAn. When leveraging Facet, all three methods performed similarly, with MethSCAn completing the DMR step the fastest at the cost of a computationally expensive preparation stage. ALLCools had implementation challenges and fewer methylation-specific visualization tools for downstream interpretation: for example, users may need to export methylation tracks into the Integrative Genomics Viewer to visualize DMR results, adding intermediate steps. Overall, Amethyst performs either faster or comparably to all existing single-cell methylation packages and has the advantage of endogenous methylation-specific visualization features.

Amethyst enables the resolution of biologically distinct populations

Next, we applied Amethyst to two published single-cell methylation datasets: 3138 human peripheral mononuclear blood cells25 (PBMCs) and 1346 human brain cells from the middle frontal gyrus24. We first applied dimensionality reductions on methylation levels over fixed genomic windows, as this is a good starting point for most tissues and the recommended default for Amethyst. mCG levels over 100 kb windows effectively resolved distinct populations for both PBMC (Fig. 2a) and brain tissue (Fig. 2b). However, selection of the optimal feature set may depend on the dataset. For example, clustering PBMCs using previously determined VMRs resolved an extra group from the 100 kb window method (Fig. 2a), while the uniquely high mCH status of neurons enabled better separation of clusters when using both mCG and mCH modalities compared to either context alone (Fig. 2b; respective silhouette scores of 0.67, 0.57, and 0.50). In both PBMC (Fig. 2a) and brain (Fig. 2b) data, MethSCAn and MOFA+ factors also cleanly resolved individual populations. Amethyst has a flexible object structure that can incorporate any number of alternative clustering strategies if using methylation levels over fixed genomic windows is insufficient.

Fig. 2. Amethyst enables the resolution of biologically distinct populations.

Fig. 2

a Examples of dimensionality reductions using different modalities for a test dataset of n = 3318 peripheral blood mononuclear cells (PBMCs)25: mCG score over 100 kb windows in Amethyst; mCG variably methylated regions (VMRs) identified using MethSCAn47; MOFA+55 factors generated from top 3000 variably CG methylated 100 kb windows; and mCG score over VMRs previously identified using Amethyst. b Examples of dimensionality reductions using different modalities for a test dataset of n = 1346 brain cells24: mCG score over 100 kb windows in Amethyst; mCG VMRs identified using MethSCAn47; MOFA+55 factors generated from top 3000 variably CG methylated 100 kb windows; and mCG score over 100 kb windows plus mCH percent over 100 kb windows in Amethyst. c, d Visualization functions heatMap (left) and histograM (right) show mCG levels over canonical marker genes in human PBMCs (c) and brain tissue data (d). Color represents %mCG for each 500 bp genome window averaged by cell type, with blue indicating hypomethylated regions. Exons are in black and putative promoter regions (start ±1500 bp) in pink. e The histograM function used to show gene body mCH levels over canonical inhibitory neuron gene GAD1 and excitatory transcription factor SATB2. Cell type headings in (d) apply. f The dotM function output shows promoter mCG levels of marker genes in PBMCs averaged by cell type. Dot size represents %mCG and color indicates Z score relative to promoter %mCG of all protein-coding genes. g The dotM function used to plot gene body %mCH levels in brain data averaged by cell type. Dot size represents gene body %mCH; color indicates gene body %mCH normalized by the mean for all protein-coding genes within that subtype. h The dimM function shows gene body %mCH hypermethylation of GAD1 in excitatory neurons (green; mean ± SE: 6.96 ± 0.20 vs 1.55 ± 0.05; W = 54280, pWilcoxon < 2.2E−16) and SATB2 in inhibitory neurons (red; mean ± SE: 7.19 ± 0.18 vs 1.67 ± 0.03; W = 57048, pWilcoxon < 2.2E−16) superimposed over UMAP embeddings generated in Fig. 2b. Statistics were calculated using a two-sided Wilcoxon rank-sum test (pShapiro < 2.2E−16). Data are imputed with Rmagic82 to smooth values; see Supplementary Fig. 2 for unimputed results. i Pearson correlation between mean gene body mCH levels of 1502 variable features identified in Luo et al.16 (rows) and data used in this study24 (columns; please see the “Cell type annotation” section in “Methods” and Supplementary Data 1 for further details).

Cell type assignment is the next step after clustering. This is challenging in any single-cell analysis, but particularly when exploring data from an emerging modality. Amethyst provides multiple methods to facilitate this process, including: (1) functions for visualizing mCG patterns across known marker genes (Fig. 2c, d); (2) tools to investigate gene body mCH or promoter mCG status between groups (Fig. 2e–h); and (3) comparison to existing references (Fig. 2i). Assessing mCG patterns across known marker genes is a powerful method for cell type annotation. Hypo-mCG across the promoter region is canonically associated with gene expression, but this principle does not always apply. Many validated marker genes will have no variable methylation patterns between groups; some will have universal promoter hypomethylation, yet distinct patterns of hypomethylation in the group predicted to express the gene (e.g., SLC17A7 in Fig. 2d); and still others will appear to utilize a transposed promoter region from the bioinformatically predicted site (e.g., C1QA in Fig. 2d). It is therefore essential to visualize mCG patterns across the entire gene body. Amethyst provides two functions to do this, heatMap and histograM (Fig. 2c–e), facilitating nuanced analysis of fluctuations in methylation levels at a user-defined resolution. Consensus between known marker genes using this method is often sufficient for the deduction of biological identity.

Amethyst also provides summary tools for calculating and visualizing aggregate methylation levels over a feature (Fig. 2f–h). This can work well for quantifying promoter mCG levels of genes when the methylation status is known to be variable between groups, such as select PBMC markers shown in Fig. 2f. Tools for aggregate values are especially useful in the context of annotating brain tissue data, as mCH levels over canonical marker genes are highly subtype specific. For example, excitatory neurons have high mCH across inhibitory neuron enzyme glutamate decarboxylase 1 (GAD1; pWilcoxon < 2.2E−16), while inhibitory neurons inversely have high mCH across excitatory neuron transcription factor SATB homeobox 2 (SATB2; pWilcoxon < 2.2E−16; Fig. 2g, h)15. Plotting GAD1 and SATB2 %mCH levels over UMAP coordinates clearly resolves these two broad classes of neurons (Fig. 2h).

Cell type annotation using gene-specific methylation information requires complex prior knowledge of expression patterns and puts significant weight on a few genes. A more robust method is to compare with an annotated reference atlas. These are currently limited, but do exist for brain data1517. We therefore leveraged a landmark human brain single-cell methylation dataset16 to determine population identity by correlating mean %mCH over genes with known transcriptomic variability to mean values in the reference dataset16 (Fig. 2i). Currently, Amethyst provides two built-in references to facilitate brain and PBMC data annotation, and more tissue types will continue to be made available as the body of single-cell methylation atlases expands.

Amethyst robustly identifies DMRs

Successful identification of DMRs is at the core of analysis for most single-cell methylation experiments. Amethyst overcomes sparsity challenges by aggregating observations of methylated and unmethylated cytosines per cluster over short, smoothed genomic regions. A variation of a Fisher’s exact test from methylKit60 is then used for each genomic window on the proportion of methylated and unmethylated cytosines. This test can either be performed to identify population-specific DMRs or to compare between individual groups. A correction for multiple testing is applied and then adjacent significant windows are collapsed into one locus. To help with functional interpretation of DMRs, an annotation column is added listing any overlapping genes or left blank if the region is intergenic. Amethyst functions such as heatMap can then be used to visualize top DMRs within their genomic context (Fig. 3a–c).

Fig. 3. Amethyst robustly identifies differentially methylated regions.

Fig. 3

a, b heatMap of top hypomethylated regions identified in peripheral blood mononuclear cells (PBMCs) (a) and brain tissue (b) in the mCG context. Color represents %mCG for each 500 bp genome window averaged by cell type, with blue showing hypomethylated regions. Any overlapping genes are plotted below. Exons are in black and putative promoter regions (start ±1500 bp) in pink. Encode candidate cis-regulatory elements (cCREs) are plotted below in orange. Statistics were calculated using a variation of a two-sided Fisher’s exact test with Bonferroni correction (see “Methods”); the black rectangle indicates the region identified as hypomethylated. ****padj < 0.0001; see Supplementary Data 2 and 3. c, d heatMap (c) and histograM (d) of top differentially methylated regions (DMRs) identified in brain data in the mCH context. See Supplementary Data 4 for statistics. Color represents %mCH for each 500 bp genome window averaged by cell type. Overlapping gene bodies are plotted below. Exons are in black and putative promoter regions (start ±1500 bp) in pink. In (c), hyper-mCH DMRs for each group are plotted in their respective color below. In (d), hyper-mCH regions are plotted in orange and hypo-mCH regions are plotted in blue below each group. Statistics were calculated as in (a, b) (see “Methods” and Supplementary Data 4 for full DMR list). e, f Bar chart showing number of PBMC (e) and brain (f) mCG DMRs per cell type overlapping GeneHancer61 annotated regulatory regions. g, h Gene ontology (GO) terms for genes overlapping hypo-mCG regions in PBMC (g) and brain (h) data. Color represents the cell type the term was identified in; size is a combined rank-based metric of enrichment and significance (see “Methods” for filtering criteria; Supplementary Data 1 for full list). + is an abbreviation for positive. i mCG DMR benchmarking was performed on n = 184 inhibitory neurons and n = 313 excitatory neurons from a brain dataset of 1346 cells24. j Venn diagram showing the percent overlap between differentially methylated sites identified between excitatory and inhibitory neurons using Amethyst (with Benjamini–Hochberg correction), MethSCAn47, and ALLCools28. l Percent DMRs identified with each method overlapping GeneHancer61 regulatory element annotations. Any base pair length overlap was considered. Both high-confidence DMRs (Bonferroni correction) and medium-confidence DMRs (Benjamini–Hochberg) were assessed for Amethyst. k Density plot of DMR size distribution identified with Amethyst (Benjamini–Hochberg correction), MethSCAn47, and ALLCools28. m Examples of DMRs identified with each method over LINGO1 and SLC6A1. heatMap parameters in (a) apply. See Supplementary Data 5 for full DMR list. Pink rectangles indicate the region zoomed in (n). Base-pair resolution of an example region in (m). Red indicates methylated sites and blue unmethylated; white denotes a non-cytosine base pair or absence of data. DMRs identified with each package are plotted below.

For our test PBMC25 and brain24 datasets, top DMRs for each cell type fall within canonical marker genes (Fig. 3a–d and Supplementary Data 2 and 3), strongly supporting the efficacy of this method. For example, a top mCG hypomethylated DMR in the natural killer (NK) cell group encompasses the natural cytotoxicity triggering receptor 1 (NCR1) gene (padj = 1.72E−22), which encodes activating receptors that catalyze the NK cell immune response (Fig. 3a). In brain data, a top hypomethylated DMR for the excitatory layer 6 transducin-like enhancer of split 4 (Exc L6 TLE4) neuron group falls over its namesake TLE4 (padj = 6.94E−05), supporting the annotation which was generated by a correlation to a reference atlas15,17 (Fig. 3b). This approach can directly be applied to the mCH context as well (Supplementary Data 4). Top mCH DMRs include canonical marker genes15,16 like glutamate receptor ionotropic kainate type subunit 3 (GRIK3; Fig. 3c; see Supplementary Data 4 for all mCH DMR statistics) or cut like homeobox 1 (CUX1; Fig. 3d). Notably, SATB2 is precisely hypermethylated across the gene bodies of medial and caudal ganglionic eminence (MGE/CGE)-derived inhibitory neurons (Fig. 3c), but no excitatory groups in which SATB2 is expressed, in concordance with the current understanding of mCH as a repressive epigenetic mark.

To further confirm the efficacy of our DMR identification approach, we assessed how many mCG DMRs overlapped with regulatory elements (Fig. 3e, f and Supplementary Data 1), which are known to contain the highest variability in methylation levels. In PBMC data, 91.2% of DMRs fell within enhancer or promoter regions (Fig. 3e), despite only about 12% of the genome being annotated as such in GeneHancer61. To determine putative functional relevance, we performed gene ontology (GO) analysis62 for genes overlapping hypomethylated regions. Top PBMC GO terms (Fig. 3g and Supplementary Data 1) were strongly related to expected functions, including “positive T cell selection” for CD8+ T cells (pelim = 1.3E−04) or “positive regulation of cell killing” for CD4+ T cells (pelim = 1.6E−04). Similar patterns were observed from DMRs identified in brain data: 83.1% fell within regulatory regions (Fig. 3f) and top ontology enrichments for genes with hypo-mCG regions were expected biological processes. Examples include “long-term synaptic potentiation” for inhibitory neurons derived from the medial ganglionic eminence (Inh MGE; pelim = 4.7E−06) and “microglial cell activation” for microglia (pelim = 1.2E−04; Fig. 3h and Supplementary Data 1). The enrichment of DMRs for regulatory regions and expected biological processes supports the robustness of Amethyst’s DMR identification approach.

Finally, we compared DMRs between 313 excitatory and 184 inhibitory neurons (Fig. 3i) identified using Amethyst against those detected with ALLCools28 and MethSCAn47 during the benchmarking analysis (Fig. 1c)47. Across DMRs, over five million differentially methylated CG sites were identified in total (Supplementary Data 5)—412,202 of which were in consensus with all methods (Fig. 3j). DMR length distribution was similar for Amethyst and MethSCAn (Fig. 3k) because both overcome sparsity challenges by first binning over short tiled genomic regions. ALLCools, on the other hand, has a site-based DMR identification approach and therefore identified much smaller regions on average (Fig. 3k). This method provides exceptional resolution but requires incredibly high data coverage. While Amethyst’s default high-confidence settings resulted in fewer total DMRs, they overlapped more frequently with known regulatory elements (Fig. 3l), implicating a greater likelihood of biological relevance. Examples of DMRs over leucine-rich repeat and Ig domain containing 1 (LINGO1) and solute carrier family 6 member 1 (SLC6A1) illustrate a general consensus between methods (Fig. 3m). However, zooming in (Fig. 3n) shows MethSCAn is more lenient (using default settings) and the site-based approach of ALLCools may result in division of functionally similar loci. Altogether, Amethyst produces higher-confidence DMRs than other methods using default settings. Users can readily adjust parameters—such as multiple testing method, fold change threshold, and distance between windows to be considered the same locus—according to what is most suitable for each individual dataset.

Amethyst resolves distinct non-CG methylation signatures in human astrocytes and oligodendrocytes

Differential mCH testing for regions (Supplementary Data 4) and gene bodies (differentially methylated genes (DMGs); Supplementary Data 6) in our brain dataset produced an unexpected finding: while the vast majority of hyper-mCH features were identified in neurons as anticipated (99.9% of hypermethylated regions and 99.0% of hypermethylated genes), Amethyst also uncovered hits in astrocytes and oligodendrocytes (Fig. 4a). This phenomenon had also been reported by Lister and colleagues in 201310 and then largely overlooked due to the relative abundance of mCH in neurons. Since this previous work was performed using bulk bisulfite analysis in NeuN+ vs. NeuN fractions, we leveraged the single-cell nature of sciMET technology to determine the glial subtype of origin. Fourteen genes were identified as hypermethylated in glia in both analyses (Fig. 4b). Of these consensus results, hits were typically driven by one cell type, demonstrating the utility of a single-cell perspective. For example, gamma subfamily C clustered protocadherins PCDHGC3, PCDHGC4, and PCDHGC5 were only hypermethylated in oligodendrocytes, while other hits were predominantly hypermethylated in astrocytes (Fig. 4b).

Fig. 4. Amethyst resolves distinct non-CG methylation signatures in human astrocytes and oligodendrocytes.

Fig. 4

a Dot plot showing the number of differentially methylated regions (DMRs) and genes (DMGs) in the mCH context identified by type (color). Size corresponds to the percentage of identified DMGs unique to that group. Both axes are shown on a log10 scale. See “Methods” for testing parameters, Supplementary Data 4 for all DMRs, and Supplementary Data 6 for all DMGs. b Dot plot showing mean %mCH values per cell type of glial hyper-mCH genes identified in both this study and Lister et al.10 (n = 14). Size indicates mean gene body %mCH; color represents %mCH normalized by the mean for all protein-coding genes within each respective subtype. c heatMaps of top four astrocyte-specific hyper-mCH regions identified using a variation of a two-sided Fisher’s exact test from methylKit60 with Bonferroni adjustment (see “Methods”). Color represents mean %mCH over 500 bp windows. Gene bodies are plotted below with exons in black and putative promoter regions (start ±1500 bp) in pink. Hypermethylated regions identified in all groups are plotted below. d Top: %mCH levels of astrocyte-specific hyper-mCH gene BMI1 (padj = 1.45E−14) superimposed over UMAP embeddings generated in Fig. 2b. Bottom: mean gene body %mCH of all 53 astrocyte-specific hyper-mCH genes normalized by mean %mCH of all protein-coding genes. Hyper-methylated genes were identified using a two-sided Wilcoxon rank-sum test with Bonferroni adjustment. e heatMaps of top four oligodendrocyte-specific hyper-mCH regions; parameters described in (c) apply. f Top: %mCH levels of oligodendrocyte-specific hyper-mCH gene BROX (padj = 8.00E−12) superimposed over UMAP embeddings generated in Fig. 2b. Bottom: Mean %mCH of five oligodendrocyte-specific hyper-mCH genes normalized by mean %mCH of all protein-coding genes. Hypermethylated genes were identified using a two-sided Wilcoxon rank-sum test with Bonferroni adjustment. g Mean normalized transcript abundance in a paired RNA + methylation human brain atlas16 of hyper-mCH genes for each cell type. Columns have RNA data from Luo et al.16; rows show hyper-mCH gene sets per type identified in this study. See “Methods” and Supplementary Data 1.

We next examined the most prominent instances of noncanonical hypermethylation in glia. Top hyper-mCH regions for the astrocyte group included cortactin binding protein 2 (CTTNBP2; padj = 1.52E−22), a neuron-enriched protein involved with dendritic spine formation63; and neural transcription factor POU class 4 homeobox 1 (POU4F1; padj = 3.30E−11; Fig. 4c). A top result for hyper-mCH over the entire gene body was BMI1, which encodes a polycomb ring finger component of a chromatin remodeling complex with widespread roles in neurodevelopment (padj = 1.45E−14; Fig. 4d). Strongest hyper-mCH regions for oligodendrocytes (Fig. 4e) included aforementioned clustered protocadherins PCDHGC3-5 (padj = 2.17E−25) with reported roles in regulating cortical interneuron programmed cell death64; and BROX (padj = 1.46E−34), a nuclear envelope-associated factor involved in mitotic membrane reassembly (also a hyper-mCH gene in Fig. 4f; padj = 8.00E−12). Visualization of methylation patterns across these regions shows that glia hyper-mCH is not typically bound by gene bodies (Fig. 4c, e) and is often decoupled from changes in mCG (Supplementary Fig. 3), suggesting more nuance to mCH deposition than previously appreciated.

While prior research has shown that gene body mCH is anti-correlated with gene expression in neurons65, subtype-specific mechanisms in glia have rarely been investigated, aside from Lister and colleagues’ report in bulk Neun tissue10. We next quantified relative expression of hyper-mCH genes by leveraging the paired transcriptomic information published with the atlas reference used to annotate cell types16. This RNA dataset captured 40/53 hyper-mCH genes in astrocytes and 2/5 hyper-mCH genes in oligodendrocytes—BROX and ARF-like GTPase 4C (ARL4C)—as protocadherins are expressed at very low levels. We found that hyper-mCH genes in astrocytes and oligodendrocytes are lowly expressed in their respective classes (Fig. 4g and Supplementary Data 1), supporting the hypothesis that mCH plays a similarly repressive role in glia as it does in neurons10. Intriguingly, BROX and ARL4C were expressed at the highest levels in a subtype of CGE-derived inhibitory neurons that express the canonical oligodendrocyte marker myelin basic protein (MBP), extending the hypothesis that mCH accumulates over key genes delineating closely related cell types15,18 to glia. Altogether, the hyper-mCH signatures resolved by Amethyst support a biologically relevant, repressive role for noncanonical methylation in both astrocytes and oligodendrocytes.

Atlas-scale brain data demonstrates common features of non-CG methylation in glia and neurons

Since protocols such as sciMET2326 easily produce up to a million single-cell methylomes, we next tested how Amethyst handles atlas-scale data, a limitation of many other single-cell methylation analysis packages. We analyzed data from Brodmann Area 46 (BA46) across four human individuals and two sequencing platforms produced using sciMETv326. Amethyst was capable of processing mCG and mCH data from all 145,219 cells that passed filtering, cleanly separating out distinct neural populations (Fig. 5a, b) without sequencing platform or sample origin bias (Fig. 5c, d). Benchmarking initial processing steps with successively down-sampled populations revealed a linear relationship between processing time and number of cells (Fig. 5e) at nanoseconds per cytosine. If further efficiency is desired, we also developed Facet36 (Fig. 1), a Python-based helper package that performs computationally intensive calculations of methylation levels over features and writes the output to the parent hdf5 file. This improvement will allow Amethyst to accommodate the rate of dataset expansion as the field evolves.

Fig. 5. Atlas-scale brain data demonstrates common features of non-CG methylation in glia and neurons.

Fig. 5

a dimFeature plot showing distribution of each cell type (represented by color) from an atlas-scale dataset of n = 145,219 human middle frontal gyrus cells superimposed over a UMAP projection26. b dimFeature plot showing the distribution of global %mCH values superimposed over UMAP embeddings, demonstrating clear separation of neurons and glia. Color is on a natural log scale. c Bar chart colored by cell type composition for each adult donor (n = 4; x axis) separated by sequencing platform. For a more detailed color legend, see (a). Local Inverse Simpson’s Index suggests data integrates well between platforms (LISIplatform = 1.73; IQR = 1.50–1.90). d Results from (a) divided by individual donor; see (a) for more details (LISIsample = 2.72; IQR = 2.18–3.20). e Time and memory requirements for processing successively downsampled datasets of the top 100,000 coverage cells in (a) (x̄ = 3.95 M cytosines/cell). Top: seconds to complete each process for each dataset size. Color represents step; line type corresponds to methylation context. Both axes are on a log10 scale. All steps except clustering (IRLBA37, Rphenograph39, and UMAP42 steps combined) were run with 10 cores. Bottom: maximum resident set size in gigabytes for chromosome indexing and window generation (mCH) plus clustering steps (mCG) when using 10 cores. Line color indicates methylation context and dot size represents the average number of nanoseconds required for processing each cytosine. f Gene body %mCH levels of BROX (red) and CTTNBP2 (green) in atlas-scale data superimposed over UMAP embeddings. Values are normalized by the average %mCH of all protein-coding genes with a maximum of 4. In atlas data, BROXmCH_norm is elevated in oligodendrocytes (mean ± SE: 3.46 ± 0.05 vs. 0.73 ± 0.01; W = 482049861, p < 2.2E−16 using a two-sided Wilcoxon rank-sum test), and CTTNBP2mCH_norm is elevated in astrocytes (mean ± SE: 2.17 ± 0.01 vs. 0.69 ± 0.002 mean ± SE; W = 2054098960, p < 2.2E−16 using a two-sided Wilcoxon rank-sum test). g Bar chart showing mean %mCHH for each trinucleotide context and cell type in hyper-mCH genes (top) and those not identified as hypermethylated (bottom; see “Methods”). x axis shows normalized proportion out of 1. h Bar chart showing the number of differentially methylated genes (DMGs) identified between individuals (n = 435) for each chromosome (columns) and sample (fill color). Statistics were calculated using a two-sided Wilcoxon rank-sum test with Bonferroni correction; see “Methods” and Supplementary Data 7. i Scatterplot of each result shown in (h). x axis shows log2 fold-change of the mCH levels for one individual compared to the other three within the same cell type; y axis shows the adjusted p value (see “Methods”) on a log10 scale. The point color represents the individual; the gene label color represents the cell type in which the DMG was identified. j heatMap showing mCH levels over gene USP9X. Color represents mean %mCH over 500 bp windows. Gene bodies are plotted below with exons in black and putative promoter regions (start ±1500 bp) in pink. Results are averaged by individual (major rows) and further subdivided by cell type (minor rows).

Next, we questioned whether this atlas-scale brain dataset supported our findings of astrocyte and oligodendrocyte-specific hyper-mCH signatures. In consensus with previous conclusions, oligodendrocytes had elevated mCH levels over BROX (pWilcoxon < 2.2E−16) and astrocytes had increased mCH over CTTNBP2 (pWilcoxon < 2.2E−16; Fig. 5f), despite having much lower global mCH levels than neurons (Fig. 5b). Since the mCH classification broadly encompasses any cytosine not followed by a guanine, we leveraged this large-scale dataset to determine if glia utilized a different trinucleotide noncanonical methylation context from neurons, which are known to be enriched for mCAC10,12,66. Discrimination of all 12 mCHH trinucleotide possibilities in hyper-mCH genes revealed that the CAH sequences—particularly CAC—were most abundantly methylated regardless of cell type (Fig. 5g). However, genes without hyper-mCH in glia had mean %mCHH ratios that converged more towards a random even distribution, providing further support that this unique hyper-mCH subset in glia is biologically relevant (Fig. 5g). Altogether, results support the hypothesis that glia utilize noncanonical methylation in a similar manner to neurons.

We next asked whether Amethyst could detect differences in gene body mCH between individuals, in addition to cell types. Between the four samples from our atlas-scale dataset, we identified a total of 453 DMGs. The majority (91%) of hits were genes residing on the X chromosome (Fig. 5h), indicating strong sex-specific differences in noncanonical methylation patterns. The top differentially methylated candidate among both neurons and glia was over the ubiquitin-specific peptidase 9 X-linked (USP9X) gene (Fig. 5i). Visualization of methylation patterns with the heatMap function supported a strong hyper-mCH enrichment specifically over USP9X in all female neural populations, while the surrounding X chromosome region was less methylated overall (Fig. 5j). This pattern was unique to the noncanonical context and not observed at mCG sites (Supplementary Fig. 4). Intriguingly, many top hits—including USP9X and lysine demethylase 6 A (KDM6A)—are known to escape X-inactivation10,67,68. This finding reinforces the hypothesis that non-CG methylation provides an additional layer of repressive epigenetic control over X-inactivation-escaping genes10. The absence of hyper-mCH over USP9X in mesoderm-originating microglia also further suggests that noncanonical methylation in ectoderm-originating astrocytes, oligodendrocytes, and neurons shares common principles, despite the historical emphasis of investigating non-CG methylation in a neuronal context.

Discussion

While other single-cell modalities such as RNA and ATAC have a plethora of robust tools available to aid users in data interpretation2935, platforms for single-cell methylation analysis have lagged far behind, necessitating significant computational expertise to interact with the data. Here we aim to lower this barrier with Amethyst, the first comprehensive R package for high-throughput single-cell methylation sequencing data analysis. Amethyst provides a flexible workflow suitable for a variety of model systems to fully analyze data from base-level information through interpretation of DMRs.

During our benchmarking analysis, many alternative single-cell methylation analysis tools failed to handle our modest test dataset of 1346 cells. Only three packages had the bandwidth to process the throughput of data produced by sciMETv3 technology: Amethyst, MethSCAn, and ALLCools. MethSCAn robustly identified VMRs and DMRs in our relatively small, high-coverage test dataset. Since Amethyst provides a flexible data structure, users can readily integrate dimensionality reductions provided by MethSCAn VMRs into the object for downstream processing. However, initial processing steps for MethSCAn are quite computationally intensive for large datasets, particularly in the mCH context. While ALLCools is capable of processing mCG and mCH for hundreds of thousands of cells, its site-based DMR identification approach necessitates incredibly high-coverage data. In addition, implementation requires extensive Python expertise, and viewing of methylation tracks may involve data export to an additional program. Amethyst can process similarly large datasets in both contexts. It robustly identifies DMRs, which can be rapidly assessed using methylation-specific visualization functions directly in RStudio, eliminating cumbersome intermediate steps.

Using Amethyst, we identified astrocyte- and oligodendrocyte-specific hyper-mCH patterns, a noncanonical methylation context that has been historically underappreciated in glia. We found that gene-specific hyper-mCH accumulates in mature glia of ectodermal lineage but not mesoderm-originating microglia. Glial hypermethylation at non-CG sites accumulates in both sex-specific patterns—for example, over genes escaping X-inactivation in females; and in a cell type-specific manner—such as the elevated mCH observed over PCDHGC3-5 in oligodendrocytes, but not astrocytes. Similar to what has been shown of neuronal mCH, hyper-mCH in glia tends to accumulate across critical neural genes in a manner anticorrelated with expression10,18. For example, CTTNBP2 in astrocytes has elevated mCH and low levels of expression, but the opposite pattern is observed in neuronal populations, where CTTNBP2 plays critical roles in dendritic spine formation and maintenance63. This observation bolsters the hypothesis that non-CG methylation plays a unique role in fine-tuning gene expression between closely related cell types in the brain18 and suggests the principle also applies to glia, in agreement with Lister and colleagues10.

Upon deconvolution of the trinucleotide possibilities contained within the mCHH classification, we found that astrocytes and oligodendrocytes have context frequency distributions similar to what has been previously shown in neurons. mCA contexts are most abundantly methylated - particularly mCAC - with little methylation observed in other contexts aside from mCTC12,13,66. This pattern strongly coincides with the binding capacity for the only known reader of mCH, methyl-CpG binding protein 2 (MeCP2)66,69, loss of which causes the devastating neurodevelopmental disorder Rett syndrome70,71. Symptoms of Rett syndrome include loss of ability to coordinate purposeful movement, repetitive hand motions, severe intellectual disability, seizures, breathing issues, and a myriad of other debilitating problems70,72. An abundance of evidence points to mCH being critical to Rett pathology, including symptom onset occurring just following the mCH accumulation plateau; mice with a MeCP2 methyl binding domain that cannot bind to mCH—but can still bind mCG—exhibit severe Rett-like phenotypes73; and genes that acquire elevated mCH after birth are preferentially misregulated in mouse models74. While neurons have often been the focus of Rett research due to the higher relative abundance of MeCP2 and mCH, our results support hypotheses of glial dysfunction in Rett syndrome7577 and suggest that—like in neurons—inability to bind mCAC contributes to glial pathology.

In addition to the biological questions revealed by this analysis, there are many opportunities for expanding Amethyst’s computational toolkit. Areas for future development include automating the clustering and annotation process, broadening the compatibility to a wider set of species, improving methods for handling missing data, generating multimodal analysis tools, and expanding the DMR test to include covariates. Our inter-individual analysis underscores the importance of including sex as a covariate in future larger-scale efforts to identify autosomal regions with variable non-CG methylation levels, as our dataset with four individuals was limited in power. Utilities are regularly being added and will continue to be optimized as the body of single-cell methylation literature expands. This will be catalyzed by Amethyst and other efforts to make single-cell methylation analysis more accessible, enabling a deeper understanding of how methylation guides biological processes across a multitude of contexts.

Methods

PBMC and brain tissue processing

Example datasets were prepared as described previously2426. All ethical regulations relevant to human research participants were followed. In brief, human banked peripheral blood mononuclear cells (PBMCs) were purchased from the St. Charles River Labs (PB009C-1/D340161). PBMCs from one F71 (sex/age) individual were taken through the splint ligation version of the sciMETv2 protocol24 and enriched for regulatory regions (sciMET-cap) using the Twist Human Methylome Panel from Twist Bioscience (105520). Please see Acharya et al.25 for a detailed description of the sciMET-cap workflow and PBMC data processing methods.

Human middle frontal gyrus tissue used in Figs. 14 was obtained as de-identified specimens from the Oregon Brain Bank, which is overseen by the OHSU Institutional Review Board. Samples were consented for genetic data sharing and genomic data under restricted access for research purposes only. Middle frontal gyrus samples were processed using either the linear amplification or splint ligation version of the sciMETv2 protocol24. Please see Nichols et al.24 for an extensive description of the sciMETv2 workflow and brain data processing methods.

The four human middle frontal gyrus (Brodmann Area 46) tissues used in Fig. 5 were obtained from the NIH NeuroBiobank as a part of the NIH BRAIN Initiative Cell Atlas Network (BICAN) collections. Data were consented for open data release. Ethical oversight was carried out by the OHSU Institutional Review Board. Individuals were F38, M48, F27, and M40. Tissue was processed using the splint ligation version of sciMETv3, which utilizes a third round of barcoding to exponentially increase throughput26. Detailed sciMETv3 workflow and sample processing methods are available in our corresponding manuscript26.

Initial processing of FASTQ files

Base-level methylation calls were determined from raw sequence reads using Premethyst, a series of custom Perl scripts (https://github.com/adeylab/premethyst). Within Premethyst, reads were first demultiplexed using unidex (https://github.com/adeylab/unidex) by matching each specified index to a whitelist of expected barcodes, allowing a hamming distance of 2. Bases 9–29 of read 2 were then removed, as they contain the transposase mosaic end recognition sequence. Reads were trimmed for adapter sequences using fastq_trim.pm, followed by alignment using BSBolt (v1.4.8) using the wrapper script fastq_align.pm. This script runs the aligner with reads 1 and 2 swapped due to the opposite configuration of sciMET adapters compared to traditional bisulfite sequencing adapters, followed by read sorting by name (cell barcode). PCR duplicates were removed for each cell using bam_rmdup.pm and then methyl calls for each context (mCG and mCH) were extracted using bam_extract.pm. All calls were then wrapped into one .h5 file using calls2h5.py. The .h5 file is organized such that the first level contains groups named for each methylation context, the second level contains groups for each cell barcode, and the third level contains datasets with corresponding base-call information for each captured cytosine. Other pipelines that produce this output file structure are equally compatible with Amethyst.

Amethyst benchmarking analysis

For the benchmarking comparisons in Fig. 1c, we ran key steps individually in separate scripts and tracked memory and time statistics using /usr/time/bin -v. Only the essential components from the Nichols et al. brain dataset24 were loaded for each workspace. The following steps were benchmarked for preparation: indexChr(brain, type = “CG”, threads = 10), makeWindows(brain, type = “CG”, stepsize = 100000, metric = “percent”, threads = 10, nmin = 2); for clustering: runIrlba(brain, genomeMatrices = “cg_100k”, dims = c(26)), runUmap(brain, neighbors = 25, dist = 0.1, method = “euclidean”); and for DMR identification: calcSmoothedWindows(neurons, type = “CG”, threads = 10, step = 500, smooth = 3, groupBy = “major_type”), testDMR(neurons_sum_matrix, comparisons = exc_vs_inh, nminTotal = 10, nminGroup = 10), high confidence DMRs =  filterDMR(neuron_dmrs, method = “bonferroni”, pThreshold = 0.01, logThreshold = 1.5), collapseDMR(maxDist = 1000, minLength = 2000); medium confidence DMRs = filterDMR(method = “BH”, pThreshold = 0.01, logThreshold = 1.25), collapseDMR(maxDist = 1000, minLength = 1000, reduce = T, annotate = T). DMR results are in Supplementary Data 5.

For benchmarking analysis on atlas-scale data, we first made an object using the top 100,000 highest coverage cells at an average of 3.95 million cytosines captured. Each successive decreasing order of magnitude was randomly sampled from this dataset using base R’s sample function. We then ran separate scripts for each downsampled dataset and tracked memory and time statistics using /usr/bin/time -v. For each mCG subset, we benchmarked indexChr(type = “CG”, threads = 10), makeWindows(stepsize = 100000, type = “CG”, metric = “score”, threads = 10, nmin = 2), runIrlba(genomeMatrices = “cg_100k_score”, dims = c(10), regressCovBias(reduction = “irlba”), runCluster(k = 50, method = “louvain”), runUmap(neighbors = 30, dist = 0.1, method = “euclidean”, reduction = “irlba”), calcSmoothedWindows(type = “CG”, threads = 10, step = 500, groupBy = “cluster_id”), testDMR(eachVsAll = TRUE, nminTotal = 10, nminGroup = 10), filterDMR(method = “bonferroni”, pThreshold = 0.01, logThreshold = 2), and collapseDMR(maxDist = 2000, minLength = 1500). For each CH subset, we benchmarked indexChr(type = “CH”, threads = 10), makeWindows(stepsize = 100000, type = “CH”, metric = “percent”, threads = 10, nmin = 2), and makeWindows(genes = protein_coding, type = “CH”, metric = “percent”, threads = 10, nmin = 2). Benchmarking values can be found in Supplementary Data 1. All Amethyst benchmarking analysis was performed using v0.0.0.9000.

Facet benchmarking analysis

Benchmarking comparisons in Fig. 1c; Supplementary Data 1 were performed with Facet v0.1.1. We first subset the hdf5 file containing the Nichols et al. brain dataset24 to passing cells only. We then applied /usr/bin/time -v to collect memory and time statistics for each aggregation command. We benchmarked the following steps: facet agg -c CG -u 100000 = 100000 -p 10 for 100 kb window feature generation and facet agg -c CG -u 500 = 1500:500 -p 10 to generate the 1500 × 500 bp sliding window matrix needed for DMR input. Because this step does not change depending on clustering, we classified it as part of the preparation phase.

ALLCools benchmarking analysis

We applied ALLCools28 on our test brain dataset of 1346 cells24 based on tutorials provided at https://lhqing.github.io/ALLCools/. First, methylation calls were extracted from deduplicated, position-sorted bam files using a custom Python script modified from the ScaleMethyl pipeline (v1.2.3), which can be found at https://github.com/ScaleBio/ScaleMethyl. Parquet files for passing cells were converted to ALLC files using a separate custom Python script modified from the ScaleMethyl pipeline. Following successful generation of indexed ALLC files and the sample table, memory and time metrics were calculated with /usr/bin/time -v for the allcools generate-dataset command with parameters --obs_dim cell -- cpu 10 --chunk_size 50 --regions chrom 100k 100000 --quantifiers chrom100k count CG.

After generation of the .mcds file, we followed the Github tutorials for initial analysis. We did not vary parameters from the suggested defaults whenever possible. We used “chrom100k” as var_dim and “cell” as obs_dim. We then loaded the data with MCDS.open and added cell metadata and mean coverage per 100 kb window as features. The dataset was filtered using the same metrics that were applied to the dataset in Amethyst of global %mCH <12 and total cytosines covered between 6 million and 120 million. We excluded regions in the hg38 ENCODE blacklist with black_list_fraction = 0.2 as suggested by ALLCools developers. Chromosomes chrM and chrY were also excluded. Top 20,000 variable 100 kb CG windows were calculated with mcds.calculate_hvf_svr. The AnnData object was transformed with log_scale. Principal component analysis was performed on highly variable windows using Scanpy. The number of passing components was calculated with significant_pc_test and pc_cutoff = 0.1. This was then scaled and used for Scanpy Leiden-based clustering with default parameters knn = 2 x log2(# cells), resolution = 1 and projection with tsne(metric=’euclidean’, exaggeration = -1, perplexity = 30). Time and memory metrics were calculated for mcds.calculate_hvf_svr, sc.tl.pca, and tsne using time.time() and memory_usage(retval = TRUE, max_usage = TRUE) commands. Benchmarking results are in Supplementary Data 1.

For DMR analysis benchmarking, we first pseudobulked ALLC files with allcools merge-allc to generate one file for excitatory neurons and one for inhibitory neurons. Strands were merged according to the suggested approach with allcools extract-allc –strandedness merge. We next calculated differentially methylated sites between the two files with call_dms(--cpu = 10, max_row_count = 50, n_permute = 3000, min_pvalue = 0.01). Time and memory metrics were captured using time.time() and memory_usage(retval = TRUE, max_usage = TRUE) commands. Finally, we collapsed DMS sites to DMRs using call_dmr78,79. Parameters were kept as default whenever possible (p_value_cutoff = 0.001, frac_delta_cutoff = 0.2, max_dist = 250, residual_quantile = 0.6, corr_cutoff = 0.3, dms_ratio = 0.8). In R, an adjustment for multiple testing was applied with p.adjust(method = “BH”) from the stats (v4.3.0) package. DMR results are in Supplementary Data 5.

MethSCAn benchmarking analysis

We implemented MethSCAn47 on our test brain dataset of 1346 cells24 based on tutorials provided at anders-biostat.github.io/MethSCAn/tutorial.html. First, we converted .cov files for passing cells produced by the Premethyst output to the starting format for MethSCAn using a custom awk script. We then followed the tutorial to carry out each command separately using /usr/bin/time -v to collect summary metrics. We used default parameters for (1) methscan prepare, (2) methscan smooth, (3) methscan scan --threads 10, and (4) methscan matrix --threads 10.

For clustering, we implemented the suggested methods on the MethSCAn Github tutorial page. The output VMR matrix was loaded into R. We then used their prcomp_iterative function for dimensionality reduction, which uses irlba::prcomp_irlba (v2.3.5.1). We calculated 26 components for consistency. Next, we ran umap (v0.2.10.0; method = “naive”, dims = 2, n_components = 2, n_neighbors = 30, min_dist = .1, metric = “euclidean”). Parameters were chosen to match methods applied to the 100 kb mCG score matrix. Benchmarking results can be found in Supplementary Data 1.

For DMR analysis, we ran the command methscan diff --threads 10 --bandwidth 1000 --stepsize 500. These changes from the default were applied to make the output more comparable with settings used for Amethyst. Resulting DMRs were further filtered for those containing at least 10 CG sites and having at least 10% of cells with values in each group (n_sites > 10 & n_cell_group1 > 31 & n_cells_group2 > 18). DMR results are in Supplementary Data 5.

scMET benchmarking analysis

We applied scMET48 analysis to our test dataset of 1346 brain cells24 according to the suggested methods described in the tutorial at rpubs.com/cakapourani/scmet-analysis. First, we used the example data to guide the generation of the scMET object, which involved calculating total positions and methylated positions for each feature per cell. To make the analysis most comparable, we used 100 kb mCG windows as features. We also calculated CG density per feature for the covariate matrix value X. After construction of the inputs, we used /usr/bin/time -v to collect memory and time statistics for the scmet(Y = cg_windows, X = X, L = 4, iter = 1000, seed = 12, n_cores = 10) command. We then used the suggested defaults to calculate variable features with scmet_hvf(delta_e = 0.75, evidence_thresh = 0.8, efdr = 0.1). These parameters resulted in only nine features being identified as variable. We acknowledge that scMET was intended for much smaller features and would likely perform much better with optimized parameters. However, given that it took 481 minutes with a 14,146,736 kb max resident set size to run—and requires in-memory storage of the object—this was not a feasible option for regular implementation. Benchmarking results can be found in Supplementary Data 1.

BPRMeth/MELISSA benchmarking analysis

We attempted to implement BPRMeth50 and sister package MELISSA49 on 1346 brain cells24. However, the starting object structure requires in-memory storage at the single-cytosine level, which would be 1.8 billion CG sites and 38.4 billion CH sites for this modestly sized dataset. We were not able to get the starting object properly constructed. Given the threats to R’s internal vector limits, we did not pursue BPRMeth or MELISSA as viable options for processing large datasets.

Epiclomal benchmarking analysis

To implement Epiclomal51, we followed guidelines outlined at github.com/molonc/Epiclomal. We again used the test dataset produced by Nichols et al.24 and used 100 kb window boundaries for the required region file input. First, we converted .cov files for passing cells produced by the Premethyst output to the correct starting format using custom python and awk scripts. We modified the config.yaml file to match the correct paths as needed, but left everything as default when possible; for example, NUM_CELLS_CUTOFF = 5, MISS_PROP_CUTOFF = 0.95, and NLOCI_CUTOFF = 0.0. The pipeline was then run with snakemake -j 10. The snakemake file was slightly modified to gather time and memory statistics for each rule with /usr/bin/time –v. Using default settings, only 13 features were identified as variable. We acknowledge that more appropriate features and parameters could likely have improved results. However, given that Epiclomal took days to run on this modest test dataset, we did not pursue it as a viable option for processing datasets at the scale produced by sciMET assays. Benchmarking results can be found in Supplementary Data 1.

MOFA+ benchmarking analysis

Since MOFA+ factors are a type of dimensionality reduction, we also benchmarked the MOFA+ (v1.8.0) package for clustering. This multiomics-oriented method of clustering is well-suited for methylation analysis as different contexts—like mCG and mCH—can be considered different MOFA “views.” For benchmarking purposes in Fig. 1c, we used 100 kb window mCG score values as the only view. All windows were used to facilitate the most direct comparison to other methods. We implemented create_mofa under default settings to create the object after getting it in the right starting format with tidyr (v1.2.1) and dplyr (v1.1.4). Default data options, model options, and training options were loaded. We then ran run_mofa(use_basilisk = TRUE).

For clustering results in Fig. 2a, b, we calculated MOFA+ factors over a subset of variable features, as recommended by the developers. We first identified variable 100 kb mCG score windows by removing windows with a low observation rate and then calculating the mean and standard deviation. The top 3000 100 kb windows with the highest standard deviation were then selected. After calculating MOFA+ factors from the 100 kb mCG score matrix subset to 3000 highly variable features (HVFs), we again ran run_mofa(use_basilisk = TRUE). UMAP coordinates from this output were projected using run_umap(n_neighbors = 25, min_dist = 0.1, metric = “euclidean”). Benchmarking results can be found in Supplementary Data 1.

PBMC dataset dimensionality reduction and clustering

After base-level calls were wrapped into an hdf5 file, an empty Amethyst (v0.0.0.9000) object was generated with createObject. Intermediate Premethyst metadata files generated during initial processing containing coverage and global methylation metrics were added using helper functions addCellInfo and addAnnot. To select for high-coverage cells, the metadata slot was filtered to cells containing 10 M to 40 M cytosines captured using dplyr::filter (v1.1.2). The rows corresponding to each chromosome for every cell were cataloged using the indexChr function. Following chromosome indexing, methylation score was calculated over various genomic features using the makeWindows function. Score—a normalized measure of deviation from baseline—was calculated as: (mCGfeature – mCGglobal)/(1 − mCGglobal) if mCGfeature – mCGglobal > 0, and (mCGfeature –  mCGglobal)/(mCGglobal) if mCGfeature – mCGglobal < 0. Features chosen were fixed 100 kb genomic windows, previously identified DMRs (based on clusters generated using the 100 kb window method), VMRs identified with MethSCAn (see MethSCAn benchmarking methods), and MOFA+ factors generated from variably methylated 100 kb windows (see MOFA+ benchmarking methods). We also calculated %mCG for promoter regions using makeWindows where promoters were defined as the strand-aware start site indicated by the Gencode annotation file ±1500 bp. We applied a lax minimum observation threshold of two to all mCG genomeMatrices to alleviate bias issues generated from missing values. However, users can easily manipulate this threshold if a more stringent filter is desired.

For each set of genomic windowing approaches tested, runIrlba was used to calculate truncated singular values. We used 24 dimensions for 100 kb windows and 18 dimensions for DMR-based features based on suggested parameters calculated using dimEstimate. For example, based on a dimEstimate test of 40 dimensions, 24 explained 99% of the variance in 100 kb mCG score windows. 0 was substituted for missing data; no other imputation was used. The function regressCovBias was applied to each resulting matrix, which calculates a linear regression for each dimension against the natural log of cytosines covered and returns the residuals. This does not change the result for a high-coverage dataset like this one but was run to maintain consistency with our standard workflow. The IRLBA output was then used for runCluster(k = 30, method = “louvain”) to calculate cluster membership and runUmap(neighbors = 30, dist = 0.1, method = “euclidean”) to compute a UMAP. The resulting UMAP coordinates and membership assignments were stored in the Amethyst object and explored with plotting functions such as dimFeature.

For promoter mCG analysis in Fig. 2f, %mCG was first calculated for all protein-coding genes with makeWindows(nmin = 2, promoter = TRUE). Mean promoter values per cell type were calculated with Amethyst’s aggregateMatrix function. We then used R’s scale(center = TRUE, scale = TRUE) function to calculate Z-score values. Please see the analysis code on Amethyst’s Github page (https://github.com/lrylaarsdam/amethyst) for further details.

Brain dataset (Figs. 24) dimensionality reduction and clustering

In Amethyst (v0.0.0.9000), an object was initiated with createObject and intermediate output files containing coverage and global methylation metrics were added using functions addCellInfo and addAnnot. High-coverage cells (6 M to 100 M) were selected in the metadata slot using dplyr::filter (v1.1.2). For both the mCG and mCH contexts, the rows corresponding to each chromosome for every cell were cataloged using the indexChr function. These indexes were used when calculating mCG score over 100 kb windows and %mCH across 100 kb genomic regions using the makeWindows function. As with PBMCs, the score was calculated as: (mCGfeature – mCGglobal)/(1 − mCGglobal) if mCGfeature – mCGglobal > 0, and (mCGfeature – mCGglobal)/(mCGglobal) if mCGfeature – mCGglobal < 0. The minimum number of observations was again two for mCG windows to alleviate bias issues generated from missing values. We chose nmin = 5 for mCH windows due to the much higher number of genomic sites, and to alleviate skewing towards biologically impossible values that would occur with few observations.

After calculation of methylation levels over fixed genomic windows, runIrlba was used to calculate truncated singular values for both mCG (dims = 26; explains 98.2% of the variance) and mCH (dims = 23; explains 99% of the variance) contexts. 0 was substituted for missing data; no other imputation was used. dimEstimate was used as a guide, but a higher threshold for mCH is needed, as a vast amount of variance is explained by the first two dimensions alone due to the large discrepancy of global %mCH between neurons and glia. We were also guided by prior knowledge of how many populations should roughly be present in the data set. The function regressCovBias was applied to each resulting matrix, which calculates a linear regression for each dimension against the natural log of cytosines covered and returns the residuals. This does not change the result for a high-coverage dataset like this one but was run to maintain consistency with our typical workflow. The IRLBA output was then used as input for runCluster(k_phenograph = 25, method = “louvain”) to calculate cluster membership and runUmap(neighbors = 25, dist = 0.1, method = “euclidean”) to compute UMAP coordinates. We used a smaller neighbors value than the PBMC dataset due to the relative dataset sizes and high heterogeneity in brain tissue. In this context, Louvain clustering resolved groups better than Leiden, but either method can be specified with runCluster.

For data in Fig. 2g, h, %mCH of all protein-coding gene bodies (defined as the start to end base position in the hg38 Gencode annotation file) was calculated with the makeWindows function using default parameters. We used the mean %mCH of all genes per type for normalization instead of global %mCH due to known differences in regional deposition. Because we were using the dim(blend = TRUE) plotting feature in Fig. 2h, the signal was smoothed by applying Amethyst’s impute function to the %mCH gene matrix, which applies Markov Affinity-based Graph Imputation of Cells with the Rmagic (v2.0.3.999) package. Unimputed results are shown in Supplementary Fig. 2.

Atlas-scale brain dataset (Fig. 5) dimensionality reduction and clustering

Atlas-scale human middle frontal gyrus data were processed as previously described26. In brief, after object generation with createObject, metadata was filtered to cells containing fewer than 1 M or more than 20 M cytosines covered. mCG score and mCH percent were calculated over fixed 100 kb genomic windows using the makeWindows function. As with other analyses, the score was calculated as: (mCGfeature – mCGglobal)/(1 − mCGglobal) if mCGfeature – mCGglobal > 0, and (mCGfeature – mCGglobal)/(mCGglobal) if mCGfeature – mCGglobal < 0. runIrlba was then performed to collapse the mCG context to 10 dimensions and mCH to 12. A lower number of dimensions was used than in other analyses because less variance described real biological signal before noise was introduced. This was due to factors such as a less sequencing depth, multiple sequencing platforms, and multiple sample sources. regressCovBias was applied to the resulting matrix, followed by Harmony38 to reconcile bias introduced by having data from two different sequencing platforms. runCluster(k = 50, method = “louvain”) and runUmap(neighbors = 30, dist = 0.01, method = “euclidean”) functions were used to resolve distinct populations. The k parameter was increased for runCluster from previous analyses due to the much larger dataset size. We filtered two small clusters containing 2682 cells in total that had much lower coverage than average and were predominantly restricted to one sequencing platform. To test how well cells integrated across batches, Local Inverse Simpson’s Index (LISI) scores were calculated using the lisi (v1.0) package by applying the compute_lisi(perplexity = 50) command to UMAP embeddings shown in Fig. 5a.

Cell type annotation

Broad cell class for both PBMC and brain data was determined by visually assessing mCG status over canonical marker genes with functions heatMap and histograM. For brain data, both global %mCH levels and %mCH over canonical marker genes were also used to support classification. Further specification was achieved in brain data by correlation of gene body %mCH to a human brain single-cell methylome reference atlas16. Reference %mCH gene body data was downloaded from https://github.com/lhqing/snmCAT-seq_integration/input. HGNC symbols were mapped to Ensembl80 IDs using biomaRt80 (v2.56.1). Mean %mCH gene body levels were calculated for each cell type based on the provided public metadata. For our data, the mean gene %mCH data for each cluster was calculated with the Amethyst function aggregateMatrix. Next, we used stats::cor (v4.2.2) to calculate pairwise Pearson correlation coefficients for gene body %mCH between reference data cell types and our clusters for 1502 differentially expressed genes reported in BA1016. Results are plotted with pheatmap (v1.0.12) in Fig. 2i. The top annotation label was utilized if the correlation coefficient exceeded 0.8. For groups failing to meet this threshold, alternative methods like comparing mCG patterns over canonical marker genes were used to determine annotations.

Doublet detection tool

High-throughput single-cell assays inevitably generate instances of multiple cells sharing the same barcode. The frequency of these instances—termed doublets—can be tuned, but at the cost of final yield. It is therefore optimal to have a computational method for doublet removal to facilitate analysis while maintaining throughput. Amethyst includes a tool for doublet detection specifically designed to accommodate sciMET data (Supplementary Fig. 1), adapting principles from approaches developed for other modalities.

The Amethyst doublet detection tool first involves the duplication of the original object. A tunable percentage of artificial doublets is generated, where the methylation values from the genomic window matrices are averaged across two randomly selected cells. A comparable number of dropout windows are retained to mitigate effects from increases in coverage. The new matrices are then taken through the dimensionality reduction process with runIrlba. The function buildDoubletModel uses the IRLBA output to build a prediction model using a random forest approach trained on the doublet object containing true cells vs. artificial doublets. The resulting model is then applied to the same object with the function predictDoubletScores. One can then compare the doublet score distributions between the simulated doublets and true cells to determine an appropriate doublet score threshold. The function addDoubletScores adds the resulting doublet score to the original object metadata. Failing cells can then easily be filtered out with methods such as dplyr::filter. An example in Supplementary Fig. 1 using the PBMC dataset demonstrates the efficacy of this approach. Artificial doublets had higher doublet scores (mean ± SE 0.73 ± 0.008 vs. 0.07 ± 0.001; p < 2.2E−16 using a two-sided Wilcoxon rank-sum test) and a higher fail rate (88.54% vs. 0.45%; p < 2.2E−16 using a two-sided Chi-square test) than true cells.

DMR analysis

PBMC and brain mCG DMRs (Supplementary Data 2 and 3, respectively) were calculated in the following manner for every cell group. First, the sum of member and all other c (methylated) and t (unmethylated) observations were pseudobulked in a 1500 bp × 500 bp sliding window matrix across the genome using the calcSmoothedWindows function. Windows with fewer than ten observations recorded in either member or non-member groups were removed. For the remaining contingency tables of c and t sums in member and nonmember cells, a variation of a two-sided Fisher’s exact test from methylKit60 was applied within testDMR to calculate a two-tailed p value approximation using hypergeometric distribution probabilities with a continuity correction. Adjustment for multiple testing was then performed with filterDMR using a Bonferroni correction (for each window × group tested). Results were filtered to a threshold of padj < 0.01 and |logFC| > 2. Neighboring significant windows (within the same group and direction of change) were collapsed into one functionally similar locus using collapseDMR with an allowance of up to 2000 bp between significant regions. We then filtered results to have a minimum length of 1000 bp for PBMCs and 2000 bp for brain. The higher threshold for the brain was applied for increased confidence considering the higher group number. For mCH DMR analysis in the brain (Supplementary Data 4), the same methods were used as for mCG, until the filtering stage. We employed the same padj < 0.01 threshold but slightly lowered the |logFC| threshold to >1.5 and increased the minimum length threshold to 2500 bp.

We also often utilized a rank-based metric to mitigate biases introduced by selecting top hits using either minimum padj or max |logFC|. This method was applied to choose candidates for Fig. 3a, b. First, we ordered DMR results by ascending padj and added a column for rank by cell type and direction of change. Next, we ordered DMR results by descending |logFC| and added a second rank column for each cell type and direction of change. The minimum combined rank score was then used to select the strongest results.

GO analysis

The package topGO62 (v2.52.0) was implemented to determine if genes overlapping hypomethylated mCG DMRs were enriched for biological processes. Any length of overlap between the start and end site of the gene body was considered. For each cell type, we first constructed an enrichment analysis object using all overlapping genes as the query set and all protein-coding genes as background with nodeSize = 10. topGO::runTest was performed using the “elim” algorithm with statistic = “fisher”. As the elim algorithm inherently considers hierarchical relationships within the GO structure and eliminates redundant testing, the developers consider the resulting p values corrected62. After running topGO::GenTable(topNodes = 500, numChar = 60), results were filtered for pelim < 0.01, number of significant genes contributing to the term >5, and fold change (significant/expected) > 2.

We then applied a similar rank-based metric as was used for DMRs to mitigate biases introduced by selection of top results using either minimum pelim or maximum fold change. First, we ordered GO results by ascending pelim and added a column for rank by cell type. Next, we ordered GO results by descending fold change and added a second rank column for each cell type and direction of change. The combined rank score was then used as the “size” aesthetic for plots in Fig. 3g, h. Some of the top terms were generic in nature due to an extremely high number of annotations (e.g., “nervous system development”), or had an extremely high fold change due to a very low expected value. For plotting purposes in Fig. 3g, h, we therefore filtered annotated < 150 and expected > 0.3 to highlight more relevant terms. Full GO results can be found in Supplementary Data 1.

DMR analysis comparison between packages

For DMR analysis comparisons between ALLCools, MethSCAn, and Amethyst in Fig. 3i–n, we first calculated mCG DMRs between 184 inhibitory and 313 excitatory neurons (Fig. 3i) as described in each corresponding methods section. Full results can be found in Supplementary Data 5. To assess the extent of overlap, we first constructed genomic ranges from each list using the GenomicRanges (v1.52.0) package. GenomicRanges::reduce was applied to this list to generate a non-overlapping master list of each DMR captured. To obtain the percent overlap of DMRs in each package (Fig. 3j), we first applied the GenomicRanges::tile command to expand each DMR into single-nucleotide bases (yielding differentially methylated sites; DMS). All non-CG sites were filtered out by applying GenomicRanges::subsetByOverlaps to the DMS list and a range of all CG sites in the human genome, which were extracted from GRCh38.p13 (sense and antisense) using a custom Perl script. Lastly, the GenomicRanges::countOverlaps function was used in conjunction with ggvenn (from ggvenn v0.1.10) to calculate the percent of DMSs shared between each package. Results are shown in Fig. 3j.

DMG analysis

For detecting DMGs in Fig. 4, we first defined gene bodies as the start to end base pair in the hg38 Gencode annotation file. If multiple entries tagged as “gene” had the same name, the longest  bp entry was used. Gene body %mCH for all protein-coding genes listed in the Gencode annotation file (n = 20,013) were calculated using the makeWindows function with a minimum threshold of 2 observations. Next, to alleviate potential biases introduced by the large disparity between group sizes, we normalized all groups to the lowest common denominator of n = 33. We did this while simultaneously avoiding information loss by dividing each cell type into 33 non-overlapping subgroups and calculating the mean %mCH per gene. We then performed a two-sided Wilcoxon rank-sum test for %mCH over each gene body in the group of interest vs. all other groups. This nonparametric test was implemented because data was typically non-Gaussian due to the large discrepancy between global %mCH in neurons and glia. After all calculations were performed, p values were adjusted for multiple testing using the Bonferroni method. We applied various filters to rule out potential sources of noise: padj < 0.05, |logFC| > 1, number of members > 20 (out of 33), number of nonmembers > 200 (out of 363), and gene length > 3000. Finally, we also filtered out hypermethylated mCH genes where the mean of the member group was less than 2% mCH. Uniqueness of the result to the group was tested using dplyr (v1.1.4) logic to calculate the number of representations for each gene in the results table. Results are in Supplementary Data 6.

For inter-individual DMG analysis in Fig. 5h, i, the same two-sided Wilcoxon rank-sum test with Bonferroni correction approach described above was applied to mCH levels over each protein-coding gene between cells from one individual vs. all three others. We did not apply the downsampling method described above because it introduced artifacts driven by populations with very low cell counts. Each test was run within the 19 distinct cell types. As before, we applied extensive filters: padj < 0.05, |logFC| > 1, number of members > 20, number of non-members > 20, gene length > 3000, and hypermethylated results had to be > 2% mCH. Finally, we also filtered out results from the chrY genes, because we only had two samples from male individuals. Future studies with more donors should be performed to obtain sufficient power to investigate regions with variable inter-individual non-CG methylation levels on chrY and autosomes. Results are in Supplementary Data 7.

Correlation of gene %mCH to RNA levels

To determine the expression status of hyper-mCH genes in glia, we downloaded the paired transcriptomic data from the human methylome atlas16 used for annotation from NCBI GEO accession GSE140493. We then used Seurat2931 (v4.3.0.1) to construct an object with the count matrix using CreateSeuratObject(min.features = 200, min.cells = 10). We then applied NormalizeData and ScaleData using default parameters and extracted the resulting data from the “data” slot. Next, we calculated the mean expression of all genes for each group and mapped the HGNC symbols to Ensembl80 IDs using biomaRt80 (v2.56.1). Next, we filtered out genes where fewer than ten values were observed across all groups. Despite the normalization measures applied in Seurat, we found that groups had variable average transcription levels, so we re-normalized the summary count matrix to 10,000 total counts per group using a sweep function in base R.

After obtaining a normalized and annotated matrix, we then joined the matrix with our hyper-mCH results. We calculated the mean expression level of each set of genes (e.g., hyper-mCH genes in astrocytes, hyper-mCH genes in MGE-derived inhibitory neurons, etc.) for each annotated group in the transcriptomic dataset16. The result was then plotted using pheatmap (v1.0.12) in Fig. 4g.

Trinucleotide context analysis

To calculate the %mCHH values for each gene in the atlas-scale analysis in Fig. 5g, we first mapped all the trinucleotide mCHH positions in the hg38 genome using a custom Perl script. Both sense and antisense strand orientations were considered. We then used bedtools intersect (v2.29.1) to annotate each site with any overlapping genes. We next read mCH positions for each barcode into R and used data.table (v1.14.6) to annotate the positions with any corresponding trinucleotide contexts and overlapping genes. Finally, we used data.table syntax to calculate %mCHH per gene for each context and barcode. We only calculated values for genes identified as hyper-mCH in the analysis described in Fig. 4 (n = 5987) and 2651 other genes across chromosomes 1 and X. However, it is important to note that while these genes were not identified as hyper-mCH, %mCH was still frequently elevated in neurons relative to glial populations. After %mCHH was calculated for each cell, we determined summary metrics for each major type. Genes with > 70%mCH, < 3000 bp in length, and < 20 observations in a group were removed. We then used dplyr::summarise to calculate mean %mCHH per context, gene, atlas cell type, and hyper-mCH gene set identified in the test brain dataset of 1346 cells24.

Statistics and reproducibility

For DMR analysis, contingency tables were first made from summed member and non-member c and t observations in each 1500 bp × 500 bp sliding window across the genome. A variation of a two-sided Fisher’s exact test was applied to calculate a two-tailed p value approximation using hypergeometric distribution probabilities with a continuity correction. Adjustment for multiple testing (all windows × all groups) was then performed using Bonferroni correction in every case except Benjamini–Hochberg results discussed in Fig. 3j–n. Thresholds of padj < 0.01 and |logFC| > 2 were implemented for mCG analysis. Thresholds of padj < 0.01 and |logFC| > 1.5 were used for mCH analysis. Neighboring significant windows were then collapsed to one locus (see “Methods”). padj and logFC for collapsed DMRs were calculated from the mean values returned for individual windows within the locus.

For differentially methylated gene (DMG) analysis in Figs. 4 and 5h–j, we performed a two-sided Wilcoxon rank-sum test for %mCH over each gene body in the group of interest vs. all other groups, treating each cell as an individual observation. This nonparametric test was implemented because data was non-Gaussian due to the large discrepancy between global %mCH in neurons and glia. After all comparisons were performed, correction for multiple testing was implemented using the Bonferroni method. We applied various filters to rule out potential sources of noise: padj < 0.05, |logFC| > 1, number of observations in each group > 20, and gene length > 3000. Finally, we also filtered out hypermethylated mCH genes where the mean of the member group was less than 2% mCH.

For direct comparison between %mCH levels of SATB2 and GAD1 in excitatory vs. inhibitory neurons (Fig. 2h) and BROX and CTTNBP2 in glia from atlas-scale data (Fig. 5f), a Shapiro test of normality was first performed. Data was not normally distributed (pShapiro < 2.2E−16), so we applied a two-sided Wilcoxon rank-sum test, treating each cell as an individual observation. The results from atlas-scale data (Fig. 5f) reproduced earlier observations (Fig. 4c, e, f) and were consistent across all four individual donors. Sex-specific inter-individual results described in Fig. 5h–j were consistent between the two donor replicates from each sex. See “Methods” for further discussion of statistical methods used in each analysis.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Supplementary information

42003_2025_8859_MOESM3_ESM.pdf (25.4KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1 (265.5KB, xlsx)
Supplementary Data 2 (727.7KB, xlsx)
Supplementary Data 3 (922.6KB, xlsx)
Supplementary Data 4 (52.9MB, xlsx)
Supplementary Data 5 (26.3MB, xlsx)
Supplementary Data 6 (4.7MB, xlsx)
Supplementary Data 7 (70.5KB, xlsx)
Supplementary Data 8 (2.2MB, pdf)
Supplementary Data 9 (2.9MB, pdf)
Reporting summary (1.8MB, pdf)

Acknowledgements

Funding for this work was provided by: NIH BRAIN Initiative RF1MH128842 and a Silver Family Foundation Innovator Award to A.C.A.; a Medical Research Foundation Early Clinical Investigator Grant to L.E.R.; and NIGMS MIRA R35 GM147698 to G.G.Y. The authors would like to acknowledge Dr. Sonia Acharya for her contributions to generation of the published PBMC data25 used in Figs. 24. We would additionally like to thank other members of the Adey Lab—as well as the Saunders and O’Roak Labs at OHSU—for helpful discussions and critiques. In particular, we would like to thank Dr. Marissa Co for her proofreading of figures and contributions to naming Amethyst’s color palettes. Finally, the authors would like to thank Dr. Ryan Lister for providing valuable feedback on our preprint.

Author contributions

L.E.R and A.C.A. devised the concept for Amethyst as a unification and expansion of the pre-existing workflow developed by A.C.A. R.V.N. and B.L.O. developed both the sciMETv2-v3 protocols23,24,26 and generated all brain24,26 and PBMC25 datasets used in Figs. 15. L.E.R. wrote Amethyst code with contributions from S.D.C. and A.C.A. B.W.S. wrote all code for Facet. L.E.R. performed analyses and wrote the manuscript under the guidance of A.C.A. J.H.K. contributed to analysis of inter-individual differences in Fig. 5. L.E.R. generated all figures. S.D.C. and G.G.Y. provided feedback on computational aspects of package development. All authors reviewed and approved this manuscript.

Peer review

Peer review information

Communications Biology thanks Maria Colomé-Tatché and the other anonymous reviewers for their contribution to the peer review of this work. Primary handling editor: David Favero. A peer review file is available.

Data availability

Previously published human peripheral blood mononuclear cell (PBMC) data are available from the NCBI Gene Expression Omnibus (https://ncbi.nlm.nih.gov/geo/) under accession GSE250282. Human brain sciMETv2 data are available from the NCBI Database of Genotypes and Phenotypes (https://ncbi.nlm.nih.gov/gap/) under accession phs003091.v2.p1. Raw and processed human sciMETv3 data are available on the NCBI Gene Expression Omnibus under accession GSE273592. Processed PBMC and brain data relevant to this manuscript are available under accession GSE303678. Source data used to generate Figs. 1b, c; 2i; 3e–h, l; 4a, b, g; and 5e, g are in Supplementary Data 1. Full DMR/DMG results are reported in Supplementary Data 27. All other data are available from the corresponding author upon reasonable request.

Code availability

Initial processing of sequencing output to base-level methylation calls can be performed with Premethyst commands, which are available at https://github.com/adeylab/premethyst. Amethyst is publicly available for installation at https://github.com/lrylaarsdam/amethyst. Example vignettes are available on Github to demonstrate the major analysis steps in Amethyst for PBMCs (Supplementary Data 8), which only have biologically relevant mCG; and frontal cortex (Supplementary Data 9), which also has substantial non-CG methylation. Vignettes are also available for incorporating clustering methods from other packages (including MethSCAn), doublet detection, imputation, batch integration, and other utilities. All analysis code for this manuscript is available at https://github.com/lrylaarsdam/amethyst. Source code, vignettes, and analysis scripts associated with this manuscript are also deposited in Zenodo under 10.5281/zenodo.1692295281.

Competing interests

The authors declare the following competing interests: A.C.A. is an author of one or more patents that pertain to sciMET technology. A.C.A. is also an advisor to Scale Biosciences, which has commercialized the technology. This potential conflict is managed by the office of research integrity at OHSU.

Footnotes

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Lauren E. Rylaarsdam, Email: rylaarsd@ohsu.edu

Andrew C. Adey, Email: adey@ohsu.edu

Supplementary information

The online version contains supplementary material available at 10.1038/s42003-025-08859-2.

References

  • 1.Moore, L. D., Le, T. & Fan, G. DNA methylation and its basic function. Neuropsychopharmacology38, 23–38 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Robertson, K. D. DNA methylation and human disease. Nat. Rev. Genet.6, 597–610 (2005). [DOI] [PubMed] [Google Scholar]
  • 3.Das, P. M. & Singal, R. DNA methylation and cancer. J. Clin. Oncol.22, 4632–4642 (2004). [DOI] [PubMed] [Google Scholar]
  • 4.Lakshminarasimhan, R. & Liang, G. The role of DNA methylation in cancer. Adv. Exp. Med. Biol.945, 151–172 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Grayson, D. R. & Guidotti, A. The dynamics of DNA methylation in schizophrenia and related psychiatric disorders. Neuropsychopharmacology38, 138–166 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Keil, K. P. & Lein, P. J. DNA methylation: a mechanism linking environmental chemical exposures to risk of autism spectrum disorders? Environ. Epigenet.2, dvv012 (2016). [DOI] [PMC free article] [PubMed]
  • 7.Awamleh, Z. et al. Generation of DNA methylation signatures and classification of variants in rare neurodevelopmental disorders using EpigenCentral. Curr. Protoc.2, e597 (2022). [DOI] [PubMed] [Google Scholar]
  • 8.Bakulski, K. M. et al. Autism-associated DNA methylation at birth from multiple tissues is enriched for autism genes in the Early Autism Risk Longitudinal InvestIgation. Front. Mol. Neurosci. 14, 775390 (2021). [DOI] [PMC free article] [PubMed]
  • 9.He, Y. & Ecker, J. R. Non-CG methylation in the human genome. Annu. Rev. Genomics Hum. Genet.16, 55–77 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lister, R. et al. Global epigenomic reconfiguration during mammalian brain development. Science341, 1237905 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Varley, K. E. et al. Dynamic DNA methylation across diverse human cell lines and tissues. Genome Res.23, 555–567 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Xie, W. et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell148, 816–831 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Guo, J. U. et al. Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nat. Neurosci.17, 215–222 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Shirane, K. et al. Mouse oocyte methylomes at base resolution reveal genome-wide accumulation of non-CpG methylation and role of DNA methyltransferases. PLoS Genet.9, e1003439 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Liu, H. et al. DNA methylation atlas of the mouse brain at single-cell resolution. Nature598, 120–128 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Luo, C. et al. Single nucleus multi-omics identifies human cortical cell regulatory genome diversity. Cell Genomics2, 100107–100106 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Liu, H. et al. Single-cell DNA methylome and 3D multi-omic atlas of the adult mouse brain. Nature624, 366–377 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Moore, J. R. et al. MeCP2 and non-CG DNA methylation stabilize the expression of long genes that distinguish closely related neuron types. Nat. Neurosci.28, 1185–1198 (2025). [DOI] [PMC free article] [PubMed]
  • 19.Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods11, 817–820 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Guo, H. et al. Profiling DNA methylome landscapes of mammalian cells with single-cell reduced-representation bisulfite sequencing. Nat. Protoc.10, 645–659 (2015). [DOI] [PubMed] [Google Scholar]
  • 21.Luo, C. et al. Robust single-cell DNA methylome profiling with snmC-seq2. Nat. Commun.9, 3824 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Luo, C. et al. Single cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science357, 600–604 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Mulqueen, R. M. et al. Highly scalable generation of DNA methylation profiles in single cells. Nat. Biotechnol.36, 428–431 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Nichols, R. V. et al. High-throughput robust single-cell DNA methylation profiling with sciMETv2. Nat. Commun.13, 7627 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Acharya, S. N. et al. sciMET-cap: high-throughput single-cell methylation analysis with a reduced sequencing burden. Genome Biol.25, 186 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Nichols, R. V., Rylaarsdam, L. E., O’Connell, B. L., Acharya, S. N. & Adey, A. C. Atlas-scale single-cell DNA methylation profiling with sciMETv3. Cell Genom.5, 100726 (2025). [DOI] [PMC free article] [PubMed]
  • 27.Iqbal, W. & Zhou, W. Computational methods for single-cell DNA methylome analysis. Genomics Proteomics Bioinformatics21, 48–66 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.ALLCools: ALL methyl-Cytosine tools. https://lhqing.github.io/ALLCools/intro.html.
  • 29.Satija, R., Farrell, J. A., Gennert, D., Schier, A. F. & Regev, A. Spatial reconstruction of single-cell gene expression. Nat. Biotechnol.33, 495–502 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol.36, 411–420 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Stuart, T. et al. Comprehensive integration of single-cell data. Cell177, 1888.e21–1902.e21 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Stuart, T., Srivastava, A., Madad, S., Lareau, C. A. & Satija, R. Single-cell chromatin state analysis with Signac. Nat. Methods18, 1333–1341 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Trapnell, C. et al. The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat. Biotechnol.32, 381–386 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Qiu, X. et al. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods14, 979–982 (2017). [DOI] [PMC free article] [PubMed]
  • 35.Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat. Genet.53, 403–411 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.amethyst-facet: Compute window aggregations and alter contents of Amethyst HDF5 files. https://pypi.org/project/amethyst-facet/ (2025).
  • 37.Baglama, J. & Reichel, L. Augmented implicitly restarted lanczos bidiagonalization methods. SIAM J. Sci. Comput.27, 19–42 (2005). [Google Scholar]
  • 38.Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods16, 1289–1296 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.JinmiaoChenLab/Rphenograph. Jinmiao Chen’s Lab. https://github.com/JinmiaoChenLab/Rphenograph (2024).
  • 40.Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding of communities in large networks. J. Stat. Mech.2008, P10008 (2008).
  • 41.Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep.9, 5233 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw.3, 861 (2018). [Google Scholar]
  • 43.Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol.37, 38–44 (2019). [DOI] [PubMed] [Google Scholar]
  • 44.MaatenVan der, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res.9, 2579–2605 (2008). [Google Scholar]
  • 45.Danese, A. et al. EpiScanpy: integrated single-cell epigenomic analysis. Nat. Commun.12, 5228 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Shahryary, Y., Hazarika, R. R. & Johannes, F. MethylStar: a fast and robust pre-processing pipeline for bulk or single-cell whole-genome bisulfite sequencing data. BMC Genomics21, 479 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Kremer, L. P. M. et al. Analyzing single-cell bisulfite sequencing data with MethSCAn. Nat. Methods21, 1616–1623 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Kapourani, C.-A., Argelaguet, R., Sanguinetti, G. & Vallejos, C. A. scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biol.22, 114 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Kapourani, C.-A. & Sanguinetti, G. Melissa: Bayesian clustering and imputation of single-cell methylomes. Genome Biol.20, 61 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Kapourani, C.-A. & Sanguinetti, G. BPRMeth: a flexible Bioconductor package for modelling methylation profiles. Bioinformatics34, 2485–2486 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Souza, C. P. E. et al. Epiclomal: probabilistic clustering of sparse single-cell DNA methylation data. PLoS Comput. Biol.16, e1008270 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol.18, 67 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Zhou, J. et al. Deep learning imputes DNA methylation states in single cells and enhances the detection of epigenetic alterations in schizophrenia. Cell Genomics5, 100774 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Uzun, Y., Wu, H. & Tan, K. Integrating single-cell methylome and transcriptome data with MAPLE. Methods Mol. Biol.2624, 43–54 (2023). [DOI] [PubMed] [Google Scholar]
  • 55.Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol.21, 111 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol.42, 293–304 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Zeng, P. & Lin, Z. coupleCoC+: an information-theoretic co-clustering-based transfer learning framework for the integrative analysis of single-cell genomic data. PLoS Comput. Biol.17, e1009064 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell177, 1873.e17–1887.e17 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Jin, S., Zhang, L. & Nie, Q. scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles. Genome Biol.21, 25 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Akalin, A. et al. methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles. Genome Biol.13, R87 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database2017, bax028 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Alexa, A., Rahnenführer, J. & Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics22, 1600–1607 (2006). [DOI] [PubMed] [Google Scholar]
  • 63.Chen, Y.-K. & Hsueh, Y.-P. Cortactin-Binding Protein 2 modulates the mobility of cortactin and regulates dendritic spine formation and maintenance. J. Neurosci.32, 1043–1055 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Mancia Leon, W. R. et al. Clustered gamma-protocadherins regulate cortical interneuron programmed cell death. eLife9, e55374 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Clemens, A. W. et al. MeCP2 represses enhancers through chromosome topology-associated DNA methylation. Mol. Cell77, 279.e8–293.e8 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Lagger, S. et al. MeCP2 recognizes cytosine methylated tri-nucleotide and di-nucleotide sequences to tune transcription in the mammalian brain. PLoS Genet.13, e1006793 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Carrel, L. & Willard, H. F. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature434, 400–404 (2005). [DOI] [PubMed] [Google Scholar]
  • 68.Sharp, A. J. et al. DNA methylation profiles of human active and inactive X chromosomes. Genome Res.21, 1592–1600 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.Gabel, H. W. et al. Disruption of DNA methylation-dependent long gene repression in Rett syndrome. Nature522, 89–93 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Rett, A. On a unusual brain atrophy syndrome in hyperammonemia in childhood. Wien. Med. Wochenschr.116, 723–726 (1966). [PubMed] [Google Scholar]
  • 71.Amir, R. E. et al. Rett syndrome is caused by mutations in X-linked MECP2, encoding methyl-CpG-binding protein 2. Nat. Genet.23, 185–188 (1999). [DOI] [PubMed] [Google Scholar]
  • 72.Hagberg, B., Aicardi, J., Dias, K. & Ramos, O. A progressive syndrome of autism, dementia, ataxia, and loss of purposeful hand use in girls: Rett’s syndrome: report of 35 cases. Ann. Neurol.14, 471–479 (1983). [DOI] [PubMed] [Google Scholar]
  • 73.Tillotson, R. et al. Neuronal non-CG methylation is an essential target for MeCP2 function. Mol. Cell81, 1260.e12–1275.e12 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 74.Chen, L. et al. MeCP2 binds to non-CG methylated DNA as neurons mature, influencing transcription and the timing of onset for Rett syndrome. Proc. Natl. Acad. Sci. USA112, 5509–5514 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Lioy, D. T. et al. A role for glia in the progression of Rett’s syndrome. Nature475, 497–500 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Rakela, B., Brehm, P. & Mandel, G. Astrocytic modulation of excitatory synaptic signaling in a mouse model of Rett syndrome. eLife7, e31629 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Kahanovitch, U., Patterson, K. C., Hernandez, R. & Olsen, M. L. Glial dysfunction in MeCP2 deficiency models: implications for Rett syndrome. Int. J. Mol. Sci.20, 3813 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Schultz, M. D. et al. Human body epigenome maps reveal noncanonical DNA methylation variation. Nature523, 212–216 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.He, Y. et al. Spatiotemporal DNA methylome dynamics of the developing mouse fetus. Nature583, 752–759 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Dyer, S. C. et al. Ensembl 2025. Nucleic Acids Res.53, D948–D957 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Rylaarsdam, L. Amethyst: a comprehensive toolkit for single-cell methylation sequencing data analysis. Zenodo 10.5281/zenodo.16922953 (2025).
  • 82.Dijk, D., van et al. Recovering gene interactions from single-cell data using data diffusion. Cell174, 716.e27–729.e27 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

42003_2025_8859_MOESM3_ESM.pdf (25.4KB, pdf)

Description of Additional Supplementary Files

Supplementary Data 1 (265.5KB, xlsx)
Supplementary Data 2 (727.7KB, xlsx)
Supplementary Data 3 (922.6KB, xlsx)
Supplementary Data 4 (52.9MB, xlsx)
Supplementary Data 5 (26.3MB, xlsx)
Supplementary Data 6 (4.7MB, xlsx)
Supplementary Data 7 (70.5KB, xlsx)
Supplementary Data 8 (2.2MB, pdf)
Supplementary Data 9 (2.9MB, pdf)
Reporting summary (1.8MB, pdf)

Data Availability Statement

Previously published human peripheral blood mononuclear cell (PBMC) data are available from the NCBI Gene Expression Omnibus (https://ncbi.nlm.nih.gov/geo/) under accession GSE250282. Human brain sciMETv2 data are available from the NCBI Database of Genotypes and Phenotypes (https://ncbi.nlm.nih.gov/gap/) under accession phs003091.v2.p1. Raw and processed human sciMETv3 data are available on the NCBI Gene Expression Omnibus under accession GSE273592. Processed PBMC and brain data relevant to this manuscript are available under accession GSE303678. Source data used to generate Figs. 1b, c; 2i; 3e–h, l; 4a, b, g; and 5e, g are in Supplementary Data 1. Full DMR/DMG results are reported in Supplementary Data 27. All other data are available from the corresponding author upon reasonable request.

Initial processing of sequencing output to base-level methylation calls can be performed with Premethyst commands, which are available at https://github.com/adeylab/premethyst. Amethyst is publicly available for installation at https://github.com/lrylaarsdam/amethyst. Example vignettes are available on Github to demonstrate the major analysis steps in Amethyst for PBMCs (Supplementary Data 8), which only have biologically relevant mCG; and frontal cortex (Supplementary Data 9), which also has substantial non-CG methylation. Vignettes are also available for incorporating clustering methods from other packages (including MethSCAn), doublet detection, imputation, batch integration, and other utilities. All analysis code for this manuscript is available at https://github.com/lrylaarsdam/amethyst. Source code, vignettes, and analysis scripts associated with this manuscript are also deposited in Zenodo under 10.5281/zenodo.1692295281.


Articles from Communications Biology are provided here courtesy of Nature Publishing Group

RESOURCES