Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2026 Apr 10.
Published before final editing as: Nat Biotechnol. 2025 Dec 3:10.1038/s41587-025-02914-3. doi: 10.1038/s41587-025-02914-3

Mapping single-cell diploid chromatin fiber architectures using DAF-seq

Elliott G Swanson 1,*, Yizi Mao 2,*, Benjamin J Mallory 1, Mitchell R Vollger 2, Stephanie C Bohaczuk 2, Christopher B Oliveira 1, Daniel B Lyon 3,4, Jane Ranchalis 2, Nancy L Parmalee 5, Barak A Cohen 3,4, James T Bennett 5,6, Andrew B Stergachis 1,2,7,
PMCID: PMC13063407  NIHMSID: NIHMS2137785  PMID: 41339527

Abstract

Gene regulation is orchestrated by the co-binding of proteins along chromosome-length chromatin fibers within single cells, yet the heterogeneity of this occupancy between haplotypes and cells remains poorly resolved in diploid organisms. Here, we present Deaminase-Assisted single-molecule chromatin Fiber sequencing (DAF-seq), which enables single-molecule footprinting at near-nucleotide resolution while synchronously profiling single-molecule chromatin states and DNA sequence. DAF-seq illuminates cooperative protein occupancy at individual regulatory elements and resolves the functional impact of somatic variants and rare chromatin epialleles. Single-cell DAF-seq (scDAF-seq) generates chromosome-length protein co-occupancy maps across 99% of each individual cell’s mappable genome. ScDAF-seq uncovers extensive chromatin plasticity both within and between single diploid cells, with chromatin actuation diverging by 61% between haplotypes within a cell, and 63% between cells. Moreover, we find that regulatory elements are preferentially co-actuated along –the same fiber in a distance-dependent manner that mirrors cohesin-mediated loops. Overall, DAF-seq enables the characterization of protein occupancy across entire chromosomes with single-nucleotide, single-molecule, single-haplotype, and single-cell precision.


Human gene regulation occurs at the level of an individual chromatin fiber within a single diploid cell. However, these regulatory patterns can diverge between individual chromatin fibers based on the stochastic nature of protein-DNA interactions, cell-to-cell variability in the abundance of chromatin proteins, the presence of underlying genetic variants (germline or somatic), or via the coordinated co-occupancy or co-actuation of chromatin features across loci or entire chromosomes. The chromatin architecture of each fiber is intrinsically related to its functional output, yet it remains unknown how rigidly these architectures are maintained. Although numerous methods exist for resolving single-molecule protein occupancy or single-cell chromatin accessibility, these existing approaches are limited in either their ability to achieve deep sequencing coverage14, high-resolution protein occupancy patterns57, or comprehensive high-resolution patterns across a single cell811. Specifically, long-read methyltransferase stenciling approaches are all bulk assays owing to the erasure of these methylation marks during DNA amplification. Consequently, these approaches limit our understanding of the chromatin landscape within a single cell to individual 10–100 kb fibers that cover only ~0.001% of that cell’s genome. Furthermore, Tn5-dependent single-cell chromatin assays8, including those that use deaminase footprinting10,11, result in sparse single-cell chromatin data owing to inherent sampling and signal-to-noise limitations with cleavage-based chromatin mapping. These drawbacks limit these approaches to single-cell readouts of scattered ~100 bp chromatin fibers that cover only ~0.01% of that cell’s genome. Given these fundamental technical limitations, we lack a detailed understanding of the basic principles guiding how chromatin is organized and regulated within a single cell, including the extent to which a cell’s chromatin epigenome varies.

To overcome these limitations, we have developed a single-molecule chromatin fiber sequencing method that enables the mapping of single-molecule protein occupancy patterns across nearly the entire genome of an individual cell with haplotype resolution (Fig. 1a). Specifically, this method leverages a variant of the double-stranded cytidine deaminase toxin A (DddA)12,13 to stencil protein occupancy in the form of deaminated cytidines. Upon DNA amplification, these deaminated cytidines create distinct modifications to the DNA sequence that can be used to track sequencing reads that arose from the same DNA template and resolve the genetic composition and protein occupancy of each chromatin fiber.

Figure 1 |. Deaminase-assisted single-molecule chromatin fiber sequencing (DAF-seq).

Figure 1 |

(a) Schematic for DAF-seq using the non-specific cytidine deaminase SsDddA to selectively stencil single-molecule protein occupancy using deaminated cytidines. (b) Demonstration of how protein occupancy impacts cytidine deamination and how this results in C->T or G->A transitions relative to the reference depending on whether the top or bottom strand is used as the template. (c) Diagram showing deamination of cytidine to uridine and subsequent conversion to thymidine upon PCR amplification. Mass spectrometry quantification of cytidine in untreated genomic DNA, as well as genomic DNA treated with wild-type SsDddA and the variant SsDddA 5. (d) Motif logos showing the sequence context of cytidine deamination genome-wide after treatment of nuclei with wild-type SsDddA and the variant SsDddA 5. (e) Genomic locus showing GM12878 DNase-seq, pseudo-bulked scATAC-seq, Fiber-seq and targeted DAF-seq data at the NAPA promoter. Region targeted for amplification noted in red, as well as deamination rate, and coverage of chromatin features derived from the DAF-seq data. Note that DAF-seq used only 1% of a SMRT cell for this sequencing, and only 78/25,768 DAF-seq molecules from this locus are displayed. (f) (left) Median per-base deamination rates within the NAPA and WASF1 promoters (blue), and a well-positioned nucleosome and highly occupied CTCF element (red) after treatment of GM12878 cells with various DAF-seq reaction conditions. (right) Violin plots showing percent deamination at the highest (4 uM) and lowest (0.25 uM) SsDddA reaction concentrations. Bases are colored by TpC (dark blue) and non-TpC (light blue) dinucleotide context. (g) Enrichment of the targeted region versus genome-wide sequencing after targeted DAF-seq of four separate loci in GM12878 cells.

Results

Simiaoa sunii DddA (SsDddA) efficiently and specifically modifies accessible cytidines

Cytidine deaminases with activity towards double-stranded DNA (dsDNA), such as the recently described enzyme DddA, provide a potential solution for mapping chromatin architectures at C/G base pairs. Specifically, DddA modifies cytidine to uridine, resulting in C->T mutations upon amplification (Fig. 1b,c), which are readily observable by both short- and long-read DNA sequencing. However, for cytidine deaminases to be optimal for chromatin stenciling, they must have minimal sequence biases, be highly catalytically active, and be highly specific for accessible DNA. Although the originally described DddA enzymes have substantial TC sequence biases12,14, subsequent studies employing directed evolution or phylogenetic approaches have identified DddA variants with reduced sequence bias when tethered to a Cas9 enzyme1517. To test whether these enzymes are suitable for chromatin stenciling, we optimized the production of two recombinant DddA variants from the bacterial species Simiaoa sunii (SsDddA) using a bacterial system (Supplementary Fig. 1) and quantified their cytidine deamination activity on purified dsDNA using mass spectrometry (Fig. 1c). This revealed that recombinant SsDddA13 is highly catalytically active, deaminating 99.8% of cytidines within dsDNA. To confirm that SsDddA has minimal sequence bias within a chromatin context, we treated nuclei with SsDddA and performed long-read sequencing of whole-genome amplified DNA from these SsDddA-treated nuclei. This demonstrated that, unlike other cytidine deaminases18, cytidine deamination with SsDddA has no appreciable sequence bias (Fig. 1d). Furthermore, we quantified the impact of 5-methylcytidine (5mC) on SsDddA activity using DNA templates treated with the CpG methyltransferase M.SssI, demonstrating that SsDddA can deaminate 5mCpG, albeit with reduced activity (Extended Data Fig. 1).

To determine the optimal SsDddA reaction conditions for chromatin stenciling, we treated GM12878 nuclei with a range of enzyme concentrations and treatment times and evaluated each condition using long-read sequencing of targeted amplicons from two genomic loci, the NAPA and WASF1 promoters (Supplementary Note), that have extensive paired DNase-seq, ATAC-seq, and single-molecule chromatin fiber sequencing (Fiber-seq) data (Fig. 1e). Using 1% of a PacBio Revio SMRT cell we sequenced these regions to 25,672x coverage for NAPA and 46,264x for WASF1. Cytidine deamination was highly specific to orthogonally defined regulatory elements and internucleosomal linker regions in a manner that mirrored the paired DNase-seq, ATAC-seq, and Fiber-seq data (Fig. 1e, Supplementary Fig. 2). We observed that treating nuclei with 4 μM SsDddA for 10 minutes provided the highest quality data, with a median of 82% and 73% deamination rates within accessible portions of the NAPA and WASF1 promoters, respectively, and a median of 2.6% and 2.3% deamination rates within well positioned CTCF and nucleosome footprints within these promoters, respectively (Fig. 1f). Furthermore, this reaction condition demonstrated minimal sequence biases (Fig. 1f, Extended Data Fig. 2) – enabling the near nucleotide-precise mapping of protein occupancy events, especially in cytidine rich CpG islands. In addition, deamination rates within the NAPA promoter were significantly higher than m6A rates from Fiber-seq performed on the same sample (one-sided t-test, P = 1.7 × 10−11), indicating that this reaction condition results in saturated SsDddA activity (Extended Data Fig. 3). Furthermore, we observed that each molecule’s deamination pattern could be readily used as a unique molecular identifier (UMI) to identify reads arising from PCR duplicates (Supplementary Fig. 3) due to practically countless combinations of C->T mutations that could exist along a 5 kb fiber. We further benchmarked targeted DAF-seq across 10 loci in human GM12878 cells, K562 cells, and frozen primary human post-mortem descending colon tissue, which showed strong agreement between DAF-seq, Fiber-seq, and ATAC-seq bulked chromatin accessibility measures (Extended Data Fig. 4). Overall, we observed that targeted DAF-seq can result in a 230,000-fold enrichment relative to untargeted genome-wide chromatin stenciling approaches (Fig. 1g), with the majority of sequenced reads constituting unique molecules at a sequencing depth of 100,000 (Supplementary Fig. 3). Together, these findings demonstrate that recombinant SsDddA can be readily purified using a bacterial system, has minimal sequence bias, is highly catalytically active, and is highly specific for accessible DNA—features that make it well suited for chromatin stenciling with single-molecule and single-nucleotide precision.

DAF-seq disentangles the regulatory logic of single-molecule TF co-occupancy

The single-molecule and single-nucleotide precision of DAF-seq combined with its high sequencing depth offers the potential to precisely delineate how individual transcription factors (TFs) occupy and co-occupy a given regulatory element. Furthermore, unlike Tn5-based enrichment strategies10,11, or approaches that rely on harsh bisulfite conversion19, targeted DAF-seq enables the single-molecule assessment of protein occupancy patterns across large stretches of regulatory DNA. To explore this, we identified 11 high-confidence TF binding elements within the NAPA promoter and quantified their overall per-molecule occupancy (Fig. 2a,b, Supplementary Table 1), demonstrating that the occupancy of each element ranged from 13% (element 6) to 96% (element 11) of fibers (Fig. 2b). We observed that occupancy at element 2 rarely occurred unless element 1 was also occupied along the same molecule (Fig. 2c), suggestive of a potential cooperative interaction between the proteins that occupy elements 1 and 2.

Figure 2 |. DAF-seq disentangles the regulatory logic of single-molecule TF co-occupancy.

Figure 2 |

(a) (top) Targeted DAF-seq chromatin actuation and ChIP-seq data from the NAPA promoter in GM12878 cells. (bottom) Zoom-in of the NAPA promoter showing single-molecule DAF-seq profiles with deaminated bases marked in red. Only a subset of ‘top-strand’ reads are shown, with reads clustered based on their occupancy pattern at 11 well-defined TF binding elements within the NAPA promoter. (b) Bar graph showing single-molecule protein occupancy measurements of the 11 binding elements within the NAPA promoter. (c) Bar graph showing the number of chromatin fibers with occupancy at different combinations of elements 1 and 2. (d) Thermodynamic stability of TF occupancy and co-occupancy on the NAPA promoter. (top) Heatmaps showing computed ΔG values for individual protein-DNA interactions and pair-wise interactions. Only pair-wise interactions that passed significance testing (P < 0.01, two-tailed binomial test) are shown (Supplemental Table X). (bottom) Cartoon diagram representing pair-wise interactions on the NAPA promoter. Blue lines represent thermodynamically favorable interactions relative to the reference state where no footprints are occupied, and red represents thermodynamically unfavorable interactions. The thickness of the lines is proportional to the absolute value of the interactions. (e) Computed ΔG values (methods) for individual protein-DNA interactions (top, blue) and pair-wise protein-protein interactions (bottom, yellow). (f) Computed ΔG values for three-way protein-protein interactions including elements 1 and 2 (top, yellow and light blue) and three-way protein-protein interactions not including either element 1 or element 2 (bottom, dark blue).

To quantify this, we implemented a thermodynamic formalism20,21 that relates the frequencies of DAF-seq reads representing different TF-bound states to the free energies of the TF-DNA and TF-TF interactions (Fig. 2df, Supplementary Table 2). This approach assumes that the NAPA promoter is close to equilibrium and that the reads representing each of the protein-bound states will follow the Boltzmann Distribution, thereby enabling us to calculate the change in free energy (ΔG) of each singly bound, doubly bound, or triply bound TF state relative to the unbound state. This demonstrated that the relative binding affinity of the protein-protein interaction between the proteins occupying elements 1 and 2 is 180,000 times stronger than the relative binding affinity of the protein-DNA interaction at element 2, indicating that element 2 is primarily occupied via a cooperative interaction (Methods) (Fig. 2e). In contrast, when elements 1 and 2 are co-occupied, it appears unfavorable for proteins to synchronously occupy elements 4, 5, 7, 8, 9, or 10 (Fig. 2f, Supplementary Table 3), indicating that occupancy at elements 1 and 2 establishes a protein complex that is not reliant upon occupancy at the other elements within the NAPA promoter. To discern whether element 1 or 2 is driving this cooperative interaction, we employed a network graph approach to quantify the conditional codependency between these elements22, revealing that element 1 was by far the most essential, supporting a clear directional effect whereby element 1 drives the codependent occupancy at element 2 (Extended Data Fig. 5, Supplementary Table 4). Of note, whereas element 1 contains a predicted high-affinity element for USF1 & USF2, the CAAT box in element 2 contains a predicted low-affinity sequence element for NFY-A, consistent with distance-dependent cooperative binding between USF1/2 and NFY-A, whereby USF1/2 binds and recruits NFY-A through its USF Specific Region (USR) domain2325. Favorable cooperative interactions were not limited to elements 1 and 2 (Fig. 2d), and three-way interactions that excluded elements 1 and 2 overall appeared favorable (Fig. 2f), suggesting that the NAPA promoter is occupied by two multiprotein complexes, one occupying elements 1 and 2, and a second occupying elements 4, 5, 7, 8, 9, and 10. Overall, these findings demonstrate that DAF-seq can accurately quantify single-molecule protein occupancy and co-occupancy patterns, revealing cooperative TF interactions within near single-nucleotide precision.

Synchronous single-molecule genomic and chromatin profiles using DAF-seq

We next evaluated the ability of DAF-seq to disentangle SsDddA-induced deaminations from germline genetic variants, leveraging the fact that SsDddA only modifies one strand of a C/G base pair, and that reads can be readily partitioned into ‘top-strand’ and ‘bottom-strand’ reads relative to the reference based on their predominance of C->T versus G->A changes, respectively (Fig. 1b). Specifically, at a C/G base pair, although the top strand may be variably deaminated and converted to a T, the bottom strand will always remain a G, enabling one to readily distinguish whether a position was originally a C/G or T/A base-pair (Fig. 3a). Consequently, DAF-seq reads mapping to a reference position that has a C/G base-pair on both haplotypes will contain 100% G on the bottom-strand, and between 0 and 100% C on the top strand, with the base content of the top strand reflecting the frequency by which that site is deaminated (Fig. 3a). This approach enabled the accurate delineation of germline variants, as well as the haplotype phasing of DAF-seq reads based on these germline variants, including instances where the only heterozygous germline variant in a read is a C/T or G/A variant (i.e., only the bottom or top strand can be accurately phased in these cases, respectively). To validate this, we applied targeted DAF-seq to a 4.4 kb region containing only a single C/T heterozygous variant (rs56269549) in GM12878 cells, allowing us to successfully phase DAF-seq reads from the bottom strand. This targeted region is located on the X chromosome and spans four UBA1 transcriptional start sites (TSSs), three of which are selectively accessible on only one of the haplotypes (i.e., Xa) in GM21878 cells with allelically skewed X chromosome inactivation26. This haplotype-resolved DAF-seq data demonstrated that whereas the canonical UBA1 TSS is accessible on both the Xa and Xi, the three upstream TSSs are selectively accessible on only the Xa, confirming that DAF-seq can accurately haplotype phase reads, even when they contain only a single heterozygous C/T variant (Extended Data Fig. 6a,b).

Figure 3 |. SLC39A4 haplotypes modulate single-molecule promoter actuation patterns.

Figure 3 |

(a) (left) Diagram showing the evaluation of the underlying genetic architecture at each reference genome position. Specifically, reads are divided into ‘top-strand’ and ‘bottom-strand’ based on their predominance of C->T versus G->A mutations relative to reference. Below, the base content at a position at all reads from the top and bottom strands are quantified and used to evaluate the germline genetic content at that position. (right) Hexbin plots showing the base content at all of the targeted DAF-seq regions used in this manuscript. (b) Liver eQTL data from GTEx of three single nucleotide polymorphisms (SNPs) within the SLC39A4 promoter as well as their linkage disequilibrium using 1,000 genomes data. Below are DNase-seq data at this region in GM12878 cells and primary liver tissue. (c) (top) Diagram showing targeted DAF-seq on the SLC39A4 promoter in two cell types heterozygous for rs2280838 followed by the Leiden clustering of reads based on their deamination profiles and subsequent projection of these reads into UMAP space. (bottom-right) Pie chart showing the relative contribution of each cluster to the total number of reads. (d) (left) Aggregate chromatin actuation and nucleosome occupancy of reads from the different clusters identified in panel b. (right) Stacked bar chart showing the relative contribution of liver or GM12878 reads to the different clusters, as well as the relative contribution of rs2280838-C or rs2280838-T. (e) Results from sub-clustering of the reads from cluster 1 in panel b. (left) Aggregate chromatin actuation and nucleosome occupancy of reads from the different subclusters. (right) Stacked bar chart showing the relative contribution of liver or GM12878 reads to the different sub-clusters, as well as the relative contribution of rs2280838-C or rs2280838-T. (bottom) Promoter modules defined using above aggregate profiles. (f) (left) Barplot showing the number of liver reads actuated at 0, 1, 2, 3, 4, or 5 of the SLC39A4 promoter modules defined in 4d, split by haplotype. (right) Stacked barplot showing the proportion of fibers with chromatin actuation at only a set number of the 5 modules. (g) Diagram showing the preference of rs2280838-T to actuate module C to switch from the closed to open chromatin state at the SLC39A4 promoter.

Leveraging this haplotype-phased data, we evaluated whether these UBA1 TSSs are being independently actuated along the Xa. We observed that although these four UBA1 TSSs showed strong co-actuation along the Xa (Extended Data Fig. 6c), the actuation of each of these TSSs was largely occurring independently of each other (Extended Data Fig. 6d). Furthermore, we observed that protein occupancy at the major binding elements within the UBA1 isoform 4 TSS that escapes XCI was also largely occurring in an independent manner and did not show appreciable differences in codependent occupancy between the Xa and Xi (two-sided paired t-test, P = 0.22) (Extended Data Fig. 6g). Together, these findings demonstrate that DAF-seq can accurately resolve chromatin patterns in a haplotype-aware manner and that actuation of the upstream UBA1 TSSs is being largely driven in an independent manner along the Xa.

Resolving chromatin transition states at the SLC39A4 promoter using DAF-seq

We next sought to determine whether DAF-seq could resolve distinct chromatin epialleles formed at a regulatory element. To test this, we focused on the SLC39A4 promoter, which is known to have haplotype- and cell-selective activity, and for which rare promoter variants can cause acrodermatitis enteropathica27. We performed targeted DAF-seq on primary post-mortem frozen liver tissue and a lymphoblastoid cell line from individuals heterozygous for the common rs2280838-T haplotype, which is associated with modestly increased SLC39A4 transcript levels selectively in liver tissue28 (Fig. 3b). We sequenced these fibers to a depth of ~1,200,000x targeted coverage in each sample, combined the deduplicated DAF-seq reads from both samples, and clustered them by their single-molecule deamination patterns (Fig. 3c, Supplementary Fig. 4). This exposed distinct single-molecule patterns in nucleosome positioning and chromatin actuation along the SLC39A4 promoter that differed by haplotype and cell type (Fig. 3d). Specifically, clusters 4, 5, and 6, which consist of variably positioned nucleosome arrays, were selective to reads from lymphoblastoid cells. In contrast, clusters 2 and 3, which consist of nucleosome arrays with exquisitely well-positioned nucleosomes within the SLC39A4 promoter, were represented by reads from both liver and lymphoblastoid cells, with a preference for liver reads from the rs2280838-C haplotype. Notably, only cluster 1 showed actuation of the SLC39A4 promoter, and this cluster consisted almost exclusively of chromatin fibers from liver tissue, with 72% of those reads originating from the rs2280838-T haplotype (Fig. 3d), indicating that the association of rs2280838 with SLC39A4 transcript levels in liver is likely mediated through differences between the two haplotypes in their propensity for forming actuated chromatin at the SLC39A4 promoter.

Further sub-clustering of the 17% of fibers corresponding to cluster 1 (Supplementary Fig. 4), revealed distinct single-molecule patterns of focal chromatin actuation within the SLC39A4 promoter that differed between the rs2280838-C and rs2280838-T haplotypes in liver (Fig. 3e). Specifically, the position immediately above rs2280838 was 2.0-fold more likely (one-sided Fisher’s exact test, P = 5.4 × 10−10) to be focally actuated along the rs2280838-T haplotype as opposed to the rs2280838-C haplotype in liver tissue (Fig. 3f). Notably, this position is predominantly occupied by an exquisitely well-positioned nucleosome along non-actuated liver fibers, indicating that rs2280838-T is likely increasing SLC39A4 transcript levels by modulating the propensity of an overlying nucleosome to occlude the SLC39A4 promoter (Fig. 3g). Overall, these findings demonstrate that DAF-seq can capture rare chromatin epialleles with single-molecule and single-haplotype precision.

Quantifying the functional impact of non-coding mosaic mutations using DAF-seq

We further sought to determine if DAF-seq can accurately capture the chromatin state along rare genetic alleles, as is often encountered when evaluating the genetic and functional impact of somatic variants with a low variant allele fraction (VAF). We have recently demonstrated that single-molecule chromatin assays are well suited for measuring the functional impacts of mosaic variants2, as unlike Tn5-based enrichment approaches, single-molecule chromatin fiber sequencing can independently measure VAF and allelic chromatin imbalance. To test this, we leveraged a cell-based model of somatic variation, which includes a 49:1 mixture of the B lymphoblast cell line COLO829BL (BL), and a melanoma tumor line COLO829T (T) derived from the same individual (Fig. 4a). Fiber-seq data from these two unmixed cell lines26 identified a CC>TT somatic dinucleotide mutation on one haplotype of COLO829T (chr17:19,447,245–19,447,246) that ablates an overlying CTCF binding element, causing selective loss of CTCF occupancy and chromatin accessibility on the variant haplotype relative to reference (Fig. 4b). Illumina PCR-free whole-genome sequencing of the 49:1 BL:T mixture identified 6 of 437 reads with the CC>TT variant for a variant allele fraction (VAF) of 1.4%. Application of targeted DAF-seq to a 3.8 kb region spanning the variant in the same 49:1 BL:T mixture readily exposed the presence of the variant (Supplementary Fig. 5), and identified 1,701 of 115,991 bottom strand (G-to-A) reads as containing the CC>TT variant, for a VAF of 1.5% (Fig. 4a) - demonstrating minimal amplification biases using targeted DAF-seq. Furthermore, comparison of the chromatin architectures between the DAF-seq reads containing the reference or CC>TT variant sequence readily exposed that the variant reads lost CTCF occupancy, chromatin accessibility and nucleosome phasing at the targeted element (Fig. 4c). Overall, these results demonstrate that DAF-seq can accurately quantify the genetic and chromatin architecture of individual sequencing reads with minimal biases, and showcase targeted DAF-seq as a powerful tool for functionally characterizing low VAF somatic variants within human tissues.

Figure 4 |. Resolving the functional impact of low VAF mosaic mutations.

Figure 4 |

(a) (left) Schematic showing the generation of the COLO829 BLT50 cell mixture, which is a mixture of the lymphoblastoid cell line COLO829BL and the melanoma cell line COLO829T derived from the same individual. Fiber-seq was performed on each of these cell lines separately, as is shown in panel b. (right) Whole genome PCR-free Illumina sequencing and targeted DAF-seq of the COLO829 BLT50 mixture showing the VAF at the chr17:19447245_6 CC>TT variant. (b) Fiber-seq data from the COLO829BL and COLO829T cells showing the per-molecule and aggregate chromatin actuation data surrounding the chr17:19447245_6 CC>TT variant, as well as the impact of this variant on a CTCF binding element. Note the complete loss of per-molecule CTCF occupancy and chromatin actuation on the reads containing the variant sequence. (c) Targeted DAF-seq of the same region as panel b in the COLO829 BLT50 cell mixture showing the single-molecule and aggregate chromatin patterns on reads containing the reference sequence (top) and the chr17:19447245_6 CC>TT variant (bottom). Note the complete loss of chromatin actuation and nucleosome positioning along the variant reads.

Comprehensive reconstruction of single-cell diploid genomes using single-cell DAF-seq

Having demonstrated that DAF-seq enables the accurate reconstruction of the genomic and chromatin architecture of individual reads, we next sought to apply this technology to single cells to comprehensively resolve the gene regulatory architecture of an individual cell. To accomplish this, we used GM24385 lymphoblastoid cells, as a highly accurate diploid genome (HG002) exists for this cell line29, enabling the benchmarking of our single-cell genomic accuracy. Specifically, we treated permeabilized GM24385 cells with SsDddA and sorted individual cells into sample wells using fluorescence-activated cell sorting (FACS). We then performed whole-genome amplification (WGA) separately on each cell using primary template-directed amplification (PTA)30 and sequenced the amplification products from 12 cells with long-read sequencing (Fig. 5a). Eight cells were sequenced to a median depth of 12 Gb, two to ~22 Gb, one to 91 GB, and one to 133 Gb (N50 for sequencing fragment length of 4.0 kb) (Fig. 5b,c, Supplementary Table 5).

Figure 5 |. Chromosome-scale genomic phasing in single cells.

Figure 5 |

(a) Schematic for single-cell DAF-seq. Specifically, permeabilized cells are treated with SsDddA and then sorted into individual wells of a plate, with each well being subjected to a custom PTA. PTA reads from each well are then sequenced using PacBio HiFi sequencing and mapped to GRCh38. Individual reads are then identified as arising from either the ‘top’ or ‘bottom’ strand based on their pattern of either C>T or G>A mutations relative to the reference, respectively. Overlapping reads are then combined to generate consensus reads for each ‘haplotype-strand’, which are then haplotype-phased across the entire genome using parental short-read data. (b) (top) Swarm plot showing the sequencing depth of each cell in terms of gigabase pairs (Gbp). Note that only a fraction of the library was sequenced for each cell. (bottom) Number of ultra-long consensus reads >100,000 bp in length from each cell. (c) (top) Density plot showing the size distribution of sequencing reads (top) or ‘haplotype-strand’ consensus reads (bottom) from four of the cells subjected to scDAF-seq. (d) (top) Genomic coverage of chromosome 13 from the raw sequencing reads from cell 2, as well as the coverage from the consensus reads from this same cell. (bottom) Genomic locus showing the consensus reads from cell 2, with reads colored based on whether they are from the top (blue) or bottom (yellow) strand, and split based on their phasing. (e) Swarm plot showing genomic coverage of consensus reads from each cell for all reads, as well as haplotype-phased reads.

Importantly, autosomes within diploid cells contain four ‘haplotype-strand’ templates at each genomic position, corresponding to the top and bottom strands of each of the two haplotypes. SsDddA treatment creates a unique deamination pattern along each ‘haplotype-strand’ template, as the precise locations of deaminase-induced mutations are stochastic due to incomplete SsDddA deamination and heterogeneous strand-specific protein occupancy patterns. Critically, PTA’s increased preference for priming on the primary template produces numerous partially overlapping amplicons originating from the same ‘haplotype-strand’ template (Extended Data Fig. 7). Consequently, we reasoned that the unique molecular identifiers created by template-specific deamination events could be used to group and collapse reads arising from the same ‘haplotype-strand’, even in situations where the underlying genomic sequence lacked haplotype-unique variants (Fig. 5a), analogous to generating consensus reads from multiple sequencing passes during circular consensus sequencing (CCS). To accomplish this, we mapped reads from each cell to GRCh38 and collapsed overlapping reads originating from the same ‘haplotype-strand’ template to generate individual ‘consensus reads’ (Methods) (Fig. 5d). This resulted in a single-cell consensus read N50 that was as high as 34.5 kb for the deepest sequenced cell (Fig. 5c), with 5,608 consensus reads from that cell >100 kb in length (Fig. 5b) and a haplotype switch error rate of 2.4–3.3%, which is comparable to current assembly methods with similar coverage3133. Overall, this resulted in an even coverage of consensus reads and readily exposed anomalous autosomal genomic loci harboring >4 ‘haplotype-strand’ templates, indicative of possible duplications present in that cell relative to GRCh38. In total, each cell had at least one read spanning between 60% to 99% of the mappable, autosomal portions of GRCh38 (Fig. 5d,e) a rate that mirrored the sequencing depth of each cell (Extended Data Fig. 8c). Moreover, between 27% and 80% of the mappable, autosomal portions of GRCh38 within a cell were covered by consensus reads that could be assigned to either the paternal or maternal haplotype using parental short-read data (Fig. 5e, Extended Data Fig. 8c, Supplementary Table 6). Together, these data demonstrate that scDAF-seq enables the accurate reconstruction of the chromosome-scale haplotype-phased diploid genome from a single cell, with thousands of consensus reads >100,000 bp in length.

Widespread plasticity in the chromatin epigenome of a single cell

It is currently unknown the degree of plasticity that is permissible within a cell’s chromatin epigenome as current single cell chromatin assays rely on Tn5-based accessibility measurements, which are sparse and unable to measure the inaccessible chromatin state. As, scDAF-seq enables the genetic evaluation of up to 99% of each cell’s mappable genome, with up to 80% of the genome being haplotype phased within a single cell (Fig. 5), we reasoned that this method would be well suited for comprehensively evaluating chromatin patterns between haplotypes within a single cell, as well as along the same haplotype between cells. To evaluate this, we first benchmarked the ability of scDAF-seq to qualitatively and quantitatively measure single-molecule chromatin accessibility patterns genome-wide. Specifically, we demonstrated that pseudo-bulked scDAF-seq measures of chromatin accessibility are comparable to that of Fiber-seq, scATAC-seq, and ATAC-seq (Fig. 6a,b, Extended Data Fig. 9). In addition, we demonstrated that scDAF-seq single-molecule chromatin actuation measurements monotonically mirror single-molecule Fiber-seq chromatin actuation measurements across both accessible regulatory elements (Fig. 6a), as well as different euchromatic and heterochromatin genomic loci (Supplementary Fig. 6). Finally, we established that scDAF-seq enables single-molecule and single-cell measures of TF occupancy and co-occupancy with near single-nucleotide resolution (Fig. 6c). Together, these findings demonstrate that when using haplotype-phased samples, scDAF-seq permits profiling gapped chromatin architectures along single fibers that are up to >200 Mb in length (Fig. 6b), with the length and completeness of each fiber simply limited by the length of the underlying chromosome and library sequencing depth, respectively (Extended Data Fig. 8).

Figure 6 |. Single-molecule chromatin epigenome of a single-cell.

Figure 6 |

(a) (left) Enrichments of scDAF-seq actuated elements (MSP > 150 bp) within Fiber-seq peaks from the same cell line (GM24385). Peaks are grouped into 10% Fiber-seq actuation bins ranging from 10–20% to 90–100%. (right) Enrichments of scDAF-seq actuated elements (MSP > 150 bp) within transcriptional start sites (TSS). (b) Example genomic loci showing single-molecule and aggregate Fiber-seq data in bulk GM24385 cells (top), as well single-molecule deamination patterns along all ‘haplotype-strand’ consensus reads at these positions within cells 2 and 4. (c) scDAF-seq single-molecule deamination patterns along top and bottom-strand consensus reads at the same NAPA promoter position shown in Figure 2, as well as the position of TF binding elements within this promoter. (d) Violin plots displaying Jaccard distance comparisons of single-molecule scDAF-seq actuation patterns between haplotypes within each cell (green) and between the same haplotype of different cells (grey) at genomic loci that are covered across both cells (left). Elements are defined using paired Fiber-seq data and are further divided into promoter-proximal (center) and promoter-distal (right) peaks. Chromatin actuation patterns were significantly more similar between haplotypes of the same cell than between the same haplotype within different cells when comparing all peaks (two-sided t-test, P = 0.006) and promoter-distal peaks (P = 0.005) but not when comparing promoter-proximal peaks (P = 0.44). (e) Violin plots showing Jaccard distance comparisons between the same haplotype of different cells for Fiber-seq peaks grouped into 10% Fiber-seq actuation bins as in a. (f) Jaccard distance comparisons between the same haplotype of different cells for TSS-overlapping Fiber-seq peaks grouped by binned log2 full-length transcript gene expression from the same cell line (GM24385). Peaks in gene expression bin 0 produced no detectable transcripts.

Bulk Fiber-seq performed on GM24385 cells identified 144,870 actuated regulatory elements within mappable autosomal regions. However, we observed that, on average, only 46% of these regulatory elements were actuated on at least one haplotype within a single cell, indicating that within most cells only a minority of regulatory elements are present in the open configuration. To compare these patterns between cells, we identified haplotype-phased GM24385 regulatory elements that had sequencing coverage across a pair of cells subjected to scDAF-seq and compared the actuation status at each element between the two cells in a haplotype-aware manner (Fig. 6d). This revealed that on average the actuation status of a regulatory element between two individual GM24385 cells differs by ~63% (range 60% to 67%) (Fig. 6d). However, individual elements varied quite substantially in their degree of plasticity, with GM24385 regulatory elements that are consistently actuated in bulk Fiber-seq data only differing in their actuation status by ~9% between cells (Fig. 6e). Notably, Jaccard distances were 27% lower for promoter-proximal elements than for promoter-distal elements (Fig. 6d), suggesting that a regulatory element’s chromatin plasticity may be related to its function. Consistent with this, we observed that genes with greater steady state transcriptional outputs had markedly fewer single-cell differences in their promoter actuation status, with the highest expressing genes in GM24385 differing in their actuation status by only ~16% between cells (Fig. 6f).

Finally, to determine whether this plasticity is mediated by cell-to-cell variation in the trans environment, we compared the actuation status of elements between the two haplotypes within the same cell. This revealed that on average, the actuation status of a regulatory element between both haplotypes within an individual GM24385 cell differs by ~61% (range 59–65%) (Fig. 6d), with only a minority of these elements representing sites with consistent haplotype-selective chromatin across GM24385 cells, as measured using bulk Fiber-seq (Supplementary Fig. 7). Overall, these findings demonstrate pervasive plasticity within the chromatin actuation status of a single cell’s regulatory landscape, and that this plasticity is directly related to the function of the chromatin epigenome. Furthermore, cell-to-cell differences in the abundance of trans-acting factors play only a minor role in governing the plasticity in the chromatin epigenome of a cell.

Codependent single-molecule chromatin actuation is largely limited to ~100kb domains

3D genome folding plays an integral role in modulating gene regulatory patterns by bringing pairs of regulatory elements into close proximity34,35, thereby enabling elements to modulate their respective activity along the same chromatin fiber. Consistent with this model, bulk Fiber-seq measurements have shown that individual regulatory elements are preferentially co-actuated along the same chromatin fiber1. However, these measurements are limited by the sequencing lengths of Fiber-seq (i.e., 10–100 kb), and are confounded by cell-to-cell differences in the abundance of trans-acting factors, which can result in molecules appearing to have codependent actuation simply because they are present within cells that have divergent levels of trans-acting factors. We sought to leverage our chromosome-length single-cell and single-molecule measurements of TF occupancy and chromatin actuation to test whether regulatory elements are indeed preferentially actuated along the same chromatin fiber. To validate this approach, we first focused on CTCF loop anchors, as these by definition form along the same chromatin fiber, and are mechanistically mediated by CTCF binding sites arranged within a +/− asymmetric orientation (Fig. 7a)3638. Overall, we observed that CTCF loop anchors defined using ChIA-PET39 exhibit significantly higher single-molecule CTCF co-occupancy than expected (two-sided t-test, P < 2.2×10−16). However, by leveraging the near single-nucleotide resolution of scDAF-seq we find that CTCF co-occupancy at loop anchors is largely limited to CTCF binding elements bound in the +/− asymmetric orientation (Fig. 7a,b), consistent with prior reports3638. Loop anchors with more 3D genome contacts show higher single-molecule CTCF co-occupancy, with the strongest 500 3D genome contacts exhibiting a median CTCF co-occupancy of 71% (Fig. 7c). Overall, this establishes that scDAF-seq can accurately measure single-molecule co-occupancy events related to 3D genome folding.

Figure 7 |. Chromosome-length single fiber co-actuation and protein co-occupancy.

Figure 7 |

(a) (top left) Diagram of chromatin loop formation. The N-terminal domains of CTCF bound to DNA in opposite orientations halts progression of the cohesin complex during loop extrusion. (center) Heatmap showing Micro-C chromatin interaction frequency in HFFc6 cells. (bottom) Example chromatin loop region identified by CTCF ChIA-PET with aggregate Fiber-seq data in bulk GM24385 cells, as well single-molecule deamination patterns at CTCF sites within loop anchors. CTCF occupancy status is displayed for each ‘haplotype-strand’ consensus read covering each CTCF site. (b) Box plots showing the percentage of scDAF-seq chromatin fibers co-occupied by CTCF within loop anchor pairs from the top 500 ChIA-PET interactions, stratified by CTCF motif orientation (n = 395 +/−, n = 129 −/−, n = 117 +/+, n = 30 −/+ CTCF motif pairs). (c) Box plots showing scDAF-seq co-occupancy of CTCF sites in the +/− orientation at loop anchors as in panel b, binning loop anchor pairs by ChIA-PET strength (blue, bin1 n = 18, bin2 n = 63, bin3 n = 122, bin4 n = 192, bin5 n =186, bin6 n = 183, bin7 n = 583, bin8 n = 999, bin9 n =795) (see Extended Data Fig. 10d). CTCF co-occupancy at shuffled regions is shown in purple (n = 138,747). Boxes represent the median and interquartile range and whiskers extend to the farthest samples within 1.5× the interquartile range of the median. (d) Diagram showing how regulatory element codependency scores are calculated. (e) Difference in average single-molecule codependency between regulatory elements along the same chromatin fiber (haplotype-strand) and opposite haplotypes within the same cell, binned by genomic distance. Stars indicate bins with significantly increased codependency scores along the same chromatin fiber (one-sided t-test, P < 0.05).

To determine whether regulatory elements are preferentially co-actuated along the same chromatin fiber, we first calculated the expected co-actuation of any two elements based on their respective actuation percentages across the 12 cells with scDAF-seq data. We then quantified whether the observed single-molecule co-actuation at these elements differed from this expected co-actuation (i.e., the ‘codependency score’), with positive codependency scores indicating elements that are preferentially co-actuated along the same molecule (Fig. 7d). To determine whether these codependency scores simply reflected differences in the trans environment between the 12 cells, we calculated a ‘pseudo codependency score’ that uses the co-actuation status of the two elements on the different haplotypes within the same cell (Extended Data Fig. 10a), under the assumption that human cells do not utilize transvection. Controlling for these ‘pseudo codependency scores’, we observed that genome-wide, regulatory elements indeed are preferentially co-actuated along the same chromatin fiber in a distance dependent manner (Fig. 7e). However, preferential single-molecule co-actuation in GM24385 cells is limited to regulatory elements that are ~100 kb apart or closer (Fig. 7e, Supplementary Table 7), a distance that mirrors that of cohesin-mediated chromatin loops40 and is an order of magnitude smaller than that of topologically associating domains41. Overall, these findings demonstrate that regulatory elements can be preferentially co-actuated along the same chromatin fiber over distances consistent with chromatin loops.

Discussion

We present DAF-seq for studying the structure and function of the non-coding genome, and leverage DAF-seq to reveal a comprehensive map of the diploid genome and chromatin epigenome from a single cell. As DAF-seq chromatin stencils are maintained upon DNA amplification, these deamination patterns can be used as a UMI to identify reads arising from the same modified DNA template. This is essential for identifying PCR duplicates and stitching PTA-produced reads together, thereby enabling true single-molecule footprinting across chromosome-length chromatin fibers, a ~10,000-fold improvement beyond what is currently possible with PCR-free chromatin stenciling methods1,4,5,7, and a ~1,000,000-fold improvement beyond what is possible with Tn5-based chromatin stenciling methods10,11. In addition, the non-sequence specific nature of SsDddA with our reaction conditions enables this method to have near single-nucleotide resolution, unlike methods that rely on enzymes with significant sequence preferences5,18. Furthermore, whereas short-read based deaminase stenciling methods are limited by biased read mapping issues10,11, pairing high-resolution deaminase stenciling with long-read sequencing permits 99.9% of our reads to be mapped back to the genome (Extended Data Fig. 10b), likely owing to the presence of unmodified DNA within nucleosome footprints that allow for seeding read mapping. Consequently, scDAF-seq enables the comprehensive mapping of the chromatin epigenome across nearly the entire genome of a single-cell in a haplotype-aware manner. This permits us to study single-fiber co-actuation and protein co-occupancy measurements of genomic elements located >200 Mb apart, a distance and resolution unobtainable with current technologies. Although scDAF-seq uses more sequencing per cell than traditional single-cell chromatin assays, we demonstrate that fundamental principles of gene regulation can be derived from just 12 cells, breaking the widely-accepted paradigm that high quality single-cell chromatin analyses require data from millions of cells. We anticipate that further reduction in the cost of long-read sequencing will permit scDAF-seq to capture the diversity of cells within a sample.

Using DAF-seq, we expose widespread heterogeneity in the primary chromatin architecture of individual regulatory elements at the single-molecule and single-cell level. Specifically, we observe critical chromatin transition states that illuminate TF and regulatory element cooperativity patterns (Fig. 2, Extended Data Fig. 5,6), patterns that to-date have largely necessitated time-consuming genome and/or epigenome editing methods to resolve42,43. As DAF-seq does not require Tn5-based enrichment, it can accurately measure diverse chromatin states, including quantifying regulatory elements present within a closed state. These features permitted us to comprehensively measure the accessible chromatin epigenome of a single cell, exposing that the accessible chromatin landscape of individual GM24385 cells can diverge by ~63% within a cell line. As scDAF-seq measures the number of haplotype-strands at each location along the genome, we can confidently state that all twelve of the cells assayed were in G1. These findings indicate that GM24385 cells are tolerant of widespread plasticity in their accessible chromatin epigenome. Furthermore, we leveraged the single-cell haplotype-resolved data to directly account for the impact of the cellular trans environment in this finding, showing that cell-to-cell differences in the cellular trans environment only modestly contribute to this heterogeneity. Together, these findings raise questions as to the dynamics underlying this heterogeneity, and how a cell stores an epigenetic memory that a specific regulatory element should become accessible within a population of GM24385 cells.

DAF-seq improves how we study the functional impact of somatic variants. Specifically, sequencing reads arising from the same DNA template can be readily identified and leveraged to construct a highly-accurate consensus sequence of the primary template, analogous to duplex sequencing methods44. In addition, DAF-seq can accurately distinguish SsDddA-induced deamination events from germline and somatic variants, permitting DAF-seq to accurately identify and quantify low VAF somatic variants using an economical experimental design (i.e., genomic PCR followed by amplicon sequencing using only a small fraction of a sequencing run per target). Furthermore, as DAF-seq is compatible with long-read sequencing, it can be leveraged to interrogate somatic variants in complex genomic regions. However, unlike ddPCR45 or other amplicon-based approaches, DAF-seq enables the simultaneous quantification of somatic genetic variants and their functional impact on chromatin patterns with improved resolution and throughput relative to existing targeted methods2 or genome-wide methods46,47. Critically, this is not possible with Tn5-based enrichment strategies10,11 as the functional impact of a variant on chromatin accessibility inherently disrupts the ability to directly quantify that variant’s VAF using the same sequencing reads.

We demonstrate that DAF-seq can be performed on frozen primary human tissue as well as cultured cells, and that DAF-seq chromatin stencils are compatible with all current short- and long-read sequencing platforms – positioning DAF-seq as an experimental tool for resolving the functional impact of the millions of genetic variants associated with both common48 and rare49 disease risk. In addition, we demonstrate that the high sequencing accuracy of PacBio HiFi sequencing coupled with the high fidelity and primary template preference provided by PTA enables scDAF-seq to generate highly accurate single-cell consensus sequences from each of the four ‘haplotype-strand’ template within a single diploid cell. The process of generating consensus template sequences dramatically improves the read coverage within a single cell, readily exposing genomic locations with >4 ‘haplotype-strand’ templates that correspond to duplications in that cell relative to the reference. We show that scDAF-seq enables the generation of thousands of ultra-long consensus reads from a single cell, and the generation of these ultra-long consensus reads can be readily increased by simply sequencing the library from each cell to higher depths. These single-cell ultra-long consensus reads enable the evaluation of single-cell genomic and epigenomic variation within the most complex regions of the genome and lay the groundwork for potentially assembling complete telomere-to-telomere genomes and chromatin epigenomes from single cells.

Methods

Bacterial strains and culture conditions

All bacterial strains used in this study were grown in Lysogeny Broth (LB) at 37 °C or on LB medium solidified with agar (RPI, cat# L24030–100.0). Filter sterilized kanamycin (Gold Biotechnology, cat# K-120) (100 mg/L for plasmid propagation, or 30 mg/L for protein expression), and IPTG (ThermoFisher, cat# R0393) (0.5 mM) were added to culture when necessary. E. coli strains DH5α (NEB, cat# C2987H) and BL21(DE3) (NEB, cat# C2527H) were used for cloning and producing plasmids, and protein expression respectively.

Cloning and purification of SsDddA

The genes for SsDddA WT, SsDddA5, and corresponding immunity protein (SsDddI), which is required for SsDddA purification, were codon optimized for E. coli expression and synthesized as gBlocks with corresponding restriction enzyme recognition sites flanking each end by IDT. The SsDddI was inserted between NdeI and XhoI, and WT SsDddA or SsDddA5 was inserted between NcoI and NotI with a N-terminal 6xHis tag of the vector pColADuet-1 (LifescienceMarket, #PVT0105). The deaminases were cloned into the vector after the immunity protein was successfully cloned into the vector. The whole plasmid sequence was confirmed by PlasmidSaurus.

The purification of deaminases was performed as previously described12 with the following modifications: protein expression was induced at 16 °C overnight, cells were resuspended in Ni-NTA Buffer A (50 mM Tris-HCl, pH 7.8, 600 mL NaCl, 10% Glycerol, 10 mM 2-Mercaptoethanol, 0.1% Triton-100) with protease inhibitor cocktail (Thermo Scientific, cat#PIA32955). The cell lysate was loaded on a HisTrap HP His tag protein purification column (5 mL, Cytiva # 17524801) and the deaminase-immunity protein complex was eluted during a gradient from 100% Ni-NTA Buffer A to 100% Ni-NTA Buffer B (50 mM Tris-HCl, pH 8, 600 mL NaCl, 500 mM imidazole, 10% Glycerol) using the NGC Quest 10 Plus Chromatography System (Bio-Rad, #7880003) and the fractions of corresponding A280 peaks were pooled. The major peak eluted during the gradient was pooled, verified using SDS-PAGE, and collected for the denaturing and renaturing steps to separate the immunity protein and the deaminase. The pooled protein complex samples were added to denaturing buffer (50 mM Tris-HCl pH 7.8, 20 mM imidazole, 500 mM NaCl, 6 M guanidine HCl, prepared from Guanidine-HCl (ThermoFisher, cat# 24110), and 5 mM 2-Mercaptoethanol) at 1:10 (v:v) ratio and incubated overnight. The mixture was then loaded back to the 5 mL HisTrap column, with 50 mL of denaturing buffer. The deaminase was renatured during a gradient from 100% denaturing buffer to 100% renaturing buffer (50 mM Tris-HCl pH 7.8, 500 mM NaCl, 10 μM ZnCl2, and 10 mM 2-Mercaptoethanol) at 1 ml/min and washed with additional 50 mL renaturing buffer. The renatured deaminase was further purified using HiLoad Superdex 200 pg preparative SEC column (120 mL, Cytiva # 28989335) with storage buffer (50 mM Tris-HCl pH 7.8, 500 mM NaCl, 10 μM ZnCl2, 1 mM DTT, and 10% Glycerol). The fractions were evaluated by SDS-PAGE, and the purest fractions were pooled, aliquot into 20 μL stocks, flash frozen with liquid nitrogen, and stored at −80 °C. All Tris-based buffers were prepared from 1 M Tris-HCl (pH8, molecular biology grade ultrapure, ThermoScientific, cat# J22638-K2).

Mass Spectrometry validation of SsDddA activity

SsDddA activity was validated via quantification of deoxycytidine deamination by UHPLC-MS/MS. Samples for quantification were treated as previously described with modifications (Kong et al. 2022). In brief, 2 ng/mL of stable isotope-labeled 2-deoxycytidine triphosphate (MilliporeSigma, cat# 646229) and 2-deoxyadenosine triphosphate (MilliporeSigma, cat# 646237) were used as references and added to 50 ng of DNA from each sample before any mass spectrometry sample preparation. The DNA from each sample was mixed with 0.02 U phosphodiesterase I (Worthington, cat# LS003926), 1 U Benzonase (MilliporeSigma, cat# E1014), and 2 U Quick CIP (NEB, cat# M0525S) in digestion buffer (10 mM Tris, 1 mM MgCl, pH 8 at RT) for 3 hours at 37°C, with a total reaction volume of 50 μL. Single nucleotides were separated from the enzymes by collecting the flow-through of a Nanosep centrifugal filter (MWCO 3 kDa, Pall, cat# OD003C33). The UHPLC-MS/MS analysis of cytosine and adenosine was performed on an ACQUITY Premier UPLC System coupled with a XEVO-TQ-XS triple quadrupole mass spectrometer. UPLC was performed on a ZORBAX Eclipse Plus C18 column (2.1 × 50 mm I.D., 1.8 μm particle size) (Agilent, cat# 959757–902) using solvent A consisting of 0.1% ammonium hydroxide in 100% acetonitrile (v/v) and solvent B consisting of 0.1 M ammonium acetate in water, with the following gradient at 0.3 mL/min: 0–1 min 100% A, 1–6 min 100–30% A and 0–70% B, 6–7 min 30–5% A and 70–95% B, 7–8 min 5–100% A and 95–0% B, 8–10 min 100% A. MS/MS analysis was operated in positive ionization mode with 3000 V capillary voltage as well as 350°C and 1000 L/hour nitrogen drying gas. A multiple reaction monitoring (MRM) mode was adopted with the following m/z transition: 227.9 -> 94.82, 227.9 -> 98.98, 227.9 -> 111.98, 227.9 -> 116.99 for dC (collision energy, 32, 18, 6, 12 eV respectively); 238.9 -> 100.8, and 238.9 -> 118.9 for isotope-labeled deoxycytidine (collision energy 32 and 6 eV respectively); 252.10 -> 136.09 for dA (collision energy, 14 eV), 267.1 -> 146.1 for isotope-labeled dA (collision energy, 14 eV) was monitored as well as control. MassLynX was used to quantify the data. All reagents used for mass spectrometer analysis are molecular grade level or above

Bulk Whole-Genome Amplification (WGA) DAF-seq

2 million K562, HG002, or GM12878 cells were permeabilized as previously described 1 with the difference of permeabilized cells being resuspended in freshly prepared Buffer C (15 mM Tris, pH 8.0; 15 mM NaCl; 60 mM KCl; 1mM EDTA, pH 8.0; 0.5 mM EGTA, pH 8.0; 0.5 mM Spermidine, 10 nM ZnCl2). The permeabilized cells were treated with 0.25 μM WT SsDddA or SsDddA5 at 25 °C for 10 min. The reaction was quenched with 5% SDS (ThermoFisher, cat# AM9820) before gDNA extraction using HMW DNA Extraction kit (Promega, cat# A2920). The gDNA was then subjected to whole genome amplification with REPLI-G Mini kit (Qiagen, cat#150023) according to the manufacturer’s protocol before being prepared for sequencing.

Nuclei isolation

GM12878, COLO829BL, and COLO829T cell lines were permeabilized using a digitonin containing isotonic buffer. Briefly, we added 800,000–1,000,000 cells per sample to a 1.5 mL tube (Eppendorf, 022363204) and centrifuged at 400g for 5 min at 4°C. The supernatant was removed, and the cell pellet was resuspended in 100 μL of chilled isotonic Perm Buffer (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 3 mM MgCl2, 0.05% digitonin) by pipette-mixing 10 times. Cells were incubated on ice for 5 min, after which they were diluted with 1 mL of isotonic Wash Buffer (20 mM Tris-HCl pH 7.4, 150 mM NaCl, 3 mM MgCl2, 10 nM ZnCl2) by pipette-mixing five times. Cells were centrifuged at 400g for 5 min at 4°C and the supernatant was removed. The cell pellet was resuspended in chilled isotonic Wash Buffer. Cells were counted using a Cellometer Spectrum Cell Counter (Nexcelom) using ViaStain acridine orange/propidium iodide solution (Nexcelom, C52–0106-5).

Tissue Preparation

Liver, heart and colon tissue were separately homogenized using a Dounce homogenizer with ten strokes using pestle A, followed by ten strokes using pestle B. 1.5 mL of cold homogenization buffer was added to the sample, which was then filtered through a 70-micron filter. Cells were counted using a Cellometer Spectrum Cell Counter (Nexcelom) using ViaStain acridine orange/propidium iodide solution (Nexcelom, C52–0106-5). 200,000 cells were added to a 1.5 mL tube (Eppendorf, 022363204) and centrifuged at 500g for 5 min at 4°C. The supernatant was removed and the cell pellet was resuspended in 60 μL of chilled Buffer A w/ ZnCl2 (15 mM Tris, pH 8.0; 15 mM NaCl; 60 mM KCl; 1mM EDTA, pH 8.0; 0.5 mM EGTA, pH 8.0; 0.5 mM Spermidine, 10 nM ZnCl2) by pipette-mixing, followed by addition of 60 ul of 2X lysis buffer (0.025% IGEPAL, 15 mM Tris, pH 8.0; 15 mM NaCl; 60 mM KCl; 1mM EDTA, pH 8.0; 0.5 mM EGTA, pH 8.0; 0.5 mM Spermidine, 10 nM ZnCl2). Cells were incubated on ice for 10 minutes and nuclei were counted using a Cellometer Spectrum Cell Counter (Nexcelom) using ViaStain acridine orange/propidium iodide solution (Nexcelom, C52–0106-5). ~100,000 nuclei were used as input to the SsDddA reaction.

DddA Reactions

All targeted DAF-seq reactions utilized SsDddA WT. Enzyme optimization experiments targeted NAPA and WASF1 and treated 100,000 GM12878 permeabilized cells at 25°C for 10 minutes and 20 minutes, at enzyme concentrations of 0.25 μM, 1 μM, and 4 μM. All subsequent SsDddA reactions were performed at 25°C for 10 minutes with a 4 μM enzyme concentration. Reactions were neutralized by the addition of 20% sodium dodecyl sulfate (SDS) to a final concentration of 5%. Genomic DNA was extracted using the Monarch Genomic DNA Purification Kit (New England Biolabs, T3010S) and quantified using a Qubit 1X dsDNA High-Sensitivity kit (Invitrogen, Q33231).

PCR Amplification of SsDddA-treated genomic DNA

Regions-of-interest were amplified from SsDddA-treated genomic DNA in 50 ul reactions consisting of 25 μL LongAmp Hot Start Taq 2X Master Mix (New England Biolabs, M0533L), 2 μL each of 10 μM forward and reverse primers (Supplementary Table 8), and 30–100 ng of gDNA. PCR conditions: 94°C for 60s, 35 cycles of 94°C for 30s, 30s annealing, and 65°C extension, followed by a final extension of 65°C for 10 min. Annealing temperatures and extension times varied by target (Supplementary Table 8). Amplicons were purified using the Monarch PCR & DNA Cleanup Kit (New England Biolabs, T1030L). In general, primers are designed by targeting AT rich regions and avoiding being within 500 bp of an actuated regulatory element. A pool of primers was used for amplification, which included primers with the random incorporation of A or G at C-complementary positions. In general, primers were screened to avoid common polymorphisms and repeat elements. Primer pairs were tested for amplification of a single band of the predicted size after gel electrophoresis, using gDNA from DAF-seq treated nuclei. After sequencing, primer pairs were evaluated for biases in terms of their amplification of ‘top-strand’ or ‘bottom-strand’ fibers, and only those pairs with comparable amplification of both strands were used for downstream applications.

Library preparation and sequencing

Libraries were prepared as previously described52. Multiplexed library preparation was performed using the SMRTbell prep kit 3.0 (PacBio, cat#102–141-700) and SMRTbell barcoded adapter plate 3.0 (PacBio, cat#102–009-200). Final sequencing libraries were sequenced on the PacBio Revio platform using v3.2 chemistry.

DAF-seq alignment and preprocessing

PacBio HiFi reads were converted to FastQ format using samtools fastq53 (v1.17, parameters: -T) and aligned to hg38 using minimap2 (v2.22-r1101, parameters: --MD -Y -y -a -x map-pb). C-to-T and G-to-A changes were identified by comparing the sequencing read and the hg38 reference using pysam (v0.21.0, https://github.com/pysam-developers/pysam). Secondary and supplementary alignments were filtered out. The designation of the original SsDddA modified DNA strand as “top” or “bottom” was identified by quantifying C-to-T and G-to-A changes. Reads with at least 90% of these changes being either C-to-T or G-to-A were classified as “CT” or “GA”, respectively, and retained for subsequent analyses. This cutoff allowed us to accurately assign read strands in the presence of germline or somatic variants. C-to-T and G-to-A changes were converted to the IUPAC DNA ambiguity codes Y and R in CT and GA reads, respectively. Modified reads were realigned to hg38 using the same parameters. Identification of modification sensitive patches (MSPs) and the generation of nucleosome and accessibility pileups was done using the fibertools52 commands ddda-to-m6a (v0.6.4, default parameters) and add-nucleosomes (v0.6.4, parameters: -n 60, -c 70, --min-distance-added 15, -d 10). MSPs > 150bp in length were identified as actuated elements.

DAF-seq Target Enrichment

The proportion of total primary read alignments that mapped within the target region was calculated for both targeted DAF-seq and Fiber-seq data from the same cell line (GM12878). DAF-seq target enrichment was calculated as the proportion of on-target reads in DAF-seq over Fiber-seq.

Targeted DAF-seq multi-region benchmarking

Targeted DAF-seq was performed as described above using the primers and PCR conditions described in Supplementary Table 9. Purified PCR products were sequenced using the Oxford Nanopore platform. Reads were aligned to hg38 using minimap2 with the “map-ont” preset. Deamination events and read strands were identified as described above. For each library and strand, outlier reads with deamination proportions above or below 1.5 times the interquartile range were filtered out. MSPs were identified using fibertools52 add-nucleosomes (v0.6.4, parameters: -n 55, -c 65, --min-distance-added 5).

PCR Duplicate Identification

DAF-seq PCR duplicate reads were identified by comparing the deamination status of every position susceptible to deamination (each C on top strand reads, each G on bottom strand reads). This combination of deamination statuses was treated as a unique identifier and all reads sharing this identifier were grouped. One read from each group was randomly selected to be marked as unique while the remaining reads in each group were marked as duplicates (Supplementary Fig. 3a,c). Unique GM12878 and Liver tissue reads identified in this manner were randomly selected for SLC39A4 analyses (Fig. 4). For comparison, duplicate reads were separately identified using the PacBio tool pbmarkdup which grouped reads with 98% sequence identity at the first and last 500 bp ( Supplementary Fig. 3b, v1.0.2, default parameters).

CpG Methylation Analysis

NAPA and UBA1 regions were amplified for 30 PCR cycles as described above (Supplementary Table 8) using 70 ng of HEK293 genomic DNA as template. 1 μg of amplicons from each target was treated with M.SssI CpG methyltransferase (New England Biolabs, M0226S) per the vendor’s recommendations. 600 ng of M.SssI treated and untreated amplicons were treated with SsDddA for 10 minutes at 25°C. DNA treated with SsDddA and M.SssI, and M.SssI only (5mC negative control) were re-amplified for 20 PCR cycles. These two samples along with M.SssI treated DNA (5mC positive control) were sequenced using PacBio HiFi as described above. Amplicons were purified after each PCR using the Monarch PCR & DNA Cleanup Kit (New England Biolabs, T1030L).

Sequencing reads from each target were aligned to reference fasta files containing only the target region. Primary alignments from the M.SssI + SsDddA treatment beginning and ending within 100bp of the region boundaries were used in analysis. Genomic positions were grouped into CpG and non-CpG cytidine (or guanidine for bottom strand) categories, and the deamination occurrences at each position within each read were aggregated. Positions within 28bp of the target region boundaries overlapped priming sites and were omitted from analysis. Positions CpG methylation of the 5mC positive control was quantified using pb-CpG-tools (aligned_bam_to_cpg_scores, v2.3.1, https://github.com/PacificBiosciences/pb-CpG-tools).

Deamination Motif Analysis

Sequence bias in SsDddA activity was evaluated for GM12878 WGA realigned data using pysam (v0.21.0, https://github.com/pysam-developers/pysam), considering only primary alignments. For each deaminated base, denoted in the read sequence by the ambiguity codes Y and R, the seven base (7mer) reference sequence centered on the deaminated base was tracked. Sequences originating from GA reads were reverse-complemented to orient the 7mer to the cytidine context. All 7mers were combined to generate a position weight matrix (PWM) which was used to generate a sequence logo (Logomaker v0.8)54.

Transcription Factor Footprinting and Codependency

DAF-seq footprint density was calculated as the density of all single-molecule stretches of three consecutive non-deaminated cytidines within the NAPA promoter. We used FIMO55 to scan the hg38 reference sequence of an applicable region for transcription factor motifs contained within the JASPAR database 56 and filtered the results for motifs with a q-value <= 0.05. We then used Fibertools52 footprint to identify single-fiber TF footprints for each motif. We filtered motifs to those footprinted on >= 5% of fibers on both top (CT) and bottom (GA) strands. Footprinted motifs that overlapped either motif by 80% were merged using Bedops57 (v2.4.41), and the resulting elements that overlapped by 90% were merged to combine elements encompassed entirely within a larger element. We quantified occupancy within each merged element as the proportion of fibers footprinted at a motif contained within the merged element. TF co-occupancy and codependency were calculated as described previously22. Briefly, for each pair of footprinted elements, we calculated the expected co-occupancy as the product of their proportion of fibers bound, while the observed co-occupancy was calculated as the proportion of fibers with an accessible element (MSP > 150 bp) spanning both elements and bound at each element. We quantified the essentiality of each element by constructing codependency graphs, with nodes representing TF elements and edge weights representing codependency scores. We constructed graphs that omitted individual elements and limited each analysis to fibers accessible but unbound at that respective element. We also constructed a baseline codependency graph containing all elements. We quantified the total codependency of each graph by summing all edge weights and normalizing by the number of edges. Finally, we calculated essentiality as the ratio of total codependency of the baseline graph over element-excluded graphs, with a higher ratio indicating a larger reduction in codependency following the loss of TF binding at that element.

Thermodynamic analysis of the NAPA promoter.

We used a thermodynamic formalism that relates the frequencies of DAF-seq reads representing the different TF-bound states of the NAPA promoter to the free energies of the TF-DNA and TF-TF interactions that occur on the NAPA promoter20,21. We assumed that the NAPA promoter is close to equilibrium and that therefore the reads representing each of the protein-bound states will follow the Boltzmann Distribution. We used the unbound state, the state in which none of the footprints are occupied, as the reference state in all of our analyses. Thus, in these analyses each ΔG is unitless value that relates the frequency of a protein-bound state relative to the state in which no proteins are bound to NAPA. Because footprint 11 (the CTCF site) was occupied more than 99% of the time, we excluded this footprint from our analyses because we could not compute a reliable value for its interaction with DNA nor would we be able to detect interactions between CTCF and the other footprints. The custom Python script we wrote for this analysis is available on GitHub.

Analysis of individual TF-DNA interactions.

We first computed the change in free energy of each singly bound TF state relative to the unbound state. We used the unbound state, the state in which none of the footprints are occupied, as the reference state in all of our analyses. The Boltzmann Factor relates the relative probabilities of a TF bound state and the unbound reference state to the change in free energy between the states as,

P(TFiboundstate)P(unboundstate)=e-ΔGiRT

In all analyses we ignored the RT term because it factors out of the analyses. The ratio of the probability of a singly bound TF state to the unbound state can be computed directly from DAF-seq data as the ratio of the read counts representing each state,

P(TFiboundstate)P(unboundstate)=readcountofTFiboundstatereadcountofunboundstate

The change in free energy associated with each singly bound TF state is then computed as,

ΔGi=-lnreadcountofTFiboundstatereadcountofunboundstate

Analysis of two-way and three-way TF interactions.

We calculated ΔGij values for each pair wise interaction between bound TFs on the Napa promoter using

P(TFijboundstate)P(unboundstate)=readcountofTFijboundstatereadcountofunboundstate=e-(ΔGi+ΔGj+ΔGij)

The value of each ΔGij was then computed from the observed read counts of the TFij bound state and the previously computed values of ΔGi and ΔGi as,

ΔGij=-lnreadcountofTFijboundstatereadcountofunboundstate-ΔGi-ΔGj

Likewise, the values for ΔGijk were computed as,

ΔGijk=-lnreadcountofTFijkboundstatereadcountofunboundstate-ΔGi-ΔGj-ΔGk-ΔGij-ΔGik-ΔGjk

To test for the significance of the ΔGij values we tested whether the observed read count of the TFij bound state was significantly higher or lower than the expected read count of the TFij bound state assuming that ΔGij=0. To perform this test, we used the Binomial Distribution where

np=readcountofunboundstate*e-ΔGi+ΔGj
n=countsunbound+countsi+countsj+countsij

Because n is large, we used the Normal Distribution as an approximation for the Binomial where

z=observedcountsij-npnp(1-p)

We used a Bonferroni-corrected P-value threshold for the z-scores corrected for the number of individual ΔGij values being tested (i.e., 45). Likewise, for testing the significance of ΔGijk values we performed two-sided binomial tests assuming ΔGijk=0 where

np=readcountofunboundstate*e-ΔGi+ΔGj+ΔGk+ΔGij+ΔGik+ΔGjk
n=countsunbound+countsi+countsj+countsk+countsij+countsik+countsjk

We used a Bonferroni-corrected P-value threshold corrected for the number of individual ΔGijk values being tested (i.e., 120).

Identification of heterozygous positions

Top and bottom strand base calls were counted at each genomic position within each target region for each primary read alignment. Positions within 25bp of the target region boundaries overlapped priming sites and were omitted from analysis. Top strand and bottom strand base call proportions were plotted in R using the ggplot2 hexbin package (parameters: bins = 30).

UBA1 Transcription Start Site Co-actuation and Codependency

UBA1 bottom strand reads were assigned haplotypes according to base calls at chrX:47,194,331 (rs56269549). The four UBA1 transcriptional start sites (TSSs) were identified previously using full-length transcript data26. MSPs > 150bp long were intersected with each TSS using Bedops57 (bedmap, parameters: --ec --fraction-ref 0.8 --echo --echo-map-id, v2.4.41). TSS co-actuation was calculated as the proportion of fibers with actuated elements overlapping both TSSs. Codependency was calculated as above using TSS co-actuation instead of TF co-occupancy. Codependent protein occupancy was compared between the Xa and Xi using a two-sided paired t-test ( P = 0.22, n = 14, t = 1.28, df = 14).

DAF-seq Clustering

Liver and GM12878 reads were assigned haplotypes according to base calls at chr8:144,416,180 (rs2280839). Reads from each sample with a Hamming distance of 3 or less were identified as duplicates and one read from each group was randomly selected as a unique read. The deamination status at each applicable genomic position was recorded and used as a feature for clustering DAF-seq reads. Clustering analysis was performed using Scanpy (v1.10.3). Briefly, 5000 unique reads from each haplotype of each tissue were randomly selected. A neighborhood graph was computed (pp.neighbors, parameters: n_pcs=0, n_neighbors=200) and visualized with Uniform Manifold Approximation and Projection (UMAP). Clustering was performed using the Leiden algorithm (tl.leiden, parameters: flavor=“igraph”). Clusters containing fewer than 1000 fibers were removed. Fibers from cluster 1 were re-clustered using the same parameters. Sub-clusters containing fewer than 200 fibers were removed. Liver tissue fibers from cluster 1 were 2.0-fold more likely to be focally actuated at rs2280838 in the rs2280838-T haplotype than the rs2280838-C haplotype (one-sided Fisher’s exact test, P = 5.4 × 10−10, n = 606, 95% CI = 0 – 0.43).

COLO829 cell mixture

Pure populations of COLO829BL cells (cat# CRL-1980, lot# 70022927) and COLO829 (referred to as COLO829T) cells (cat# CRL-1974, lot# 70024393) were obtained from ATCC and expanded. All cells were grown at 37°C with 5% CO2. COLO829BL lymphoblastic suspension cells were grown in RPMI-1640 (Fisher cat# 11875093) with 10% fetal bovine serum (FBS, Fisher cat# 10082147) shaking at 90 rpm. COLO829T adherent melanoma cells were grown in RPMI-1640 10% FBS. Both cell lines were expanded and cryopreserved in freezing media consisting of RPMI-1640 10% FBS with 10% DMSO (Sigma Aldrich cat# D2650–100ML) at a cell density of 3 million viable cells per mL. Cryopreserved COLO828T and COLO829BL cells were thawed and mixed in a 1:49 ratio using viable cell counts based on Countess cell counter readings (Invitrogen Thermo Fisher) with trypan blue (Fisher cat# 15250061). Cells were aliquoted into cryovials with constant swirling to maintain the homogeneity of the mix. Mixed cells were cryopreserved at a cell density of 2.55 million viable cells per mL.

Single-cell Fluorescence Activated Cell Sorting (FACS)

DddA-treated cells were sorted on a BD FACSAriaII using sequential gating on forward-scatter (FSC) and side-scatter (FSC) to detect doublets and debris. Individual cells were sorted into wells of a 96-well plate containing 3 ul of Cell Buffer (BioSkryb Genomics, 100183). Cells were immediately flash frozen on dry ice.

Single-cell PTA Whole Genome Amplification

After dispensation of individual nuclei into independent plate wells, the ResolveServices (SM) team performed a custom protocol derived from the Services Custom PacBio Long Read Amplification (BioSkryb Genomics 101157) assay. Briefly, the ResolveDNA(TM) workflow was customized to increase amplicon size. After 2.5 hours of isothermal amplification, samples were quantified using qubit HS DNA (Thermofisher Q33231) and shipped to the Stergachis laboratory for library preparation and sequencing.

Single-cell read collapsing

PacBio HiFi reads were converted to FastQ format using Samtools fastq53 (v1.17, parameters: -T) and aligned to hg38 using minimap2 (v2.22-r1101, parameters: --MD -Y -y -a -x map-pb). Soft-clipped bases were trimmed from the alignments to remove chimeric portions of each read. The original template strand of clipped reads was identified as “top” or “bottom” strand by quantifying the proportion of C-to-T and G-to-A changes relative to the hg38 reference using pysam (v0.21.0, https://github.com/pysam-developers/pysam). Primary alignments with at least 90% of these changes were classified as “CT” or “GA” and retained for subsequent analyses.

The hg38 reference genome was partitioned into overlapping 150 bp bins with a 25 bp sliding window. For each bin, the sequence Hamming distance between reads that completely spanned that bin was computed, ignoring insertion and deletions. Similar reads were identified as those sharing a minimum of 11 bins (400 bp in total) and with >= 99% identical sequence within >= 80% of shared bins. Similar reads were sequentially grouped together such that each read shared similarity with at least one other read within the group. All reads within each group were required to meet the sequence similarity requirements with each other (i.e. reads in disagreement with >= 1 read within the group were grouped separately).

Consensus sequences were generated for each group as follows: for each reference position represented within the read group, the composition of base calls at that position was quantified. In cases with disagreement between reads, bases comprising >= 50% of all base calls were used, except in cases where the two most common bases are C & T or G & A and comprised >= 50% of all base calls, in which case C or G was used to overcome putative spurious post-amplification deamination. Deletions and insertions were ignored unless they were present in all reads at that position. When insertions were present in every read the shortest insertion was used. Consensus sequences were subjected to a second round of merging in which overlapping consensus sequences could be merged if they overlapped by a minimum of 7 bins (300 bp in total) and had >= 99% identical sequence within >= 80% of shared bins. In cases of base-level disagreements, the base call of the consensus sequence with the most raw reads in the applicable bin was chosen. Consensus reads were assigned a random identifier and converted to fasta format. Consensus reads were realigned to hg38 as above and C-to-T and G-to-A changes were converted to the IUPAC DNA ambiguity codes Y and R in CT and GA reads, respectively.

Consensus read phasing and coverage quantification

Consensus reads were assigned haplotypes with Whatshap haplotag58 (v2.3, parameters: --ignore-read-groups, --output-haplotag-list) using previously phased parental variants26. The mappable hg38 genome was computed as regions not overlapping the ENCODE blacklist (accession ENCFF356LFX)59 or hg38 regions with unreliable coverage in HG002 Fiber-seq data26. Regions of greater than 400bp contiguous bases that consisted of more than 1 consensus read per haplotype strand in at least one cell were identified using samtools depth53 (v1.17) and excluded from the hg38 mappable genome. Coverage calculations were performed using samtools depth in combination with Bedtools60 (v2.31.0) and Bedops57 (v2.4.41). Read length statistics were calculated in python using custom scripts (see Zenodo entry).

Single-cell chromatin actuation

Fibertools52 add-nucleosomes (v0.5.4, parameters: -n 60, -c 70, --min-distance-added 10) was used to calculate deamination autocorrelation and identify modification-sensitive patches (MSPs) within consensus reads. HG002 regulatory elements were identified previously from Fiber-seq data using the FIRE pipeline26. scDAF-seq regulatory elements were classified as actuated if they were overlapped by an MSP of >150bp on either strand. These overlaps were required to span at least 50% of either the length of the MSP or the length of the FIRE peak. Read lengths, deamination rates, Jaccard distances, and percent actuation were computed in python using custom scripts (see Zenodo entry, Supplementary Tables 10,11). Chromatin actuation Jaccard distances were significantly lower between haplotypes of the same cell than between the same haplotype within different cells when comparing all peaks (two-sided t-test, P = 0.006, n = 12, t = −3.23, df = 13.6, 95% CI = −0.031 - −0.001) and promoter-distal peaks (P = 0.005, n = 12, t = −3.36, df = 13.7, 95% CI = −0.033 - −0.007) but not when comparing promoter-proximal peaks (P = 0.44, n = 12, t = −0.80, df = 12.6, 95% confidence interval = −0.024 – 0.011). FIRE peaks within 250 bp of an Ensembl canonical transcript TSS in the Gencode 45 release were classified as promoter-proximal. FIRE peaks beyond 250 bp of a canonical transcript TSS were classified as promoter-distal. Enrichments of deaminated cytidines and actuated regulatory elements by genomic repeat class was performed by intersecting these features with RepeatMasker (Smit, AFA, Hubley, R. & Green, P “RepeatMasker” at http://www.repeatmasker.org) annotations within the mappable autosomal portions of GRCh38. Regulatory element codependency was calculated as described previously22. Briefly, for each pair of Fiber-seq FIRE peaks, we calculated the expected co-actuation as the product of fibers actuated at each peak, while the observed co-actuated was calculated as the proportion of fibers actuated at both elements. Codependency scores were binned by genomic distance between peaks using a log2 scale. Codependency scores of each bin were calculated as the mean of codependency scores within that distance bin for peak pairs covered by at least 8 fibers.

Full-length transcript data processing

GM24385 ISO-seq reads were aligned to hg38 using pbmm2 align (PacBio; parameters: --preset ISOSEQ) and the aligned reads were collapsed using isoseq collapse (PacBio; parameters: --do-not-collapse-extra-5exons). Transcript starts were intersected with GM24385 Fiber-seq promoter-proximal peaks using Bedops57 bedmap and transcript counts were quantified using custom scripts.

Accessibility enrichment scores

Bulk ATAC-seq datasets were aligned to hg38 using Bowtie261, deduplicated for enrichments using gatk MarkDuplicates, and peaks were called separately for each dataset using macs2 (parameters: --nomodel --shift −100 --extsize 200). MSP and TSS enrichment scores were calculated for scDAF-seq and Fiber-seq libraries by mapping actuated elements (MSP > 150 bp) within 2 kb of GM24385 Fiber-seq FIRE peaks and TSSs using Bedops57 (v2.4.41). TSS enrichment scores were calculated for ATAC-seq libraries as above by mapping deduplicated sequencing reads to sample-specific TSS-overlapping peak sets. Single-cell ATAC-seq fragments were used as pseudobulked input to peak calling and TSS enrichment. All scATAC-seq datasets were pseudobulked using all sequenced fragments with the exception of GM12878 which was filtered to only include sequenced fragments from cell passing the cellranger-atac pipeline (10x Genomics v.2.1.0). Genomic density at these positions was calculated using Bedtools60 genomecov (v2.31.0) and signal was aggregated across all regions using custom python scripts. TSS signal was oriented by strand.

Single-cell CTCF co-occupancy

We identified CTCF motifs within the GRCh38 reference using FIMO55. We then filtered these motifs to include only those that filly overlapped CTCF ChIP-seq peaks from the lymphoblastoid cell line GM12878 (ENCODE accessions ENCFF356LIU and ENCFF960ZGP), GM12878 CTCF ChIA-PET anchor regions (ENCODE accession ENCFF780PGS), and GM24385 Fiber-seq FIRE peaks using Bedtools60 (v2.31.0) and Bedops57 (v2.4.41). We decided to limit CTCF footprinting to the core CTCF binding motif, which includes modules two and three62. CTCF footprints were identified within actuated elements that completely overlapped modules two and three of a motif using Fibertools52 footprint. Thus, fibers with partial deamination within a footprint region are classified as ‘unbound’ at that site. We limited CTCF co-occupancy analyses to FIRE peaks actuated on at least 30% of fibers in Fiber-seq data and required a minimum of four haplotype-strand fibers covering both CTCF motifs. For each loop, we quantified the percentage of fibers co-occupied by CTCF within anchor regions (anchor regions often contain multiple CTCF motifs) for each motif orientation observed within the loop anchor pair. CTCF loop anchors defined using ChIA-PET exhibit significantly higher single-molecule CTCF co-occupancy than expected by chance (two-sided t-test, P < 2.2×10−16, n = 2,440,641, t = 9.19, df = 10351, 95% CI = 0.02 – 0.03).

Extended Data

Extended Data Figure 1: SsDddA activity at 5mCpG dinucleotides.

Extended Data Figure 1:

(a) Experimental overview of the comparison of SsDddA deamination rate between 5mC methylated and unmethylated cytidines. (b) IGV browser displaying CpG methylation for 5mC negative control (top), 5mC positive control (middle), and M.SssI plus SsDddA treated DNA (bottom). (c) Joint violin and box plots displaying 5mC modification scores at CpG dinucleotides for untreated and M.SssI treated controls (NAPA n = 160, UBA1 n = 168 CpG dinucleotides). Boxes represent the median and interquartile range and whiskers extend to the farthest samples within 1.5× the interquartile range of the median. (d) Percentage of cytidine deamination by SsDddA grouped by CpG dinucleotides and all other cytidine bases. Data is shown for two targeted regions, the NAPA promoter region and the UBA1 promoter region.

Extended Data Figure 2: Deamination frequency by SsDddA enzyme concentration.

Extended Data Figure 2:

(a) Deamination rate of the targeted NAPA region and promoter element in particular after treatment of GM12878 with various DAF-seq reaction conditions. (b) Deamination percentages of cytidine bases within the NAPA promoter, excluding bases within TF footprint regions, for each DAF-seq library from GM12878 cells treated with a range of SsDddA enzyme concentrations and treatment times. Bases are colored by TpC (red) and non-TpC (grey) dinucleotide context. (c) Violin plots showing the distributions of the data displayed in b. (d) Deamination percentages of cytidine bases within the NAPA promoter as in b, colored by CpG (blue) and non-CpG (grey) dinucleotide context. (e) Violin plots showing the distributions of the data displayed in d. (f,g) Deamination percentages of all cytidine bases within the WASF1 promoter (f) and the CD40 promoter (g) from the variable GM12878 SsDddA treatments in a. Median conversion percentage of each treatment is colored blue.

Extended Data Figure 3: NAPA modification saturation.

Extended Data Figure 3:

Boxplots showing modification percentages of each applicable position within the NAPA promoter for targeted DAF-seq (red, n = 188 positions) or Fiber-seq libraries (purple, n = 125 positions each). NAPA promoter modification rates were significantly higher in GM12878 cells in targeted DAF-seq than Fiber-seq (one-sided t-test, P = 1.7 x 10-11, t = 6.97, df = 228, 95% CI = 0.15 - ∞). Boxes represent the median and interquartile range and whiskers extend to the farthest samples within 1.5× the interquartile range of the median.

Extended Data Figure 4: Targeted DAF-seq multi-region benchmarking.

Extended Data Figure 4:

IGV Browser displaying percent actuation of targeted DAF-seq Oxford Nanopore data (red), percent actuation of Fiber-seq PacBio HiFi data (purple), and bulk ATAC-seq read coverage (blue) from GM12878 cells, K562 cells, and colon tissue, at each of 10 target regions.

Extended Data Figure 5: NAPA TF Codependency.

Extended Data Figure 5:

(a) Heatmap showing single-molecule pair-wise TF co-occupancy of the 11 elements within the NAPA promoter. (b) Diagram showing how TF codependency scores are calculated. (c) Heatmap showing single-molecule pair-wise TF codependency of the 11 elements within the NAPA promoter, as well as a network diagram of elements 1, 2, and 3 with the edge weights corresponding to the strength of the codependency. (d) Scatter plot comparing codependency scores (x-axis) and chi-squared test statistics (y-axis) for each TF combination within the NAPA promoter. Interactions which were significant after Bonferroni correction (P <= 0.01, chi-square test of independence, n = 55) are colored red, non-significant interactions are colored blue. (e) Bar graph showing the number of molecules with occupancy at different combinations of elements 1, 2, or 3. (f) Single-molecule conditional codependency with a bar graph showing the impact that removal of occupancy at elements 1, 2, or 3 has on the codependency score of the remaining elements within this cluster. A higher score indicates that an element is essential for a codependent network. (g) Directed network diagram showing how elements 1, 2, or 3 go from the unbound to fully bound state. Individual nodes are colored based on their occupancy patterns and are sized based on the data from panel e. The edges connecting the unbound and fully bound state are weighted based on the size of the smallest node through which they traverse during this path, and the translucency of each edge leaving the 1 bound state is dependent upon the data from panel f.

Extended Data Figure 6: UBA1 transcriptional start site codependency.

Extended Data Figure 6:

(a) Haplotype-phased single-molecule and aggregate targeted DAF-seq data of the UBA1 promoter in GM12878 cells which have allelically skewed XCI. The single C/T germline variant used for phasing is indicated. (b) Bar graph showing chromatin actuation at each of the four UBA1 TSSs by haplotype. (c,d) Heatmap showing single-molecule pair-wise co-actuation (c) and codependency (d) of the four UBA1 TSSs along the Xa. (e,f) UBA1 TSS co-actuation (e) and (f) codependency for all reads regardless of haplotype. (g) (left) Zoom-in of UBA1 isoform 4 TSS showing single-molecule protein occupancy at four binding elements. (right) Bar graph showing single-molecule protein occupancy at each of the four UBA1 isoform 4 TSS binding elements by haplotype.

Extended Data Figure 7: Generation of single-cell consensus reads.

Extended Data Figure 7:

IGV browser displaying single-cell DAF-seq reads from one cell. Deamination events for raw reads (top) are colored red for reads originating from the top strand and green for reads originating from the bottom strand. Consensus reads (bottom) are grouped by strand (top or bottom) and ordered by haplotype.

Extended Data Figure 8: scDAF-seq Karyoplots.

Extended Data Figure 8:

(a) Genomic coverage of consensus reads from cell 2. (b) Genomic coverage of chromosome 1 consensus reads each cell. (c) Scatter plot comparing sequenced bases with percent coverage of mappable GRCh38 for all consensus reads (blue) and phased consensus reads (red).

Extended Data Figure 9: scDAF-seq enrichments.

Extended Data Figure 9:

(a-c) Enrichments of accessible elements +/− 2 kb from transcriptional start sites (TSS) for (a) GM24385 single-cell DAF-seq, (b) Fiber-seq libraries, and (c) ATAC-seq libraries. The median scDAF-seq TSS enrichment is displayed as a dash black line in a-c. (d) Joint violin and box plot showing the distribution of percent deamination of each cell (n = 12 cells). Boxes represent the median and interquartile range and whiskers extend to the farthest samples within 1.5× the interquartile range of the median. (e) Violin and box plots as in d showing per-cell scDAF-seq percent actuation at promoter-distal (left, blue, n = 123,232) and promoter-proximal (right, yellow, n = 21,638) GM24385 Fiber-seq FIRE peaks. (f) Autocorrelation plot of single-molecule deamination patterns in each cell showing a pattern consistent with nucleosomes being the predominant modulator of single-cell and single-molecule deamination by SsDddA. The vertical dashed line at 147bp represents the theoretical nucleosome footprint size.

Extended Data Figure 10: scDAF-seq codependency.

Extended Data Figure 10:

(a) Single-cell codependent regulatory element actuation by binned log2 genomic distance. Average codependency scores between regulatory elements in each distance bin are displayed for single chromatin fibers (same cell and haplotype) in red, for opposite haplotypes within the same cell in yellow, and for chromatin fibers from different cells in blue. (b) Percentage of raw scDAF-seq reads from each cell aligning to hg38 vs percent cytidine deamination within that cell. Mapping percentage is not significantly correlated with deamination rate (Pearson’s correlation, R2 = 0.23, two-sided t-test P = 0.12, n = 12). (c) Mean genome-wide CTCF occupancy within actuated regulatory elements from each cell vs percent cytidine deamination within that cell. CTCF occupancy is not significantly correlated with deamination rate (Pearson’s correlation, R2 = 0.08, two-sided t-test P = 0.39, n = 12). (d) Box plots showing the distribution of CTCF ChIA-PET scores in each bin shown in Fig. 7 (bin1 n = 18, bin2 n = 63, bin3 n = 122, bin4 n = 192, bin5 n =186, bin6 n = 183, bin7 n = 583, bin8 n = 999, bin9 n =795). Boxes represent the median and interquartile range and whiskers extend to the farthest samples within 1.5× the interquartile range of the median. (e) Stacked barplots showing the proportions of CTCF motif orientations contained within each loop anchor pair for each ChIA-PET bin and shuffled regions.

Supplementary Material

Supplemental Tables
Supplemental Information
Extended Data Captions

Acknowledgements

We thank Northwest Genome Center and Katherine M. Munson for their assistance in PacBio sequencing, UW Mass Spectrometry Center for their assistance in mass spectrometry experiments, the Fowler lab and Hyeon-Jin Kim for assistance with nuclei sorting, Thomas J. Bell and Kathryn Leonard at The National Disease Research Interchange (NDRI) for primary frozen tissue, and members of the UW-SCRI Somatic Mosaicism across Human Tissues (SMaHT) Genome Characterization Center for generating the COLO829 cell mixture, Illumina sequencing data, and their feedback and support. Funding: A.B.S. holds a Career Award for Medical Scientists from the Burroughs Wellcome Fund and is a Pew Biomedical Scholar. This research is supported by the National Institutes of Health (NIH) Common Fund, through the Office of Strategic Coordination/Office of the NIH Director under award UM1DA058220 to A.B.S. This study was also supported by NIH grants 1DP5OD029630, and a UW ADRC Developmental Project (NIH grant P30AG066509) to A.B.S.. M.R.V was supported by an NIH Pathway to Independence Award from NIGMS (1K99GM155552-01), and both M.R.V and S.C.B. were supported by a training grant (T32) from the NIH (2T32GM007454-46). E.S. was supported by a Curci Fellowship, as well as a training grant (T32) from the NIH (T32HG000035).

Footnotes

Competing Interests Statement

A.B.S., E.G.S., and Y.M. are co-inventors on the U.S. Provisional Patent Application 63/687,924 that includes discoveries described in this manuscript regarding ‘Chromatin Stenciling’ using DAF-seq. The remaining authors declare no competing interests.

Data availability

DNA sequencing data have been deposited to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA120335150. Processed data are available in the Supplementary Tables or on GitHub (https://github.com/StergachisLab/DAF-seq-Manuscript). Accession numbers for the publicly available datasets used in this manuscript are provided in Supplementary Table 12. Fiber-seq datasets used in this manuscript were generated previously as part of Vollger et al. 202426 and are available at https://s3-us-west-1.amazonaws.com/stergachis-manuscript-data/index.html?prefix=2024/Vollger_et_al/FIRE/.

Code availability

Custom code for data analysis, filtering and visualization can be found on Zenodo (ref51; https://doi.org/10.5281/zenodo.14563107) and GitHub (https://github.com/StergachisLab/DAF-seq-Manuscript).

References

  • 1.Stergachis AB, Debo BM, Haugen E, Churchman LS & Stamatoyannopoulos JA Single-molecule regulatory architectures captured by chromatin fiber sequencing. Science 368, 1449–1454 (2020). [DOI] [PubMed] [Google Scholar]
  • 2.Bohaczuk SC et al. Resolving the chromatin impact of mosaic variants with targeted Fiber-seq. bioRxiv 2024.07.09.602608 (2024) doi: 10.1101/2024.07.09.602608. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Battaglia S et al. Long-range phasing of dynamic, tissue-specific and allele-specific regulatory elements. Nat Genet 54, 1504–1513 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Abdulhay NJ et al. Massively multiplex single-molecule oligonucleosome footprinting. Elife 9, 1–23 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Lee I et al. Simultaneous profiling of chromatin accessibility and methylation on human cell lines with nanopore sequencing. Nat. Methods 17, 1191–1199 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Krebs AR et al. Genome-wide Single-Molecule Footprinting Reveals High RNA Polymerase II Turnover at Paused Promoters. Mol. Cell 67, 411–422.e4 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Shipony Z et al. Long-range single-molecule mapping of chromatin accessibility in eukaryotes. Nat. Methods 1–9 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Buenrostro JD et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Cusanovich DA et al. Multiplex single cell profiling of chromatin accessibility by combinatorial cellular indexing. Science 348, 910–914 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.He R et al. Genome-wide single-cell and single-molecule footprinting of transcription factors with deaminase. Proc. Natl. Acad. Sci. U. S. A. 121, e2423270121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Yu T et al. Deaminase-mediated chromatin accessibility profiling with single-allele resolution. bioRxiv 2024.12.17.628768 (2024) doi: 10.1101/2024.12.17.628768. [DOI] [Google Scholar]
  • 12.Mok BY et al. A bacterial cytidine deaminase toxin enables CRISPR-free mitochondrial base editing. Nature 583, 631–637 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Mi L et al. DddA homolog search and engineering expand sequence compatibility of mitochondrial base editing. Nat. Commun. 14, 874 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Yin L, Shi K & Aihara H Structural basis of sequence-specific cytosine deamination by double-stranded DNA deaminase toxin DddA. Nat. Struct. Mol. Biol. 30, 1153–1159 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Mok BY et al. CRISPR-free base editors with enhanced activity and expanded targeting scope in mitochondrial and nuclear DNA. Nat. Biotechnol. 40, 1378–1387 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Guo J et al. A DddA ortholog-based and transactivator-assisted nuclear and mitochondrial cytosine base editors with expanded target compatibility. Mol. Cell 83, 1710–1724.e7 (2023). [DOI] [PubMed] [Google Scholar]
  • 17.Huang J et al. Discovery of deaminase functions by structure-based protein clustering. Cell 186, 3182–3195.e14 (2023). [DOI] [PubMed] [Google Scholar]
  • 18.Roh H et al. Coupling CRISPR scanning with targeted chromatin accessibility profiling using a double-stranded DNA deaminase. bioRxiv 2024.12.17.628791 (2024) doi: 10.1101/2024.12.17.628791. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Kelly TK et al. Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA molecules. Genome Res. 22, 2497–2506 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Bintu L et al. Transcriptional regulation by the numbers: models. Curr. Opin. Genet. Dev. 15, 116–124 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Sherman MS & Cohen BA Thermodynamic state ensemble models of cis-regulation. PLoS Comput. Biol. 8, e1002407 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Grasberger H et al. STR mutations on chromosome 15q cause thyrotropin resistance by activating a primate-specific enhancer of MIR7–2/MIR1179. Nat. Genet. 56, 877–888 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Bernardini A et al. The USR domain of USF1 mediates NF-Y interactions and cooperative DNA binding. Int. J. Biol. Macromol. 193, 401–413 (2021). [DOI] [PubMed] [Google Scholar]
  • 24.Ito Y, Zhang Y, Dangaria S, Luan X & Diekwisch TGH NF-Y and USF1 transcription factor binding to CCAAT-box and E-box elements activates the CP27 promoter. Gene 473, 92–99 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhu J, Giannola DM, Zhang Y, Rivera AJ & Emerson SG NF-Y cooperates with USF1/2 to induce the hematopoietic expression of HOXB4. Blood 102, 2420–2427 (2003). [DOI] [PubMed] [Google Scholar]
  • 26.Vollger MR et al. A haplotype-resolved view of human gene regulation. bioRxiv 2024.06.14.599122 (2024) doi: 10.1101/2024.06.14.599122. [DOI] [Google Scholar]
  • 27.Zalusky MP et al. 3-hour genome sequencing and targeted analysis to rapidly assess genetic risk. Genet Med Open 2, 101833 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Aguet F et al. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science 369, 1318–1330 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Rautiainen M et al. Telomere-to-telomere assembly of diploid chromosomes with Verkko. Nat. Biotechnol. (2023) doi: 10.1038/S41587-023-01662-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Gonzalez-Pena V et al. Accurate genomic variant detection in single cells with primary template-directed amplification. Proc. Natl. Acad. Sci. U. S. A. 118, e2024176118 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Jarvis ED et al. Semi-automated assembly of high-quality diploid human reference genomes. Nature 611, 519–531 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cheng H, Concepcion GT, Feng X, Zhang H & Li H Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Porubsky D et al. A familial, telomere-to-telomere reference for human de novo mutation and recombination from a four-generation pedigree. Genomics (2024). [Google Scholar]
  • 34.Lieberman-Aiden E et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326, 289–293 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Cremer T & Cremer C Chromosome territories, nuclear architecture and gene regulation in mammalian cells. Nat. Rev. Genet. 2, 292–301 (2001). [DOI] [PubMed] [Google Scholar]
  • 36.Monahan K et al. Role of CCCTC binding factor (CTCF) and cohesin in the generation of single-cell diversity of protocadherin-α gene expression. Proc. Natl. Acad. Sci. U. S. A. 109, 9125–9130 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rao SSP et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Guo Y et al. CRISPR inversion of CTCF sites alters genome topology and enhancer/promoter function. Cell 162, 900–910 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Grubert F et al. Landscape of cohesin-mediated chromatin loops in the human genome. Nature 583, 737–743 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Dixon JR et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485, 376–380 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Dixit A et al. Perturb-Seq: Dissecting Molecular Circuits with Scalable Single-Cell RNA Profiling of Pooled Genetic Screens. Cell 167, 1853–1866.e17 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Gilbert LA et al. Genome-Scale CRISPR-Mediated Control of Gene Repression and Activation. Cell 159, 647–661 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Kennedy SR et al. Detecting ultralow-frequency mutations by Duplex Sequencing. Nat. Protoc. 9, 2586–2606 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Vogelstein B & Kinzler KW Digital PCR. Proc. Natl. Acad. Sci. U. S. A. 96, 9236–9241 (1999). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Muyas F et al. De novo detection of somatic mutations in high-throughput single-cell profiling data sets. Nat. Biotechnol. 42, 758–767 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Dou J et al. Single-nucleotide variant calling in single-cell sequencing data with Monopogen. Nat. Biotechnol. 42, 803–812 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Maurano MT et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190–1195 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Cheng YHH, Bohaczuk SC & Stergachis AB Functional categorization of gene regulatory variants that cause Mendelian conditions. Hum. Genet. 143, 559–605 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Swanson EG NCBI BioProject database for DAF-seq. Datasets. NCBI SRA. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1203351 (2025).
  • 51.Swanson EG StergachisLab/DAF-seq-Manuscript. Source code. Zenodo. doi: 10.5281/zenodo.14563107 (2025). [DOI] [Google Scholar]

Methods-only references

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Tables
Supplemental Information
Extended Data Captions

Data Availability Statement

DNA sequencing data have been deposited to the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/) under accession number PRJNA120335150. Processed data are available in the Supplementary Tables or on GitHub (https://github.com/StergachisLab/DAF-seq-Manuscript). Accession numbers for the publicly available datasets used in this manuscript are provided in Supplementary Table 12. Fiber-seq datasets used in this manuscript were generated previously as part of Vollger et al. 202426 and are available at https://s3-us-west-1.amazonaws.com/stergachis-manuscript-data/index.html?prefix=2024/Vollger_et_al/FIRE/.

RESOURCES