Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2020 Apr 10;18(4):e3000684. doi: 10.1371/journal.pbio.3000684

Precise genomic mapping of 5-hydroxymethylcytosine via covalent tether-directed sequencing

Povilas Gibas 1,#, Milda Narmontė 1,#, Zdislav Staševskij 1, Juozas Gordevičius 1, Saulius Klimašauskas 1,*, Edita Kriukienė 1,*
Editor: Tom Misteli2
PMCID: PMC7176277  PMID: 32275660

Abstract

5-hydroxymethylcytosine (5hmC) is the most prevalent intermediate on the oxidative DNA demethylation pathway and is implicated in regulation of embryogenesis, neurological processes, and cancerogenesis. Profiling of this relatively scarce genomic modification in clinical samples requires cost-effective high-resolution techniques that avoid harsh chemical treatment. Here, we present a bisulfite-free approach for 5hmC profiling at single-nucleotide resolution, named hmTOP-seq (5hmC-specific tethered oligonucleotide–primed sequencing), which is based on direct sequence readout primed at covalently labeled 5hmC sites from an in situ tethered DNA oligonucleotide. Examination of distinct conjugation chemistries suggested a structural model for the tether-directed nonhomologous polymerase priming enabling theoretical evaluation of suitable tethers at the design stage. The hmTOP-seq procedure was optimized and validated on a small model genome and mouse embryonic stem cells, which allowed construction of single-nucleotide 5hmC maps reflecting subtle differences in strand-specific CG hydroxymethylation. Collectively, hmTOP-seq provides a new valuable tool for cost-effective and precise identification of 5hmC in characterizing its biological role and epigenetic changes associated with human disease.


This study describes hmTOP-seq, a bisulfite-free approach for profiling of the epigenetic mark 5-hydroxymethylcytosine (5hmC) at single-nucleotide resolution, based on direct sequence readout primed at an in situ tethered DNA oligonucleotide.

Introduction

DNA methylation is involved in many biological processes such as embryogenesis, establishment of cell identity and organismal fate, and development of various pathological conditions, including cancer. A well-documented repressive role of 5-methylcytosine (5mC) can be reversed via the action of the Ten-Eleven Translocation enzymes (TET1, 2, and 3), which remove 5mC through the formation of several oxidized forms of 5mC: 5-hydroxymethylcytosine (5hmC), 5-formylcytosine, and 5-carboxylcytosine. In addition to being a 5mC demethylation intermediate, a number of studies have reported 5hmC as a stable epigenetic mark with biological role in transcription regulation in both physiological and pathological states [16]. Of all known oxidized forms of 5mC, 5hmC is the most abundant in mammalian tissues, whose levels vary in a tissue-specific manner, reaching up to 1.8% of total cytosine in human neurons [711]. In various solid tumors, 5hmC was found at significantly reduced levels, reinforcing the diagnostic and prognostic significance of 5hmC in cancer research [1214]. However, as whole-genome sequencing and data analysis still remain hardly accessible to large-scale population and clinical studies, cost-effective sensitive techniques are in high demand for unlocking the epigenetic role and diagnostic potential of DNA hydroxymethylation marks.

Enzymatic, chemical treatment–based, or antibody enrichment–based technologies have been developed for analysis of 5hmC genome-wide [2,1519]. Single-nucleotide-resolution whole-genome bisulfite sequencing (WGBS) and its derivatives offer unprecedented capability to infer absolute DNA modification levels [2023]. However, noticeable degradation of DNA and high sequencing depths required for confident determination of scarcely abundant modifications [24] demand sizeable amounts of input DNA and significant experimental/computational resources, which limit widespread application of WGBS in large-scale clinical and populational studies.

Methods based on covalent 5hmC modification have revealed their potential in genome-wide analysis of 5hmC [2]. However, despite the high sensitivity arising from covalent 5hmC derivatization, all methods employing affinity enrichment, including hMe-Seal [2], suffer from low resolution (200–500 bp). Acknowledging the constraints of the existing methods and inspired by the success of our recently developed uTOP-seq (uCG-specific tethered oligonucleotide–primed sequencing) strategy for single-base-resolution mapping of unmodified CG sites (uCGs) [25], we went on to develop a high-resolution bisulfite-free method for analysis of 5hmC, named hmTOP-seq (5hmC-specific tethered oligonucleotide–primed sequencing). We examined a series of chemo-enzymatic tethering strategies for their suitability to support the tethered oligonucleotide–primed sequencing (TOP-seq) reaction and suggested a first predictive model for this unconventional mode of polymerase action: nonhomologous proximity-driven internal priming. Based on this knowledge, we optimized and validated the developed procedure on model DNA systems and mouse embryonic stem cells (mESCs).

Results

Covalent derivatization and sequence readout at 5hmC sites

The hmTOP-seq analysis relies on a previously uncharacterized feature of DNA polymerases—proximity-driven nonhomologous priming to produce adjoining DNA molecules for downstream sequencing and genomic mapping. This event is brought about by a DNA oligonucleotide probe covalently tethered at target sites in DNA. As originally described for TOP-seq analysis of unmodified CGs [25], the covalent tethering of 5hmC residues was achieved in two steps: chemo-enzymatic derivatization of 5-hydroxyl-group with a reactive chemical group followed by selective conjugation of an appropriately derivatized oligodeoxyribonucleotide (ODN). We explored the effects of structural features of the chemical tether on the efficiency of the reaction by Pfu polymerase. In particular, we compared three tethering strategies in terms of their capacity to efficiently and accurately prime the polymerase reaction on DNA (Fig 1A). The first two exploited the bacteriophage T4 β-glucosyltransferase (BGT)-directed transfer of azide-containing chemical moieties to 5hmC from a modified cofactor, UDP-6-azidoglucose [2], but differed in that a copper(I)-catalyzed (Fig 1A, compounds 1a,b) or copper-free azide-alkyne cycloaddition (compounds 2a,b) coupling of the ODN was used in the second step. The third strategy (compound 3) was based on the capacity of the eM.SssI MTase (an engineered version of the CG-specific DNA cytosine-5 methyltransferase M.SssI) to covalently modify 5-hydroxymethylated CG sites (5hmCGs) with aliphatic thiols such as cysteamine (2-mercaptoethylamine), permitting subsequent tethering of an N-hydroxysuccinimide (NHS) ester–containing ODN via the attached terminal amino group [26]. Altogether, this gave us the opportunity to explore chemical linkers of different length, rigidity, and overall bulk. We also examined two different tether attachment points in the ODN strand (S1 Table and S1 Fig). We found that the 1a and 3 labeling chemistries were well compatible with the nonhomologous priming at or near 5hmC residues in our testing system composed of two model DNA fragments (S2A, S2B and S2C Fig). However, in both cases, higher yields of the desired products were consistently obtained when the tether was linked at the 5-position of the second T residue rather than the 5′-phosphate group (S1 Table). In contrast to the Cu(I) click–derived product, the ODN conjugated through the dibenzocyclooctyne (DBCO) cycloaddition 2a,b failed to generate priming products of comparable quality and quantity in our hands (S2D Fig). A recent application of the DBCO chemistry (apparently tethered to the 5′-phosphate group as in 2b) for 5hmC analysis included a lengthy conjugation time (24 h) but ultimately showed a lower accuracy of the priming reaction, which limited the resolution of the method to ±20 nucleotides (nt) [27]. Our developed procedure based on the copper(I) click coupling (compound 1a) generated sequence reads that predominantly included the full sequence of the tethered ODN immediately followed by the target nucleotide and adjoining sequence (. . .AAAGNNN . . ., see Fig 1B).

Fig 1. Tethered oligonucleotide priming and analysis of 5hmC.

Fig 1

(A) Strategies of 5hmC derivatization. (B) Schematic of a conventional primer extension reaction by a DNA polymerase derived from Pfu-DNA cocrystal structure (PDB: 5OMF) (left) and a putative mechanism of tethered ODN–primed polymerase reaction (right). The chemical tether (shown as green/red line) connecting T2 in the bound DNA duplex and the 5hmC residue in gDNA strand facilitates the capture of the 5hmC in a stacked position required for the priming reaction. Beige areas show regions of extensive contacts between the Pfu polymerase and the bound DNA. (C) Left, chemical structure of the tethering linker 1 (top) derived by MM2 conformational energy minimization (S1 Fig) is shown in the context of the template DNA strand 5′..T4G5T6.. of the KOD polymerase–DNA complex (PDB: 4AIL) resembling the positions of the hmC, T1, and T2 nucleotides (bottom) in Fig 1B. Right, a large cleft of Pfu polymerase (space-fill orange) seen from the major groove side of the bound DNA duplex (blue and green strands). (D) Workflow of the hmTOP-seq procedure. (Step 1) Fragmented gDNA is tagged with an azide group through BGT-glucosylation. (Step 2) Azide-modified DNA is ligated to partially complementary adaptors. (Step 3) The ODN containing a biotin group is tethered to azide groups using click chemistry. (Step 4) The biotin-labeled fragments are captured on streptavidin beads. (Step 5) TO-primed strand extension. (Step 6) PCR amplification with Ad-A2 and Ad-TO primers containing NGS platform-specific 5′-end adaptor sequences. Unidirectional sequencing from the A adaptor sequence included in the 5′ part of Ad-TO-barcode amplification primer. 5hmC, 5-hydroxymethylcytosine; A1 and A2, strands of a partially complementary adaptor; Ad, extended sections of platform-specific adapters; BGT, β-glucosyltransferase; DBCO, dibenzocyclooctyne; eM.Sssl, an engineered version of the CG-specific DNA cytosine-5 methyltransferase M.SssI; gDNA, genomic DNA; hmC, 5-hydroxymethylcytosine; hmCG, hydroxymethylated CG site; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; KOD, the KOD DNA polymerase from Thermococcus kodakaraensis; mCG, methylated CG site; NGS, next generation sequencing; NHS, N-hydroxysuccinimide; ODN, oligodeoxyribonucleotide; PDB, Protein Data Bank; Pfu, the Pfu DNA polymerase from Pyrococcus furiosus; TO, tethered oligodeoxyribonucleotide.

Based on these observations and available cocrystal structures of related Pfu and KOD polymerases in complex with their DNA substrates [28,29], we proposed a model for the putative priming complex. The crystallographic evidence indicates that the polymerases tightly bind the duplex part of the DNA, whereas the 5′ part of the template strand is less structured and can assume multiple conformations (S1 Fig). The template-dependent addition of an incoming nucleotide at the 3′ end of the priming strand occurs upon stacking of the template nucleotide with the 5′-terminal nucleotide in the bound duplex (Fig 1B). Similarly, in the case of the tethered-ODN priming, this can be achieved if the ODN duplex binds in the active site of the enzyme bringing the tethered nucleotide of the template DNA strand (5hmC) into spatial proximity and whereby promoting its binding in a stacked position in lieu of the template nucleotide. Although this conformation of the tethered DNA strand may be not the most preferred in general, it has a certain probability to occur (depending on particular structure of the linker) upon prolonged initial amplification cycles, leading to primer strand extension and successive stepwise translocation of the duplex away from the active site. A wide cleft of Pfu transversing the contour of the bound duplex DNA appears to be well suited to accommodate fairly bulky structural extensions on the major groove side of the template strand, such as the downstream part of the template strand and the tethering linkers of a certain size (Fig 1C). Consistent with our experimental observations (S1 Table), MM2 conformational modeling studies suggested though that accommodating the DBCO coupling linker 2a might be problematic because of its bulkiness and/or steric clashes with the bound DNA duplex (S1 Fig).

Evaluation of hmTOP-seq on model genome and mESCs

Since the BGT Cu(I)-catalyzed click labeling using 1a showed best performance as compared with the other tether chemistries (S2 Fig, S1 Table), it was selected for further optimization as part of the ultimate hmTOP-seq procedure (Fig 1D). The sensitivity and specificity of hmTOP-seq was first assessed on a model bacteriophage lambda genome that was pre-hydroxymethylated at the GCGC sites (215 sites in the 42-kb DNA) to various extents using our previously discovered atypical reactivity of the bacterial M.HhaI MTase (the GCGC-specific DNA cytosine-5 methyltransferase M.HhaI) ([26,30] and S2E Fig). All the GCGC sites were detected by hmTOP-seq and correlation between two technical replicates of each library type was very high (Pearson mean r = 0.98; sd = 0.003; p < 2.2 × 10−16) (Fig 2A). Moreover, we observed increasing median read coverage with increasing hydroxymethylation of GCGC sites, which demonstrated quantitative detection of 5hmC by our method (quadratic regression; adjusted R2 = 0.9057; p = 1 × 10−4; Fig 2B). At other CG sites, the median read coverage was a constant zero. Importantly, the majority of mapped reads started precisely at GCGC sites, with a minor fraction distributed in adjacent positions (−1 ÷ −4, S3A Fig), which altogether makes 98% of total reads identifying hydroxymethylated GCGC sites.

Fig 2. Analysis of hmTOP-seq libraries in model lambda genome and mESCs.

Fig 2

(A) Correlation between the coverage of 2.5% and 40% pre-hydroxymethylated GCGC sites in the technical replicates of hmTOP-seq libraries of lambda bacteriophage genome. (B) Dependence of hmTOP-seq coverage on the level of hydroxymethylation of GCGC sites in bacteriophage lambda DNA. Quadratic regression was used to fit the plotted data (Y = 99.5 + 15.7X − 0.240X2). (C) Distance distribution of read start positions from a nearest CG site in the hmTOP-seq library of mESC DNA (500-ng input). (D) Correlation of 5hmCG coverage and h-density signals in replicates of hmTOP-seq libraries prepared with varying amounts of mESC DNA. h-density was computed by normalizing coverage values by the unweighted CG density as described in [25]. (E) Comparison of hmTOP-seq coverage and 5hmC percentages estimated by bisulfite-based TAB-seq. Each dot represents the average hmTOP-seq value for specific TAB-seq percentage group (97% of all CG that overlap between hmTOP-seq and TAB-seq are used for analysis). (F) Correlation between hmTOP-seq coverage at 5hmCHs in technical replicates of 500-ng input mESC DNA libraries (OR = 347, p < 2.2 × 10−16; Fisher’s exact test). (G) Correlation of 5hmC signal between mESCs hmTOP-seq and nano-hmC-Seal data (average peak region size 615 bp). Within each nano-hmC-Seal peak region, total amount of signal from both methods was square-root transformed and correlated per each autosome. The data underlying this figure are included in S1 Data. 5hmC, 5-hydroxymethylcytosine; 5hmCG, hydroxymethylated CG site; 5hmCH, hydroxymethylated CH site, where H = A, C, or T; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell; OR, odds ratio; TAB-seq, Tet-assisted bisulfite sequencing.

We next sought to examine the ability of hmTOP-seq to localize 5hmC residues in mammalian genomic DNA (gDNA). We created base resolution hmTOP-seq maps of mESCs in two replicates using various input DNA amounts (5, 50, and 500 ng) (for sequencing statistics, see S2 Table). We also prepared control hmTOP-seq libraries using the same pipeline without the BGT labeling step. Similarly to the lambda DNA experiment, the majority (93%) of mapped reads started at or in the immediate vicinity of CG sites, with only a fraction of reads distributing around CGs (Fig 2C and S3B Fig). Such a precise readout of CG sites clearly demonstrates the single-base resolution capacity of the method. Replicates of the higher-input hmTOP-seq libraries correlated well (Pearson mean r was 0.46 and 0.8 for 50-ng and 500-ng input libraries, respectively) (Fig 2D), and notably, subsampling of the original datasets down to 30% (approximately 9.6 million and 6.7 million processed reads, respectively) only marginally influenced the correlation (S4 Fig), suggesting that the sequencing depths could be lowered considerably. The technical replicates of 5-ng DNA input libraries showed considerably smaller correlation (Pearson r = 0.11).

With the 500-ng and 50-ng input DNA samples, we found approximately 4.8 million and 2 million 5hmCGs, respectively, that overlapped between replicates of the libraries (mean coverage 5.7× and 7×, respectively). Only about 0.26 million overlapping 5hmCGs were detected in the 5-ng input samples (mean coverage 13.7×). Taken together, these results showed that reducing the input DNA library size leads to an increased variability even at high sequencing depths. However, the variability dropped and correlations increased after calculation of a normalized density (h-density) for 180-bp windows as described for uTOP-seq [25] for all input DNA libraries (Fig 2D). Importantly, the 500-ng control libraries had mean coverage 1.5× and showed very low correlation (Pearson r = 0.03), which indicates that hmTOP-seq detects 5hmCG modification even at very low DNA inputs.

We then compared our 5hmCG datasets with the bisulfite treatment–based TAB-seq data [18] and first calculated the overlap of 5hmCGs detected by TAB-seq (1.9 million CGs) and hmTOP-seq. hmTOP-seq recovered 50% and 25% of TAB-seq–identified 5hmCGs in the 500-ng and 50-ng input DNA datasets, respectively (odds ratio [OR] = 4 and OR = 3.8, respectively, p < 2.2 × 10−16; Fisher’s exact test). Notably, the amounts of identified 5hmCGs in 500- and 50-ng input DNA hmTOP-seq libraries outnumbered those obtained by TAB-seq, which normally requires micrograms of starting DNA. Comparison between the hmTOP-seq coverage and the percentages of 5hmC estimated by TAB-seq in the overlapping set of CGs indicated a good agreement between the two methods (Fig 2E).

The inherent single-nucleotide resolution of hmTOP-seq and sequence nonselectivity of BGT [2] opened a possibility for 5hmC identification in non-CG context. In mESCs, cytosine methylation and hydroxymethylation is present at CHG and CHH sites (H = A, C, or T) [18,31]. Using six hmTOP-seq control libraries, we observed 55,025 non-CG sites of which only 190 overlapped in at least two control libraries, and only 284 of them overlapped with the hydroxymethylated CH sites (5hmCHs) (76,665 occurrences; see below) that we identified in the 500-ng hmTOP-seq libraries, suggesting that these sites resulted from a random priming events rather than from BGT-directed covalent labeling and could be defined as false positive. Moreover, correlation between the control libraries was close to zero, for example, r = 0.03 for the 500-ng libraries. Of all 5hmCHs detected in the 500-ng hmTOP-seq libraries, for further analysis we selected only those sites which overlapped between technical replicates, resulting in a final set of 76,665 5hmCHs (mean coverage 2.7×; Pearson r = 0.76) (Fig 2F). In comparison, CH positions that did not overlap had lower 1.3X sequencing counts (p < 2.2 × 10−16, two-sided t test). Of note, as sequencing counts indicate relative hydroxymethylation of cytosine, lower 2.7× average coverage of the called 5hmCHs in relation to 5hmCGs (5.7×) points to a lower 5hmC occupancy at 5hmCHs. Fifty percent of detected 5hmCHs were found in CA sites (CA:CT:CC = 0.50:0.33:0.17) and distributed in a ratio 1:2 in CHG and CHH context, respectively (34% and 66%). These proportions corresponded well to those reported by the restriction enzyme–based 5hmC profiling methods ABA-seq and Pvu-Seal-seq [19,32]. The total number of the hmTOP-seq called 5hmCHs constituted 1.6% of 5hmCGs. This is close to 1.3% reported by TAB-seq [18] but differs slightly from the data presented by ABA-seq and Pvu-Seal-seq (4.1% of 5hmC exist in CH context [1.4% CHG and 2.7% CHH]). Similarly to the above-mentioned restriction enzyme–based methods, we identified more 5hmCHs than TAB-seq (26,616 CH positions were detected by TAB-seq) and generally confirmed the presence of 5hmC at non-CG sites in mESCs. Overall, all these data indicate a good reproducibility and capacity of hmTOP-seq to localize genomic 5hmC sites in gDNA.

Next, we performed a pairwise comparison between hmTOP-seq and nano-hmC-Seal, a low-resolution method that also adds an azide-modified glucose moiety to 5hmC and then pulls down modified DNA fragments without preserving single-CG hydroxymethylation information [33]. We observed a considerable agreement between the two types of data for all hmTOP-seq datasets (Pearson mean r = 0.88; sd = 0.03; p < 2.2 × 10−16), despite different resolutions of the two methods (average used nano-hmC-Seal region sized was 615 bp) (Fig 2G).

The analysis of 5hmCG distribution at various genomic elements demonstrated good agreement with the published data [18,19,34]. The highly hydroxymethylated CG sites (top 20% of hmTOP-seq data) were enriched in poised enhancers marked by histone H3 lysine 4 monomethylation (H3K4me1) histone marks, active enhancers (marked by histone H3 lysine 27 acetylation [H3K27ac] and histone H3 lysine 4 trimethylation [H3K4me3]), exons, 3′ untranslated regions (UTRs), downstream regions of protein-coding genes, shores of CG islands (CGIs), and nonactive promoters depleted in histone H3 lysine 9 acetylation (H3K9ac) histone mark (Fig 3A). CGIs, active promoters marked by H3K9ac, intergenic regions, and all major type of repeats were depleted in 5hmCGs (data shown only for long terminal repeats [LTRs]), except for short interspersed nuclear elements (SINEs) that demonstrated moderate enrichment for less hydroxymethylated 5hmCGs. Along a composite gene, 5hmCGs were most abundant in introns and their density increases toward the 3′ end of the gene (Fig 3B). Because of a small number of identified 5hmCHs, we were unable to accurately estimate the metagene profile of 5hmCHs, even though 53% of identified sites resided in protein-coding genes (S5 Fig) and showed enrichment towards the sense strand (OR = 1.1, p = 5.3 × 10−8; Fisher’s exact test).

Fig 3. Genomic distribution of 5hmCGs in mESCs.

Fig 3

(A) OR from Fisher’s exact test for enrichment of the high- (top 20%), medium- (middle 20%), and low-coverage (bottom 20%) 5hmCGs across various genomic features. Poised enhancers (“enh.”): regions with H3K4me1 mark only; active enhancers: regions with H3K4me1 and H3K27ac histone marks; active promoters: 2-kb regions upstream of the gene start that overlap H3K9ac histone mark; nonactive promoters: 2-kb upstream regions depleted in H3K9ac. All analyses are shown for a 500-ng input hmTOP-seq library. All shown enrichment values have p ≤ 1.1 × 10−6. The data underlying this figure are included in S1 Data. (B) hmTOP-seq coverage profile normalized to CG density over different gene-associated regions: upstream (2 kb), 5′ UTR, exons, introns, 3′ UTRs, and downstream (2 kb). Distribution of (C) 5hmCGs and (D) uCGs across the sense and the antisense strands of genes grouped according to their expression level. Numbers of genes in each group and p-values for the modification difference between the strands are shown above each graph. All analyses are shown for a 500-ng input hmTOP-seq library. 5hmCG, hydroxymethylated CG site; CGI, CG island; H3K4me1, histone H3 lysine 4 monomethylation; H3K9ac, histone H3 lysine 9 acetylation; H3K27ac, histone H3 lysine 27 acetylation; H3K36me3, histone H3 lysine 36 trimethylation; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; LTR, long terminal repeat; mESC, mouse embryonic stem cell; OR, odds ratio; SINE, short interspersed nuclear element; TSS, transcription start site; TTS, transcription termination site; uCG, unmethylated CG; UTR, untranslated region.

Strand-specific analysis of 5hmCGs at gene bodies

As hmTOP-seq targets both individual strands of a CG dinucleotide separately, it has a capacity to display strand-specific hydroxymethylation information. Following the previous observations that 5hmCGs are asymmetrically modified on different strands [18,35], we analyzed strand-specific distribution of 5hmCGs across gene-associated elements. In TAB-seq analysis, the average abundance of 5hmC at the called CG sites is 20% compared with 10.9% at the opposite cytosine [18]. We detected a shift in CG hydroxymethylation toward the sense strand relative to the antisense strand in gene bodies. Importantly, the extent of the strand-specific 5hmCG bias increased according to the expression levels of genes, showing the strongest bias for the highly expressed gene group (p = 0.4 and p = 1 × 10−63, for low- and high-expression groups, respectively, two-sided paired t test) (Fig 3C). The detected skewed hydroxymethylation at CGs is an important indicator demonstrating that hmTOP-seq can sense subtle differences in 5hmCG levels. Furthermore, the association of 5hmCG enrichment at the sense strand with gene expression levels is in good agreement with the proposed association of 5hmC with active gene expression [4]. Consistent with a suggested repressive role of 5hmC at promoter regions [4], promoters of the highly expressed genes were less enriched in 5hmCGs. Of note, the detected strand-specific 5hmCG bias persisted in the 50-ng input hmTOP-seq library (S6 Fig), suggesting that hmTOP-seq is able to discern high-resolution 5hmCG patterns in limited samples. We then compared hydroxymethylation and general cytosine modification levels at CGs across the same gene groups, using hmTOP-seq and uTOP-seq data, respectively (Fig 3D). Accordingly, promoters of more highly expressed genes were relatively less methylated as compared with those of weakly expressed genes. There was no noticeable difference for the uCG distribution between the sense and the antisense strands, consistent with the symmetrical methylation of CGs [18].

To further demonstrate the capacity of hmTOP-seq for high-resolution analysis, we analyzed distribution of 5hmCGs around exon-intron boundaries. DNA hydroxymethylation and modification differences at the exon-intron junction were reported for the immediate boundary (within the first 5 nt) and up to the first 20 nt to the boundary [17,36]. We selected all internal protein-coding exons that contained 5hmCG both on the intronic and exonic side within 25-nt distance from the splicing site. In our analysis, a cross-boundary change in 5hmCG distribution at the site of transition from exon to intron was most evident in the first 5–10 nt from the boundary on the sense strand, whereas hydroxymethylation levels in the antisense strand remained constant (Fig 4A). We identified a 5hmC peak at the −1 and −2 positions on the exonic side, then a sharp drop at around +5 and again an increase in 5hmC levels for longer distances (up to +25 nt at the intron side). Strikingly, at the site of transition from intron to exon, we detected a prominent intronic 5hmC peak at around −5 position on the coding strand. For the opposite strand, CGs that localized immediately adjacent to the intron-exon junction, as well as the peri-boundary CGs, showed a lower 5hmC level across introns relative to exons. The observed cross-boundary 5hmCG changes are in good agreement with TAB-seq data of mouse and human tissues [35], indicating that our method can approach a precision achievable only to the gold-standard methods. Importantly, the cross-boundary differences were also evident in the 50-ng input DNA hmTOP-seq analysis, in which considerably lower numbers of 5hmCGs were identified (S7 Fig). Of note, the general cross-boundary DNA modification levels of CGs followed a similar trend, as evidenced by the TOP-seq profiles of uCGs (Fig 4B).

Fig 4. CG modification profiles at exon-intron cross-boundaries.

Fig 4

Distribution of 5hmCGs (A) (500-ng input DNA hmTOP-seq libraries) or uCGs (B) at both sides of the exon-intron boundary is presented for the sense and the antisense strands. The x-axis shows the distance (nt) of CGs from the boundary. p-Values indicate a difference in coverage between exonic and intronic side of the boundary for the first 25 nt. At the exon-intron boundary (left part), general 5hmCG modification levels are higher at the exonic side as compared with the intronic side for both strands. At the intron-exon boundary (right part), the antisense strand shows the same trend, whereas the sense strand shows higher general 5hmCG levels on the intronic side. The data underlying this figure are included in S1 Data. 5hmCG, hydroxymethylated CG site; nt, nucleotide; uCG, unmodified CG site.

Discussion

To date, the TOP-seq approach, which relies on targeted nonhomologous priming of a DNA polymerase to provide enhanced mapping resolution and sequencing economy, has been implemented for analysis of unmodified [25] or hydroxymethylated CG sites in DNA ([27] and this work). Taking advantage of available empirical data, we—for the first time, to our knowledge—were able to propose a general mechanism for this unconventional mode of polymerase action. At the heart of the mechanism is proximity-driven capture of a covalently tagged target base, located at an internal position of the DNA template strand, in the active site of a DNA polymerase; the latter is attracted to the DNA by the short duplex composed of the tethered ODN and its complementary priming strand. This apparent invasion of the DNA helix requires no “jumping” of the polymerase, as has been suggested elsewhere [27]. Most importantly, the new level of mechanistic understanding enables a better evaluation of suitable tethering chemistries by simple modeling prior to empirical examination, which is exemplified by the superior resolution of the newly developed hmTOP-seq procedure as compared with its predecessor [27]. The attained precision capacitated the readout and mapping of genomic 5hmC in both CG and non-CG contexts.

The nondestructive nature of hmTOP-seq allowed the construction of 5hmC maps with less starting DNA amounts than for routine WGBS. Besides its high robustness and inherent single-base resolution, the hmTOP-seq approach offers cost-efficiency, stemming from the target-selective sequencing and facile data processing. Such analysis results in extremely informative datasets as demonstrated by whole-genome high-coverage mapping of 5hmC in mESCs using only on average 20 million of processed reads. To our knowledge, no other method to date can offer a similar performance.

We demonstrated that the method can assess subtle hydroxymethylation changes at individual CGs distributed around exon-intron cross-boundaries, the precision achievable only to the gold-standard WGBS. Furthermore, even lower-input (50 ng) DNA hmTOP-seq libraries permitted detection of strand-specific CG hydroxymethylation.

Most recent studies have proposed bisulfite-free methods for single-base resolution 5hmC analysis which exploit differential sensitivity of the cytosine modifications to enzymatic deamination by an AID/APOBEC family DNA deaminase [37] or chemical modification of 5hmC [38,39] to afford differential readout of 5hmC in conventional sequencing. These methods require whole-genome sequencing, which bears a hefty price tag for large-scale studies of epigenetic diseases and cancer.

hmTOP-seq, as all other methods based on the TOP-seq strategy, cannot directly determine the absolute modification levels of CGs but can infer relative hydroxymethylation of each CG based on their sequencing coverage. In addition to 5hmCG-based single-nucleotide-resolution analysis, calculation of regional h-density profiles, especially with lower-input hmTOP-seq libraries (5-ng input DNA), should be used for comparison of 5hmC levels among multiple samples. As a read-count-based epigenome profiling approach, hmTOP-seq is affected by confounders such as aneuploidy and copy number variations (CNVs). However, although for studies of heavily genetically transformed genomes, such as cancer genomes, a prior knowledge of regions affected by CNV is necessary, the presence of such large megabase regions do not prevent computation of differentially modified regions within CNVs.

Overall, hmTOP-seq is an attractive semiquantitative method for screening hydroxymethylated cytosines genome-wide in various tissues and clinical samples. We envision that the combination of inherent high resolution and cost-efficiency will pave the way for its wide application in harnessing complex human epigenome variations.

Materials and methods

Cell culture

Murine embryonic stem cells E14TG2a were kindly provided by Prof. Guoliang Xu (Shanghai Institute of Biochemistry and Cell Biology). mESCs were grown on 0.1% gelatin coated dishes in the presence of 1,000 U/ml LIF (ESGRO Millipore) and 15% fetal bovine serum as described previously [40].

gDNA

DNA from mESCs was purified using standard phenol-chloroform extraction.

Validation of hmTOP-seq in a model DNA fragment system

Two model DNA fragments were made by PCR from human BMX gene: 1H, 155 bp (oligonucleotides for PCR, 1H-dir 5′-TGTGTTACTGTGTGGAAAAGACC-3′, 1H-rev 5′-CCACTCCTTATAGTTTGGCTGA-3′) and 2H, 202 bp (2H-dir 5′-GCAATGTGTTGTGGAGGAGA-3′, 2H-rev 5′- CCTACTTGGGTTTGCCCTCT-3′). The 5hmC was introduced at GCGC sites of the DNA fragments by incubating 1 μg of 1H/2H DNA fragment mix with a 5-fold molar excess of M.HhaI and 13 mM formaldehyde for 1 h at room temperature followed by Proteinase K treatment (0.2 mg/ml) for 30 min at 55°C and column purification (GeneJet PCR Purification kit [TS]) [26,30]. The efficiency of hydroxymethylation was approximately 90%, according to R.Hin6I endonuclease protection and qPCR analysis (S2E Fig).

To label 5hmC with an azide group, M.HhaI-treated 1H/2H were labeled with 10 U T4 BGT (TS) supplemented with 50 μM UDP-glucose-azide (Jena Bioscience) for 2 h at 37°C, followed by enzyme inactivation at 65°C for 20 min and column purification (DNA Clean & Concentrator-5, Zymo Research).

To label 5hmC-DNA with cysteamine, M.HhaI-treated 1H/2H DNA fragments were incubated with a 10-fold molar excess of M.SssI (TS), 12.5 mM cysteamine (Sigma) in 40 μl of 50 mM NaOAc (pH 6.0), 0.2 mg/ml BSA for 1 h at 30°C, followed by enzyme inactivation at 65°C for 15 min and column purification (GeneJet PCR Purification kit [TS]). DNA was eluted in 40 μl of 20 mM NaHCO3 (pH 9.0) and supplemented with 40 μl of 0.3 M NaHCO3, 20 μl Azide Amine-Activator (Interchim) to final 2 mg/ml concentration, incubated for 2 h at room temperature, and purified through GeneJet PCR Purification kit (Protocol A [TS]) columns.

Following the covalent labeling procedures, DNA was further processed as follows: after ligation of the partially complementary adapters as described previously (step 2 [25]), DNA was supplemented with 20 μM alkyne DNA oligonucleotide (ODN; 5′-T[alkyneT]TTATATATTTATTGGAGACTGACTACCAGATGTAACA-3′, Base-click) and 8 mM CuBr: 24 mM THPTA mixture (Sigma) in 50% of DMSO, incubated for 20 min at 45°C, and subsequently diluted to <1% DMSO before a column purification (GeneJet NGS Cleanup kit, Protocol A [TS]).

A DBCO moiety containing ODN was produced by incubating amine DNA oligonucleotide (5′-C[T-C2-NH2]TTTATATATTTATTGGAGACTGACTACTACCAGATGTAACA-3′, Metabion) with 100× excess of DBCO-sulfo-NHS ester (Glen Research) in 150 mM NaHCO3 for 2 h at room temperature and subsequently purified with Oligo Clean & Concentrator (Zymo Research) spin columns. Azide-DNA was supplemented with 10 μM DBCO ODN in 25% DMSO, incubated for 2 h at 45°C, and subsequently diluted to <1% DMSO before a column purification (GeneJet NGS Cleanup kit, Protocol A [TS]).

Coupling efficiency (S1 Table) was calculated from 2–3 independent experiments as a percentage of model DNA fragments conjugated to ODN on Agilent Bioanalyzer profiles.

A 20 μl reaction containing 5-ng ODN-conjugated DNA, 0.5 μM of a complementary priming strand (EP; 5′-TGTTACATCTGGTAGTCAGTCTCCAATAAATATAT-3′, with custom LNA modifications [Exiqon] and phosphorothioate linkages at the 3′ end), and 2.5 U Pfu DNA polymerase (TS) in Pfu buffer supplemented with 0.2 mM dNTP was incubated at the following cycling conditions: 95°C for 2 min; 5 cycles at 95°C for 1 min, 65°C for 10 min, 72°C for 10 min. Primed DNA was amplified with 2× Platinum SuperFi PCR Master Mix (TS), 0.5 μM barcoded fusion PCR primers A(Ad)-EP-barcode-primer (63 nt), and trP1(Ad)-A2-primer (45 nt) (both primers contained phosphorothioate modifications). Thermocycler conditions were 94°C for 4 min; 15–25 cycles at 95°C for 1 min, 60°C for 1 min, 72°C for 1 min. The final amplified DNA fragments were column purified (GeneJet NGS Cleanup kit, Protocol A [TS]).

To evaluate the priming efficiency, 10 μl of the priming reaction mixture containing 2.5 ng of ODN-conjugated 1H fragment was added to 50 μl mixture of Platinum SuperFi PCR Master mix, EP and 1H-dir or EP and 1H-rev primers (0.5 μM each), and 0.08x SYBR Green I (Sigma-Aldrich) and amplified on Rotor-Gene Q (Qiagen) real-time PCR machine using 30 cycles of the same thermocycling program as stated above.

Validation of hmTOP-seq in a model DNA system

GCGC-hydroxymethylated bacteriophage lambda gDNA was prepared by treating 3 μg of fragmented lambda DNA with a 3-fold molar excess of M.HhaI and 13 mM formaldehyde for 1 h at room temperature. After reaction, DNA was purified using GeneJet PCR Purification kit (TS). Five different DNA mixtures containing 2.5%, 5%, 10%, 20%, and 40% 5hmC at GCGC sites were produced by premixing the HhaI-modified and untreated lambda DNA. Libraries were prepared as described above (with mESC gDNA) with the following changes: 300-ng DNA samples were labeled with T4 BGT as above. ODN-conjugated DNA was used in 25 μl of priming reaction mixture with 2.5 U Pfu DNA polymerase. PCR amplification (for 12 cycles) was carried out by adding all of the above reaction mixture to 100 μl of amplification reaction.

Analysis of bacteriophage lambda hmTOP-seq data

Processing of the bacteriophage lambda hmTOP-seq sequencing data was performed as described previously [25] except for the minimal length of retained reads set at 120 nt. Processed reads were mapped to a lambda phage genome, and only the reads with mapping quality equal or above 60 and starting exactly at a CG site in a GCGC context were used for further analysis. PCR duplicates were removed using the following algorithm: for each uniquely mapped read, its start coordinate (5′ end) and read length without the 3′ adapter were obtained; reads sharing identical start coordinate and read length were considered duplicates and only one was retained.

To test the relationship between the level of hydroxymethylation of GCGC sites (2.5%, 5%, 10%, 20%, 40% 5hmC) and their coverage, a quadratic regression model was applied in which median coverage was the dependent variable and 5hmC level was the independent variable.

Preparation of hmTOP-seq libraries of mESCs

Extracted gDNA of mESCs was sonicated on M220 Focused-ultrasonicator (Covaris) in 10 mM Tris-HCl (pH 8.5) buffer to yield fragments with a peak size of approximately 200 bp. The protocol of hmTOP-seq (Fig 1D) is as follows.

  • Step 1. The 5hmC glycosylation was carried out in a 50 μl reaction mixture with 5, 50, or 500ng of fragmented gDNA supplemented with 50 μM UDP-6-azide-glucose (Jena Bioscience) and 5 U T4 BGT (TS) for 2 h at 37°C, followed by enzyme inactivation at 65°C for 20 min and column purification (GeneJet PCR Purification kit [TS]).

  • Step 2. Azide-tagged DNA was end-filled using a DNA End Repair Kit (TS) according to the vendor’s recommendations, and DNA was purified using the GeneJet Purification Kit (TS). A 3′-dA mononucleotide extension was added to end-repaired DNA by incubating with Klenow exo- polymerase in Klenow Buffer (TS) in the presence of 0.5 mM dATP at 37°C for 45 min, enzyme inactivated at 75°C for 15 min followed by purification through DNA Clean & Concentrator-5 columns (Zymo Research). Partially complementary adaptors A1/A2 (4.5 mM) (produced by annealing of partially complementary 32/33-nt oligonucleotides A1/A2, A1 5′ P-GATTGGAAGAGTGGTTCAGCAGGAATGCTGAG, and A2 5′-ACACTCTTTCCCTACATGACAC TCTTCCAATCT) were ligated by incubating the DNA with 15 U of T4 DNA Ligase (TS) in Ligase buffer at 22°C overnight in a total volume of 30 μl, followed by thermal inactivation at 65°C for 10 min and column purification (DNA Clean & Concentrator-5, Zymo Research).

  • Step 3. DNA from Step 2 DNA was supplemented with 20 μM biotinylated alkyne-containing DNA oligonucleotide (5′-T[alkyneT]TTTTGTGTGGTTTGGAGACTGACTACCAGATGTAACA-biotin, Base-click) and 8 mM CuBr: 24 mM THPTA mixture (Sigma) in 50% of DMSO, incubated for 20 min at 45°C, and subsequently diluted to <1.5% DMSO before a column purification (GeneJet NGS Cleanup kit, Protocol A [TS]).

  • Step 4. DNA recovered after the biotinylation step was incubated with 0.1 mg Dynabeads MyOne C1 Streptavidin (TS) in buffer A (10 mM Tris-HCl [pH 8.5], 1 M NaCl) at room temperature for 3 h on a roller. DNA-bound beads were washed 2× with buffer B (10 mM Tris-HCl [pH 8.5], 3 M NaCl, 0.05% Tween 20); 2× with buffer A (supplemented with 0.05% Tween 20); and 1× with 100 mM NaCl and were finally resuspended in water and incubated for 5 min at 95°C to recover enriched DNA fraction.

  • Step 5. Enriched DNA was subsequently used in 30 μl of priming reaction in Pfu buffer with 1.5 U Pfu DNA polymerase (TS), 0.2 mM dNTP, 0.5 μM complementary priming strand (EP; 5′-TGTTACATCTGGTAGTCAGTCTCCAAACCACACAA-3′, with custom LNA modifications [Exiqon] and phosphorothioate linkages at the 3′ end). Reaction mixture was incubated at the following cycling conditions: 95°C for 2 min; 5 cycles at 95°C for 1 min, 65°C for 10 min, 72°C for 10 min.

  • Step 6. Amplification of a primed DNA library was carried out by adding 22 μl of the priming reaction mixture to 100 μl of amplification reaction containing 50 μl of 2× Platinum SuperFi PCR Master Mix (TS) and barcoded fusion PCR primers A(Ad)-EP-barcode-primer (63 nt) and trP1(Ad)-A2-primer (45 nt) at 0.5 μM each (both primers contained phosphorothioate modifications). Thermocycler conditions were as follows: 94°C for 4 min; 12 or 15 cycles at 95°C for 1 min, 60°C for 1 min, 72°C for 1 min. The final libraries were size-selected for approximately 300-bp fragments (MagJet NGS Cleanup and Size-selection kit [TS]), and their quality was tested on Agilent 2100 Bioanalyzer (Agilent Technologies) and by qPCR (TS, Qiagen). Libraries were subjected to Ion Proton (TS) sequencing.

Preparation of uTOP-seq libraries of mESCs

uTOP-seq libraries of mESC using 300-ng input DNA were prepared as described previously [25].

Analysis of mESC hmTOP-seq data

hmTOP-seq data were processed as described for bacteriophage lambda hmTOP-seq with the following exceptions: a minimal raw read length was 80 nt and reads were mapped to the mouse genome build mm10 with a minimal mapping quality of 30. The 5hmCH coverage was calculated using reads that start exactly at those CH sites that do not contain a CG closer than 7 nt in the downstream direction. h-density was calculated and TOP-seq data were processed as described previously [25]. Additionally, the h-density signal was log2 transformed.

To test correlations between the technical replicates in silico subsampling was performed by randomly selecting a fraction of reads assigned to a CG site. Correlation between hmTOP-seq and nano-hmC-Seal signal was determined in high-confidence nano-hmC-Seal regions. For each nano-hmC-Seal region, a total signal from each method was calculated and square-root transformed. Correlation between hmTOP-seq and nano-hmC-Seal as well as for hmTOP-seq technical replicates was calculated for each autosome separately.

Enrichment of genomic elements for the 5hmCG signal was calculated as follows. The 5hmCGs were divided into three signal-strength groups (low: bottom 20%; middle: mid 40%–50%; top: top 80%). Then, a contingency table was created for each CG site falling into a signal group and overlapping a genomic region. Fisher’s exact test was performed to estimate the OR and p-value. For 5hmCH signal enrichment, the same algorithm was used, but only with one signal-strength group (all available 5hmCHs). A general gene profile was created by dividing each gene element into 10 equally sized bins and for each bin calculating average 5hmCG signal normalized by CG density. To calculate strand differences in 5hmCG and uCG along the protein-coding gene, we used the following algorithm. First, we removed genes that were shorter than 1% or longer than 99% of all the genes. Next, each gene was assigned an expression group and divided into 60 equally sized bins. Four expression groups were defined: genes without expression and genes with low, mid, and high expression divided into three equally sized groups. To have a representative signal in the gene body, we selected only those genes that had modification signal in at least 20 bins on both strands. Two-sided paired t test was used to calculate differences between the strands per each expression group. To calculate profiles around the exon-intron boundaries, we selected all internal protein-coding exons that contained modification sites on both sides of the boundary (within absolute distance 1–25 nt). To test differences between the intron and exon sides, the average signal for each exon-intron boundary was calculated and a two-sided paired t test was performed.

Annotations

Mouse genome sequence (build mm10), CGI, and repeat coordinates were downloaded from the UCSC genome browser. CGI shores were defined as 2-kb regions around a CGI. Gene dataset was downloaded from the GENCODE encyclopedia of genes [41]. Gene upstream (promoter) or downstream regions were set as 2-kb regions from a given transcription start or end site, respectively. Intergenic regions were defined as regions that are 50 kb away from the protein-coding genes. mESC histone marks and RNA dataset was taken from the ENCODE data portal (ENCODE project consortium [42,43]). TAB-seq and nano-hmC-Seal datasets were obtained from the Gene Expression Omnibus (TAB-seq: GSE36173; nano-hmC-Seal: GSE77967) [18,33]. Both 5hmC modification datasets were converted to mm10 mouse genome version.

Supporting information

S1 Fig. Putative conformations of the tethering linkers 1a, 2a, and 3 in relation to T2 (tethered ODN) and target 5hmC (gDNA) nucleotides.

Linker conformations were determined by ChemBio3D Ultra MM2 energy minimization using the target values derived from the coordinates of dT4 and dT6 nucleotides in the template strand of the KOD polymerase–DNA complex (PDB: 5omf) as follows: C5–C5 distance = 8.99 Å; C5-methyl bond angles for dT6 (corresponds to T2 in the tethered ODN) and dT4 (corresponds to 5hmC) = 58.8° and 124.4°, respectively; dihedral angle (dT4–dT6 helical twist) = 59.8°. Actual refined values are shown in fine print. 5hmC, 5-hydroxymethylcytosine; ODN, oligodeoxyribonucleotide; PDB, Protein Data Bank.

(TIF)

S2 Fig. Specificity and efficiency of 5hmC mapping by hmTOP-seq.

(A) Schematic view shows four priming products generated from the two model DNA fragments, 1H and 2H. Theoretical sizes of the four specific hmTOP-seq and uTOP-seq products (including 135-bp adapters) are as follows: 1H-dir 176 bp, 2H-dir 194 bp, 1H-rev 247 bp, and 2H-rev 276 bp. (B) Agilent Bioanalyzer profiles of hmTOP-seq priming products obtained from 1H and 2H model DNA fragments, each containing a single 5hmC, followed by derivatization with cysteamine (black line) or the azide group (red line). For each type of derivatization, different number of PCR cycles was required to detect comparable amounts of four products (20 cycles for azide- and 25 cycles for cysteamine-derivatization). (C) Comparison of hmTOP-seq and uTOP-seq [25] in the 1H/2H model DNA system. Bioanalyzer profiles of the products obtained from 1H and 2H model DNA fragments each containing a single 5hmC or an unmodified CG processed through the hmTOP-seq (red line) or uTOP-seq procedure (black line). In both cases, the corresponding workflow generates four specific products with similar efficiencies (15 cycles of PCR were used). Azide-labeling of unmethylated CG sites in the model DNA fragments was performed as described [25]. (D) Agilent Bioanalyzer profiles of hmTOP-seq priming products obtained from 1H/2H, followed by copper-free (black line) or Cu(I)-catalyzed (red line) click conjugation to DNA oligonucleotide. In both cases, 25 cycles of PCR were used. (E) Assessment of the M.HhaI-directed hydroxymethylation efficiency on two model DNA fragments. After incubation of 1H/2H model DNA fragments with M.HhaI in the presence of formaldehyde, the DNA fragments were cleaved with Hin6I restriction endonuclease and the amount of uncleaved DNA was evaluated by qPCR with the respective primer pairs (1H-dir/1H-rev or 2H-dir/2H-rev; see “Validation of hmTOP-seq in a model DNA fragment system” in Materials and methods). The data underlying this section are included in S2 Data. 5hmC, 5-hydroxymethylcytosine; A1 and A2, strands of a partially complementary adaptor; Ad-A2 and Ad-TO, adapters containing NGS platform-specific 5′-end sequences; hmTOP-seq, 5hmC-specific TOP-seq; qPCR, quantitative PCR; TO, tethered oligodeoxyribonucleotide; uTOP-seq, uCG-specific TOP-seq.

(TIFF)

S3 Fig

Distance distribution of read start positions from (A) a nearest GCCG or (B) a CG site in the hmTOP-seq library of pre-hydroxymethylated lambda DNA (2.5% 5hmC at GCGC sites) and mESCs, respectively. The data underlying this figure are included in S2 Data. 5hmC, 5-hydroxymethylcytosine; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell.

(TIFF)

S4 Fig. Correlation between technical replicates using subsamples of the hmTOP-seq data.

Sequencing reads were sampled from 500-ng and 50-ng input DNA hmTOP-seq libraries, respectively. The data underlying this section are included in S2 Data. hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing.

(TIFF)

S5 Fig. hmTOP-seq analysis of non-CG sites.

Odds ratio (Fisher’s test) for enrichment (A) or distribution (B) of 76,665 5hmCHs detected in mESCs across various genomic features. All enrichments have p < 0.05. The data underlying this section are included in S2 Data. 5hmCH, hydroxymethylated CH site; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell.

(TIFF)

S6 Fig. Genomic distribution of 5hmCGs in mESCs.

Distribution of 5hmCGs in 50-ng input DNA hmTOP-seq libraries across the sense and the antisense strands of genes grouped according to their expression level. Numbers of genes in each group and p-values for the modification difference between the strands are shown above each graph. hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell.

(TIFF)

S7 Fig. Changes in 5hmCG distribution at exon-intron cross-boundaries.

Distribution of 5hmCGs in 50-ng input DNA hmTOP-seq libraries at both sides of the exon-intron boundary is presented for the sense and the antisense strands. The x-axis shows the distance (nt) of CGs from boundary. p-Values indicate a difference in coverage between exonic and intronic side of the boundary for the first 25 nt. The data underlying this section are included in S2 Data. 5hmCG, hydroxymethylated CG site; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; nt, nucleotide.

(TIFF)

S1 Table. Structural parameters and TOP-seq performance of tethering linkers.

TOP-seq, tethered oligonucleotide–primed sequencing.

(XLSX)

S2 Table. Sequencing statistics of hmTOP-seq and uTOP-seq libraries of mESC DNA.

hmTOP-seq, 5hmC-specific TOP-seq; mESC, mouse embryonic stem cell; uTOP-seq, uCG-specific TOP-seq.

(XLSX)

S1 Data. Numerical data underlying Figs 2A, 2B, 2C, 2D, 2E, 2F, 2G, 3A, 4A and 4B.

(XLSX)

S2 Data. Numerical data underlying S2E, S3A, S3B, S4, S5A, S5B and S7 Figs.

(XLSX)

Acknowledgments

We are grateful to Janina Ličytė for initial experiments of 5hmC labeling with BGT and gDNA of mESCs and Prof. Viktoras Masevičius for advice on MM2 modeling. We thank Prof. Guoliang Xu for mESCs and Dr. Vaidotas Stankevičius for cultivating them.

Abbreviations

5hmC

5-hydroxymethylcytosine

5hmCG

hydroxymethylated CG site

5hmCH

hydroxymethylated CH site

5mC

5-methylcytosine

ABA-seq

DNA modification–dependent AbaSI restriction coupled with sequencing

BGT

β-glucosyltransferase

CGI

CG island

CNV

copy number variation

DBCO

dibenzocyclooctyne

eM.SssI MTase

an engineered version of the CG-specific DNA cytosine-5 methyltransferase M.SssI

gDNA

genomic DNA; H3K27ac, histone H3 lysine 27 acetylation

H3K36me3

histone H3 lysine 36 trimethylation

H3K4me1

histone H3 lysine 4 monomethylation

H3K4me3

histone H3 lysine 4 trimethylation

H3K9ac

H3 lysine 9 acetylation

hMe-Seal

5hmC selective chemical labeling

hmTOP-seq

5hmC-specific tethered oligonucleotide–primed sequencing

KOD polymerase

the thermophilic DNA polymerase from Thermococcus kodakaraensis

LTR

long terminal repeat

mCG

methylated CG site

mESC

mouse embryonic stem cell

M.HhaI MTase

the GCGC-specific DNA cytosine-5 methyltransferase HhaI

NHS

N-hydroxysuccinimide

nt

nucleotide

ODN

oligodeoxyribonucleotide

OR

odds ratio

PDB

Protein Data Bank

Pfu polymerase

the thermophilic DNA polymerase from Pyrococcus furiosus

Pvu-Seal-seq

DNA modification–dependent 5hmC enrichment coupled with sequencing

SINE

short interspersed nuclear element

TAB-seq

Tet-assisted bisulfite sequencing

TET

Ten-Eleven Translocation enzyme

TOP-seq

tethered oligonucleotide–primed sequencing

TTS

transcription termination site

uCG

unmodified CG site

uTOP-seq

uCG-specific tethered oligonucleotide–primed sequencing

UTR

untranslated region

WGBS

whole-genome bisulfite sequencing

Data Availability

Raw and processed hmTOP-seq data generated in this study have been deposited in the NCBI Gene Expression Omnibus under accession number GSE140206.

Funding Statement

The work was supported by the Research Council of Lithuania (https://www.lmt.lt/en/) (Researcher groups project No. MIP-58-17 to EK) and the European Research Council (https://erc.europa.eu/) (ERC-AdG-2016/742654 to SK). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Pastor WA, Pape UJ, Huang Y, Henderson HR, Lister R, Ko M, et al. Genome-wide mapping of 5-hydroxymethylcytosine in embryonic stem cells. Nature. 2011;473: 394–397. 10.1038/nature10102 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Song C-X, Szulwach KE, Fu Y, Dai Q, Yi C, Li X, et al. Selective chemical labeling reveals the genome-wide distribution of 5-hydroxymethylcytosine. Nat Biotechnol. 2011;29: 68–72. 10.1038/nbt.1732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Szulwach KE, Li X, Li Y, Song CX, Wu H, Dai Q, et al. 5-hmC-mediated epigenetic dynamics during postnatal neurodevelopment and aging. Nat Neurosci. 2011;14: 1607–1616. 10.1038/nn.2959 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wu H, D’Alessio AC, Ito S, Wang Z, Cui K, Zhao K, et al. Genome-wide analysis of 5-hydroxymethylcytosine distribution reveals its dual function in transcriptional regulation in mouse embryonic stem cells. Genes Dev. 2011;25: 679–684. 10.1101/gad.2036011 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Spruijt CG, Gnerlich F, Smits AH, Pfaffeneder T, Jansen PWTC, Bauer C, et al. Dynamic readers for 5-(Hydroxy)methylcytosine and its oxidized derivatives. Cell. 2013;152: 1146–1159. 10.1016/j.cell.2013.02.004 [DOI] [PubMed] [Google Scholar]
  • 6.Bachman M, Uribe-Lewis S, Yang X, Williams M, Murrell A, Balasubramanian S. 5-Hydroxymethylcytosine is a predominantly stable DNA modification. Nat Chem. 2014;6: 1049–1055. 10.1038/nchem.2064 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Szwagierczak A, Bultmann S, Schmidt CS, Spada F, Leonhardt H. Sensitive enzymatic quantification of 5-hydroxymethylcytosine in genomic DNA. Nucleic Acids Res. 2010;38 10.1093/nar/gkq684 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Globisch D, Münzel M, Müller M, Michalakis S, Wagner M, Koch S, et al. Tissue Distribution of 5-Hydroxymethylcytosine and Search for Active Demethylation Intermediates. PLoS ONE. 2010;5: e15367 10.1371/journal.pone.0015367 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Terragni J, Bitinaite J, Zheng Y, Pradhan S. Biochemical characterization of recombinant β-glucosyltransferase and analysis of global 5-Hydroxymethylcytosine in unique genomes. Biochemistry. 2012;51: 1009–1019. 10.1021/bi2014739 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Nestor CE, Ottaviano R, Reddington J, Sproul D, Reinhardt D, Dunican D, et al. Tissue type is a major modifier of the 5-hydroxymethylcytosine content of human genes. Genome Res. 2012;22: 467–477. 10.1101/gr.126417.111 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Wagner M, Steinbacher J, Kraus TFJ, Michalakis S, Hackner B, Pfaffeneder T, et al. Age-Dependent Levels of 5-Methyl-, 5-Hydroxymethyl-, and 5-Formylcytosine in Human and Mouse Brain Tissues. Angew Chemie Int Ed. 2015;54: 12511–12514. 10.1002/anie.201502722 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Haffner MC, Chaux A, Meeker AK, Esopi D, Gerber J, Pellakuru LG, et al. Global 5-hydroxymethylcytosine content is significantly reduced in tissue stem/progenitor cell compartments and in human cancers. Oncotarget. 2011;2: 627–37. 10.18632/oncotarget.316 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Jin S-G, Jiang Y, Qiu R, Rauch TA, Wang Y, Schackert G, et al. 5-Hydroxymethylcytosine Is Strongly Depleted in Human Cancers but Its Levels Do Not Correlate with IDH1 Mutations. Cancer Res. 2011;71: 7360–7365. 10.1158/0008-5472.CAN-11-2023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Kraus TFJ, Globisch D, Wagner M, Eigenbrod S, Widmann D, Münzel M, et al. Low values of 5-hydroxymethylcytosine (5hmC), the “sixth base,” are associated with anaplasia in human brain tumors. Int J Cancer. 2012;131: 1577–1590. 10.1002/ijc.27429 [DOI] [PubMed] [Google Scholar]
  • 15.Ficz G, Branco MR, Seisenberger S, Santos F, Krueger F, Hore TA, et al. Dynamic regulation of 5-hydroxymethylcytosine in mouse ES cells and during differentiation. Nature. 2011;473: 398–402. 10.1038/nature10008 [DOI] [PubMed] [Google Scholar]
  • 16.Booth MJ, Branco MR, Ficz G, Oxley D, Krueger F, Reik W, et al. Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science. 2012;336: 934–7. 10.1126/science.1220671 [DOI] [PubMed] [Google Scholar]
  • 17.Khare T, Pai S, Koncevicius K, Pal M, Kriukiene E, Liutkeviciute Z, et al. 5-hmC in the brain is abundant in synaptic genes and shows differences at the exon-intron boundary. Nat Struct Mol Biol. 2012;19: 1037–43. 10.1038/nsmb.2372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Yu M, Hon GC, Szulwach KE, Song C-X, Zhang L, Kim A, et al. Base-Resolution Analysis of 5-Hydroxymethylcytosine in the Mammalian Genome. Cell. 2012;149: 1368–1380. 10.1016/j.cell.2012.04.027 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Sun Z, Dai N, Borgaro JG, Quimby A, Sun D, Corrêa IR, et al. A Sensitive Approach to Map Genome-wide 5-Hydroxymethylcytosine and 5-Formylcytosine at Single-Base Resolution. Mol Cell. 2015;57: 750–761. 10.1016/j.molcel.2014.12.035 [DOI] [PubMed] [Google Scholar]
  • 20.Chen K, Zhang J, Guo Z, Ma Q, Xu Z, Zhou Y, et al. Loss of 5-hydroxymethylcytosine is linked to gene body hypermethylation in kidney cancer. Cell Res. 2016;26: 103–118. 10.1038/cr.2015.150 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Johnson KC, Houseman EA, King JE, von Herrmann KM, Fadul CE, Christensen BC. 5-Hydroxymethylcytosine localizes to enhancer elements and is associated with survival in glioblastoma patients. Nat Commun. 2016;7: 13177 10.1038/ncomms13177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Raiber E-A, Beraldi D, Martínez Cuesta S, McInroy GR, Kingsbury Z, Becq J, et al. Base resolution maps reveal the importance of 5-hydroxymethylcytosine in a human glioblastoma. npj Genomic Med. 2017;2: 6 10.1038/s41525-017-0007-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Zhou D, Alver BM, Li S, Hlady RA, Thompson JJ, Schroeder MA, et al. Distinctive epigenomes characterize glioma stem cells and their response to differentiation cues. Genome Biol. 2018;19: 43 10.1186/s13059-018-1420-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Libertini E, Heath SC, Hamoudi RA, Gut M, Ziller MJ, Czyz A, et al. Information recovery from low coverage whole-genome bisulfite sequencing. Nat Commun. 2016;7: 11306 10.1038/ncomms11306 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Staševskij Z, Gibas P, Gordevičius J, Kriukienė E, Klimašauskas S. Tethered Oligonucleotide-Primed Sequencing, TOP-Seq: A High-Resolution Economical Approach for DNA Epigenome Profiling. Mol Cell. 2017;65: 554–564.e6. 10.1016/j.molcel.2016.12.012 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Liutkevičiūtė Z, Kriukienė E, Grigaitytė I, Masevičius V, Klimašauskas S. Methyltransferase-Directed Derivatization of 5-Hydroxymethylcytosine in DNA. Angew Chemie Int Ed. 2011;50: 2090–2093. 10.1002/anie.201007169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Hu L, Liu Y, Han S, Yang L, Cui X, Gao Y, et al. Jump-seq: Genome-Wide Capture and Amplification of 5-Hydroxymethylcytosine Sites. J Am Chem Soc. 2019;141: 8694–8697. 10.1021/jacs.9b02512 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Bergen K, Betz K, Welte W, Diederichs K, Marx A. Structures of KOD and 9°N DNA Polymerases Complexed with Primer Template Duplex. ChemBioChem. 2013;14: 1058–1062. 10.1002/cbic.201300175 [DOI] [PubMed] [Google Scholar]
  • 29.Wynne SA, Pinheiro VB, Holliger P, Leslie AGW. Structures of an Apo and a Binary Complex of an Evolved Archeal B Family DNA Polymerase Capable of Synthesising Highly Cy-Dye Labelled DNA. Maga G, editor. PLoS ONE. 2013;8: e70892 10.1371/journal.pone.0070892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Liutkevičiūtė Z, Lukinavičius G, Masevičius V, Daujotytė D, Klimašauskas S. Cytosine-5-methyltransferases add aldehydes to DNA. Nat Chem Biol. 2009;5: 400–402. 10.1038/nchembio.172 [DOI] [PubMed] [Google Scholar]
  • 31.Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Schöler A, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011;480: 490–495. 10.1038/nature10716 [DOI] [PubMed] [Google Scholar]
  • 32.Sun Z, Terragni J, Jolyon T, Borgaro JG, Liu Y, Yu L, et al. High-resolution enzymatic mapping of genomic 5-hydroxymethylcytosine in mouse embryonic stem cells. Cell Rep. 2013;3: 567–76. 10.1016/j.celrep.2013.01.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Han D, Lu X, Shih AH, Nie J, You Q, Xu MM, et al. A Highly Sensitive and Robust Method for Genome-wide 5hmC Profiling of Rare Cell Populations. Mol Cell. 2016;63: 711–719. 10.1016/j.molcel.2016.06.028 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Song C-X, Szulwach KE, Dai Q, Fu Y, Mao S-Q, Lin L, et al. Genome-wide Profiling of 5-Formylcytosine Reveals Its Roles in Epigenetic Priming. Cell. 2013;153: 678–691. 10.1016/j.cell.2013.04.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wen L, Li X, Yan L, Tan Y, Li R, Zhao Y, et al. Whole-genome analysis of 5-hydroxymethylcytosine and 5-methylcytosine at base resolution in the human brain. Genome Biol. 2014;15: R49 10.1186/gb-2014-15-3-r49 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, et al. Dynamic changes in the human methylome during differentiation. Genome Res. 2010;20: 320–331. 10.1101/gr.101907.109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Schutsky EK, DeNizio JE, Hu P, Liu MY, Nabel CS, Fabyanic EB, et al. Nondestructive, base-resolution sequencing of 5-hydroxymethylcytosine using a DNA deaminase. Nat Biotechnol. 2018;36: 1083–1090. 10.1038/nbt.4204 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Zeng H, He B, Xia B, Bai D, Lu X, Cai J, et al. Bisulfite-Free, Nanoscale Analysis of 5-Hydroxymethylcytosine at Single Base Resolution. J Am Chem Soc. 2018;140: 9 10.1021/jacs.8b08297 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wang Y, Zhang X, Wu F, Chen Z, Zhou X. Bisulfite-free, single base-resolution analysis of 5-hydroxymethylcytosine in genomic DNA by chemical-mediated mismatch. Chem Sci. 2019;10: 447–452. 10.1039/c8sc04272a [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Xu Y, Wu F, Tan L, Kong L, Xiong L, Deng J, et al. Genome-wide Regulation of 5hmC, 5mC, and Gene Expression by Tet1 Hydroxylase in Mouse Embryonic Stem Cells. Mol Cell. 2011;42: 451–464. 10.1016/j.molcel.2011.04.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 2019;47: D766–D773. 10.1093/nar/gky955 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Consortium TEP. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489: 57–74. 10.1038/nature11247 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, et al. The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res. 2018;46: D794–D801. 10.1093/nar/gkx1081 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Ines Alvarez-Garcia

8 Nov 2019

Dear Dr Kriukiene,

Thank you for submitting your manuscript entitled "Precise Genomic Mapping of 5-hydroxymethylcytosine via Covalent Tether-directed Sequencing" for consideration as a Research Article by PLOS Biology. Thank you also for your patience as we completed our editorial process, and please accept my apologies for the delay in providing you with our decision.

Your manuscript has now been evaluated by the PLOS Biology editorial staff as well as by an academic editor with relevant expertise and I am writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Nov 12 2019 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

Carlyle House, Carlyle Road

Cambridge, CB4 3DN

+44 1223–442810

Decision Letter 1

Ines Alvarez-Garcia

20 Dec 2019

Dear Dr Kriukiene,

Thank you very much for submitting your manuscript "Precise Genomic Mapping of 5-hydroxymethylcytosine via Covalent Tether-directed Sequencing" for consideration as a Methods and Resources at PLOS Biology. Thank you also for your patience as we completed our editorial process, and please accept my apologies for the delay in providing you with our decision. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

As you will see, the reviewers are positive in general and find the method described in the manuscript interesting and novel. However, they also raise several issues that need to be addressed before we can consider the manuscript for publication. After consulting with the academic editor, we would like to highlight several points, which we deem particularly important and consider essential. These include issues relating to priming efficiency (Rev. 2, point 1), yield of reaction (Rev. 2, point 2), demonstration of b-GT activity outside GC (Rev. 2, point 5) and demonstration of a clear benchmark (Rev. 3, point 1). In addition, inclusion of a discussion of limitations of the method, as suggested by Reviewer 1, will also be important.

In light of the reviews (attached below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 2 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

Carlyle House, Carlyle Road

Cambridge, CB4 3DN

+44 1223–442810

---------------------------------------------------------------

Reviewers’ comments

Rev. 1: Sriharsa Pradhan – please note that this reviewer has waived anonymity

This manuscript describes a new sequencing technique called hmTOP-seq that can be used for identifying 5hmC at single base resolution genome-wide. This method involves Fragmented genomic DNA is tagged with an azide group through BGT-glucosylation flowed by ODN containing single biotin group is tethered to azide-groups using click chemistry. The biotinlabeled fragments are captured on streptavidin beads, TO-primed strand extension and PCR amplification was performed to generate sequencing library. The authors hypothesized that this method would identify 5hmC on any genomic DNA. To validate their approach, the authors used mouse ES genomic DNA and compared their sequencing data with data obtained from TAB-seq (bisulphite based) and ABA-seq and Pvu-Seal Seq (restriction enzyme based) and found good correlation between 5hmC mapped regions. The total number of the hmTOPseq called 5hmCH sites constituted 1.6% of 5hmCGs similar to 1.3% reported by TABseq. There are some difference between restriction enzyme based methods. The authors have also compared 5hmC distributions at other epigenomics features such as histone modifications and genic distribution. Based on their analyses, the authors conclude that the non-destructive nature of hmTOP-seq allowed the construction of 5hmC maps with several hundred-fold less starting DNA amounts than for WGBS, thus is a cost-effective target enrichment protocol. Furthermore, only on average 20 M of processed reads can be used for mES genome 5hmC analysis. This is a good and through study supported by chemistry and pfu enzymatic amplification characteristics. The method described is likely to be used by various labs studying DNA hydroxymethylation and function. I do not have any major criticisms few minor clarifications needed this manuscript’s readability better.

Minor comments:

1. Would the authors mention the shortcomings of this method? In general, all protocols have some issues that needs to be performed carefully for optimal results.

2. Similarly, a stepwise working protocol i.e. easy to follow must be included in supplementary section for reproducibility of the protocol.

Rev. 2:

Gibas et al, have developed a new method to sequence and map 5hmC, an important modified DNA mark found in humans. Their technique relies on enzymatic modification of 5hmC with an azido-containing sugar, then conjugation to an oligonucleotide primer. Following, enrichment of the oligo-containing DNA, this primer is used to amplify the surrounding region of the 5hmC. Excitingly, they show the first base extension of this priming occurring mostly at the 5hmC it is attached to. This gives them the ability to map the position of 5hmC at single-base resolution, not currently possible with pull-down techniques. I believe this is an interesting piece of work and deserves to be published, however there are a number of pieces of information missing to confirm their findings before this should happen.

Major comments:

1. The initial analysis of the technique on model DNA fragments is weak and very badly explained. The authors must re-explain what is happening in Figure S2, as it is not clear. This is vital as it underpins their whole technology. It is important for them to sequence the amplified products to find out what percentage actually prime at the 5hmC site verses the neighbouring sites. Additionally, the use of ‘high’, ‘medium’, and ‘low’ in Table S1 is unacceptable. The authors must quantify their findings. For instance, in the text they mention the DBCO addition ‘failed to generate any substantial amounts of the priming’, but in Table S1 they mark it as ‘low’.

2. The analysis of the technique on the bacteriophage was also lacking. How does formaldehyde and bacteriophage genomic DNA form 5hmC? Did they analyse the DNA to confirm the yield of this reaction? They authors seem to assume 100% yield. Why was analysis only considered at GCGC sites? The authors should have again mapped the priming sites to see if this matches the analysis of the synthetic DNA. Additionally, in Fig 2B, why was a curved line used as a trend line, this looks linear through the medians?

3. Regarding the analysis of the mESCs data, the quantification of mapped reads around CG sites is vague. In the text it says 93% ‘at or around CGs’, why can’t this be quantified to % at each position relative to the 5hmC. By the Fig 2C this looks to be ~65% at the 5hmC site. This should then be compared to the data from the model DNA and bacteriophage mapping, which was not conducted. I think I understand, but the authors should really explain why their method can analyse asymmetric 5hmC so well (I believe because the direction of priming has to be 5’ to 3’). It would be good to make a figure for this.

4. A lot of the time is unclear if the mESC data is being analysed with only the 500 ng library, or combined with either or both of the 50 and 5 ng libraries. This needs to be clear at all times. How many replicates of each library size were prepared and analysed? The pearson correlation coefficient is good for the 500 ng library, but poor for the 50 ng library and no correlation is seen for the 5 ng library. This indicates that they require at least 500 ng input DNA for good coverage of 5hmC (even if less of the library can be sequenced following this). This goes against one of the main advantages the authors are claiming. Additionally, how good is it to then use data from the 50 and 500 ng libraries together?

5. The authors analyse 5hmC in non-CG context with their method. However, has it been previously shown that the b-GT is active outside of the CG context? If not, the authors should quantify this activity in synthetic DNA to confirm their method is actually functional in these contexts.

Other comments:

6. More citations are required where reference 1 is, to highlight the amount of work that has been carried out into quantifying the levels of 5hmC across multiple tissue types.

7. 5hmC is only mentioned as an intermediate in the demethylation pathway, please include the extra details that demonstrate it may be an epigenetic mark in its own right.

8. The authors mention that high sequencing depths is ‘prohibitively expensive’ (page 2). I think this is unfair, as many groups do sequence to high depths. Prohibitively is not the correct term.

9. As the authors use the ‘TOP-seq’ method also, they need to explain how this works.

10. The ‘h-density’ is not explained at all, and I do not know what this term refers to. Please explain why they have used this, and what the ‘180 bp window’ is. Does this remove the single base resolution advantage?

11. At the end of page 8 the authors say they compare ‘hydroxymethylation and general cytosine modification levels’ using hmTOP and TOP. How was this done as the authors haven’t demonstrated that TOP-seq can show where modified cytosines are present, just where non-modified cytosine is present. I don't see how they can call a modified cytosine from the negative of TOP-seq, as their method is not truly quantifiable.

12. The authors mention reference 18 in passing. However, this is an incredibly important literature reference, as they perform almost the same method described within. This should be described in more detail. From reading reference 18, it does seem this paper has a major advantage of closer to single-base resolution, but this needs to be discussed in more detail.

13. Why was LNA and phosphorothioate modifications used on the priming strand (page 14)?

Rev. 3:

In this study Gibas et al set a goal to develop a method to map 5-hydroxymethylcytosine (5hmC) in DNA. Previously, the group demonstrated that tethering of non-complimentary oligonucleotides to unmodified CpGs enables priming of DNA polymerases. As a consequence, loci containing unmodified CpGs can be enriched by amplification and identified using short read sequencing (TOP-seq).

To map genome-wide 5hmC authors explore several strategies. DNA methyltransferase SssI and β-glucosyltransferase (BGT) were used to modify 5'hydroxymethyl in 5hmC with functional groups, which can be chemically ligated to oligonucleotides. The best efficiency was achieved by employing BGT, which installed azidoglucose enabling azide-alkyne cycloaddition (CLICK). Oligonucleotide complementary primers are then used to amplify and enrich 5hmC containing DNA, which is sequenced and mapped to localise the modification (hmTOP-seq). To evaluate the performance of the method authors sequence in vitro hydroxymethylated phage DNA and mouse embryonic stem cell DNA. Sequencing results demonstrated near-single nucleotide resolution - 60% reads exactly start with a CpG, whereas 93% of reads are localised within 4 bp. Important advantage of the hmTOP-seq is that the method works with up to 5 ng of DNA. However, the performance of the method, when starting with 500 ng of DNA is substantially better achieving twice better correlation between the replicated and the coverage of 5hmCpGs. Finally, authors were able to detect both 5hmCpG and 5hmCpH sequences, strand-specific enrichment of 5hmCpGs in mES cell genome and enrichment of 5hmCs around the splice junction, demonstrating sensitivity of the method.

A number of 5hmC mapping techniques already have been published and are in use. Similar approach to install oligonucleotides for 5hmC mapping was used by Hu et al (2019, Jump-seq). Authors of this manuscript, however did employ different oligonucleotides and managed achieve a better resolution. Overall, I think the manuscript does provide convincing proofs that priming based 5hmC mapping strategy could be successfully used to map 5hmCs at a lower cost to other methods.

I have following criticisms, which need to be addressed before publication:

1. It is important to benchmark performance of hmTOP-seq in ability to detect 5hmC sites with different modification frequency. Authors could use TAB-seq data to select sites with different degrees of 5hmC and examine correlation with hmTOP-seq coverage.

2. What are the sources of the inefficiency of hmTOP-seq - is it labelling of 5hmCs or DNA polymerase priming from non-complimentary oligonucleotide?

3. Authors state - "Using six hmTOP-seq control libraries we observed 55,025 false-positive sites". I was not able to find what are the cut-offs for false positives and what criteria were used to define them.

4. When discussing strand bias instead of term "opposite" strand, "template" strand should be used.

Decision Letter 2

Ines Alvarez-Garcia

12 Feb 2020

Dear Dr Kriukiene,

Thank you for submitting your revised Methods and Resources manuscript entitled "Precise Genomic Mapping of 5-hydroxymethylcytosine via Covalent Tether-directed Sequencing" for publication in PLOS Biology. I have now obtained advice from two of the original reviewers and have discussed their comments with the Academic Editor.

Based on the reviews (attached below), we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the two remaining points raised by Reviewer 2. Please also make sure to address the data and other policy-related requests noted at the end of this email.

We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer. In addition to the remaining revisions and before we will be able to formally accept your manuscript and consider it "in press", we also need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication.

*Copyediting*

Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

*Submitting Your Revision*

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include a cover letter, a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable), and a track-changes file indicating any changes that you have made to the manuscript.

Please do not hesitate to contact me should you have any questions.

Sincerely,

Ines

--

Ines Alvarez-Garcia, PhD

Senior Editor

PLOS Biology

Carlyle House, Carlyle Road

Cambridge, CB4 3DN

+44 1223–442810

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it:

Fig. 2A, B, C, D, E, F, G; Fig. 3A; Fig. 4A, B; Fig. S2E; Fig. S3A, B; Fig. S4; Fig. S5A, B and Fig. S7

NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on WHERE THE UNDERLYING DATA CAN BE FOUND, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

---------------------------------------------------------------

Reviewers’ comments

Rev. 2:

The authors have substantially improved their manuscript. I only have two further comments:

1) Regarding the trend line in Fig 2B. After plotting a linear regression, this also does not look as though it fits. Potentially it is linear from 0-20, but then it plateaus. I didn't mean for the authors to just put a straight line on the graph. They should actually use some statistics to identify the trend.

2) The authors mention input quantity as an improvement over bisulfite sequencing in the introducion "demands sizeable amounts of input DNA" and the discussion "hmTOP-seq allowed the construction of 5hmC maps with several hundred-fold less starting DNA amounts than for WGBS". The authors method does have some distinct advantages over WGBS commonly, however input quantity is not one of them and 'hundred-fold less' is grossly overstating even to the conventional bilsufite sequencing in input quantities (0.5-5 μg DNA) (DOI 10.1186/s13059-018-1408-2). At this reference, you can also find links to papers where WGBS has been performed on a few hundred cells to even single cells. In the authors manuscript 5 ng could not be used for single base resolution.

Rev. 3:

The authors addressed my concerns. The manuscript is now suitable for publication.

Decision Letter 3

Ines Alvarez-Garcia

27 Mar 2020

Dear Dr Kriukiene,

On behalf of my colleagues and the Academic Editor, Tom Misteli, I am pleased to inform you that we will be delighted to publish your Methods and Resources in PLOS Biology.

The files will now enter our production system. You will receive a copyedited version of the manuscript, along with your figures for a final review. You will be given two business days to review and approve the copyedit. Then, within a week, you will receive a PDF proof of your typeset article. You will have two days to review the PDF and make any final corrections. If there is a chance that you'll be unavailable during the copy editing/proof review period, please provide us with contact details of one of the other authors whom you nominate to handle these stages on your behalf. This will ensure that any requested corrections reach the production department in time for publication.

Early Version

The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process.

Kind regards,

Alice Musson

Publication Assistant,

PLOS Biology

on behalf of

Ines Alvarez-Garcia,

Senior Editor

PLOS Biology

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Putative conformations of the tethering linkers 1a, 2a, and 3 in relation to T2 (tethered ODN) and target 5hmC (gDNA) nucleotides.

    Linker conformations were determined by ChemBio3D Ultra MM2 energy minimization using the target values derived from the coordinates of dT4 and dT6 nucleotides in the template strand of the KOD polymerase–DNA complex (PDB: 5omf) as follows: C5–C5 distance = 8.99 Å; C5-methyl bond angles for dT6 (corresponds to T2 in the tethered ODN) and dT4 (corresponds to 5hmC) = 58.8° and 124.4°, respectively; dihedral angle (dT4–dT6 helical twist) = 59.8°. Actual refined values are shown in fine print. 5hmC, 5-hydroxymethylcytosine; ODN, oligodeoxyribonucleotide; PDB, Protein Data Bank.

    (TIF)

    S2 Fig. Specificity and efficiency of 5hmC mapping by hmTOP-seq.

    (A) Schematic view shows four priming products generated from the two model DNA fragments, 1H and 2H. Theoretical sizes of the four specific hmTOP-seq and uTOP-seq products (including 135-bp adapters) are as follows: 1H-dir 176 bp, 2H-dir 194 bp, 1H-rev 247 bp, and 2H-rev 276 bp. (B) Agilent Bioanalyzer profiles of hmTOP-seq priming products obtained from 1H and 2H model DNA fragments, each containing a single 5hmC, followed by derivatization with cysteamine (black line) or the azide group (red line). For each type of derivatization, different number of PCR cycles was required to detect comparable amounts of four products (20 cycles for azide- and 25 cycles for cysteamine-derivatization). (C) Comparison of hmTOP-seq and uTOP-seq [25] in the 1H/2H model DNA system. Bioanalyzer profiles of the products obtained from 1H and 2H model DNA fragments each containing a single 5hmC or an unmodified CG processed through the hmTOP-seq (red line) or uTOP-seq procedure (black line). In both cases, the corresponding workflow generates four specific products with similar efficiencies (15 cycles of PCR were used). Azide-labeling of unmethylated CG sites in the model DNA fragments was performed as described [25]. (D) Agilent Bioanalyzer profiles of hmTOP-seq priming products obtained from 1H/2H, followed by copper-free (black line) or Cu(I)-catalyzed (red line) click conjugation to DNA oligonucleotide. In both cases, 25 cycles of PCR were used. (E) Assessment of the M.HhaI-directed hydroxymethylation efficiency on two model DNA fragments. After incubation of 1H/2H model DNA fragments with M.HhaI in the presence of formaldehyde, the DNA fragments were cleaved with Hin6I restriction endonuclease and the amount of uncleaved DNA was evaluated by qPCR with the respective primer pairs (1H-dir/1H-rev or 2H-dir/2H-rev; see “Validation of hmTOP-seq in a model DNA fragment system” in Materials and methods). The data underlying this section are included in S2 Data. 5hmC, 5-hydroxymethylcytosine; A1 and A2, strands of a partially complementary adaptor; Ad-A2 and Ad-TO, adapters containing NGS platform-specific 5′-end sequences; hmTOP-seq, 5hmC-specific TOP-seq; qPCR, quantitative PCR; TO, tethered oligodeoxyribonucleotide; uTOP-seq, uCG-specific TOP-seq.

    (TIFF)

    S3 Fig

    Distance distribution of read start positions from (A) a nearest GCCG or (B) a CG site in the hmTOP-seq library of pre-hydroxymethylated lambda DNA (2.5% 5hmC at GCGC sites) and mESCs, respectively. The data underlying this figure are included in S2 Data. 5hmC, 5-hydroxymethylcytosine; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell.

    (TIFF)

    S4 Fig. Correlation between technical replicates using subsamples of the hmTOP-seq data.

    Sequencing reads were sampled from 500-ng and 50-ng input DNA hmTOP-seq libraries, respectively. The data underlying this section are included in S2 Data. hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing.

    (TIFF)

    S5 Fig. hmTOP-seq analysis of non-CG sites.

    Odds ratio (Fisher’s test) for enrichment (A) or distribution (B) of 76,665 5hmCHs detected in mESCs across various genomic features. All enrichments have p < 0.05. The data underlying this section are included in S2 Data. 5hmCH, hydroxymethylated CH site; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell.

    (TIFF)

    S6 Fig. Genomic distribution of 5hmCGs in mESCs.

    Distribution of 5hmCGs in 50-ng input DNA hmTOP-seq libraries across the sense and the antisense strands of genes grouped according to their expression level. Numbers of genes in each group and p-values for the modification difference between the strands are shown above each graph. hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; mESC, mouse embryonic stem cell.

    (TIFF)

    S7 Fig. Changes in 5hmCG distribution at exon-intron cross-boundaries.

    Distribution of 5hmCGs in 50-ng input DNA hmTOP-seq libraries at both sides of the exon-intron boundary is presented for the sense and the antisense strands. The x-axis shows the distance (nt) of CGs from boundary. p-Values indicate a difference in coverage between exonic and intronic side of the boundary for the first 25 nt. The data underlying this section are included in S2 Data. 5hmCG, hydroxymethylated CG site; hmTOP-seq, 5hmC-specific tethered oligonucleotide–primed sequencing; nt, nucleotide.

    (TIFF)

    S1 Table. Structural parameters and TOP-seq performance of tethering linkers.

    TOP-seq, tethered oligonucleotide–primed sequencing.

    (XLSX)

    S2 Table. Sequencing statistics of hmTOP-seq and uTOP-seq libraries of mESC DNA.

    hmTOP-seq, 5hmC-specific TOP-seq; mESC, mouse embryonic stem cell; uTOP-seq, uCG-specific TOP-seq.

    (XLSX)

    S1 Data. Numerical data underlying Figs 2A, 2B, 2C, 2D, 2E, 2F, 2G, 3A, 4A and 4B.

    (XLSX)

    S2 Data. Numerical data underlying S2E, S3A, S3B, S4, S5A, S5B and S7 Figs.

    (XLSX)

    Attachment

    Submitted filename: Rebuttal letter.pdf

    Attachment

    Submitted filename: Rebuttal letter.pdf

    Data Availability Statement

    Raw and processed hmTOP-seq data generated in this study have been deposited in the NCBI Gene Expression Omnibus under accession number GSE140206.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES