Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2013 Aug 21;29(20):2645–2646. doi: 10.1093/bioinformatics/btt459

MLML: consistent simultaneous estimates of DNA methylation and hydroxymethylation

Jianghan Qu 1,, Meng Zhou 1,, Qiang Song 1, Elizabeth E Hong 1, Andrew D Smith 1,*
PMCID: PMC3789553  PMID: 23969133

Abstract

Motivation: The two major epigenetic modifications of cytosines, 5-methylcytosine (5-mC) and 5-hydroxymethylcytosine (5-hmC), coexist with each other in a range of mammalian cell populations. Increasing evidence points to important roles of 5-hmC in demethylation of 5-mC and epigenomic regulation in development. Recently developed experimental methods allow direct single-base profiling of either 5-hmC or 5-mC. Meaningful analyses seem to require combining these experiments with bisulfite sequencing, but doing so naively produces inconsistent estimates of 5-mC or 5-hmC levels.

Results: We present a method to jointly model read counts from bisulfite sequencing, oxidative bisulfite sequencing and Tet-Assisted Bisulfite sequencing, providing simultaneous estimates of 5-hmC and 5-mC levels that are consistent across experiment types.

Availability: http://smithlab.usc.edu/software/mlml

Contact: andrewds@usc.edu

Supplementary information: Supplementary material is available at Bioinformatics online.

1 INTRODUCTION

DNA methylation is an important epigenetic mark in mammals. In addition to the extensively studied 5-methylcytosine (5-mC) modification, its oxidation product, 5-hydroxymethylcytosine (5-hmC), has been observed at substantial levels in both somatic and embryonic stem cells (Kriaucionis and Heintz, 2009; Tahiliani et al., 2009). Recent studies of 5-hmC in mouse TET knock-out models (Ito et al., 2010), mouse zygotic development (Iqbal et al., 2011) and multiple cell types (Globisch et al., 2010; Ito et al., 2011; Kinney et al., 2011; Sun et al., 2013) suggest that 5-hmC is involved in epigenetic regulation.

The current most comprehensive and accurate method for profiling cytosine methylation is bisulfite sequencing (BS-seq). Treatment with sodium bisulfite converts unmethylated cytosines to uracils, but does not distinguish between 5-mC and 5-hmC (Huang et al., 2010), and consequently the yield of methylation from BS-seq is the sum of 5-mC and 5-hmC levels. Two recently developed techniques, oxidative bisulfite sequencing (oxBS-seq) (Booth et al., 2012) and Tet-Assisted Bisulfite sequencing (TAB-seq) (Yu et al., 2012), provide high-throughput single-base resolution measurements of 5-mC and 5-hmC, respectively. Any two of BS-seq, TAB-seq or oxBS-seq can be combined to profile both the 5-mC and 5-hmC methylomes of a cell population, and especially when studying 5-hmC, proper interpretation of results depends on having some estimate of the 5-mC level. However, naive manipulation of read count frequencies from independent sequencing experiments often produces two kinds of ‘overshoot’ problems in estimating 5-mC and 5-hmC levels. When combining BS-seq with TAB-seq, the 5-mC level at a given CpG site can be estimated by subtracting the 5-hmC level (TAB-seq) from the combined 5-mC + 5-hmC level (BS-seq). The result can be negative, because of random sampling (or systematic error) in each experiment. Similarly, combining TAB-seq and oxBS-seq could lead to estimates of 5-mC and 5-hmC levels exceeding 100%. These overshoot sites may constitute a substantial proportion. In one dataset based on oxBS-seq technology, 17% of CpG sites captured by reduced representation bisulfite sequencing (RRBS) and oxRRBS experiments exhibited overshoot (Booth et al., 2012). To fully leverage the information in these data requires some method for making consistent estimates of 5-mC and 5-hmC levels.

We present maximum likelihood methylation levels (MLML) for simultaneous estimation of 5-mC and 5-hmC, combining data from any two of BS-seq, TAB-seq or oxBS-seq, or all three when available. Our estimates are consistent in that 5-mC and 5-hmC levels are non-negative, and never sum over 1. In an important subset of cases, our estimates are not only consistent but also show significantly greater accuracy at sites with lower coverage.

2 METHODS

Each of BS-seq, TAB-seq and oxBS-seq provides some amount of information about both the 5-mC and 5-hmC levels. Our approach is to combine information from any pair or all three of these experiments, and arrive at maximum likelihood estimates (MLEs) for the 5-mC and 5-hmC levels. A similar method has been developed in the context of haplotype frequency estimation from pooled sequencing (Kessner et al., 2013). To explain our method, we assume the data are from TAB-seq and BS-seq experiments for the same biological sample. The more general formulation is provided in Supplementary Information.

Focusing on an individual CpG site, let pm denote the methylation level (a probability), ph the hydroxymethylation and Inline graphic the level of unmethylated C. In the TAB-seq experiment, let h denote the number of C reads mapping over the CpG site, and let g denote the T reads mapping over the same CpG. The total reads covering the CpG site in the TAB-seq experiment is then h + g. Similarly, let t denote the number of C reads mapping over the site in the BS-seq experiment, whereas u denotes the number of T reads, and the total reads covering the CpG in the BS-seq experiment is Inline graphic. If values for pm and ph are known, h and u are binomial random variables, i.e. Inline graphic, and Inline graphic:

graphic file with name btt459um1.jpg
graphic file with name btt459um2.jpg

Given observations of Inline graphic, when no overshoot would result, we use the frequencies to estimate Inline graphic. In this case, the frequencies directly give MLEs. At overshoot sites, we introduce latent variables and use expectation maximization to approximate the MLE for p. Let Inline graphic (Inline graphic) be the number of C (T) reads in BS-seq (TAB-seq) that correspond to 5-mCs. Then Inline graphic (Inline graphic) is the number of C (T) reads corresponding to 5-hmC (unmethylated C). The complete data likelihood is then

graphic file with name btt459um3.jpg

where Inline graphic is a multinomial p.m.f. Estimates for ph and pm are then computed by expectation maximization algorithm to account for the latent Inline graphic and Inline graphic (Supplementary Information). The MLEs can be compared with binomial confidence intervals around corresponding frequency estimates if direct readouts (e.g. for 5-hmC in the case of TAB-seq) are available. When estimates fall outside the specified confidence interval, sites are flagged as being ‘strongly’ inconsistent. An overabundance of such sites might suggest systematic error.

3 RESULTS

To understand the properties of our estimators and the frequency method, we used simulations with fixed coverage and precisely set levels for 5-mC and 5-hmC, assuming the experiments were BS-seq and TAB-seq. The case of BS-seq and oxBS-seq is symmetric with the estimates for ph and pm exchanged. For each valid combination of 5-mC and 5-hmC levels from Inline graphic, we simulated from binomial distributions for both BS-seq and TAB-seq. Estimates for ph and pm were made using the maximum likelihood method and the frequency method, which estimate ph using Inline graphic and pm using Inline graphic. The relative error (Inline graphic) for both estimation methods was computed and then averaged over 100 000 simulations for each parameter combination. The average estimation errors are presented in Supplementary Table S1. Estimates of ph are more accurate using MLML, especially at lower values of ph and low coverage. For example, when the true values are Inline graphic, the MLML reduces the average relative error by >23% at overshoot sites compared with frequency estimates when the coverage is Inline graphic, and this reduction in error increases to 57% for such sites covered only Inline graphic. The trend for errors of ph estimates is shown in Figure 1a, indicating the accuracy advantage for MLML as a function of coverage. The simulation also revealed substantial amounts of overshoot sites under different 5-mC and 5-hmC level combinations (Fig. 1b, Supplementary Tables).

Fig. 1.

Fig. 1.

Accuracy is improved at lower coverage using MLML (BS-seq + TAB-seq). (a) Average absolute errors of 5-hmC level estimates at overshoot sites. (b) Proportion of overshoot sites in simulated data

Supplementary Material

Supplementary Data

ACKNOWLEDGEMENTS

The authors thank Fang Fang, Philip Uren, Benjamin Decato and Jun Zhou for suggestions and testing of this software. They also thank an anonymous referee for suggesting the functionality of flagging sites that are so inconsistent as to suggest non-random error.

Funding: This work was supported by a grant from the US National Institutes of Health National Human Genome Research Institute (R01 HG005238).

Conflict of interest: none declared.

REFERENCES

  1. Booth M, et al. Quantitative sequencing of 5-methylcytosine and 5-hydroxymethylcytosine at single-base resolution. Science. 2012;336:934–937. doi: 10.1126/science.1220671. [DOI] [PubMed] [Google Scholar]
  2. Globisch D, et al. Tissue distribution of 5-hydroxymethylcytosine and search for active demethylation intermediates. PLoS One. 2010;5:e15367. doi: 10.1371/journal.pone.0015367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Huang Y, et al. The behaviour of 5-hydroxymethylcytosine in bisulfite sequencing. PLoS One. 2010;5:e8888. doi: 10.1371/journal.pone.0008888. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Iqbal K, et al. Reprogramming of the paternal genome upon fertilization involves genome-wide oxidation of 5-methylcytosine. Proc. Natl Acad. Sci. USA. 2011;108:3642–3647. doi: 10.1073/pnas.1014033108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Ito S, et al. Role of Tet proteins in 5mC to 5hmC conversion, ES-cell self-renewal and inner cell mass specification. Nature. 2010;466:1129–1133. doi: 10.1038/nature09303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Ito S, et al. Tet proteins can convert 5-methylcytosine to 5-formylcytosine and 5-carboxylcytosine. Science. 2011;333:1300–1303. doi: 10.1126/science.1210597. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Kessner D, et al. Likelihood estimation of frequencies of known haplotypes from pooled sequence data. Mol. Biol. Evol. 2013;30:1145–1158. doi: 10.1093/molbev/mst016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Kinney S, et al. Tissue specific distribution and dynamic changes of 5-hydroxymethylcytosine in mammalian genome. J. Biol. Chem. 2011;286:24685–24693. doi: 10.1074/jbc.M110.217083. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Kriaucionis S, Heintz N. The nuclear DNA base 5-hydroxymethylcytosine is present in Purkinje neurons and the brain. Science. 2009;324:929–930. doi: 10.1126/science.1169786. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Sun Z, et al. High-resolution enzymatic mapping of genomic 5-hydroxymethylcytosine in mouse embryonic stem cells. Cell Rep. 2013;21:567–576. doi: 10.1016/j.celrep.2013.01.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Tahiliani M, et al. Conversion of 5-methylcytosine to 5-hydroxymethylcytosine in mammalian DNA by MLL partner TET1. Science. 2009;324:930–935. doi: 10.1126/science.1170116. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Yu M, et al. Base-resolution analysis of 5-hydroxymethylcytosine in the mammalian genome. Cell. 2012;149:1368–1380. doi: 10.1016/j.cell.2012.04.027. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES