Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2018 Apr 2.
Published in final edited form as: Int J Math Comput Sci. 2010;5(2):87–100.

An Empirical Bayes Approach for Methylation Differentiation at the Single Nucleotide Resolution

Kenneth McCallum 1, Wenxin Jiang 1, Ji-Ping Wang 1
PMCID: PMC5880554  NIHMSID: NIHMS827227  PMID: 29619112

Abstract

DNA methylation is an important epigenetic phenomenon that is associated with a variety of diseases, particularly cancers. Recent development of high throughput sequencing technology has enabled researchers to investigate the methylation rate at a single nucleotide resolution for any given sample. Testing for methylation rate equality or difference between two samples, however, is challenged by the small sample size observed at many sites across the genome. Fisher’s exact test is typically used in this situation; however, it is conservative and it cannot be used to test for specific difference in methylation rate between two samples. In this paper, we propose an empirical Bayes approach that utilizes the genome-wide data as prior information for methylation differentiation between two samples. We show that this new approach is more powerful than Fisher’s exact test. In addition, it can be used to test for any specific methylation difference while controlling the false discovery rate (FDR). The new method is applied to a real data set from a colon tumor study.

Keywords: Empirical Bayes, DNA Methylation, Single-nucleotide

AMS subject classification: 62J05, 62J07, 62H35, 62P10

1. Introduction

The epigenetic phenomenon of DNA methylation, in which cytosines in CpG dinucleotides are chemically modified by the addition of a methyl group, plays an important role in genetic regulation [1, 2]. Methylation rates are known to change throughout the genome during development in mammals [3, 4]. Furthermore, differential methylation rates are associated with a variety of diseases, including neurodevelopmental disorders [5], and numerous cancers [6, 7, 8, 9].

Most research up to now has focused on methylation rates over large regions of the genome; however, an increasing number of studies attempt to quantify and analyze methylation rates at specific sites [1, 2]. Methods making use of universal bead arrays have been able to detect differential methylation rates at the single nucleotide resolution and have demonstrated that these differences can be used to distinguish normal and cancer tissues [10]. However, the array based methods are limited in terms of the number of sites that can be examined; for example, only 1536 sites were included in the study by Bibikova et al. [10]. Another approach, the one which will be the focus of this study, makes use of high throughput bisulfite sequencing. This method is increasingly common and has the potential to examine hundreds of thousands or millions of sites simultaneously. For example, Laurent et al. [11] and Gu et al. [12] both made use of bisulfite sequencing to generate maps of methylation with single-nucleotide resolution, Han et al. [9] tested for site specific differential methylation in samples taken from subjects with and without lung cancer using bisulfite sequencing, and Houseman et al. [13] used clustering methods to differentiate methylation rates.

The data set produced by Gu et al. [12] is used for illustration in this study. The data was for two tissue samples, a colon tumor and normal colon tissue, both taken from the same donor. Bisulfite sequencing was used to determine methylation status at targeted CpG sites across the genome. At each site (corresponding to the C in a CpG dinucleotide), the data included the number of reads (number of sequenced DNA fragments) that covered the given site, and the number of reads that were positive for methylation at the given site. Figure 1 illustrates the format of the data. A total of 920,441 sites had at least one read for both tissue samples and only these sites were included in the present study. Although a few sites had very large numbers of reads, some with more than 1400, the majority were small, with a median of 10 or fewer across both samples. Summary statistics for the number of reads are given in Table 1.

Figure 1.

Figure 1

Format of data.

Figure 1 shows two regions of equal length that contain the targeted CpG sites for methylation examination. Bisulfite sequencing generated two groups of short reads of identical length, each of which covered one of these regions. For example, in region 1, four reads were generated. Within each read, the methylation status at the sites were identified as positive or negative. In the figure, n is used to denote the total number of reads observed at a given site, and x the number of methylations (positives) observed across reads.

Table 1.

Quantiles for number of reads per site (nij).

Minimum Q1 Median Q2 Maximum
Normal 1 2 6 14 14043
Tumor 1 3 10 25 14361

The central goal of methylation studies is to identify CpG sites or regions that show differential methylation rates between disease or cancer tissue and normal tissue. Given the small number of reads (or sample size) at a given site, Fisher’s exact test is often the only choice for testing the equality of methylation rates between two samples. Fisher’s exact test, however, is conservative in power. Furthermore, it cannot be used to test for specific difference in the methylation rates between two samples. The latter is particularly important, as in practice, a meaningful difference in methylation is often called only if the methylation rate in one sample is higher/lower than the other by a predefined threshold value (see details below). These two limitations motivate us to seek an alternative approach.

In this paper, we proposed an empirical Bayes (eB) approach, in which we utilize the large amount of data observed from the entire genome to construct a prior distribution for the methylation rate. Based on the posterior distribution of the methylation rate at each site, we test for difference of methylation rate while controlling false discovery rate. We show this new approach has improved power compared to Fisher’s exact test. In addition, it can be used to test any specific difference of methylation rates between two samples.

2. Methods & Results

2.1. The Model

Let xij be the observed methylations out of a total of nij reads at site i from sample j for i = 1, …, M and j = 1, 2, and θij be the true, unobserved, methylation rate. Let N = (N1, N2), X = (X1, X2) where Nj = {nij : i = 1, …, M} and Xj = {xij i = 1, …, M}. We assume that the methylation rates are independent across samples and sites, and have a common distribution within each sample,

θij~Beta(γj,λj).

Note that although the model allows for the possibility that the samples may differ with respect to the hyper-parameters γ and λ, results given below show that in practice a single set of hyper-parameters can be used if the samples are similar. We further assume that each read is a random observation from the population, i.e., the entire underlying tissue. Then xij follows a binomial distribution

xij|(nij,θij)~Binomial(nij,θij).

The posterior probability for the methylation rate given the reads data is then

θij|(nij,xij)~Beta(γj+xij,λj+nijxij).

To estimate the hyper-parameters, observe that the likelihood function is

L(γj,λj;Nj,Xj)=i=1M01P[Xij=xij|θij,nij]p(θij|γj,λj)dθij.

Under the model,

P[Xij=xij|θij,nij]=(nijxij)θijxij(1θij)nijxij

and

p(θij|γj,λj)=B1(γj,λj)θijγj1(1θij)λj1,

where B stands for the Beta function. Therefore,

L(γj,λj;Nj,Xj)=i=1M01(nijxij)B1(γj,λj)θijγj+xij1(1θij)λj+nijxij1dθij=i=1M(nijxij)B1(γj,λj)B(γj+xij,λj+nijxij),

The maximum likelihood estimates of γj and λj, denoted γ^j and λ^j, can easily be found using a method such as the Newton-Raphson algorithm. For the tumor and normal colon tissue data from [12], we fitted the Beta-binomial model for each sample separately, and then for the combined data. Results are summarized in Table 2. The MLEs for γ are approximately equal across samples while the MLEs for λ show a greater difference, with the tumor tissue having a λ value approximately 10% greater than the normal tissue. Despite this small discrepancy in λ, the density curves, shown in Figure 2, appear almost identical. This suggests that little would be gained by specifying separate priors for the two samples in this case.

Table 2.

Maximum likelihood estimates of hyperparameter values.

Parameter Normal Tumor Combined
γ 0.365518 0.359081 0.362389
λ 0.550387 0.614117 0.582426

Estimates of the parameters for the prior distribution are given for normal colon tissue data, tumor colon tissue data, and the combined data set.

Figure 2.

Figure 2

Prior Distribution Densities.

The densities of the fitted prior distributions are given. These priors are assumed to be i.i.d. across all sites in the data set used to fit them.

To verify the fit of the model, based on the empirical distribution of nij, we calculated the expected empirical distribution of the observed methylation rates, defined as xij/nij, based on the fitted beta model from the joint data, treating xij as a random variable. For a given nij

P(Xij=xij|nij,γ^,λ^)=01P[Xij=xij|nij,θij]p(θij|γ^,λ^)dθ.

The probability for observing methylation rate q ≡ xij/nij is then found by taking the weighted average over all pairs (x, n) such that x/n = q. That is, if Sq = {(x, n): x/n = q}, then the expected probability to observe q in the sample given the empirical distribution of nij is

SqP(Xij=xij|nij,γ^,λ^)P(Nij=nij)

where P (Nij = nij) is the probability of a site having nij reads based on the empirical distribution. This is then compared to the observed distribution of methylation rates. Figure 3 gives the shape of the distributions. The shape is similar to that of the curves in Figure 2, though it reflects the fact that the distribution of observed rates is discrete. The spikes that occur near 0 and 1 in the plot are partially due to the large number of sites with small numbers of reads, which are highly constrained in terms of values they can take on. Overall, the evidence shows that the model is a very good fit for the data.

Figure 3.

Figure 3

Theoretical and Observed Methylation Rates.

Proportion of positive reads out of total reads (xij/nij) is given on the x-axis. Number of sites matching a given proportion is shown on the y-axis. The grey bars represent the observed data while the bars with black diagonal stripes indicate the theoretical number given the prior distribution for the underlying methylation rate and the empirical distribution for the number of reads.

2.2. Hypothesis Tests

Two different sets of hypotheses are considered. The first is a simple test of equality

H0:θi1=θi2vs.HA:θi1θi2.

The second is a test of difference of rates given by

H0:|θi1θi2|cvs.HA:|θi1θi2|>c

for some constant c. The second hypothesis is particularly interesting, as in practice, differential methylation is often called when the difference is substantial, e.g., c=0.2 [11].

Given γ^j, λ^j the posterior distribution of the methylation rate is

θij|(xij,nij)~Beta(γ^j+xij,λ^j+nijxij).

For convenience, we shall denote the posterior distribution as πθ|X,N(θij) in the following context. For testing H0: θi1 = θi2 versus H1: θi1θi2, we define the posterior log odds as follows:

Δi=log[πθ|X,N(θi1>θi2)1πθ|X,N(θi1>θi2)].

We reject H0 if |Δi| > δα, where δα is the cutoff value corresponding to level α.

Given the prior distribution and the number of reads at a site for each sample, it is possible to calculate the level (α) and power of the test for a given critical value (δα) analytically. However, doing so for every combination of number of reads appearing in the sample would be extremely computationally intensive. Here we estimate it using Monte Carlo simulations. We first generate Monte-Carlo samples as follows:

  1. Sample (ni1, ni2) pairs with replacement from the observed data. We sample the pairs instead of individual nij’s to account for possible dependence of reads count between samples due to various factors including DNA sequence features.

  2. Sample θij values for each site in each sample from the fitted prior distribution, i.e., Beta(γ^,λ^) from the combined data or Beta(γ^j,λ^j) from separate samples.

  3. Generate xij from Binomial(nij, θij)

Two simulated data sets of size equal to the original data are generated. In one set, we use Beta(γ^,λ^) to generate the θij values for both samples. For the second set, Beta(γ^,λ^) is used only for the sites with equal θij values while the remaining sites are simulated using the separate estimates from the normal and tumor tissues (i.e., Beta(γ^j,λ^j)). In both cases, the first 100,000 sites are set so that θi1 = θi2 while the remaining ones are allowed to vary.

After the simulated data set is complete, the posterior log odds can be calculated for each site. A suitable critical value can then be selected for a level α test by setting δα equal to the 100(1 − α)th percentile of the absolute values of the log odds for the subset of sites with θi1 = θi2. Similarly, power can be estimated by taking the proportion of sites with θi1θi2 with posterior log odds that have absolute values less than δα. In implementing this test for the simulated data sets, rather than refitting the values of γ^ and λ^, the values used in the simulation were reused in calculating the posteriors. This is justified by the large size of the data sets and the resultant accuracy and precision of the MLE.

Two versions of the test are conducted on each of the simulated data sets. The first uses the combined estimates of the hyper-parameters, while the second uses the separate estimates of the hyper-parameters for each of the two data sets. The results in this case indicate that it makes little or no difference which way the priors are specified. This is not unexpected since the two prior distributions were so similar. However, this might not generalize to all cases if two specimens are markedly different in their methylation patterns. Table 3 shows the approximate critical values for level 0.1 and 0.05 tests, and estimated power. Results for Fisher’s exact test are also given for comparison. It should be cautioned that the critical values given here depends on both the hyper-parameters and the distribution of read counts, and hence are specific to this data set and should not be taken to be generally applicable.

Table 3.

Test of Equality

Simulation 1 Simulation 2

Critical level power Critical Level power
1 prior 2.5 0.1 0.518 2.5 0.1 0.522
1 prior 3.15 0.05 0.438 3.15 0.05 0.442
2 priors 2.5 0.1 0.524 2.5 0.1 0.522
2 priors 3.15 0.05 0.452 3.21 0.05 0.441
Fisher’s Exact .1 0.356 0.1 0.355
Fisher’s Exact 0.05 0.317 0.05 0.316

Simulation 1 used a single prior from the combined data set to generate the methylation rates. Simulation 2 used the prior from the combined data set to generate methylation rates for the subset of the simulated data where the rates were set equal across tissue samples, and used each sample’s individually calculated prior for the remaining data points. The designations of 1 prior and 2 prior refer to whether the combined data estimates of the parameters or the individual tissue sample estimates were used in calculating the log odds.

For the test of difference, we define the posterior log odds that |θi1θi2| > c as follows,

Δic=log[πθ|X,N(|θi1θi2|>c)1πθ|X,N(|θi1θi2|>c)].

As with the test of equality, a critical value for a given level α, and the corresponding power, can be determined by simulations. A difference threshold of c = 0.2 was chosen for the test, which corresponds to the bin width for categorizing methylation rates used in other studies (eg. Laurent et al., 2010, [11]). Simulations indicate greater power for the test of differences than for the test of equality at a given level. Results are summarized in Table 4. The power of all three tests is plotted against the level in Figure 4.

Table 4.

Test of Differences

Critical Level Power
0.857 0.1 0.741
1.658 0.05 0.620

Critical values, level, and power for the test of differences of methylation rates are reported with a c = 0.2 as the null hypothesized largest absolute difference. Test were done with a single prior on simulated data using the prior fitted to the combined data.

Figure 4.

Figure 4

Power versus Level.

Power is shown on the y-axis and level on the x-axis. Values are estimates based on simulated data.

2.3. False Discovery Rate

Using the estimate of level, α, and power, β, from the simulations, the false discovery rate (FDR) can be estimated for the original data set. Let M▪▪ be the number of sites, M0▪ and M1▪ be the total number of true null and alternative hypotheses respectively. Let M▪0 and M▪1 be the numbers of claimed negatives and positives. Table 5 tabulates four different incidents incurred in hypothesis testing: true negatives (M00), false natives (M10), true positives (M11), and false positives (M01). Then

E[M1]=αM0+βM1=αM0+β(MM0).

This implies that M0▪ can be estimated by

M^0=[M1βM]/[αβ]

The FDR is then estimated by

FDR=αM^0M1.

Table 5.

Hypotheses and True Values

Test Negative Test Positive Total
H0 M00 M01 M0▪
H1 M10 M11 M1▪

Total M▪0 M▪1 M▪▪

The number of true negative, M0▪, and true positives, M1▪, compared to the number testing as negative, M▪0, and testing as positive, M▪1. Only M▪0, M▪1, and M▪▪ are directly observed.

Since M00, M01, M10, and M11 are all functions of the specified level α, estimation of M0▪ and FDR requires an appropriate choice of α. For the real data, using the estimated α and β from the simulation studies presented in Figure 4, we calculated M^0 for α ranging from 0.1 to 0.0001. Interestingly, M^0 increased monotonically from around 740,000 to over 870,000 as the type I error level decreased from 0.1 to less than 0.0001. To determine which value of α can lead to a most accurate estimate of M0•, we simulated data sets containing 800,000 true nulls and 192,000 true alternatives where the methylation rate θij followed the prior distribution fitted from the eB approach. The monotonicity, however, was not observed; and M0▪ was estimated very accurately for any α value used in the same range. This likely indicates some violations of model assumptions in the real data. We leave this as an open question for future investigation.

In the absence of a reliable estimate of M0▪, a precise estimate of FDR cannot be calculated. The most conservative estimate of FDR can be obtained by substituting M▪▪ for M^0 into the FDR formula. An FDR of 0.05 is achieved by setting the level as α = 0.00092, at which M▪1 = 16, 976 sites were identified as differentially methylated. In contrast, only 5,003 sites were identified as differentially methylated at the same FDR using Fisher’s exact test (the FDR was controlled by requiring the q-value of each individual hypothesis to be ≤ 0.05 using the QVALUE R package downloaded from http://www.bioconductor.org). The eB test clearly shows improved power over Fisher’s test, however, the majority of the true positive sites remain un-identified due to the limitation of small sample size (nij).

The same method was applied to the test of difference ( H0:|θi1θi2|c vs. HA:|θi1θi2|>c at c = 0.20). An FDR≤ 0.05 was achieved at level 0.000088. A total of 1,630 sites were identified to have significantly pronounced difference (≥ 0.2) in methylation rates between the two samples.

To gain further insights into the FDR behavior, in Figure 5, we plotted the FDR as a function of the level summarized from the simulation studies described above. The proposed eB method has a lower FDR than Fisher’s exact test at all levels less than 0.1. Since the actual level of Fisher’s exact test is typically lower than the nominal level, we also plotted the FDR vs actual level of Fisher’s exact test. The actual level was assessed in the same way as for the eB approach, by finding the actual type I error rate in the simulated data at each given nominal level. The posterior log odds test has a uniformly lower FDR than the Fisher’s exact test even after the adjustment for the difference in nominal and actual levels. In practice, as Fisher’s test is always performed under the nominal level while the true level is never known, a comparison of the power or FDR under the nominal level is more meaningful. A spreadsheet with the locations on the genome that tested as positive is available at http://bioinfo.stats.northwestern.edu/~jzwang/.

Figure 5.

Figure 5

FDR versus Level.

FDR is shown on the y-axis and level on the x-axis. The two curves for Fisher’s exact test differ due to the overly conservative nature of the test. Values are estimates based on simulated data. The abbreviation “eB” stands for empirical Bayes.

3. Discussion

In this paper, we showed the two advantages of proposed empirical Bayes approach over Fisher’s exact test in methylation differentiation studies. This method is particularly useful when the number of reads at each site (or sample size) is small, while genome-wide data can provide rich information regarding the methylation rates across sites. Indeed, as shown in Figure 2, the fitted beta distribution has majority of probability mass concentrated around 0 and 1. This suggests that most sites have either very high or very low methylation rates. This prior information tends to shrink the posterior distribution of θij towards the two ends. For example, if the observed xij = 0 and nij = 2, then it is highly likely that this site has a low methylation rate regardless of the small sample size, and vice versa. This strong prior information forms the basis for the power improvement when using posterior log odds as the test statistics.

Several possibilities exist for generalizations or refinements of this approach. Firstly, the bias issue in estimation of M^0 needs further investigation. It is not clear to us whether there is a causal relationship between the type I error level α and the bias. Secondly, currently all sites are treated as independent. In a real genome, it is possible that sites nearby may be correlated in methylation rate. Characterizing such dependence may help further improve the power of the eB approach. Finally, only two tissue samples were used to generate the data for this study; however, it will often be desirable to incorporate multiple specimens for each condition. If methylation rates across specimens can be considered to be independent, then the density of the vector of methylation rates will be a product of beta densities. Similarly, the vector of positive reads will have probability mass given by the product of independent binomial pmfs. Because of independence, the posterior density for the vector of methylation rates will then be the product of beta densities, with the beta densities being the same as the posteriors would be if each specimen were treated separately. Once this distribution is known, the distribution of weighted averages of the methylation rates can be easily obtained.

Acknowledgments

This work is supported by NCI grant U54CA143869.

References

  • 1.Tost J. DNA Methylation: An Introduction to the Biology and Disease-Associated Changes of a Promising Biomarker. Molecular Biotechnology. 2010;44:71–81. doi: 10.1007/s12033-009-9216-2. [DOI] [PubMed] [Google Scholar]
  • 2.Bird A. DNA methylation patterns in epigenetic memory. Genes and Development. 2002;16:6–21. doi: 10.1101/gad.947102. [DOI] [PubMed] [Google Scholar]
  • 3.Guibert S, Forne T, Weber M. Dynamic regulation of DNA methylation during mammalian development. Epigenetics. 2009;1:81–98. doi: 10.2217/epi.09.5. [DOI] [PubMed] [Google Scholar]
  • 4.Reik W, Dean W, Walter J. Epigenetic reprogramming in mammalian development. Science. 2001;293:1089–1093. doi: 10.1126/science.1063443. [DOI] [PubMed] [Google Scholar]
  • 5.Robertson KD. DNA methylation and human disease. Nature Reviews Genetics. 2005;6:597–610. doi: 10.1038/nrg1655. [DOI] [PubMed] [Google Scholar]
  • 6.Heller G, Zielinski CC, Zöchbauer-Müller S. Lung Cancer: From single-gene methylation to methylome profiling. Cancer Metastasis Review. 2010;29:95–107. doi: 10.1007/s10555-010-9203-x. [DOI] [PubMed] [Google Scholar]
  • 7.Christensen BC, Marsit CJ, Houseman AE, Godleski JJ, Longacker JL, Zeng S, Yeh RF, Wrensch MR, Wiemels JL, Karagas MR, Bueno R, Sugarbaker DJ, Nelson HH, Wiencke JK, Kelsey KT. Differentiation of Lung Adenocarcinoma, Pleural Mesothelioma, and Nonmalignant Pulmonary Tissues Using DNA Methylation Profiles. Cancer Research. 2009;69:6315–6321. doi: 10.1158/0008-5472.CAN-09-1073. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hsiung DT, Marsit CJ, Houseman EA, Eddy K, Furniss CS, McClean MD, Kelsey KT. Global DNA Methylation Level in Whole Blood as a Biomarker in Head and Neck Squamous Cell Carcinoma, Cancer Epidemiology. Biomarkers and Prevention. 2007;16:108–114. doi: 10.1158/1055-9965.EPI-06-0636. [DOI] [PubMed] [Google Scholar]
  • 9.Han W, Wang T, Reilly AA, Keller SM, Spivack SD. Gene promoter methylation assayed in exhaled breath, with differences in smokers and lung cancer patients. Respiratory Research. 2009;10:86. doi: 10.1186/1465-9921-10-86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Bibikova M, Lin Z, Zhou L, Chudin E, Garcia EW, Wu B, Doucet D, Thomas NJ, Wang Y, Vollmer E, Goldmann T, Seifart C, Jiang W, Barker DL, Chee MS, Floros J, Fan Jian-Bing. High-throughput DNA methylation profiling using universal bead arrays. Genome Research. 2006;16:383–393. doi: 10.1101/gr.4410706. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, Low HM, Sung KWK, Rigoutsos I, Loring J, Wei CL. Dynamic changes in the human methylome during differentiation. Genome Research. 2010;20:320–331. doi: 10.1101/gr.101907.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Gu H, Bock C, Mikkelsen TS, Jäger N, Smith ZD, Tomazou E, Gnirke A, Lander ES, Meissner A. Genome-scale DNA methylation mapping of clinical samples at single-nucleotide resolution. Nature Methods. 2010;7:133–136. doi: 10.1038/nmeth.1414. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Houseman EA, Christensen BC, Yeh RF, Marsit CJ, Karagas MR, Wrensch M, Nelson HH, Wiemels J, Zeng S, Wiencke JK, Kelsey KT. Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics. 2008;9:365. doi: 10.1186/1471-2105-9-365. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES