Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2014 May 13;9(5):e97349. doi: 10.1371/journal.pone.0097349

A Note on Exact Differences between Beta Distributions in Genomic (Methylation) Studies

Emanuele Raineri 1,*, Marc Dabad 1, Simon Heath 1
Editor: Dajun Deng2
PMCID: PMC4019606  PMID: 24824426

Abstract

We apply a known algorithm for computing exactly inequalities between Beta distributions to assess whether a given position in a genome is differentially methylated across samples. We discuss the advantages brought by the adoption of this solution with respect to two approximations (Fisher's test and Z score). The same formalism presented here can be applied in a similar way to variant calling.

Introduction

Average DNA methylation at a locus can be measured by Whole Genome Bisulfite Sequencing (WGBS), which determines the fraction of DNA strands methylated at any given genomic position in a population of cells (this definition is likely to sound obvious to those who already know about WGBS and too terse to those who don't: a good introduction to this kind of measurements is contained in chapter 11 of [1]). In what follows we will call this fraction Inline graphic; when we distinguish between different samples we will write Inline graphic and Inline graphic. WGBS experiments estimate this numbers by measuring the methylation state of a random ( i.e. selected in some unpredictable way) set of reads sequenced from the sample. Since one can only analyze a finite number of reads per sample the value of Inline graphic will be known only up to some variability.

In this paper we propose an answer to the basic question : how does one assess whether two cell populations have different methylation levels at a genomic position? Researchers in the field have already dealt with this issue in a variety of ways: for example [2] uses a Fisher's test. In [3] Sun et al. compute a confidence interval for Inline graphic starting from some reasonable choice of a probabilistic model. Bsmooth [4] (which tackles the slightly different problem of defining differently methylated regions as opposed to positions) ultimately relies on a t-test. The authors of [5] use a hierarchical model to estimate the parameters needed for a Gaussian hypothesis test. Here we would like to bring attention to another possible approach, based on properties of the Beta distributions which are explained in [6], [7]. Similarly to e.g. [3] we do not test an hypothesis and output a p-value; rather we compute the probability distribution of the parameter of a Bayesian model.

Beta Distribution to Model Methylation Probabilities

The Beta probability distribution (over Inline graphic) with parameters Inline graphic is defined by

graphic file with name pone.0097349.e008.jpg

where Inline graphic is the Beta function

graphic file with name pone.0097349.e010.jpg

Inline graphic appears very naturally in many studies of genomic data: typically such analyses also entail the comparison between different samples, which in turn means that different Betas have to be combined. Here for concreteness we are describing the case of measuring DNA methylation differences across samples via whole genome bisulfite sequencing but the same concepts apply with almost no change to genotyping.

To appreciate how this variability can be quantified, consider a set of reads out of a WGBS experiment covering a certain genomic coordinate Inline graphic with read depth Inline graphic. Since not all the strands in the sample being sequenced will, in general, have the same bases methylated at the same time, this will be a collection of heterogeneous reads : some will indicate methylation at position Inline graphic (these are the so called non converted reads), others (the converted reads) will correspond to molecules that are not methylated. Now, if Inline graphic were known a priori, the probability of obtaining Inline graphic non converted reads would be given by a binomial distribution (which is closely related to Inline graphic):

graphic file with name pone.0097349.e018.jpg

If one assumes a uniform prior on Inline graphic, (Inline graphic) the expression for Inline graphic is very similar (The factor Inline graphic cancels out when applying Bayes' theorem)

graphic file with name pone.0097349.e023.jpg

Therefore, to assess whether a position is differentially methylated across two samples with non converted reads respectively Inline graphic and read depths Inline graphic one has to compute

graphic file with name pone.0097349.e026.jpg

where

graphic file with name pone.0097349.e027.jpg (1)

The purpose of the software we will discuss in this note is to estimate Inline graphic given the result of a WGBS experiment.

Exact Computation of Beta Differences

A method for computing

graphic file with name pone.0097349.e029.jpg

which turns out to be efficient enough for our purposes is presented in full detail in [6], [7]. We will summarize its derivation here for the sake of completeness, and advise interested readers to study those papers for a more detailed discussion. We start with some preliminary definitions: let Inline graphic where Inline graphic and Inline graphic are distributed respectively as Inline graphic and Inline graphic. Besides, we will use the notation Inline graphic for the cumulative distribution function of the Beta distribution (also known as the incomplete Beta distribution).

Now, by definition one has

graphic file with name pone.0097349.e036.jpg

But then, using the identity ([8])

graphic file with name pone.0097349.e037.jpg

one finds that

graphic file with name pone.0097349.e038.jpg (2)

where

graphic file with name pone.0097349.e039.jpg

Furthermore, one can prove that Inline graphic possesses a number of symmetries. An obvious one is Inline graphic. Also true are

graphic file with name pone.0097349.e042.jpg (3)

Using (2) and (3) one can design a nice recursive scheme

graphic file with name pone.0097349.e043.jpg

where the base case is provided by Inline graphic (this because if Inline graphic and Inline graphic have exactly the same distribution, Inline graphic).

Approximate Computation

Even if methylation data are well modelled by a Inline graphic, the comparison presented above is never (to our knowledge) used in the literature. As (hopefully fair) representatives of the methods which we have found are used instead, we will analyze the performances of the Fisher's test and that of a test based on a Gaussian approximation.

To do a Fisher's test, one builds a contingency table with the number of non converted and converted reads in the two samples (note that this kind of test breaks down when one of the rows (or columns) of the contingency table is zero). In the Gaussian approximation, one models Inline graphic for each sample with a Gaussian with the same mean and variance of Inline graphic; and then uses the two Gaussians to test for differences between Inline graphic and Inline graphic. In both cases we will consider one tailed tests.

Results and Discussion

Comparison with Approximate Results

We organized the comparison between the exact and approximate solution in two steps. First, we looked at the behaviour of the two tests on a pair of real samples (see below for instructions on how to access the data we used).

The results are shown in figure 1. On the Inline graphic axis we plotted Inline graphic, on the Inline graphic axis we plotted the corresponding Inline graphic-value obtained by approximating the Beta respectively with a Fisher's test (on the left) and with a Gaussian (on the right). We did the comparison over Inline graphic positions : the plot is in fact a two dimensional histogram, in which different shades of blue indicate how many times the two values fall into a certain region of the plane. There is not much to comment there except to note that, as expected, there is a broad correspondence between the different methods. Also, at such a scale the Beta probabilities seem more similar to the Z score test than to the Fisher's p-values (the right hand side plot looks more like a diagonal).

Figure 1. Comparing beta distribution with Fisher's test and Z score test.

Figure 1

Each plot contains an enlarged version around p-value Inline graphic. Notice that the in these magnified plots the Inline graphic axis is Inline graphic, for exact powers of Inline graphic take less space in the labels then string of 9 s.

Next, we simulated a pair of samples whose counts are generated by the same underlying binomial process (i.e. Inline graphic) at different coverages. These constitute a negative control, in the sense that none of the methods should report a significant difference between the samples. Furthermore, we generated a pair of samples such that their underlying binomial probabilities are markedly different Inline graphic; those are the true positives, i.e. cases for which the tests should detect that Inline graphic. We then compare the receiver operating characteristic (ROC) curves of the three methods for different values of the samples' coverages, Inline graphic. The results are depicted in figure 2. That plot justifies the usage of the Inline graphic distribution: the number of false negatives accumulated by the other two methods considered stops them from reaching an high enough true positive rate (even when the threshold for computing it is very permissive). Note, for example, that the blue line is not even visible in the leftmost panels of figure 2. This effect is also shown in figure 3 where we depict the distribution of the outputs for the three methods at read depth Inline graphic.

Figure 2. ROC curves for the three methods under comparison.

Figure 2

Each point in the ROC curve is obtained by choosing a different threshold for calling differential methylation. For the Z score test and the Fisher's test the p-values are: Inline graphic. For the Beta distributions the threshold probabilities are: Inline graphic. TPR means true positive rate; FPR means false positive rate.

Figure 3. Distribution of p-values (for the hypothesis tests discussed) and of Inline graphic computed with the Inline graphic model.

Figure 3

The first row depicts the truly different samples (Inline graphic). The bottom row refers to the control samples. For all the plots Inline graphic.

Differentially Methylated Regions and Effects of Coverage

Using the above concepts, we can compute differentially methylated regions (DMR) along the genome : these are uninterrupted blocks of nucleotides where the two samples have different methylation. One possible technique to find such blocks is to conjoin a number of adjacent nucleotides in a DMR, disregarding their exact methylation probabilities, and to assign hard boundaries. This usually implies that a number of ad hoc rules must be established to control the minimum distance between 2 neighbouring DMRs, the minimum length of a DMR, how to exactly count the intersection of DMRs with annotated regions, and so on and so forth. Using our method, though, one can simply assign to each nucleotide the probability computed by the algorithm presented here; any further analysis can be conducted without imposing arbitrary threshold or boundaries. For instance one can ask what is the average value of this probability over some specific regions (introns, enhancers) with respect to randomly chosen regions of the genome. Often it is not clear a priori what is the correct scale to use when looking at methylation : if this is the case, one can smooth the probability per nucleotide by computing a kernel density estimation at various bandwidths, or simply clump together the values of a number of nearby bases in a single (average) value. Note that smoothing is justified by the fact that methylation levels are correlated in space (the strength and persistence of the correlation is different from sample to sample, reflecting technical and biological variability); in fact as hinted at in [4], analyzing together nearby positions could provide a way of correcting measurement errors.

We would also like to comment on the fact that the different coverage of the samples does have an effect on the estimation of differential methylation. The main idea to understand here is that low coverage means uncertainty: and uncertainty can give rise to results which, while correct, are slightly counterintuitive. For example in figure 4 we show that a sample with low methylation and low coverage can be (maybe, one cannot say for sure) more methylated than a sample with high, certain methylation. The right panel of the same figure suggests that a good way of filtering for certainty is to select positions with low estimated variability (rather than to select based on read counts): this is because the same read depth can correspond to different variances depending on how many reads are non converted or converted.

Figure 4. Effects of coverage on Inline graphic.

Figure 4

In the left panel we show that a sample for which the methylation is estimated (with high uncertainty) to be low can be (with some probability) more methylated than a sample for which the methylation level is higher, and certain. In the right panel : even if the total coverage is the same, the uncertainty over Inline graphic varies according to the count of non converted (NC) and converted (C) reads.

Finally, once one has the estimates for Inline graphic and Inline graphic (as obtained via the ratio of unconverted reads over the coverage) and Inline graphic ( i.e. the output of the algorithm expalined in this paper) one can take an informed decision on a locus, keeping into account both the size of the difference in methylation and its variability.

Implementation and Data Availability

The algorithm described above is implemented in a C program, called methyl_diff, available from the Github page of one of the authors : http://emanueleraineri.github.io/. The program takes as input (from stdin) four integers, i.e. the number of non converted and converted reads for the first and the second sample respectively, and prints Inline graphic on the stdout. It takes Inline graphic to process Inline graphic lines on off-the-shelf hardware (MacBookPro with Intel i7@2.66 GHz). Note that the data used to produce figure 1 are publicly available (they were generated for BLUEPRINT, a consortium, studying epigenetic marks in immune system cells.) in at least two ways (also corresponding to two different formats):

  1. First of all, they can be downloaded from the same web page where the source code of our implementation is stored. The file G199.G202 contains the methylation levels of Inline graphic random positions from the chromosome 1 of samples G199 and G202 (first we determined which positions had been sequenced in both samples; then we extracted a random subset of those). One can feed columns Inline graphic directly to the methyl_diff executable (those columns are the unconverted, converted reads from the two samples).

  2. Secondly, they can be downloaded from the BLUEPRINT project ftp site ftp.ebi.ac.uk/pub/databases/blueprint/data/homo_sapiens/.

Funding Statement

The authors would like to acknowledge the support provided by the Spanish Ministerio de Ciencia e Innovación (grant SAF2011-30391). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Calladine CR, Drew H, Luisi B, Travers A (2004) Understanding DNA: The Molecule and How it Works. Academic Press.
  • 2. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, et al. (2009) Human dna methylomes at base resolution show widespread epigenomic differences. Nature 462: 315–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Sun D, Xi Y, Rodriguez B, Park HJ, Tong P, et al. (2014) Moabs: model based analysis of bisulfite sequencing data. Genome Biology 15: R38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Hansen KD, Langmead B, Irizarry RA (2012) Bsmooth: from whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol 13: R83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Feng H, Conneely KN, Wu H (2014) A bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data. Nucleic acids research: gku154. [DOI] [PMC free article] [PubMed]
  • 6.Cook JD (2008) Numerical computation of stochastic inequality probabilities. Technical report, UT MD Anderson Cancer Center Department of Biostatistics.
  • 7.Cook JD (2005) Exact calculation of beta inequalities. Technical report, UT MD Anderson Cancer Center Department of Biostatistics.
  • 8.Abramowitz M, Stegun I (1965) Handbook of Mathematical Functions. Dover Publications Inc.

Articles from PLoS ONE are provided here courtesy of PLOS

RESOURCES