Abstract
Motivation
Both -value and M-value have been used as metrics to measure methylation levels. The M-value is more statistically valid for the differential analysis of methylation levels. However, the -value is much more biologically interpretable and needs to be reported when M-value method is used for conducting differential methylation analysis. There is an urgent need to know how to interpret the degree of differential methylation from the M-value. In M-value linear regression model, differential methylation M-value M can be easily obtained from the coefficient estimate, but it is not straightforward to get the differential methylation -value, since it cannot be obtained from the coefficient alone.
Results
To fill the gap, we have built a bridge to connect the statistically sound M-value linear regression model and the biologically interpretable In this article, three methods were proposed to calculate differential methylation values, from M-value linear regression model and compared with the directly obtained from -value linear regression model. We showed that under the condition that M-value linear regression model is correct, the method M-model-coef is the best among the four methods. M-model-M-mean method works very well too. If the coefficients are not given (as ‘MethLAB’ package), the M-model-M-mean method should be used. The directly obtained from -value linear regression model can give very biased results, especially when M-values are not in (−2, 2) or -values are not in (0.2, 0.8).
Availability and implementation
The dataset for example is available at the National Center for Biotechnology Information Gene Expression Omnibus repository, GSE104778.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Changes in DNA methylation patterns play a critical role in the organ development, aging and diseases such as multiple sclerosis, diabetes, schizophrenia and cancer (Laird, 2010). Advances in the high-throughput assessment of DNA methylation have enabled quantitative profiling of DNA methylation of CpG loci throughout the genome, which is crucial to understand the role of epigenetics in regulating gene expression. The microarray-based Infinium HumanMethylation27 BeadChip, the Infinium HumanMethylation450 BeadChip and the newly developed MethylationEPIC BeadChip (Infinium) microarray (850k) (Moran et al., 2016; Sandoval et al., 2011; Thirlwell et al., 2010) are widely used commercial platforms for low-cost high-throughput methylation profiling. Both -value and M-value have been used as metrics to measure methylation levels. -value is defined as:
where and are intensities measured by the methylated and unmethylated probes for an interrogated CpG site and a constant offset (by default, is added to regularize -value when both methylated and unmethylated probe intensities are low (Du et al., 2010). The standardized fraction, i.e. the M-value is defined as
While the M-value is more statistically valid for the differential analysis of methylation levels because the M-value is approximately homoscedastic (the -value has severe heteroscedasticity outside the middle methylation range, which imposes serious challenges in applying many statistic models) (Du et al., 2010), the -value is much more biologically interpretable, because it corresponds roughly to the percentage of a site that is methylated, which makes the -value very attractive when modelling the underlying biological effect (Du et al., 2010).
Saadati et al. (2014) examined parametric methods, such as linear and beta regression, and nonparametric methods, such as rank-based regression. They found that the use of -values in a beta regression setting may be of benefit, but only if the underlying distribution of the -values is indeed the beta distribution, which requires that the methylated and unmethylated signal intensities are independently gamma distributed with the same scale parameter. Beta regression model seems very susceptible to the violation of the beta distribution assumption and may show an uncontrolled false discovery rate. By allowing for possible correlations between the methylated and unmethylated signal intensities, Weinhold et al. (2016) proposed the Ratio of Correlated Gammas (RCG) model and showed the large benefit of RCG model when the correlation is high. However, when the correlation is low (ρ = 0.2), the Type I error exceeds the nominal level of significance, 0.05 in all of their simulations. Currently, the M-value linear regression model is one of the most popular models in the analysis of DNA methylation data. In this article, we focus on M-value linear regression model.
Du et al. (2010) compared -value and M-value approaches and demonstrated that the relationship between the -value and M-value methods is a Log-transformation.
They showed that the -value method has severe heteroscedasticity for highly methylated or unmethylated CpG sites and recommended using the M-value method for conducting differential methylation analysis and including the -value statistics when reporting the results to investigators.
In M-value linear regression model (see the next section for details), differential methylation value, M can be easily obtained from the coefficient estimate, however, it is not straightforward to get the differential methylation -value, since it cannot be obtained from the coefficient alone (see the methods below). Due to this reason, some investigators usually run both M-value linear regression model and -value linear regression model. They use the M-value linear regression model to select the CpG sites and report the p-values, but use -value linear regression model to report differential methylation -value, . This can cause inconsistent results. First, the two models have different assumptions. Second, can be out of the [0, 1] interval (see below for details). In this article, we suggest different methods to calculate differential methylation values, from M-value linear regression model and compare it with the directly obtained from -value linear regression model.
The outline of this article is as follows. The four different methods (three methods from the M-value linear regression model and one from -value linear regression model) to obtain are presented in Section 2. In Section 3, simulations are conducted to compare the proposed methods. In Section 4, a simple method is proposed to quickly estimate when M-values are in (−2, 2) or -values are in (0.2, 0.8). Examples are given in Section 5 to illustrate the proposed methods. Section 6 discusses the implications and provides concluding remarks.
2 Materials and methods
Considering the following linear regression model:
(1) |
where is the methylation value (M-value or -value), is the variable of interest and are the adjusting variables (confounders) for individual . If M-value (-value) is used in (1), the model is called M-value (-value) linear regression model. The variable of interest, and the covariates can be categorical or continuous. This model has been implemented in R package ‘MethLAB’ although the coefficient estimates for the covariates are not available (only the coefficient estimate for the variable of interest is available, this is also the case for most common methylation packages).
2.1. M-value linear regression model
The M-value linear regression model is the model (1), where M-value is used. For this model, the differential methylation value, M is the coefficient estimate, for 1 unit increase of the variable of interest, . Given the value for each covariate (e.g. can be chosen as the mean of ), , can be obtained by
(2) |
where . This method will be called ‘M-model-coef’ method for the rest of the article.
If the coefficients are not given (as ‘MethLAB’ package (Kilaru et al. 2012)), can be chosen as the mean of methylation M-value, which will be called ‘M-model-M-mean’ method. might also be chosen as
where is the mean of methylation -value. This method will be called ‘M-model--mean’ method.
2.2. -value linear regression model
-value linear regression model is the model (1), where -value is used. For this model, the differential methylation value, is the coefficient estimate, for 1 unit increase of the variable of interest, . This method will be called ‘-model-coef’ method.
Note that the -value has lower limit, 0 and upper limit, 1 but the right side of (1) does not have any limits. Due to this reason, the differential methylation -value can be outside the limits if we consider many units increase of the variable of interest, .
3 Results
Simulations are conducted to compare the methods proposed in Section 2 under the condition that M-value linear regression model is correct. These simulations are not to show M-value linear regression model is better than -value linear regression model (this work has been done by Du et al. (2010)). Linear regression model (1) with was used to generate the methylation M-values with sample size = 200. In the first simulation, we assume (_value) = 0.05, 0.5 and 0.85 and = 0.1. The covariate X was generated from normal distribution with mean = 2 and standard deviation = 1. and were calculated by
In the second simulation, we assume = 0.05, 0.5 and 0.75 and = 0.2. After M-values were generated, the β_values were calculated by
Four methods (M-model-coef, M-model-M-mean, M-model--mean and -model-coef) were performed on the generated data to estimate separately. We repeated 10 000 times. The bias and standard deviation (SD) were summarized in Tables 1 and 2 below.
Table 1.
_value: | M-model-coef |
M-model-M-mean |
M-model--mean |
-model-coef |
||||
---|---|---|---|---|---|---|---|---|
Bias | SD | Bias | SD | Bias | SD | Bias | SD | |
0.05 | 0.0002 | 0.0075 | 0.0002 | 0.0100 | 0.0640 | 0.0159 | −0.0138 | 0.0094 |
0.5 | −0.0002 | 0.0118 | −0.0002 | 0.0118 | −0.0002 | 0.0118 | −0.0113 | 0.0103 |
0.85 | −0.0001 | 0.0046 | 0.0001 | 0.0082 | 0.0396 | 0.0094 | 0.0581 | 0.0103 |
Table 2.
_value: | M-model-coef |
M-model-M-mean |
M-model--mean |
-model-coef |
||||
---|---|---|---|---|---|---|---|---|
Bias | SD | Bias | SD | Bias | SD | Bias | SD | |
0.05 | 0.0001 | 0.0117 | 0.0006 | 0.0216 | 0.1574 | 0.0222 | −0.0532 | 0.0134 |
0.5 | −0.0002 | 0.0106 | −0.0003 | 0.0109 | −0.0002 | 0.0107 | −0.0297 | 0.0087 |
0.75 | 0.0001 | 0.0073 | 0.0005 | 0.0193 | 0.0627 | 0.0148 | 0.0577 | 0.0112 |
Based on the simulations (Tables 1 and 2), we can see the method M-model-coef is the best among the four methods. M-model-M-mean method works very well although it has a slightly larger bias and SD than M-model-coef method. It has much less bias than M-model--mean method (except _value = 0.5, we will discuss this situation in the next section) and -model-coef method. If the coefficients are not given (as ‘MethLAB’ package, Kilaru et al., 2012), M-model-M-mean should be used.
4 A simple method when M-values are in (−2, 2) or -values are in (0.2, 0.8)
As shown in simulations above, M-model--mean method has a large bias, compared with M-model-coef method and M-model-M-mean method when_value 0.5. However, when _value = 0.5, M-model--mean method works very well. In fact, from Figure 1 above, we can see there is roughly linear relationship between M-value and _value when M-values are in (−2, 2) or -values are in (0.2, 0.8). Based on this approximately linear relationship, we can roughly estimate from the following simple formula:
(3) |
However, most of the methylation sites have -values outside of (0.2, 0.8), limiting the utility and applicability of this formula.
5 Example
In this section, we will use two real studies as examples to illustrate the methods introduced above.
The first study was to determine whether maternal, postnatal, and early childhood lead exposure can alter the differentially methylated regions (DMRs) that control the monoallelic expression of imprinted genes involved in metabolism, growth, and development (Li et al., 2016). In this study, we reported that mean blood lead concentration from birth to 78 months was associated with a significant decrease in PEG3 DMR methylation. For 1 μg/L increase of the mean blood lead concentration, = −0.0014117 if ‘β-model-coef’ method is used; = −0.0014489 if ‘M-model-coef’ method is used; = −0.0014491 if ‘M-model-M-mean’ method is used; = −0.0014495 if ‘M-model--mean’ method is used; = −0.0012782 if the simple method is used. There were no dramatic differences among all the methods in this example (note that mean _value for PEG3 DMR methylation is 0.43, which is in (0.2, 0.8)).
The second study was to determine the association of CpG site changes with concentration of methylmercury (MeHg), major polychlorinated biphenyls (PCBs) and other organochlorine compounds (Leung et al., 2018). For this example, we only consider the association between PCB congener 105 (PCB105) and CpG site cg20619296. For 1 μg/g increase of PCB105, = −0.099 if ‘β-model-coef’ method is used; = −0.195 if ‘M-model-coef’ method is used; = −0.195 if ‘M-model-M-mean’ method is used; =−0.203 if ‘M-model--mean’ method is used; The simple method is not suitable for this case since the mean _value for CpG site cg20619296 is 0.9736, which is not in (0.2, 0.8). For this example, we can see there is a large difference between ‘β-model-coef’ method and ‘M-model-coef’ method (or ‘M-model-M-mean’ method). given by ‘M-model-coef’ method (or ‘M-model-M-mean’ method) is almost a double of given by ‘β-model-coef’ method.
6 Discussion
Both -value and M-value are commonly used to measure methylation levels. The M-value is more statistically valid for the differential analysis of methylation levels in relation to exposure x. However, the -value is much more biologically interpretable to show how much methylation was changed. In this article, we proposed three different methods to calculate differential methylation values, from M-value linear regression model. Under the condition that M-value linear regression model is correct, we showed that the method M-model-coef is the best among the four methods. M-model-M-mean method works very well too. If the coefficients are not given (as ‘MethLAB’ package, Kilaru et al., 2012), the M-model-M-mean method should be used. The directly obtained from -value linear regression model can give very biased results, especially when M-values are not in (−2, 2) or -values are not in (0.2, 0.8). Note the -value distribution across the methylome are not uniform, but more like ‘U-shape’ between 0 and 1. That means, most of the methylation sites have -values outside of (0.2, 0.8), and the conclusion from this article can provide very valuable suggestions for better estimating the change of methylation level.
Supplementary Material
Acknowledgements
The author wishes to thank Dr Philippe Grandjean and the reviewers for their insightful and constructive comments that have greatly improved this article.
Funding
This work was in part supported by the National Institute of Environmental Health Sciences of the National Institutes of Health [P30-ES006096].
Conflict of Interest: none declared.
References
- Du P. et al. (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11, 587. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kilaru V. et al. (2012) MethLAB: a graphical user interface package for the analysis of array-based DNA methylation data. Epigenetics, 7, 225–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Laird P.W. (2010) Principles and challenges of genome-wide DNA methylation analysis. Nat. Rev. Genet., 11, 191–203. [DOI] [PubMed] [Google Scholar]
- Leung Y.K. et al. (2018) Identification of sex-specific DNA methylation changes driven by specific chemicals in cord blood in a Faroese birth cohort, Epigenetics, 13, 290–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Y. et al. (2016) Lead exposure during early human development and DNA methylation of imprinted gene regulatory elements in adulthood. Environ. Health Perspect., 124, 666–673. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moran S. et al. (2016) Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics, 8, 389–399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saadati M., Benner A. (2014) Statistical challenges of high-dimensional methylation data. Stat. Med., 33, 5347–5357. [DOI] [PubMed] [Google Scholar]
- Sandoval J. et al. (2011) Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics, 6, 692–702. [DOI] [PubMed] [Google Scholar]
- Thirlwell C. et al. (2010) Genome-wide DNA methylation analysis of archival formalin-fixed paraffin-embedded tissue using the Illumina Infinium HumanMethylation27 BeadChip. Methods, 52, 248–254. [DOI] [PubMed] [Google Scholar]
- Weinhold L. et al. (2016) A statistical model for the analysis of beta values in DNA methylation studies. BMC Bioinformatics, 17, 480.. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.