Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 Jul 20;33(23):3701–3708. doi: 10.1093/bioinformatics/btx467

Detect differentially methylated regions using non-homogeneous hidden Markov model for methylation array data

Linghao Shen 1, Jun Zhu 2, Shuo-Yen Robert Li 3, Xiaodan Fan 4,
Editor: John Hancock
PMCID: PMC6355111  PMID: 29036320

Abstract

Motivation

DNA methylation is an important epigenetic mechanism in gene regulation and the detection of differentially methylated regions (DMRs) is enthralling for many disease studies. There are several aspects that we can improve over existing DMR detection methods: (i) methylation statuses of nearby CpG sites are highly correlated, but this fact has seldom been modelled rigorously due to the uneven spacing; (ii) it is practically important to be able to handle both paired and unpaired samples; and (iii) the capability to detect DMRs from a single pair of samples is demanded.

Results

We present DMRMark (DMR detection based on non-homogeneous hidden Markov model), a novel Bayesian framework for detecting DMRs from methylation array data. It combines the constrained Gaussian mixture model that incorporates the biological knowledge with the non-homogeneous hidden Markov model that models spatial correlation. Unlike existing methods, our DMR detection is achieved without predefined boundaries or decision windows. Furthermore, our method can detect DMRs from a single pair of samples and can also incorporate unpaired samples. Both simulation studies and real datasets from The Cancer Genome Atlas showed the significant improvement of DMRMark over other methods.

Availability and implementation

DMRMark is freely available as an R package at the CRAN R package repository.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

Methylation is one of the most informative epigenetic modifications that is currently widely studied (Kelly et al., 2015; Sanchez-Mut et al., 2016; Stelzer et al., 2015). Many efforts have been devoted to detecting DMRs for biological inference (Bonin et al., 2016; Kretzmer et al., 2015). DMR is the region where multiple adjacent CpG sites show different methylation statuses between phenotypes (Rakyan et al., 2011), which may occur throughout the whole genome (Suzuki and Bird, 2008). Microarray is a cost-effective approach for investigating methylation status. Several microarray platforms are designed for probing the methylation status at pre-selected CpG sites, such as the widely used Infinium HumanMethylation450 BeadChip (450K array).

One special characteristic of the methylation data is the strong spatial correlation. Hodges et al. (2009) found that nearby CpG sites tend to share the same methylation status. Chen et al. (2016) also suggested that differential methylation measured regionally should be more biologically interpretable and statistically powerful than those measured individually. Thus people often pool neighbouring information to determine the methylation status, using methods such as adjacent site clustering (Aclust by Sofer et al., 2013), LOESS smoothing (Bumphunter by Jaffe et al., 2012), Stouffer’s method (Probe Lasso by Butcher and Beck, 2015) and Gaussian kernel smoothing (DMRcate by Peters et al., 2015). More importantly, the CpG sites are not uniformly distributed throughout the whole genome (Miranda and Jones, 2007), so are the probes of methylation array. Thus spatial correlations of neighbouring probes should better be weighted based on their relative distance. Aclust merges neighbouring similar CpG sites into clusters. Bumphunter uses linear regression to determine differential expression between genotypes. However, both of them do not consider the varying probe distances of methylation array. Probe Lasso is designed for DMR detection from 450 K array, which customizes decision window sizes for different region types exhaustively according to the annotation of 450 K array. It weights and combines neighbouring methylation measurements according to the local correlation structures. This approach may be unreliable when the sample size is small, which is common in many clinical studies. DMRcate is a newly proposed DMR detection method which has a good performance in both simulation and real application. It incorporates spatial correlation via Gaussian kernel smoothing. This approach is appropriate, but its performance is sensitive to a smoothing parameter that is required to be provided by the user. In short, existing methods still lack the rigorous approach to weight and combine neighbouring methylation statuses.

Calling the DMRs from individual differentially methylated CpG sites (DMCs) is also non-trivial. Some widely used methods like COHCAP (Warden et al., 2013) only operate on predefined regions. Newer methods like WFMM (Lee and Morris, 2016) can operate on user-defined regions. However, Lay et al. (2015) showed that methylation changes at CpG shores or intergenic regions may also have important regulatory functions. These regions can be easily neglected by user-defined criteria. Methylation arrays also contain many unannotated but useful probes, thus DMR calling methods that can automatically define regions and utilize all array probes should be preferable.

In this paper, we propose DMRMark, a novel method based on non-homogeneous hidden Markov model (NHMM) to detect DMRs from methylation array data. The spatial correlations are modelled by the transition probabilities of NHMM. We extend the exponential transition function in OncoSNP (Yau et al., 2010) to reflect the different distance between array probes. NHMM provides automatic DMR calling via either the Viterbi algorithm or posterior sampling, and also allows DMR detection without replicates. To model the methylation measurements, we use a Constrained Gaussian Mixture model (CGM) with constraints based on the biological meaning of DMRs. This CGM restricts parameters to a smaller space such that the results are biologically reasonable. By combining NHMM and CGM, DMRMark can systematically pool the information from individual array probes for better DMR detection.

The remainder of the paper is organized as follows. In Section 2, we introduce the NHMM model. A Bayesian approach for parameter estimation is provided in Section 3. We evaluate our method by both synthetic and empirical data in Section 4. Finally, we discuss and conclude the paper in Section 5.

2 Model

Methylation status can be measured by β-value (Bibikova et al., 2006) or M-value (Irizarry et al., 2008). Du et al. (2010) provided a rigorous comparison of the two measures and showed that M-value is more statistically convenient and thus is used in our study. Figure 1 illustrates the empirical distributions of M-values of paired samples from two datasets, where a pair of samples are the tumor sample and the corresponding normal samples from the same patient. We assume that the methylation data is generated by NHMM, where the true methylation statuses are the hidden states and observed M-values are the response (Fig. 2). By inspecting the empirical data, we found that the paired M-values showed strong positive correlation. Thus we decide to model pairs of M-values simultaneously. Let M=(X,Y) be the paired M-values from control and case groups respectively, and Mti=(Xti,Yti) be the pair of M-values observed on the t-th locus of the i-th sample. We also assume that there are totally T loci and n pairs of samples. Let St be the hidden methylation status at the t-th locus, and Lt be the distance (in bp) from the locus t to t + 1. Denote S=(S1,,ST),L=(L1,,LT1). The notation Lj:t will be used to indicate the segment (Lj,,Lt), and so as Mj:t and Sj:t. Θ denotes all the parameters involved in the model. In NHMM, we assume the following conditional independence:

P(St|S1:(t1),L,Θ)=P(St|St1,Lt1,Θ), (1)
P(Mt|M1:(t1),S,L,Θ)=P(Mt|St,Θ). (2)

NHMM adds the covariate L to the transition probabilities, which reflects the different distances between loci. Equation (1) and (2) are the transition and response model, respectively. We studied the exponential transition model, and used CGM as the response model. The traditional Hidden Markov Model (HMM) has also been studied in detecting DMRs by Saito et al. (2014). However, they modelled the changes of methylation counts as the hidden states, and used constant transition probabilities. As a comparison, our model uses a simpler and more flexible transition model, and applies a novel response model.

Fig. 1.

Fig. 1.

Scatter plots of M-values from normal and tumor tissues of two TCGA datasets: (A) BLCA and (B) UCEC (details in Section 4.2). Circles indicate benchmark non-DMCs, and daggers indicate DMCs. For clear illustration, each figure randomly plots 10 000 loci

Fig. 2.

Fig. 2.

Illustration of the DMR detection scheme. (A) Paired M-values are modelled simultaneously. The horizontal line indicates zero M-value, and vertical lines indicate the probe positions. (B) Four methylation statuses. The transitions within the same status (solid lines) have higher probabilities than those between different statuses (broken lines). When the distance between loci getting longer, the transitions approaching uniform. (C) The Viterbi algorithm performs automatical DMR calling. The stacked bars plot the marginal probabilities of each status at each locus, which may be rugged. But if balancing over the neighbourhood with non-homogeneous transitions, reasonable regions (indicated by the Viterbi path) can be detected

2.1 Transition model

The transition models in exponential forms are common in NHMM and desirable for the array data since a specific methylation array only uses a subset of CpG sites as probes. Our transition model is an exponential function of Lt, which has a similar form as the one introduced in OncoSNP (Yau et al., 2010) for modelling SNP data. We make some modifications to fit the methylation data. Specifically, we consider the methylation status index k=1,2,3,4 to represent both low methylation, both high methylation, hypermethylation and hypomethylation, respectively:

P(St=i|St1=i,Lt1=l)=14+34exp(lexp(ci)), (3)
P(St=j|St1=i,Lt1=l)=1414exp(lexp(ci)), (4)

where ji, and ck models the speed of correlation decreasing with distance. We take an additional exponential to scale ck for the ease of estimation. We allow different ck for different k since different methylation statuses may show uneven proportions, which is more realistic than the transition model in OncoSNP. According to Kolde et al. (2016), the consecutive loci tend to be uncorrelated when their distances are beyond 10 000 bp. Thus, we suggest that an input sequence of probes should be broken to multiple input sequences at the site where adjacent probes are 10 000 bp away from each other. This transition model mimics the fact that neighbouring loci tend to have the same methylation status, and the shorter distance, the higher probability of being the same. Our transition model is flexible enough to model any stationary distribution of the methylation status (see Supplementary Section S7).

2.2 Response model

We used CGM as the response model. CGM is widely used in classification (Yukinawa et al., 2009) and image analysis (Ji et al., 2017). We design constraints on its parameter space to fit the biological features of DMRs. Previous methods used the bimodal Beta mixture to model β-values from one group of samples (Molaro et al., 2011). We extended it to the 4-component CGM to model paired M-values. As illustrated by the empirical data (Fig. 1), the M-values of non-DMCs spread along and stay close to the diagonal, but the M-values of DMCs tend to cluster together and be away from the diagonal. Depending on the purity of the tumor samples, the distance between the M-values of DMCs and the diagonal may vary. For non-DMCs, we model the differences between paired M-values by zero-mean normal distributions, which shares the similar idea as Smyth (2004). While for the two types of DMCs, we use a separate bivariate normal distribution to model the paired M-values of each type. A DMR should be a consecutive region where the corresponding methylation statuses St’s are either all 3 or 4.

In summary, we assume the distributions for non-DMCs and DMCs as follows. For non-DMCs (k = 1, 2), we first model Xt with normal distributions, then model (XtYt) with zero-mean normal distributions:

P(Mt|St=k,θ,σ2,σN2)=P(Xt|St=k,θk,σk2)P(Yt|Xt,St=k,σN2) (5)
P(Xt|St=k,θk,σk2)=i=1nNormal(Xti;θk,σk2), (6)
P(Yt|Xt,St=k,σN2)=i=1nNormal(Yti;Xti,σN2), (7)

where θ=(θ1,θ2) and σ2=(σ12,σ22) captures the positions and spreads of non-DMC clusters. σN2 is shared by both non-DMC clusters since both non-DMC clusters show similar spread in the direction perpendicular to the diagonal.

For DMCs (k=3,4), we model the paired M-values with the bivariate normal distributions

P(Mt|St=k,μ,Σ)=i=1nNormal(Xti,Yti;μk,Σk). (8)

To enforce the biological definition of DMCs, which shall be typically away from the diagonal, we add constraints to both the prior distributions of the mean and variance parameters of the M-values. More specifically, for the prior distributions of μk’s, we use truncated normal distributions to restrict their positions away from the diagonal:

P(μ3)Normal(μ3;μ30,1/κ30)·I(μ32μ31>D), (9)
P(μ4)Normal(μ4;μ40,1/κ40)·I(μ41μ42>D), (10)

where I(·) is the indicator function which equals to 1 if (·) is satisfied and 0 otherwise, μk0 and κk0 are normal priors, and D is the hyperparameter quantifying the minimum distance between the mean M-values of control and case groups. For the prior distributions of Σk’s, we use the following truncated Inverse-Wishart distributions (IW) to ensure the majority of its population stay consistently on one side of the diagonal:

P(Σk|μk)IW(Ak0,νk0)·I(Δ(μk,Σk)<0). (11)

Δ(μk,Σk) is the discriminant (Cattani et al., 2005) of the quadratic polynomial

[(xμk1)cosϕ+(xμk2)sinϕ]2λk1+[(xμk1)sinϕ(xμk2)cosϕ]2λk2χ2,α2, (12)

where λk1 and λk2 are the two eigenvalues of Σk, ϕ is the rotation angle of the first eigenvector of Σk, and χ2,α2 is the value of χ2-distribution of degree 2 at p-value α. Details about Equation (12) can be found in the Supplementary Section S1. In short, the real roots of Equation (12) represent the intersections of the (1α)-contour of a bivariate normal distribution and the diagonal. Enforcing Δ(μk,Σk)<0 results in no intersection and thus keeps at least the (1α) highest density area of the bivariate normal distribution on one side of the diagonal. CGM restricts the positions and shapes of DMC clusters such that the M-value differences of DMCs with a cluster are mostly consistent.

In our experiments, we choose D as the 90-th percentile of the empirical distribution of |XtiYti| across all t and i, and choose α=20%. We also conducted simulation experiments to assess the sensitivity of the results to the choices of D and α (see Supplementary Table S1). The results were stable when the hyperparameter values were not extremely high or low.

3 Materials and methods

In this section, we describe the methods for parameter estimation and DMR calling. Traditionally the Expectation Maximization algorithm was used to estimate the parameters of HMM, but it cannot easily handle the non-standard distributions of the transition parameter ck and the truncated parameters of CGM. Thus we designed a Metropolis-within-Gibbs algorithm (Gelman et al., 2003) to perform the estimation by posterior sampling. Our algorithm iteratively samples all parameters from their corresponding posterior distributions conditional on all other parameters. To sample from non-standard conditional distributions, Metropolis-Hastings (MH) method (Gelman et al., 2003) is adopted. Conditional on point estimates of parameters, both posterior sampling and the Viterbi algorithm can be used to infer methylation statuses. In the following, we outline the algorithm about parameter estimation, and the details can be found in Supplementary Section S2.

3.1 Prior distributions

Priors of DMC’s parameters were given in Equation (9), (10) and (11), which incorporate prior knowledge via location and shape restriction. Normal priors are assumed for all ck, and conventional conjugate priors are assumed for other parameters (see Supplementary Section S2.1).

3.2 Posterior sampling with paired samples

We first describe the inference when all samples are paired, leaving the unpaired case to the next subsection. Starting from random values sampled from corresponding priors, our algorithm iteratively update parameters by sampling from their corresponding conditional distributions as outlined below (see details in Supplementary Section S2.2) until convergence:

  • Step 1: Conditional on transition and response parameters, sample the hidden states S1,,ST with the procedures in Rydén (2008).

  • Step 2: Conditional on the methylation statuses of the first probes of all chains, sample the initial distribution parameter of the chains from its conjugate posterior Dirichlet distribution.

  • Step 3: Conditional on L and S, sample ck for all k. The difficulty in sampling ck lies on computing the normalizing constant of ck|P(ck)t=2T [P(St|St1,Lt1,ck)]I(St1=k). Unlike ordinary HMM, here the transition probabilities are varying with L. We used an MH algorithm to sample new ck conditional on old ck.

  • Step 4: For non-DMC (k = 1, 2) parameters, sample σN2,σk2 and θk conditional on X,Y and S. With conjugate priors, their posteriors are of classical forms as in Murphy (2007).

  • Step 5: For DMC (k = 3, 4) parameters, sample Σk and μk conditional on X,Y and S, using MH algorithms to tackle the truncation.

After discarding early iterations as burn-in, we collect later posterior samples for inference. One way to get point estimations of parameters is to calculate the means of posterior samples. Conditional on the point estimators of parameters, we can use the Viterbi algorithm to compute the most likely sequence of methylation statuses and the corresponding probability. In this way, the resulted DMCs are automatically aggregated to DMRs.

3.3 Utilization of unpaired samples

In reality, the control and case groups can be unpaired. For example, The Cancer Genome Atlas (TCGA) has more tumor samples than normal ones. The estimation of parameters for DMC clusters is based on computing the joint likelihood P(Xti,Yti|μk,Σk) for assessing acceptance in the MH algorithm. When data is unpaired, we treat the missing data as random variables and integrate them out from the joint likelihood, resulting in a univariate normal distribution as the marginalized likelihood. All MH steps remain similar. For the estimation of parameters for non-DMC clusters in the case of unpaired samples, we can write their individual distributions by integrating out the other from their joint distribution:

P(Xti|θk,σk2,σN2)=Normal(Xti;θk,σk2), (13)
P(Yti|θk,σk2,σN2)=Normal(Yti;θk,σN2+σk2). (14)

In this case, the conditional posterior distributions of σN2 and σk2 are no longer standard distributions. Accordingly, we make the following modifications to Step 4 in Section 3.2 (see Supplementary Section S2.3).

  • For k = 1, 2, sample θk from normal distributions conditional on (X,Y) and the latest (S,σk2,σN2).

  • Sample σ12,σ22 and σN2 as a whole conditional on (X,Y) and the latest (θk,S) using the MH algorithm.

4 Results

We conducted comparisons between our method and existing DMR detection methods that are able to assess all probes of methylation array as candidates for DMR calling. The methods we compared were Bumphunter, Probe Lasso and DMRcate. Since we would like to utilize all probes and detect DMRs even in unannotated regions, we did not consider methods such as COHCAP and WFMM. The three tested existing methods cover popular approaches including linear regression and different variations of moderated-t statistics, but they all have the shortcoming in dealing with the spatial correlation of methylation data, which is systematically modelled in DMRMark. We summarize their parameter settings for in Table 1. For reference, we also show the expected performance of random guessing, which randomly classifies loci as DMC or non-DMC at a probability, then repeats it for all probabilities.

Table 1.

Parameter setting for all tested methods

Method Version Parameter setting Tuning parameters
DMRMark v1.0 initHeuristic = FALSE VitP
Bumphunter v1.12.0 maxGap = 1000, smooth = TRUE, smoothFunction = loessByCluster cutoff
DMRcate v1.8.6 All default pcutoff
Probe Lasso ChAMP v1.6.2 filterXY = F, mafPol.upper = 1, lassoRadius = 1000, adjPVal = 0.95, minSigProbesLasso = 1 DMRpval

Note: Any omission implies the use of default setting. The parameter initHeuristic of DMRMark is set to FALSE for random initialization. The parameter adjPVal of Probe Lasso is set to 0.95 to filter out extremely noisy probes while retaining most of its operation region. The tuning parameters are the thresholds determining the balance between recall and precision. Different values for these tuning parameters were tried to get the whole performance curves.

We also retrieved publicly available methylation array (mostly 450 K array) datasets from TCGA Research Network (http://cancergenome.nih.gov/). In the absence of ground true DMR annotations in the empirical data, we followed Peters et al. (2015) to treat DMRs called from the paired whole-genome bisulfite sequencing (WGBS) data as the benchmark for evaluation. WGBS can be treated as the replicates of methylation arrays with much higher resolution, thus DMRs calling from WGBS is expected to be more accurate. Note that the WGBS data is much more expensive, thus still relatively rare as compared to the methylation array data. We searched the TCGA database for paired normal and tumor samples which have both WGBS and methylation array data. We found one such pair of samples for both Bladder Urothelial Carcinoma (BLCA) and Uterine Corpus Endometrial Carcinoma (UCEC). Since other methods require multiple pairs of samples, we retrieved more pairs of methylation array data from each tumor type by selecting the paired sample from the same batch and the same tissue source. This results in 6 pairs for BLCA and 4 pairs for UCEC.

For WGBS data, we used the R package DSS (Wu et al., 2015) to call the DMRs at different significant levels (the DSS-p parameter), then the called DMRs were mapped to the methylation array probes as the benchmark DMRs. For methylation array data, we first performed between-array and within-array normalization using the Illumina and SWAN methods provided by Minfi (Aryee et al., 2014). Following the procedure of Peters et al. (2015), we also removed the cross-hybridizing probes (Chen et al., 2013) and the probes within 2 bp of SNPs or from Y-chromosome in order to reduce potential confounding. To show the concordance between WGBS data and 450 K array data, we calculated the Pearson correlation coefficients between the methylation percentages of WGBS and the β-values of their paired 450 K arrays for matched CpG sites (see Supplementary Section S4). The correlation turned out to be high, which also justified the use of DMRs from WGBS data to validate the results from array data. The called DMRs from WGBS data were shown in Figure 1.

We measured the performance of DMR detection via the area under the precision-recall curve (AUCPR), instead of receiver operating curves (ROC). Since DMRs are rare along the genome as compared to non-DMRs and ROC is susceptible to class skew (Keilwagen et al., 2014), AUCPR should fit better here.

4.1 Simulation studies

For simulation studies, we used two different methods to synthesize the data, namely the NHMM-based method and the TSS-based method. Both methods used the same probe layout of 450 K array. The NHMM-based method simulated methylation statuses for all loci and corresponding M-values by exactly following our model assumption. For the TSS-based method, we first set the methylation statuses using the procedure from Peters et al. (2015), which randomly selects some transcription start site (TSS) regions to be DMRs; then we generate M-values in DMR regions and non-DMR regions by randomly sampling from the pool of M-values of benchmark DMRs and benchmark non-DMRs, respectively; finally we add further Gaussian noise to the M-values. For each method, we generated 5 pairs of samples based on the first 10 000 probes of 450 K array. Details can be found in Supplementary Section S3.

For performance comparison with existing methods, we applied aforementioned DMR detection methods to analyze the synthetic data. Their performances are shown in Figure 3, in which DMRMark showed better accuracy than other methods in both synthetic datasets.

Fig. 3.

Fig. 3.

Evaluation of different DMR finding methods. The box plots mark the median and quartiles of AUCPR from 100 independent simulation tests with 5 pairs of samples. The black horizontal lines indicate the expected performance of random guessing. The data were simulated by (A) NHMM-based method and (B) TSS-based method

To show the ability to handle a single pair of samples and unpaired samples, we applied our method to two other cases: (i) the first pair of the NHMM-based data; (ii) the first pair plus the control sample from the second pair and the case sample from the third pair of the NHMM-based data. According to convergence diagnosis plots and statistics (see Supplementary Section S2.4), the parameters generally converged within 100 iterations. Table 2 presents the parameter estimation and DMR detection accuracy, including the means and standard deviations (in brackets) of 100 independent simulations. It shows that the estimated parameters were consistent with the truth with small estimation errors. When using additional unpaired data, the AUCPR was significantly increased (Wilcoxon signed-rank test p-value =1.98×1018), and the estimation of almost all parameters were significantly improved (see Supplementary Section S8).

Table 2.

Parameter estimation results

Truth Value Single Paired With Additional unpaired
θ1 −3.5 −3.49(0.01) −3.50(0.008)
θ2 3 2.98(0.02) 2.99(0.01)
μ3 −2, 1 −2.05(0.09), 1.08(0.07) −1.99(0.06), 1.02(0.05)
μ4 2.2, −1 2.32(0.10), -1.15(0.07) 2.22(0.06), -1.06(0.05)
σ12 1 1.02(0.02) 1.00(0.01)
σ22 1.5 1.52(0.03) 1.51(0.02)
σN2 1.21 1.25(0.02) 1.22(0.01)
Σ3 (2111.5) (1.91(0.17)0.94(0.12)0.94(0.12)1.39(0.12)) (2.00(0.12)0.99(0.09)0.99(0.09)1.48(0.08))
Σ4 (2111.5) (1.87(0.24)0.95(0.19)0.95(0.19)1.36(0.18)) (1.99(0.16)0.99(0.13)0.99(0.13)1.46(0.11))
c 11, 11, 7, 6 11.04(0.12), 11.08(0.12), 7.13(0.19), 6.16(0.20) 11.01(0.11), 11.06(0.11), 7.10(0.18), 6.14(0.19)
AUCPR 0.976(0.006) 0.996(0.001)

Note: The table presents the parameter estimation accuracy using the first pair only (Single paired) and using additionally the control sample of the second pair and the case sample of the third pair (With additional unpaired). The numbers are means (standard deviations in parentheses) of parameters estimated from 100 independent datasets generated by the same true parameter. All parameter estimation results except c4 were significantly improved with additional unpaired data.

4.2 Real data experiments

We compared the performance of DMRMark with other methods using the BLCA and UCEC 450 K array datasets from TCGA. The performances of different methods were shown in Figure 4. From Figure 4A and B, we observed that Bumphunter using regression methods performed the worst; DMRcate and Probe Lasso that are based on moderated-t got similar performances. In all situations, DMRMark showed predominant performances, which indicated that our approach can be more powerful than existing methods. The performance gain of using DMRMark in BLCA is less than that in UCEC, which may be due to the fact that BLCA has larger between-sample variation (lower correlation as shown in Supplementary Fig. S3). In Figure 4C and D, Bumphunter cannot reach high recall even when the cut-off was set to be 0, which may be due to its internal filtering. Also Bumphunter only performed well in the low recall region, which makes it less favourable. Probe Lasso performs internal P-value adjustment that transforms all P-values above a threshold to 1, which sometimes causes a sharp decreasing as in Figure 4D. Otherwise its performance was similar to DMRcate. Overall, we observed that DMRMark outperformed other methods in almost the whole region.

Fig. 4.

Fig. 4.

The performance of different methods for BLCA (A and C) and UCEC (B and D) datasets. In (A) and (B), we evaluated their AUCPR according to the DMRs called from the paired WGBS data by different DSS-p, where Random represents the performance of random guessing. In (C) and (D), we inspected the precision-recall curves of different methods at default DSS-p = 0.001

To check the biological relevance of the detected DMRs, we conducted enrichment tests based on the lists of marker genes reported for BLCA (Chung et al., 2011; Reinert et al., 2011) and UCEC (Wentzensen et al., 2014). We ranked the detected DMRs by the maximal marginal differential methylation probabilities of their component probes, then selected the top 500 DMR genes according to the 450 K array annotations. The marker genes were significantly enriched in the top 500 DMR genes (P-value =1.14×108 for BLCA and P-value =1.00×105 for UCEC). Interestingly, the DMR ranked highest in BLCA corresponds to the gene OTX1, which is a well-known biomarker for both breast cancer (Acton, 2012) and lung cancer (Rauch et al., 2012). This fact coincides with the findings of Cancer Genome Atlas Research Network et al. (2014) that some bladder, breast and lung cancer shares common pathways. Recently, OTX1 had also been shown to be a biomarker of bladder cancer (Beukers et al., 2017). More literature supports are provided in Supplementary Section S9. These findings indicate that our method can give biologically relevant results.

To evaluate the performance of DMRMark in detecting DMRs from a single pair of samples, we applied it on the one WGBS-matched pair of 450 K array data for both tumor types. Since moderated-t-based methods are incapable in this single pair case, we compared the performance of DMRMark with those from Bumphunter and a clustering approach using CGM only (i.e. DMRMark without NHMM). The results were shown in Figure 5A and B. It shows that Bumphunter performed poorly except in the low recall region, while DMRMark gave satisfactory results in the whole region. The clustering approach using CGM only also showed acceptable performance, which indicates that the CGM response model in DMRMark is proper. Due to the transition model which makes use of the spatial correlation, DMRMark consistently outperformed the approach using CGM only. We expect the performance gain will be even larger when the probes are denser. The methylation statuses of all probes inferred by DMRMark are shown in Figure 5. As a comparison, we also clustered the probes using Mclust (Fraley et al., 2012) (see Supplementary Section S6). The benchmark grouping shown in Figure 1 is similar to that of DMRMark but not Mclust. This is because DMRMark incorporates spatial correlation by NHMM and also the biological knowledge about DMRs through the priors of CGM.

Fig. 5.

Fig. 5.

The performance on single-paired data for both BLCA (A and C) and UCEC (B and D). (A) and (B) show the precision-recall curves. (C) and (D) demonstrate the methylation statuses inferred by DMRMark (randomly plot 10 000 probes for clarity). Dots, circles, daggers and crosses represent the methylation status k=1,2,3,4, respectively

5 Discussion

We developed DMRMark, a new approach to model methylation statuses and detect DMRs from methylation array data. In DMRMark, NHMM is used to capture the spatial correlation, and CGM is designed specifically for modelling the biological meaning of different methylation statuses of paired M-values. The Bayesian approach is used to infer the parameters. The Viterbi algorithm is used to report the most likely sequence of methylation statuses, which produces DMR calling automatically. We demonstrated the improved performance of DMRMark over existing methods using both synthetic and real TCGA array data.

One important contribution of DMRMark is the automatic DMR calling. Previous methods which rely on biological features (e.g. CpG island, CpG shore) for DMR calling may ignore important functional signals in intergenic regions. The methods we compared in Section 4 used predefined boundaries or decision windows to call DMRs, which first analyzed individual methylation statuses and then aggregated similar individuals within certain distance. This approach may not be flexible and accurate enough. Kolde et al. (2016) used the minimum description length method to find region boundaries. Their approach is effective, but it has small drawbacks that the initial segmentation using constant cut-off is required, and also they did not consider the varying probe distances of 450 K array. HMM-based region definitions have also been studied before. For example, Sun et al. (2014) applied HMM to aggregate individually detected DMCs into regions. But they usually separated DMC detection and DMR calling into two steps, thus did not fully exploit the potential of HMM framework. Our approach conducted methylation status analysis and DMR calling simultaneously. Regional information improves the analysis of individual loci, and the improved individual status help call better DMRs. Another advantage of DMRMark is its capability to detect DMRs without replicates, which can be useful in identifying personal methylation difference. The need of this capability is increasing with the rising of personalized medicine, thus we expect our approach can benefit a wide range of new applications.

The probabilistic framework provided by DMRMark can be easily extended to handle more complexity of real scenarios. For example, normal cell contamination in tumor sample often degrades the comparison between tumor and normal samples. Takahashi et al. (2013) illustrated the potential of using known completely methylated or unmethylated loci to estimate the fractions of cancer cell in tumor samples. Zhang et al. (2015) modelled the β-values at DMCs as linear combinations of β-values from pure methylated and pure unmethylated loci. Zheng et al. (2017) treated the mode of β-values of impure samples at chosen DMCs as its purity. But these methods are not efficient enough because they did not face with the fact that DMR detection and purity estimation are essentially interweaved problems. One possible solution is to integrate our probabilistic model with the cell purity estimation. The joint inference of DMRs and cell purity in the same hierarchical probabilistic model shall provide a much greater statistical power, although more complicated.

One constraint of our method is that it can currently only handle the two-group comparison. For diseases which have multiple subtypes, we can extending our model to increase the number of transition and dimensionality of the response model, though the curse of dimensionality is unavoidable. DMRMark may also be applied on WGBS data by treating methylation ratios as β-values. But essentially, redesigning the response model based on a binomial or Poisson distribution should be able to denote the data characteristics in the sequencing data scenario more faithfully. As the WGBS data is much denser than microarray, this extension of our approach might perform even better.

Another way to improve over our current implementation of DMRMark is parallelization. Without parallelization, it currently took 6.4 and 5.7 hours to run DMRMark for the full BLCA and UCEC datasets, respectively, on a PC with Windows 8.1, Intel CPU i3-4130 (3 M Cache, 3.40 GHz) and 8GB memory. Most of the computational burden comes from the updating of the hidden states of NHMM, which can be greatly accelerated by parallelization since the state updating of disconnected segments, such as different chromosomes, are conditionally independent.

Supplementary Material

Supplementary Data

Acknowledgements

We thank Professor Suet Yi Leung from the University of Hong Kong for engaging discussions.

Funding

This research is partially supported by three grants from the Research Grants Council of the Hong Kong SAR, China (CUHK400913, CUHK14203915, T12-710/16-R) and two direct grants from the Chinese University of Hong Kong (CUHK4053135, CUHK3132753).

Conflict of Interest: none declared.

References

  1. Acton Q.A. (2012). Breast Cancer: New Insights for the Healthcare Professional.ScholarlyEditions, Atlanta, Georgia, USA [Google Scholar]
  2. Aryee M.J. et al. (2014) Minfi: a flexible and comprehensive bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics, 30, 1363–1369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Beukers W. et al. (2017) FGFR3, TERT and OTX1 as a Urinary Biomarker Combination for Surveillance of Patients with Bladder Cancer in a Large Prospective Multicenter Study. J. Urol, 197, 1410–1418. [DOI] [PubMed] [Google Scholar]
  4. Bibikova M. et al. (2006) High-throughput DNA methylation profiling using universal bead arrays. Genome Res., 16, 383–393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bonin C.A. et al. (2016) Identification of differentially methylated regions in new genes associated with knee osteoarthritis. Gene, 576, 312–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Butcher L.M., Beck S. (2015) Probe Lasso: a novel method to rope in differentially methylated regions with 450k DNA methylation data. Methods, 72, 21–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cancer Genome Atlas Research Network et al. (2014) Comprehensive molecular characterization of urothelial bladder carcinoma. Nature, 507, 315–322. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cattani E. et al. (2005). Solving Polynomial Equations: foundations, Algorithms, and Applications (Algorithms and Computation in Mathematics). Springer-Verlag New York, Inc, Secaucus, NJ, USA. [Google Scholar]
  9. Chen D.-P. et al. (2016) Methods for identifying differentially methylated regions for sequence- and array-based data. Brief. Funct. Genomics, elw018. [DOI] [PubMed] [Google Scholar]
  10. Chen Y.A. et al. (2013) Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray. Epigenetics, 8, 203–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Chung W. et al. (2011) Detection of bladder cancer using novel DNA methylation biomarkers in urine sediments. Cancer Epidemiol. Prevent. Biomark., 20, 1483–1491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Du P. et al. (2010) Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics, 11, 587.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fraley C. et al. (2012). mclust Version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report no. 597, Department of Statistics, University of Washington.
  14. Gelman A. et al. (2003) Bayesian Data Analysis. Chapman Hall CRC London, UK. [Google Scholar]
  15. Hodges E. et al. (2009) High definition profiling of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. Genome Res., 19, 1593–1605. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Irizarry R.A. et al. (2008) Comprehensive high-throughput arrays for relative methylation (CHARM). Genome Res., 18, 780–790. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Jaffe A.E. et al. (2012) Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. Int. J. Epidemiol., 41, 200–209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Ji Z. et al. (2017) A rough set bounded spatially constrained asymmetric Gaussian mixture model for image segmentation. Plos One, 12, e0168449.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Keilwagen J. et al. (2014) Area under precision-recall curves for weighted and unweighted data. PLoS ONE, 9, e92209.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Kelly A.D. et al. (2015) Abstract B22: Genome-wide methylation analysis reveals an independently validated CpG island methylator phenotype associated with favorable prognosis in acute myeloid leukemia. Clin. Cancer Res., 21, B22–B22. [Google Scholar]
  21. Kolde R. et al. (2016) seqlm: an MDL based method for identifying differentially methylated regions in high density methylation array data. Bioinformatics, btw304.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Kretzmer H. et al. (2015) DNA methylome analysis in Burkitt and follicular lymphomas identifies differentially methylated regions linked to somatic mutation and transcriptional control. Nat. Genet., 47, 1316–1325. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lay F.D. et al. (2015) The role of DNA methylation in directing the functional organization of the cancer epigenome. Genome Res., 25, 467–477. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Lee W., Morris J.S. (2016) Identification of differentially methylated loci using wavelet-based functional mixed models. Bioinformatics, 32, 664–672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Miranda T.B., Jones P.A. (2007) DNA methylation: the nuts and bolts of repression. J. Cell. Physiol., 213, 384–390. [DOI] [PubMed] [Google Scholar]
  26. Molaro A. et al. (2011) Sperm methylation profiles reveal features of epigenetic inheritance and evolution in primates. Cell, 146, 1029–1041. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Murphy K.P. (2007). Conjugate Bayesian analysis of the Gaussian distribution. Technical report, University of British Columbia.
  28. Peters T.J. et al. (2015) De novo identification of differentially methylated regions in the human genome. Epigenet. Chromatin, 8, 1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Rakyan V.K. et al. (2011) Epigenome-wide association studies for common human diseases. Nat. Rev. Genet., 12, 529–541. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Rauch T.A. et al. (2012) DNA methylation biomarkers for lung cancer. Tumor Biol., 33, 287–296. [DOI] [PubMed] [Google Scholar]
  31. Reinert T. et al. (2011) Comprehensive genome methylation analysis in bladder cancer: Identification and validation of novel methylated genes and application of these as urinary tumor markers. Clin. Cancer Res., 17, 5582–5592. [DOI] [PubMed] [Google Scholar]
  32. Rydén T. (2008) EM versus Markov chain Monte Carlo for estimation of hidden Markov models: a computational perspective. Bayesian Anal., 3, 659–688. [Google Scholar]
  33. Saito Y. et al. (2014) Bisulfighter: accurate detection of methylated cytosines and differentially methylated regions. Nucleic Acids Res., 42, e45.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Sanchez-Mut J.V. et al. (2016) Human DNA methylomes of neurodegenerative diseases show common epigenomic patterns. Transl. Psychiatry, 6, e718.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Smyth G.K. (2004) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol., 3, Article3. [DOI] [PubMed] [Google Scholar]
  36. Sofer T. et al. (2013) A-clustering: a novel method for the detection of co-regulated methylation regions, and regions associated with exposure. Bioinformatics, 29, 2884–2891. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Stelzer Y. et al. (2015) Tracing dynamic changes of DNA methylation at single-cell resolution. Cell, 163, 218–229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Sun D. et al. (2014) MOABS: model based analysis of bisulfite sequencing data. Genome Biol., 15, R38.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Suzuki M.M., Bird A. (2008) DNA methylation landscapes: provocative insights from epigenomics. Nat. Rev. Genet., 9, 465–476. [DOI] [PubMed] [Google Scholar]
  40. Takahashi T. et al. (2013) Estimation of the fraction of cancer cells in a tumor DNA sample using DNA methylation. PLoS ONE, 8, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Warden C.D. et al. (2013) COHCAP: an integrative genomic pipeline for single-nucleotide resolution DNA methylation analysis. Nucleic Acids Res., 41, e117.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Wentzensen N. et al. (2014) Discovery and validation of methylation markers for endometrial cancer. International Journal of Cancer, 135, 1860–1868. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Wu H. et al. (2015) Detection of differentially methylated regions from whole-genome bisulfite sequencing data without replicates. Nucleic Acids Res., 43, e141.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Yau C. et al. (2010) A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol., 11, R92.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Yukinawa N. et al. (2009) Optimal aggregation of binary classifiers for multiclass cancer diagnosis using gene expression profiles. IEEE/ACM Trans. Comput. Biol. Bioinf., 6, 333–343. [DOI] [PubMed] [Google Scholar]
  46. Zhang N. et al. (2015) Predicting tumor purity from methylation microarray data. Bioinformatics, 31, 3401–3405. [DOI] [PubMed] [Google Scholar]
  47. Zheng X. et al. (2017) Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies. Genome Biol., 18, 17.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES