Skip to main content
Journal of Applied Statistics logoLink to Journal of Applied Statistics
. 2021 Jan 30;49(7):1677–1691. doi: 10.1080/02664763.2021.1877636

Testing differentially methylated regions through functional principal component analysis

Mohamed Milad a,CONTACT, Gayla R Olbricht b,
PMCID: PMC9042039  PMID: 35707559

Abstract

DNA methylation is an epigenetic modification that plays an important role in many biological processes and diseases. Several statistical methods have been proposed to test for DNA methylation differences between conditions at individual cytosine sites, followed by a post hoc aggregation procedure to explore regional differences. While there are benefits to analyzing CpGs individually, there are both biological and statistical reasons to test entire genomic regions for differential methylation. Variability in methylation levels measured by Next-Generation Sequencing (NGS) is often observed across CpG sites in a genomic region. Evaluating meaningful changes in regional level methylation profiles between conditions over noisy site-level measurements is often difficult to implement with parametric models. To overcome these limitations, this study develops a nonparametric approach to detect predefined differentially methylated regions (DMR) based on functional principal component analysis (FPCA). The performance of this approach is compared with two alternative methods (GIFT and M3D), using real and simulated data.

KEYWORDS: Functional principal component, epigenetics, DNA methylation, next-generation sequencing

Introduction

DNA methylation is an epigenetic modification involved in gene silencing and tissue differentiation [17]. The role of DNA methylation in human health has been heavily researched in cancer studies as specific methylation patterns are associated with cancer [15]. Methylation can alter the function of genes by adding a methyl (CH3) group to DNA at cytosine sites [12]. In mammals, DNA methylation almost always occurs when a methyl group attaches to a cytosine (C) when followed by a guanine (G) on the DNA sequence (i.e. CpG dinucleotides) [12]. A number of biological processes in mammals (e.g. the silencing of transposable elements, gene expression regulation, genomic imprinting, and X-chromosome inactivation) involve methylation [11]. Although the methylation of CpG locations in promoter regions is linked to gene silencing, recent research indicates that CpG methylation within genes bodies’ correlates with gene expression in a more complex manner [23].

To obtain quantitative methylation data with base pair resolution across the genome, a bisulfite treatment of DNA is followed by Next-Generation Sequencing (NGS). The bisulfite treatment transforms unmethylated cytosine (C) nucleotides into uracils (U), which amplify as thymine (T) during Polymerase Chain Reaction (PCR) [10] while methylated cytosines remain unchanged. The bisulfite-treated sample is then sequenced via NGS to obtain a library of sequencing reads. After sequencing reads derived from bisulfite-treated DNA are aligned to a reference genome, the methylation status of a cytosine in the reference can be assessed by observing nucleotide differences in the aligned reads. When a C in a bisulfite-treated read remained unchanged compared to a C in the reference, the reference C is methylated in that particular location [7]. However, if a T is present in bisulfite-treated read in place of a C in the reference, then the reference cytosine is unmethylated for that read [7]. This approach can be applied to the whole genome using methylC-seq [21] or BS-seq [14] methods. However, such studies are often costly for organisms with large genome sizes or for case–control studies where large sample sizes are needed. An alternative way to pair bisulfite sequencing with NGS is to perform Reduced Representation Bisulfite Sequencing (RRBS) [14], which focuses on capturing a subset of the genome. RRBS utilizes restriction enzymes, such as MspI or TaqI, to cleave at CCGG loci so as to select an informative subset of short reads to sequence [7]. This process allows for more accurate and specific results, with greater coverage of CpG-dense regions, including promotors, CpG islands, and repetitive sequences. It reduces the numbers of nucleotides to be sequenced to 1% of the genome and thus has a lower cost than sequencing all cytosines genome wide [7]. Many statistical issues are shared between whole genome methods and RRBS, but the following discussion is in the context of RRBS for illustrative purposes.

An essential issue in DNA methylation analysis is identifying genomic loci or regions with varying methylation levels related to distinct biological conditions. The individual loci may be noisy (especially in heterochromatin), but the overall regions tend to be informative [3]. Region-level conclusions are also often more meaningful biologically, making it desirable to consider summarizing information across individual loci in a region. Recently, new statistical methods and software tools have been created to identify differentially methylated regions (DMRs) from RRBS data [1, 4, 14]. Most of these methods search for DMRs by first testing each cytosine site, then applying a post-hoc aggregation procedure. Post-hoc aggregation reflects the fact that it is unknown which regions are of interest before testing, thus a procedure is needed to control the type I error rate while also letting the data guide the search for locations of informative regions. One of the first methods developed, BSmooth [8], uses a smoothing process across the genome within each sample to improve the accuracy when estimating the methylation level for any single CpG site. This smoothing process is beneficial since methylation levels of neighboring cytosines are known to be highly correlated [8]. BSmooth distinguishes differentially methylated regions by combining neighboring differentially methylated cytosines (DMCs), which are found using a t-statistic approach, with either a quantile or direct t-statistic cutoff [14]. A majority of the newer methods, such as BiSeq [9] and methylSig [16], also use local smoothing, along with a beta binomial model of methylation at individual cytosine sites; these two methods then combine the results of tests at individual loci to compute a measure of significance for an estimated DMR. Another method called MethylKit [1] uses annotation to provide a statistical test that pools the sequencing reads across an annotated unit (e.g. gene) by group. The MethylKit approach is able to test at both the site-level and for predefined regions based on annotation. With multiple samples, a logistic regression with a binary predictor corresponding to condition status is used, which can be expressed as a binomial-based test [1].

Two newer alternative approaches, the generalized integrated functional test (GIFT) [19] and the Maximum Mean Methylation Discrepancy (M3D) method [13], involve performing tests at the region-level based on pre-defined regions. GIFT [19] is a general method for using functional data analysis to test for regional differential methylation. GIFT estimates the functional relationship between the methylation proportion and the CpG site location using wavelet functions, yielding an estimated methylation profile. Their generalized integrated function test estimates subject-specific functional profiles first by using wavelets, and averaging profiles within groups. An ANOVA-like test is used for testing group differences for a pre-defined region, by comparing the overall functional relationship to the average curve within each group. This method mainly focuses on testing for differential methylation of a pre-defined region.

M3D [13], relies on the Maximum Mean Methylation Discrepancy method to assess changes in the shapes of methylation profiles within the local predefined regions being tested. It applies a machine learning technique called Maximum Mean Discrepancy (MMD) [6] to test the homogeneity in underlying methylation-generating distributions. The method uses a radial basis function (RBF) kernel to construct the MMD between data sets under different conditions in each region being tested; this number is modified based on changes in coverage profiles. The M3D statistics are compared to a null distribution of observed M3D statistics between replicate pairs [13]. It has been suggested that the shape of the methylation profile is a crucial factor in predicting gene expression, supporting the notion of a functional role for the methylation pattern [22]. In a review of the literature, it appears that only the M3D and GIFT methods consider differences in the shape of the methylation profile over the region. The advantage of the M3D method is that it is sensitive to spatially correlated changes in methylation profiles. GIFT does not account for the within and between CpG sites dependence, however, GIFT captures the spike-like features evident in the observed methylation within a region.

Building on the strengths of M3D and GIFT, this research explores the use of functional data analysis (FDA) techniques to characterize additional properties of the curve shapes of methylation profiles in genomic regions. Since previous studies have indicated the importance of methylation profile shape in predicting gene expression, the FDA techniques could be advantageous in detecting differential profiles that M3D and GIFT are unable to find. Specifically, in this research, a nonparametric approach based on functional principal component analysis (FPCA) is introduced to detect differential methylation regions (DMRs) from predefined regions, which explicitly accounts for adjusting spatial correlations between cytosine sites. FPCA allows investigation of dominant modes of variation in the RRBS data using the eigenfunctions of the methylation profile covariance function. This method can be employed to test for changes in shape of methylation profiles across regions, as opposed to testing only at individual cytosine sites. This study compares the performance of FPCA to the other existing methods (M3D and GIFT) that test for region level shape differences using a simulation based on real RRBS data.

Data Source

To evaluate the performance of the FPCA method, a simulation study based on real RRBS data was performed. Methylation data of bisulfite-sequenced DNA was obtained from 4 patients with acute promyelocytic leukemia (APL) and 12 APL control samples. This data set was obtained under accession number GSE42119 (National Center for Biotechnology Information) [20]. The RRBS data were preprocessed using Bismark version 0.5 (a reference genome alignment tool) that maps bisulfite-treated sequencing reads to a genome of interest and performs methylation calls in a single step [10]. Additional file 1, shows the average Pearson correlation in three different coverage depths between a reference cytosine and up to 10 downstream cytosines. The correlation between neighboring sites is very high in the controls and moderately high in the APL samples.

Methodology

The methylation profile function is defined as follows. Let t be a genomic position within a genomic region and T be the length of the genomic region being considered. Two conditions (or two group of samples) were considered: case and control. Assume that nA random case samples and nB random control samples are sequenced. Let xi(t) denote the methylation proportion profile function, covering the genomic position t from the ith case sample. Similarly, yi(t) relates to the ith control sample.

First, as a brief review of FPCA [18], consider the following. Let X(t) be a centered, square-integrable function, in this case describing the methylation proportion profile over the predefined region. Let ϕ1, ϕ2, … be the orthonormal eigenfunctions. By the Karhunen Loève theorem, the centered process in the eigenbasis functions can be expressed as X(t)=k=1ξkϕk(t), where ξk=X(t)ϕk(t)dt is the principal component coefficient associated with the kth eigenfunction ϕk(t), with E(ξk)=0 and E(ξkξl)=0 for kl. Furthermore, the covariance function R(s,t) can be written as R(s,t)=Cov(X(s),X(t))=k=1λkϕk(s)ϕk(t) with λk=Var(ξk).

The first eigenfunction ϕ1 represents the principal mode of variation in X(t). That is, ϕ1(t) is the ϕ that maximizes the variance of ξ=X(t)ϕ(t)dt [18].

Var(ξ)=Var[X(t)ϕ(t)dt]=ϕ(s)R(s,t)ϕ(t)dsdt. Similarly, ϕk is the function that maximizes Var(ξ) in the functional space that is orthogonal to ϕ1,,ϕk1. Since X(t)=k=1ξkϕk(t), the centered process X(t) is equivalent to the vector (ξ1,ξ2,). Also, because R(s,t)=Cov(X(s),X(t))=k=1λkϕk(s)ϕk(t), the eigenfunctions ϕ1, ϕ2, … should satisfy λ1λ2λ3, and

R(s,t)ϕk(s)ds=λkϕk(t). (1)

For any integer k1, solving this equation also finds ϕ1, ϕ2, … [18].

Performing FPCA on DNA methylation data

In the context of this study, FPCA can be performed to find the eigenfunctions and corresponding principal components as follows [2,5]. Let X(t)=[X1(t),X2(t),XN(t)]T be a vector-valued function, with Xi(t) denoting the methylation proportion profile function for the ith sample among N replicates in the predefined region. A set of orthonormal basis functions are selected using either a Fourier or B-spline basis. Note that the Fourier basis is typically used for periodic data, while B-spline basis is used for non-periodic data. Both will be explored since it is unclear which will work best for methylation data. The chosen basis has P functions Δ(t)=[δ1(t),δ2(t),..,δP(t)], where it is assumed that the methylation functions X1(t),,XN(t) in the predefined region and eigenfunctions {ϕ1, ϕ2, … } can be expressed as a linear combination of δ1(t),δ2(t),,δP(t). Now the methylation profile function can be expressed as, X(t)=CΔ(t), where the ijth element in the matrix C is Cij=Xi(t)δj(t)dt, with i=1,,N and j=1,,P. Similarly, ϕ(t) can be expressed as ϕ(t)=ΔT(t)β, where β=[β1,,βP]T with βj=ϕ(t)δj(t)dt. To find the eigenfunctions ϕ, or equivalently, to determine β, use Equation (1), which has the following equivalent expression [18,24]:

E[(ξ1ξP)(ξ1ξP)](β1βP)=λ(β1βP) (2)

Replace E(ξiξj) with its empirical estimate from sample methylation region functions X1(t),,XN(t) to obtain an empirical version of Equation (2):

1NCTCβ=λβ (3)

The eigenfunctions can be found by solving for the above multivariate eigenvalue (λ) and multivariate eigenvector ( β). The number of eigenfunctions can be chosen based on the percentage of variance explained. In this study, 90% was used, but this can be modified to allow different function approximation accuracies [24].

Test statistic

The pooled empirical methylation profile xi(t) of cases and yi(t) of controls was used to estimate the orthonormal principal component function ϕj(t), j=1,,k (eigenfunctions), employing the basis expansion method [24]. Let the corresponding principal components associated with ϕj(t) be ξij and ηij, for xi(t) and yi(t), respectively. The test statistic was defined using ξijs and ηijs to evaluate the difference in average principal component scores between the case and control samples. Vectors of averages of the functional principal component scores in cases and controls are denoted by ξ¯=[ξ¯1,,ξ¯k]T and η¯=[η¯1,,η¯k]T, where ξ¯j=1nAi=1nAξij, and η¯j=1nBi=1nBηij, j=1,,k. The pooled covariance matrix S is defined as follows:

S=1nA+nB2(i=1nA(ξiξ¯)(ξiξ¯)T+i=1nB(ηiη¯)(ηiη¯)T)

where ξi=[ξi1,,ξik]T, ηi=[ηi1,,ηik]T. Let Λ=(1nA+1nB)S. Then, the test statistic is defined as T2=(ξ¯η¯)TΛ1(ξ¯η¯). Note that this is a form of a Hotelling’s T2 statistic. Under the null hypothesis of no differential methylation between the case and control group in a specific region, T2 asymptotically follows a central χ(k)2 distribution, where k is the number of functional principal components. To accurately estimate the p-value, it is best to use a large number of replicates in each group [24]. Note that since the number of regions is determined prior to testing, the false discovery rate can be controlled across the entire set of region level tests.

Simulation study

To mimic methylation profile changes accurately, a simulation was constructed from the RRBS data set described above following the same approach as in M3D [13]. The regions (CpG clusters) were defined as follows: (1) CpG sites that covered at least 75% of samples were defined as frequently covered CpG sites and (2) a maximum distance of 100 base pairs to the nearest neighbor within a region was accepted. Using these criteria, only regions with at least 20 frequently covered CpG sites were used in the analysis [9]. The simulation study focused on the first 1000 regions on chromosome 1. Out of the 12 control samples in the RRBS data, 4 patients were randomly selected to use in the simulation study as controls. Four more replicates were simulated 100 times to be the testing group (i.e. cases). Of these, 250, 100 and 50 of the CpG clusters (predefined regions) were randomly selected to apply differential methylation changes. The replicates that acted as the testing group (cases) were simulated by first adding or subtracting random Poisson (λ =1) noise to the total number of reads at each cytosine. Uniform [−0.1, 0.1] random noise was added to cytosine methylation levels. The methylation level Li, defined as the ratio of methylated reads to total reads mapped to a particular cytosine, were adjusted within the 250, 100 and 50 selected, predefined regions [13]. The degree of methylation level change was controlled by the parameter α[0,1]; new methylation levels were simulated by Linew=(1α)Liold+α when Liold<0.5 for hypermethylation (methylation higher in case than control) and Linew=(1α)Liold when Liold>0.5 for hypomethylation (methylation lower in case than control) [13].

FPCA, M3D and GIFT were applied to all 100 simulated data sets under various settings. For FPCA, both Fourier and B-spline basis were investigated. In general, 15–37 knots, with polynomial order 4, seemed to be a reasonable model for the FPCA-B-spline. To investigate the performance of the methods under different degrees of differential methylation the alpha parameter was varied as α = {0.4, 0.6, 0.8, 1}. To examine the robustness of the methods for various experimental design features, two different minimum sequencing depths (at least 5 and 20 reads) were simulated and three different replicate numbers per group (3, 8, 12) were simulated. Methods were compared by calculating the average type I and type II error rates across 100 data sets as well as the true positive rate (TPR). The false discovery rate (FDR) was controlled at 0.05 for all analyses [20].

Results

Simulation results

The results using FPCA were compared with the results using M3D and GIFT. Table 1 summarizes the results obtained for different values of the methylation change strength parameter α and different basis expansions, based on an average sequencing depth of at least 20 reads, where the proportion of true differentially methylated regions is 25% of 1000 regions simulated. The average and standard deviation for the correct number of DMRs is given along with type I and type II errors for each method. Of the 250 true DMRs, FPCA under the Fourier expansion approach identified 229.85 on average, with 3.93 falsely called DMRs when α = 100%. The FPCA under the B-spline expansion approach identified 229.03 true DMRs on average, with 4.08 falsely called DMRs when α = 100%. M3D identified 224.51 true DMRs on average, with no falsely called DMRs and GIFT identified 227.99 true DMRs on average with 4.12 falsely called DMRs at α = 100%.

Table 1.

Results for average of 100 simulations based on FPCA-Fourier, FPCA-BSpline, M3D, and GIFT (wavelet smoothing approach) on average sequencing depth (at least 20 reads), with various levels of strength of methylation change (α). Proportion of true differentially methylated regions is 25% of 1000 regions simulated.

  100% 80% 60% 40%
Methods FPCAFourier FPCABspline M3D GIFT FPCAFourier FPCABspline M3D GIFT FPCAFourier FPCABspline M3D GIFT FPCAFourier FPCABspline M3D GIFT
Correct 229.85 229.03 224.51 227.99 229.02 227.03 222.94 226.2 219.82 219.05 202.95 217.9 212.5 207.88 190.07 206.33
S.D. 0.796 0.784 0.502 0.781 0.84 0.809 0.502 0.781 0.783 0.845 0.757 0.613 0.833 0.794 0.781 0.781
Type-1 3.93 4.08 0 4.12 3.07 3.41 0 4.98 2.97 3.03 0 3.12 2.47 2.46 0 2.12
S.D. 0.794 0.977 0 0.966 0.877 1.09 0 0.966 0.892 0.934 0 0.966 0.501 0.9 0 0.966
Type-2 20.15 20.97 25.49 22.01 20.98 22.97 27.06 23.8 30.18 30.95 47.05 32.1 37.5 42.12 59.93 42.67
S.D. 0.796 0.78 0.502 0.68 0.84 0.809 0.502 0.68 0.783 0.845 0.757 0.68 0.833 0.794 0.781 0.65

FPCA-Fourier correctly identified 229.02 DMRs on average, while FPCA-Bspline correctly identified 227.03 at a methylation level difference of 80%, with 3.07 and 3.41 falsely called DMRs on average respectively at α = 80%. GIFT identified 226.20 true DMRs on average with 4.98 falsely called DMRs and M3D called 222.94 true DMRs on average, with no falsely called DMRs when α = 80%. At α = 60%, the FPCA-Fourier and FPCA-B-spline correctly identified 219.82 and 219.05 DMRs on average, respectively, with 2.97 and 3.03 falsely called DMRs on average; whereas M3D and GIFT called 202.95 and 217.9 correct DMRs on average respectively, with no falsely called DMRs for M3D and 3.12 falsely called DMRS for GIFT. At α = 40% the FPCA-Fourier and FPCA-Bspline correctly identified 212.5 and 207.88 DMRs on average, respectively, with 2.47 and 2.46 falsely called DMRs on average; whereas M3D and GIFT correctly called 190.07 and 206.33 DMRs on average, with no falsely called DMRs for M3D and 2.12 falsely called DMRS for GIFT on average.

In conclusion, all methods had a low average type I error rate with the maximum being 0.0066 in GIFT when α = 80%. It should be noted that M3D did not produce any type I errors, making it the most conservative of the methods but at the sacrifice of higher type II errors (i.e. lower power). M3D had higher type II errors across all values of α than both the FPCA and GIFT methods. Results were similar for FPCA-Fourier and FPCA-Bspline, with FPCA-Fourier having slightly lower type II errors and thus giving it a slight advantage.

Across all methods, there were fewer type II errors as α increased from 40% to 100%, which is expected since it is easier to detect more extreme differences. However, it is notable that for small α values, α = 0.40, 0.60, there are more extreme differences between M3D and the FPCA methods. This indicates FPCA and GIFT can improve DMR detection in more difficult situations when ‘the signal’ is low.

In contrast, Table 2 displays the results based on an average sequencing depth of at least 5 reads where the proportion of true differentially methylated regions is 25% of the 1000 regions simulated. At methylation strength 100%, FPCA-Fourier and FPCA-B-spline called 223.88 and 222.5 DMRs on average, respectively, out of the 250 true DMRs, with 5.97 and 6.2 false DMRs. M3D called only 200.04 true DMRs on average, with no false DMRs. GIFT called 222.19 true DMRs on average, with 6.07 DMRs. The number of truly identified DMRs decreased using FPCA-Fourier, FPCA-Bspline, M3D and GIFT, when decreasing the strength of methylation change from α = 80% to 40%, as was also observed in Table 2.

Table 2.

Results for average of 100 simulations based on FPCA-Fourier, FPCA-BSpline, M3D, and GIFT (wavelet smoothing approach) on average sequencing depth (at least 5 reads), with various levels of strength of methylation change (α). Proportion of true differentially methylated regions is 25% of 1000 regions simulated.

  100% 80% 60% 40%
Methods FPCAFourier FPCABspline M3D GIFT FPCAFourier FPCABspline M3D GIFT FPCAFourier FPCABspline M3D GIFT FPCAFourier FPCABspline M3D GIFT
Correct 223.88 222.5 200.04 221.19 219.15 219.08 197.13 218.11 211.06 210.09 178 207.18 202.02 197.93 170.05 196.03
S.D. 0.819 0.885 0.815 0.823 0.832 0.812 0.824 0.813 0.826 0.829 0.804 0.803 0.843 0.791 0.808 0.813
Type-1 5.97 6.2 0 6.07 6.05 6.15 0 5.05 3.94 3.71 0 3.91 4.54 5.07 0 3.07
S.D. 1.041 1.47 0 1.054 1.426 1.431 0 1.055 0.887 1.112 0 1.054 0.8946 1.029 0 1.004
Type-2 26.12 27.5 49.96 28.81 30.85 30.92 52.87 31.89 38.94 39.91 72 42.82 47.98 52.07 79.95 53.97
S.D. 0.819 0.885 0.8155 0.829 0.821 0.812 0.824 0.828 0.826 0.829 0.804 0.819 0.843 0.791 0.808 0.849

Overall, similar trends were observed for at least 5 reads as were found for at least 20 reads. Also, all the methods had higher type II errors for at least 5 reads than at least 20 reads while still maintaining a low type I error rate on average. It should be noted that when the sequence depth is at least 5 reads there are more drastic differences between the FPCA, GIFT and M3D methods even for the largest α = 100% (i.e. across all α). Results for simulations where the true proportion of differentially methylated regions is 5% and 10% of the 1000 regions are presented in Additional file 2. Similar results as Table 1 are provided to compare all four methods based on average sequencing depth for at least 20 reads. We also evaluated performance of the methods when the none (0%) of the regions were differentially methylated, to determine the type 1 error under the null. None of the methods detected any false positive DMRs.

Figure 1 shows the average true positive rates (TPRs) over the 100 simulated data sets for varying degrees of differential methylation ( α values) for each of the four methods (FPCA-Fourier, FPCA-Bspline, M3D, and GIFT) and two coverage depth (at least 5 and 20 reads) where true proportion of differentially methylated regions is 25%. The FPCA-Fourier method had the highest average TPR with an average sequencing depth of at least 5 and 20 reads across all α values, with FPCA-Bspline yielding similar but slightly lower TPRs. GIFT had a lower average TPR compared to FPCA for Fourier and Bspline. Overall, FPCA-Fourier and FPCA-B spline outperformed GIFT and M3D with regard to TPR in both average sequencing depths (at least 5 and 20 reads), across all levels of differential methylation strength. The coverage is also important to investigate, since low coverage can lead to less stable methylation estimates and prevent statistical significance while high coverage costs more to obtain. Figure 1 shows that the sequencing depth of at least 20 has the highest average TPR compared to average sequencing depth of at least 5. However, this difference is more drastic for M3D than it is for the FPCA and GIFT methods. The FPCA and GIFT methods maintain an average TPR between 79% and 90% for a depth of 5 reads; whereas the M3D TPR ranges from 68% to 80%.

Figure 1.

Figure 1.

True Positive Rates Based on the Average over 100 Simulations on depths of at least 5 Reads (left graph) and at least 20 Reads (right graph). The proportion of true differentially methylated regions is 25% of 1000 Regions simulated. TPR verses α Level for Controlling the Degree of Differential Methylation for Each of Four Methods: FPCA (Fourier Expansion Approach), FPCA (Bspline Expansion Approach), M3D and GIFT (Wavelet smoothing Approach).

Robustness in replications

To examine the robustness of the FPCA method to changes in replication number, simulated data sets were created for differing numbers of replicates per group. Control samples from real RRBS data set were used as the control groups for 3, 8 and 12 replicates per group. This was possible since the data set contained 12 control samples. A set of 3, 8, or 12 replicates were simulated as previously described to act as the cases groups. As before, the same 250 regions were simulated to be true DMRs using α = 80% and a coverage of 20 reads. The FPCA-Fourier basis function method was used to identify DMRs with 3, 8 and 12 replicates per group since this method performed best for at least 20 reads and these results were compared. The false discovery rate (FDR) was controlled at 5%. The FPCA-Fourier method identified 179, 193, and 216 true DMRs out of the total of 250, with 3, 8, and 12 replicates per group, respectively.

As shown in Figure 2, the overlap between the three sets of true DMRs identified accounts for 70% of the total. As was expected, the testing lost power with lower replication, with 12 replicates per group identifying the most unique true DMRs and having the lowest number of type II errors, and the highest number of type II errors occurred for three replicates per group. More similarity was observed between the simulations with eight and 12 replicates as they shared 15 true DMRs uniquely, whereas the simulation with three replicates had no unique overlap with eight or 12 replicates. Overall, the type II error rates ranged from 13.6% in the 12 replicate cases to 28.4% in the three replicate cases. Type I error was low for all three cases with the lowest being 0.13% for 12 replicates and the highest being 0.53% for three replicates. This shows that while more replicates are better, the FPCA- Fourier method exhibits a reasonable amount of robustness to smaller replicate number per group.

Figure 2.

Figure 2.

Venn diagram of true DMRs detected with FPCA-Fourier, for 3, 8, and 12 replicates per group. The number and percentage of Type I and Type II errors are also given for each replicate number.

DMRs detected in real data

An analysis was completed using the real RRBS data with four samples from bone marrow patients with acute promyelocytic leukemia (APL) and four control samples (APL in remission). All CpG sites (with at least 20 reads) across all samples were used, including all regions with start and stop locations. Since this data set provided a coverage of at least 20 reads, the FPCA-Fourier method was applied since it performed the best under that setting and these results were compared to GIFT and M3D.

Out of 14,000 CpG regions selected for testing, FPCA-Fourier and GIFT identified 3897 and 3488 DMRs and M3D identified 2603 DMRs total, with 1222 DMRs in common. Figure 3 confirms that the FPCA-Fourier method identified a unique group of changed profiles between the two conditions in the real data sets. The false discovery rate (FDR) was controlled at 0.05 for all analyses [20].

Figure 3.

Figure 3.

Venn diagram comparing the number of significant differentially methylated regions (DMRs) identified by the FPCA-Fourier, M3D, and GIFT methods in the Real APL RRBS data.

Discussion

This research demonstrates that information from reduced representation bisulfite sequencing (RRBS) datasets can be analyzed using higher-order mathematics, specifically a functional data analysis approach. Here, a dimension reduction approach is presented, based on the Karhunen-Loève transform, to create a hypothesis test for differential methylated regions (DMRs) using functional principal components based on the spatial features of methylation profiles. This allows the investigation of dominant modes of variation in the methylation profile over a region using eigenfunctions of the covariance function. The FPCA in this study employs a few principal components that increase the power and reduce degrees of freedom in testing to make the underlying biological signals stable. An FPCA based on Fourier and B-spline functions was developed that successfully detects information from shapes of the methylation curves that cannot be identified by traditional multivariate statistics and tests for differentially methylated regions between case and control groups. An empirical comparison, using a simulation based on real data, showed a substantial increase in the true positive rate for FPCA in comparison with the M3D approach, as well as considerable robustness with respect to coverage depth and replication. In general, the simulation results were similar for FPCA-Fourier and FPCA-Bspline, with FPCA- Fourier having slightly lower type II errors across most of the simulation settings thus giving it a slight advantage.

The good performance of the FPCA method is attributable to a number of factors. First, the method takes spatial correlation into account in analyzing the methylation profile. Second, the FPCA translates high-dimensional DNA methylation data into a few principal components, which greatly reduces the degrees of freedom in testing, while preserving most of the underlying biological signals. Overall, FPCA-Fourier and FPCA-B-spline substantially outperformed GIFT and M3D with regard to TPR in both average sequencing depths (at least 5 and 20 reads), across all levels of differential methylation strength.

The methodology proposed and illustrated here builds on the interpretation of next-generation sequencing data. The FPCA method can be applied in cancer research as well as in the pursuit of therapies to combat or prevent lupus, muscular dystrophy, and other diseases. In fact, because hypermethylation occurs early in colon cancer, detection of hypermethylation could be an important indicator of potential health problems, which might be detected using the FPCA method. In addition, future studies are needed to investigate the use of other functional data analysis techniques, such as functional linear regression or functional canonical correlation analysis, as well as incorporating smoothing penalties into the analysis. Although the FPCA framework was investigated using RRBS data, it should scale up well for utilization in whole genome bisulfite sequencing studies, but this should be investigated more fully. Finally, although the FPCA method exhibited robustness in detecting DMRs under low coverage and replications in two groups, it is of interest to extend the method to work for experiments that require testing for differences between more than two groups or that have covariate information.

Supplementary Material

Supplemental Material
Supplemental Material

Acknowledgements

We thank two anonymous reviewers for the opportunity to revise and improve our work. We have strengthened our paper based on the constructive comments from the reviewers by (1) expanding our simulation study by adding different proportions of true DMRs to our simulation settings, (2) comparing our method to one additional method (GIFT) for pre-defined DMR detection, (3) investigating the correlation between methylation levels of neighboring CpG sites in the real data, and (4) clarifying the writing to address points that may have caused misunderstanding.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

Data set was obtained under accession number GSE42119 (National Center for Biotechnology Information) https://www.ncbi.nlm.nih.gov/.

References

  • 1.Akalin A., Garrett-Bakelman F.E., Kormaksson M., Busuttil J., Zhang L., Khrebtukova I., Milne T.A., Huang Y., Biswas D., Hess J.L., Allis C.D., Roeder R.G., Valk P.J.M., Löwenberg B., Delwel R., Fernandez H.F., Paietta E., Tallman M.S., Schroth G.P., Mason C.E., Melnick A., and Figueroa M.E., Base-pair resolution DNA methylation sequencing reveals profoundly divergent epigenetic landscapes in acute myeloid leukemia. PLoS Genet. 8 (2012), pp. 1002781. Available at 10.1371/journal.pgen.1002781. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Aguilera A., Aguilera-Morillo M.C., Escabias M., and Valderrama M., Penalized spline approaches for functional principal component logit regression, in Recent Advances in Functional Data Analysis and Related Topics, Ferraty F., ed., Springer, Berlin, 2011. pp. 1–8. [Google Scholar]
  • 3.Aryee M.J., Jaffe A.E., Corrada-Bravo H., Ladd-Acosta C., Feinberg A.P., Hansen K.D., and Irizarry R.A., Minfi: A flexible and comprehensive bioconductor package for the analysis of infinium DNA methylation microarrays. Bioinformatics. 30 (2014), pp. 1363–1369. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Benoukraf T., Wongphayak S., Hadi L.H., Wu M., and Soong R., GBSA: A comprehensive software for analyzing whole genome bisulfite sequencing data. Nucleic Acids Res. 41 (2013), pp. e55. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Ferraty F. and Romain Y. (eds.), The Oxford Handbook of Functional Data Analysis, Oxford University Press, New York, 2011. [Google Scholar]
  • 6.Gretton A., Borgwardt K.M., Rasch M., Scholkopf B., and Smola A.J., A kernel method for the two-sample-problem. Adv. Neural. Inf. Process. Syst. 19 (2007), pp. 513. [Google Scholar]
  • 7.Gu H., Smith Z.D., Bock C., Boyle P., Gnirke A., and Meissner A., Preparation of reduced representation bisulfite sequencing libraries for genome-scale DNA methylation profiling. Nat. Protoc. 6 (2011), pp. 468–481. [DOI] [PubMed] [Google Scholar]
  • 8.Hansen K.D., Langmead B., and Irizarry R.A., BSmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome Biol. 13 (2012), pp. R83. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Hebestreit K., Dugas M., and Klein H.-U., Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics. 29 (2013), pp. 1647–1653. [DOI] [PubMed] [Google Scholar]
  • 10.Krueger F., Kreck B., Franke A., and Andrews S.R., DNA methylome analysis using short bisulfite sequencing data. Nat. Methods 9 (2011), pp. 145–151. [DOI] [PubMed] [Google Scholar]
  • 11.Li E. and Zhang Y., DNA methylation in mammals. Cold. Spring. Harb. Perspect. Biol. 6 (2014), pp. 019133. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Lim D.H.K. and Maher E.R., DNA methylation: A form of epigenetic control of gene expression. TOG 12 (2010), pp. 37–42. [Google Scholar]
  • 13.Mayo T., Schweikert G., and Sanguinetti G., M3d: A kernel-based test for spatial correlated changes in methylation profiles. Bioinformatics. 31 (2015), pp. 809–816. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Meissner A., Gnirke A., Bell G.W., Ramsahoye B., Lander E.S., and Jaenisch R., Reduced representation bisulfite sequencing for comparative high-resolution DNA methylation analysis. Nucleic Acids Res. 33 (2005), pp. 5868–5877. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Merlo A., Herman J.G., Mao L., Lee D.J., Gabrielson E., Burger P.C., Baylin S.B., and Sidransky D., 5′ CpG island methylation is associated with transcriptional silencing of the tumor suppressor p16/CDKN2/MTS1 in human cancers. Nature Med. 1 (1995), pp. 686–692. [DOI] [PubMed] [Google Scholar]
  • 16.Park Y., Figueroa M.E., Rozek L.S., and Sartor M.A., Methylsig: A whole genome DNA methylation analysis pipeline. Bioinformatics. 30 (2014), pp. 2414–2422. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Phillips T., The role of methylation in gene expression. Nat. Educ. 1 (2008), pp. 116. [Google Scholar]
  • 18.Ramsay J.O. and Silverman B.W., Functional Data Analysis, Springer, New York, 2005. [Google Scholar]
  • 19.Ryu D., Xu H., George V., Su S., Wang X., Shi H., and Podolsky R.H., Differential methylation tests of regulatory regions. Stat. Appl. Genet. Mol. Biol. 15 (2015), pp. 237–251. [DOI] [PubMed] [Google Scholar]
  • 20.Benjamini Y., and Hochberg Y., Controlling the false discovery rate: a practical and powerful approach to multiple testing, J R Stat Soc Series B Stat Methodol. 57 (1995), pp. 289–300. [Google Scholar]
  • 21.Urich M.A., Nery J.R., Lister R., Schmitz R.J., and Ecker J.R., MethylC-seq library preparation for base-resolution whole-genome bisulfite sequencing. Nat. Protoc. 10 (2015), pp. 475–483. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.VanderKraats N.D., Hiken J.F., Decker K.F., and Edwards J.R., Discovering high-resolution patterns of differential DNA methylation that correlate with gene expression changes. Nucleic Acids Res. 41 (2013), pp. 6816–6827. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Varley K.E., Gertz J., Bowling K.M., Parker S.L., Reddy T.E., Pauli-Behn F., Cross M.K., Williams B.A., Stamatoyannopoulos J.A., Crawford G.E., and Absher D.M., Dynamic DNA methylation across diverse human cell lines and tissues. Genome Res. 23 (2013), pp. 555–567. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Xiong H., Brown J.B., Boley N., Bickel P.J., and Huang H., DE-FPCA: Testing gene differential expression and exon usage through functional principal component analysis, in Statistical Analysis of Next Generation Sequencing Data, Datta S. and Nettleton D., eds., Springer International Publishing, Switzerland, 2014. pp. 129–143. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material
Supplemental Material

Data Availability Statement

Data set was obtained under accession number GSE42119 (National Center for Biotechnology Information) https://www.ncbi.nlm.nih.gov/.


Articles from Journal of Applied Statistics are provided here courtesy of Taylor & Francis

RESOURCES