Skip to main content
Nature Communications logoLink to Nature Communications
. 2019 Jul 15;10:3113. doi: 10.1038/s41467-019-10864-z

Detection of cell-type-specific risk-CpG sites in epigenome-wide association studies

Xiangyu Luo 1,2, Can Yang 3,, Yingying Wei 2,
PMCID: PMC6629651  PMID: 31308366

Abstract

In epigenome-wide association studies, the measured signals for each sample are a mixture of methylation profiles from different cell types. Current approaches to the association detection claim whether a cytosine-phosphate-guanine (CpG) site is associated with the phenotype or not at aggregate level and can suffer from low statistical power. Here, we propose a statistical method, HIgh REsolution (HIRE), which not only improves the power of association detection at aggregate level as compared to the existing methods but also enables the detection of risk-CpG sites for individual cell types.

Subject terms: Computational biology and bioinformatics, Biomarkers, Mathematics and computing, Epigenomics, DNA methylation


Cellular heterogeneity is one of the major confounding factors in EWAS studies. Here the authors present a statistical method, HIgh REsolution (HIRE), which enables the detection of risk-CpG sites for individual cell types.

Introduction

Epigenome-wide association studies (EWAS) aim to identify cytosine-phosphate-guanine (CpG) sites associated with phenotypes of interest, such as disease status13, smoking history4,5, body mass index6, and age7,8. However, because the samples in EWAS are measured at the bulk level rather than at the single-cell level, the obtained methylome for each sample shows the signals aggregated from distinct cell types3,9,10, which leads to two main challenges in the analysis of EWAS data. On the one hand, the cell type compositions differ among samples and can be associated with phenotypes3,10. Both binary phenotypes, such as the diseased or normal status3, and continuous phenotypes, such as age10, have been found to affect the cell type compositions. As a result, ignoring the cellular heterogeneity in EWAS can lead to many spurious associations1013. On the other hand, the phenotype may change the methylation level of a CpG site in some but not all of the cell types. Identification of the exact cell types that carry the risk-CpG sites can deepen our understandings of disease mechanisms. However, such identification is challenging because only the aggregated-level signals can be observed.

To the best of our knowledge, no existing statistical method for EWAS can detect cell-type-specific associations despite active research to account for cell-type heterogeneity. The existing approaches can be categorized into two schools14: reference-based and reference-free methods. The reference-based methods9,15 require the reference methylation profiles for each cell type to be known a priori, and they regress the aggregated methylation levels observed from each sample on the same set of references to learn the sample’s cellular compositions. However, because samples have different attributes, such as age and gender, the methylation levels of a given cell type can vary among samples. It is thus problematic to assume that all of the samples have the same set of reference profiles10,14. Furthermore, high-quality references are difficult to obtain for most EWAS due to the existence of unknown cell types, the high cost of cell sorting, and confounding effects14. Consequently, a large amount of recent EWAS literature was devoted to identification of risk-CpG sites without the need for the reference methylation profiles.

The reference-free methods can generally be further divided into two classes according to whether they estimate the cell-type mixing proportions directly. The direct-decomposition-based procedures consist of two stages. In the first stage, they simultaneously estimate the cellular compositions of each sample and the cell-type-specific reference methylomes via quadratic programming16; and in the second stage, they treat the estimated cell-type proportions as covariates with additive effects in the linear models to conduct association tests. However, when estimating cellular compositions during the first stage, the direct-decomposition-based methods also do not consider samples’ phenotype information, thus suffering from the same problem of biasing the cellular composition estimates as the reference-based approaches9. Moreover, similar to tumor purity17, we argue that the estimated cellular composition has a multiplicative rather than an additive effect on the observed methylation level (Methods). The second class of methods, which is exemplified by SVA18, RefFreeEWAS19, and ReFACTor13, does not carry out cell-type decompositions. They resort to singular value decomposition, which includes the principal component analysis, to construct surrogates for the underlying cell-type composition. EWASher, a linear mixed model, also belongs to this class because it is equivalent to the use of principal components as fixed-effect covariates11. However, the use of principal components as the covariates in the regression undergoes the same issue of additive effects as the direct-decomposition-based methods. Therefore, the existing reference-free methods have low power in detecting risk-CpG sites12.

Although the existing methods aim to address the cellular heterogeneity problem in EWAS and claim whether a CpG site is associated with phenotypes at the aggregate level, none of them can identify the risk-CpG sites for each individual cell type, thus missing the opportunity to obtain finer-grained results in EWAS.

Here, we propose a method, HIRE, to identify the association in EWAS at a HIgh REsolution: detecting whether a CpG site has any associations with the phenotypes in each cell type (Methods). The keys to HIRE’s success are twofold. First, HIRE links the underlying cell-type-specific methylation profiles for each sample to the sample’s phenotypes, thus avoiding the bias in estimating the cellular composition by the reference-based and direct-decomposition-based methods. Second, HIRE correctly characterizes the cellular compositions as the multiplicative effects, whereas the existing methods inappropriately treat the cell proportions as additive effects (Methods). HIRE is applicable to EWAS with binary phenotypes, continuous phenotypes, or both. By helping researchers understand in which cell types the CpG sites are affected by a disease, HIRE can ultimately facilitate the development of epigenetic therapies by targeting the specifically affected cell types.

Results

Method overview

HIRE is a hierarchical model that closely follows the data generation process. Its elaborate modeling depicts how phenotypes affect the methylation levels of each sample. Here, we briefly introduce the method. The technical details are provided in the Methods section and the Supplementary Methods.

Let us first review the cornerstone in most EWAS approaches. These methods model the observed methylation levels of the m CpG sites for sample i, Oi = (O1i, O2i, …, Omi)T, as the weighted average of the methylation profiles of K cell types, ui = (ui1, ui2, …, uiK). The weights are the cellular compositions pi = (p1i, p2i, …, pKi)T of sample i (see the top panel of Fig. 1a). However, regardless of whether the reference is known a priori or not, the existing methods assume that the cell-type-specific methylation profiles uis remain the same for all samples: ui = M, for i = 1, …, n. Unfortunately, because the methylation levels can actually change with covariates such as age and disease status, ignoring the covariates’ effects and enforcing static reference methylomes can bias the estimation of pi and thus affect all downstream analyses14. More importantly, the assumption that cell-type-specific methylation profiles are the same for each sample prevents the detection of cell-type-specific risk-CpG sites.

Fig. 1.

Fig. 1

A simple cartoon illustration of the HIRE model with three cell types (K = 3) and two phenotypes (disease status and age; q = 2). a Data generation procedure for the observed methylation vector Oi for sample i (i = 1, …, n). In the top panel, Oi is the convolution of cell-type-specific methylation profiles ui with cellular compositions pi. Both ui and pi depend on the attributes of sample i. The bottom panel describes how sample i’s phenotypes affect ui via two phenotype-effect matrices B1 and B2. In B1 and B2, the white square represents zero, which indicates that the phenotype exerts no influence on the corresponding methylation level in ui. b Inputs and outputs of HIRE. We input the observed methylation matrix O, the phenotype data matrix X, and a predetermined cell type number K into HIRE, and HIRE outputs the estimates for the cellular compositions p^, the baseline methylation profiles μ^, the phenotype effects B^, and the penalized BIC value. In addition, HIRE tests whether there is any association between CpG site j and phenotype in cell type kH0:βjk=0 vs H1:βjk0—and provides the p-values

For association detection at the aggregate level, after estimation of pi using the deconvolution-based approach or its surrogates from principal component-based methods, the existing methods examine a linear model in which the phenotypes xi=(xi1,,xi,,xiq)T and the cellular proportions pi exert additive effects on the methylation level Oi:

graphic file with name 41467_2019_10864_Figa_HTML.gif 1

A CpG-site j is then associated with phenotype if we reject the null hypothesis that the covariate coefficient Tj equals zero.

In contrast, HIRE further models the effect of each phenotype on each cell type as shown in the bottom panel of Fig. 1a. In cell type k, sample i’s cell-type-specific methylation profile, uik, is the summation of the corresponding baseline cell-type-specific methylation levels, μk, and the phenotype effects Bkxi on sample i from all the l = 1, …, q phenotypes: uik=μk+l=1qBkxi, where xi is the phenotype of sample i and Bk=(β1k,,βmk)T—the kth column of B—reflects the association of phenotype with each of the m CpG sites in cell type k. Thus, by collecting the baseline cell-type-specific methylation profiles to μ = (μ1, …, μk) and denoting the m by K phenotype coefficient matrix (βjk:1jm,1kK) by B, we now have:

graphic file with name 41467_2019_10864_Figb_HTML.gif 2

A comparison of xi in Eq. (1) and xipi,=1,,q in Eq. (2) reveals that via the two-layer hierarchical model HIRE correctly captures the multiplicative effects of the cellular compositions on the phenotype effects (see also Methods and the Supplementary Methods). As a result, HIRE achieves greater statistical power for association detection at the aggregate level and enables the fine-scale resolutions that were previously infeasible. We mathematically prove that the HIRE model is identifiable under mild conditions that are easily met in reality (see Theorem 1 and its proof in Methods).

Figure 1b summarizes the inputs and outputs of HIRE. Given the methylation measurements at the aggregate level of n samples, HIRE can estimate all parameters of interests —pi (i = 1, …, n), μ, and B (=1,,q). HIRE then determines whether any association exists between CpG site j and phenotype in each individual cell type by testing the hypotheses H0:βjk=0 versus H1:βjk0. When the null hypothesis H0:βjk=0 is rejected, HIRE calls CpG site j as a risk-CpG site for phenotype in cell type k. The detection of cell-type-specific risk-CpG sites cannot be performed with any of the existing state-of-the-art methods.

Moreover, HIRE allows users to prespecify the number of cell types K. When K is unknown, HIRE selects the number of cell types according to the penalized Bayesian information criterion (pBIC)20 (Supplementary Methods).

Simulation

As the definition of the gold standard for real data is debatable21,22, we designed extensive simulation studies to evaluate the performance of HIRE and compared it with commonly used methods—unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor (Methods). We generated datasets in which the observed methylation was a mixture of several cell types and each sample was accompanied with a diseased or normal status and a continuous age attribute. We deliberately designed some cell types to have similar baseline methylation profiles to mimic cell types from the same cell lineage. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated two scenarios in which (1) all phenotype effects βjks are zero—the true null case—to compare the ability of each method to control false positives; and (2) a small portion of βjks are non-zero—the true alternative case—to study each method’s power to detect risk-CpG sites. Under the true alternative, both the binary and the continuous phenotypes were assumed to have cell-type-specific risk-CpG sites and to affect the cell-type proportions among the samples10. We further simulated phenotype effects with various directions and magnitudes.

Under the true null, HIRE, EWASHer, and ReFACTor control the false positive rates (FPRs) very well: none are greater than 0.05% (Table 1 and Supplementary Figs. 19). In comparison, RefFreeEWAS often has FPRs greater than 0.1% and thus does not perform as well as HIRE, and the unadjusted analysis and SVA further suffer from the dramatic inflation of false positives. For the true alternative settings, given that the FPRs are well-controlled, with FPRs below 0.05%, HIRE achieves the highest true positive rates (TPR) of all methods in every simulation setting (see also Fig. 2a and Supplementary Figs. 1017). As expected, as the sample size increases, HIRE’s power increases. For example, when the data include five cell types, HIRE can identify 89.6% of the risk-CpG sites with 300 samples, and HIRE can detect almost all risk-CpG sites when the sample size reaches 600, which is a typical sample size for EWAS. Although EWASHer and ReFACTor have low FPRs, they miss a large proportion of risk-CpG sites. EWASHer’s maximum TPR is only 35.33%, and ReFACTor’s maximum TPR is slightly over 60%. However, in those cases, HIRE’s power is greater than 95%. Consistent with the true null scenario, in the true alternative, RefFreeEWAS has inflated FPRs compared to HIRE, and the unadjusted analysis and SVA always have huge false positives. Therefore, HIRE substantially improves the power of association detection at the aggregate level compared with existing methods.

Table 1.

Performance of HIRE and other competing methods in simulation studies

Cell type number Sample size HIRE SVA Unadjusted RefFreeEWAS EWASHer ReFACTor
True null K = 3 n = 180 FPR 0 0.35% 21.72% 0.12% 0 0
True null K = 3 n = 300 FPR 0.00% 3.13% 53.40% 0.12% 0.00% 0
True null K = 3 n = 600 FPR 0 22.06% 77.2% 0.1% 0.00% 0
True null K = 5 n = 180 FPR 0.05% 0.3% 18.5% 0.13% 0 0
True null K = 5 n = 300 FPR 0.01% 1.54% 36.89% 0.08% 0 0
True null K = 5 n = 600 FPR 0 8.09% 55.66% 0.12% 0 0
True null K = 7 n = 180 FPR 0.00% 0.04% 5.13% 0.11% 0 0
True null K = 7 n = 300 FPR 0.00% 0.22% 22.84% 0.12% 0 0
True null K = 7 n = 600 FPR 0 6.80% 48.12% 0.11% 0 0
True alternative K = 3 n = 180 FPR 0.01% 1.24% 32.20% 0.11% 0 0
True alternative K = 3 n = 180 TPR 98.67% 87.33% 60.67% 79.33% 21.33% 60.67%
True alternative K = 3 n = 300 FPR 0.00% 7.60% 65.03% 0.09% 0 2.21%
True alternative K = 3 n = 300 TPR 96.67% 86% 79.33% 63.33% 24.67% 56.67%
True alternative K = 3 n = 600 FPR 0.03% 16.91% 73.55% 0.11% 0.00% 1.34%
True alternative K = 3 n = 600 TPR 100% 84% 88.67% 90% 35.33% 45.33%
True alternative K = 5 n = 180 FPR 0.00% 2.70% 20.06% 0.14% 0 0
True alternative K = 5 n = 180 TPR 66% 80.8% 35.2% 71.2% 6.4% 41.6%
True alternative K = 5 n = 300 FPR 0.01% 1.76% 36.11% 0.17% 0 0.01%
True alternative K = 5 n = 300 TPR 89.6% 86.4% 69.2% 76.4% 11.2% 53.6%
True alternative K = 5 n = 600 FPR 0.01% 15.31% 56.25% 0.09% 0 0.11%
True alternative K = 5 n = 600 TPR 98.4% 82.8% 62% 84.4% 18% 62.8%
True alternative K = 7 n = 180 FPR 0 0.37% 9.30% 0.11% 0 0
True alternative K = 7 n = 180 TPR 43% 58.67% 35% 54.67% 5% 26%
True alternative K = 7 n = 300 FPR 0.00% 1.08% 26.67% 0.11% 0 0
True alternative K = 7 n = 300 TPR 63.33% 73% 45% 76.33% 5% 35.67%
True alternative K = 7 n = 600 FPR 0.04% 25.80% 56.76% 0.10% 0 0.00%
True alternative K = 7 n = 600 TPR 82.67% 83% 66% 74.33% 5% 51.67%

Performance of HIRE and other competing methods in simulation studies in detecting risk-CpG sites at the aggregate level. For the true null cases in which no CpG site is at risk, the average of the false positive rates (FPRs) based on five replicates is reported. For the true alternative cases, the averages of the FPRs and the true positive rates (TPRs) based on five replicates are reported. The number of CpG sites at risk is 30, 50, and 60 for the cell type number K = 3, 5, and 7, respectively. HIRE calls a CpG site as significant at the aggregate level if it is at risk in at least one cell type. We used Bonferroni correction for each method to control the family-wise error rate (FWER) below α = 0.01. Because HIRE can provide the p-values of CpG sites for all cell types and phenotypes, the p-value threshold for significance is α/(mKq), where m = 10,000 is the number of CpG sites and q = 2 is the phenotype number. For the other five methods, the p-value threshold is set to α/m. Notice that “0” represents exact zero, and “0.00%” indicates a very small positive number that is rounded down to zero using four decimal places

Fig. 2.

Fig. 2

Association detection performance of HIRE and commonly used methods in the true alternative setting with K = 3 and n = 180. Source data are provided as a Source Data file. In all figures, red corresponds to HIRE; yellow indicates the unadjusted analysis; brown represents SVA; purple refers to RefFreeEWAS; dark blue indicates EWASher; and light blue corresponds to ReFACTor. a ROC curves of HIRE and commonly used methods. HIRE has the largest area under the curve among all of the methods. b True cell-type-specific association pattern with disease status for 10,000 simulated CpG sites; columns correspond to cell types, and the rows represent the CpG sites. Dark cells correspond to risk-CpG sites, and grey cells are CpG sites not associated with the disease status. c Detected cell-type-specific association pattern with disease status by HIRE. Darkness represents -log10(p-value) di The p-value density plots for association with disease status in the simulation dataset for d HIRE, e unadjusted analysis, f SVA, g RefFreeEWAS, h EWASHer, and i ReFACTor. jo The Q-Q plots for association with disease status for j HIRE, k unadjusted analysis, l SVA, m RefFreeEWAS, n EWASHer, and o ReFACTor

In the multiple hypothesis testing, the p-values from the truly null features should follow a uniform distribution on (0, 1), whereas those for the truly alternative features are concentrated near zero23. Both the histograms (Fig. 2d–i) and the Q-Q plots (Fig. 2j–o) show that the p-value distribution of HIRE is the best fit to the underlying truth—there are only a small proportion of signals, followed by RefFreeEWAS and ReFACTor. EWASHer easily overcorrects signals with its p-value density having a dip near zero (Fig. 2h), thus failing to detect the true associations. In contrast, the unadjusted analysis and SVA generate very small p-values clustered near zero, resulting in inflated type I errors.

In addition to the traditional association detection at the aggregate level, HIRE can identify the association for each CpG site with the phenotypes under each cell type. Table 2 shows the FPR and TPR of HIRE for each cell type in various simulation settings. Such fine analysis is not possible with the other methods. Consistent with association detection at the aggregate level, HIRE always controls the FPR well. When K = 3 and n = 180, HIRE accurately detects the risk-CpG sites associated with disease status a TPR of greater than 83% and an FPR of 0.01% or less in all three cell types. Similarly, most of the CpG sites affected by age are also correctly identified in each cell type. HIRE’s learned cell-type-specific association patterns closely matches the underlying true associations (see Fig. 2b, c and Supplementary Figs. 1826). Once again, HIRE’s power decreases with the number of cell types and increases with the sample size. When the samples consist of seven cell types and the proportion of the least abundant cell type is as low as 4.2%, given a typical current EWAS with around 600 samples, HIRE can detect most cell-type-specific risk-CpG sites reasonably well. Moreover, HIRE’s estimates for the baseline methylation profiles, cellular compositions, and phenotype effects have little bias (Supplementary Figs. 2762); therefore, HIRE can provide accurate estimates and is powerful in detecting cell-type-specific risk-CpG sites.

Table 2.

Performance of HIRE in detecting cell-type-specific risk-CpG sites

Phenotype Cell type number Sample size Cell type 1 Cell type 2 Cell type 3 Cell type 4 Cell type 5 Cell type 6 Cell type 7
Disease status K = 3 n = 180 FPR 0.01% 0.00% 0.01%
Disease status K = 3 n = 180 TPR 83% 85% 92%
Disease status K = 3 n = 300 FPR 0.02% 0.02% 0.04%
Disease status K = 3 n = 300 TPR 74% 85% 95%
Disease status K = 3 n = 600 FPR 0.03% 0.03% 0.05%
Disease status K = 3 n = 600 TPR 99% 98% 100%
Disease status K = 5 n = 180 FPR 0.01% 0 0.00% 0.01% 0.02%
Disease status K = 5 n = 180 TPR 35% 46% 44% 39% 75%
Disease status K = 5 n = 300 FPR 0.02% 0.02% 0.02% 0.06% 0.10%
Disease status K = 5 n = 300 TPR 66% 73% 67% 43% 43%
Disease status K = 5 n = 600 FPR 0.02% 0.02% 0.01% 0.10% 0.12%
Disease status K = 5 n = 600 TPR 81% 77% 92% 52% 56%
Disease status K = 7 n = 180 FPR 0 0 0.01% 0 0.00% 0 0.00%
Disease status K = 7 n = 180 TPR 13% 28% 32% 20% 21% 15% 69%
Disease status K = 7 n = 300 FPR 0.01% 0.01% 0.01% 0.00% 0.01% 0.01% 0.02%
Disease status K = 7 n = 300 TPR 20% 48% 60% 52% 40% 23% 78%
Disease status K = 7 n = 600 FPR 0.02% 0.02% 0.01% 0.02% 0.01% 0.01% 0.07%
Disease status K = 7 n = 600 TPR 37% 79% 90% 52% 71% 66% 98%
Age K = 3 n = 180 FPR 0.01% 0.01% 0.06%
Age K = 3 n = 180 TPR 68% 76% 96%
Age K = 3 n = 300 FPR 0.05% 0.03% 0.08%
Age K = 3 n = 300 TPR 95% 95% 90%
Age K = 3 n = 600 FPR 0.06% 0.06% 0.08%
Age K = 3 n = 600 TPR 94% 99% 95%
Age K = 5 n = 180 FPR 0.05% 0.05% 0.01% 0.04% 0.06%
Age K = 5 n = 180 TPR 67% 61% 82% 69% 97%
Age K = 5 n = 300 FPR 0.09% 0.03% 0.04% 0.04% 0.08%
Age K = 5 n = 300 TPR 78% 85% 97% 85% 91%
Age K = 5 n = 600 FPR 0.07% 0.06% 0.07% 0.08% 0.08%
Age K = 5 n = 600 TPR 88% 84% 94% 83% 94%
Age K = 7 n = 180 FPR 0.02% 0.01% 0.01% 0 0.02% 0.01% 0
Age K = 7 n = 180 TPR 39% 62% 58% 68% 38% 54% 85%
Age K = 7 n = 300 FPR 0.08% 0.01% 0.04% 0.01% 0.03% 0.03% 0.02%
Age K = 7 n = 300 TPR 46% 62% 79% 84% 79% 73% 93%
Age K = 7 n = 600 FPR 0.09% 0.08% 0.04% 0.10% 0.03% 0.05% 0.07%
Age K = 7 n = 600 TPR 52% 77% 85% 84% 77% 79% 84%

Performance of HIRE in detecting cell-type-specific risk-CpG sites in the true alternative cases. The results are based on five replicates for each setting. A CpG site is claimed to be significant in a given cell type if its p-value is less than α/(mKq)

In the HIRE model, we assume that different CpG sites are independent, and we investigate the performance of HIRE when such a model assumption is violated and dependences exist among nearby CpG sites. Specifically, we assume that every 50 consecutive CpG sites belongs to a block. For CpG sites within the same block, their random noises ϵ follow a multivariate normal distribution with mean zero and 50 × 50 covariance matrix Σ, and Σ’s corresponding correlation matrix has its (i, j) entry equal to ρ|ij|. We vary ρ to 0.8, 0.6, and 0.4. A comparison of Supplementary Tables 13 with Supplementary Table 4 shows that even when strong correlations exit among nearby CpG sites, HIRE still provides good performances in controlling the FPR and detecting the risk-CpG sites under the model misspecification setting.

To further evaluate HIRE’s performance on experimentally mixed samples, we conducted another semi-simulated dataset that includes six samples mixed with six purified cell types in predetermined proportions24. Once again, HIRE successfully recovers the six underlying reference cell types and estimates the cellular compositions well (see Methods).

Real data analysis

HIRE also provides greater insight into real data than previous studies. The rheumatoid arthritis (RA) dataset3 contains methylation profiles collected from the whole blood of 354 patients with RA and 335 normal participants. In addition to the RA status, other attributes such as gender, smoking history, age, and batch information are available. We first corrected the batch effects and then applied HIRE to the dataset (Methods). Figure 3a displays the p-values regarding the association with the RA status for each CpG site in each cell type, in which HIRE selected six cell types (Supplementary Fig. 63a), consistent with the number of cell types in the previous study13. Despite potential batch effects and biological variability, three of the six cell types can be matched to known blood cell references—cell type 1 was matched to CD4+ T cells, cell types 2 and 4 were matched to neutrophils, and the remaining three cell types cannot be aligned to the references (Methods and Supplementary Fig. 64). HIRE detected 63 risk-CpG sites in cell type 3—the largest number of associations across all cell types—but no risk-CpG sites in cell type 1 (Supplementary Table 5). Therefore, the disease status affected some but not necessarily all cell types. Note that the significant CpG site cg06373940 called by HIRE is located on gene ERCC3. The level of ERCC3’s corresponding protein has been reported to increase in RA synovium25. Moreover, we found that five CpG sites had a significant association with smoking history (Supplementary Fig. 65 and Supplementary Table 6). One of them is cg05575921, which was recently linked to smoking in two other independent studies of blood samples26,27. However, these findings were missed by the association detection at the aggregate level in previous analyses of the same dataset11,13. The p-value density plots and Q-Q plots for the commonly used methods are also displayed in Fig. 3c–n; they present patterns similar to those observed in the simulation study except for an obvious overcorrection by ReFACTor.

Fig. 3.

Fig. 3

Application of HIRE and commonly used methods to two real methylation datasets: RA and GALA II. Source data are provided as a Source Data file. a Cell-type-specic association pattern with RA status detected by HIRE in the RA dataset. Darkness represents the −log10(p−value). b Cell-type-specic association pattern with gender detected by HIRE in the GALA II dataset. The darkness represents the −log10(p−value). ch The p-value density plots for association with RA status in the RA dataset for c HIRE, d unadjusted analysis, e SVA, f RefFreeEWAS, g EWASHer, and h ReFACTor. i-n Q-Q plots for association with RA status in the RA dataset for i HIRE, j unadjusted analysis, k SVA, l RefFreeEWAS, m EWASHer, and n ReFACTor

The high resolution provided by HIRE makes it a powerful tool for EWAS studies. Rahmani et al. used ReFACTor13 to analyze the GALA II blood methylation dataset28, which consists of 573 samples collected from a pediatric Latino population. Each sample includes the gender information and belongs to one of the following four populations: Mexican, Mixed Latino, Puerto Rican, and Other Latino. We applied HIRE to the dataset to investigate whether any cell-type-specific CpG sites were associated with gender and ethnicity. We created three dummy variables to represent the four ethnic groups. By taking the indicators of ethnicity as phenotypes in the model, HIRE automatically and simultaneously accounts for the population differences in cell composition and cell-type-specific methylation levels. HIRE correctly selected the number of cell types as six as reported in the previous study13 (Supplementary Fig. 63b). According to cell-type alignment, cell types 1 and 5 can be annotated as CD4+ T cells; cell types 2, 3, and 4 belong to neutrophils; and cell type 6 was annotated as CD56+ natural killer cell (CD56+ NK) using the references (Supplementary Fig. 66). HIRE found that 1936 CpG sites were associated with ethnicity across all cell types (Supplementary Fig. 67) and identified 14, 52, 155, 15, 18, and 14 risk-CpG sites for gender in cell types 1–6, respectively (Fig. 3b). Gene set enrichment analysis showed that the genes that harbored risk-CpG sites for gender were significantly enriched in seven canonical pathways (Supplementary Table 7), of which the PID_CMYB_PATHWAY was ranked the highest. The transcription factor c − MYB in the PID_CMYB_PATHWAY enhances the progression of breast cancer29; therefore, the different occurrence rates of breast cancers in men and women may be linked to the differences at the epigenome level. In comparison, only one pathway was found to be enriched with the genes that host the risk-CpG sites claimed by ReFACTor at the aggregate level (Supplementary Table 8). All of these observations highlight the importance of the finer-scale resolutions of HIRE.

Discussion

In reality, the phenotype may affect a risk-CpG site in some but not all of the cell types. HIRE can detect the cell-type-specific association pattern with each phenotype for EWAS. The identification of cell-type-specific risk-CpG sites will help epigenetic therapies to target the affected cell types in a more effective manner.

Statistically, instead of assuming fixed reference methylomes for all samples as the existing methods do9,13,16, HIRE allows each sample’s cell-type-specific methylation profiles to depend on its phenotypes. As a result, HIRE correctly models the multiplicative effects of the cellular compositions on the observed methylation levels, whereas the existing approaches all misspecify the cellular compositions as additive effects (Methods). As a result, HIRE enables the detection of cell-type-specific risk-CpG sites that cannot be feasibly detected with existing state-of-the-art methods. As a byproduct, HIRE also improves the statistical power of association detection at the aggregate level relative to existing state-of-the-art methods. Computationally, the time complexity of one iteration by HIRE is O(nmKp + nK3), which thus provides fast convergence when K is moderate. The statistical and computational advantages equip HIRE to be scaled up for large-cohort EWAS.

So far, in the EWAS community, no gold-standard exists for the comparison of various methods. Ideally, we would like to have epigenetic spike-in experiments in which purified cell types are isolated, CpG-sites are epigenetically edited on a per-cell-type basis, and cell types are finally mixed in predetermined proportions. Given such experiments, the underlying knowledge of which CpGs are differentially methylated in each cell type and the cell mixing proportions for each sample are known. However, biotechnologies for epigenetic editing, such as CRISPR-Cas, are still not mature at this stage, with many off-target modifications30. Therefore, most computational EWAS studies refer to numerical simulation studies rather than to experimental studies when evaluating the performance of their algorithms12,13. Here, we follow the example of previous comparative studies and design our simulation studies to serve as the computational counterpart of experimental spike-in studies. With the rapid advances in epigenetic editing, we hope the community can devote greater effort in the near future to the creation of a gold-standard dataset, such as those generated in the early years for gene expression microarray studies31.

The beta-values that represent methylation levels always lie between zero and one. As previous approaches to EWAS often assume normal distribution for the beta-values and show good performances in real applications9,13, in HIRE, we also assume that the beta-values follow a normal distribution. Consequently, the fitted methylation level may lie outside the range of [0, 1]. Nevertheless, we do in fact constrain the baseline methylation profiles μjks to the closed interval [0, 1] and force the cellular compositions pkis to be non-negative and to add up to one: k=1Kpki=1. As a result, because the phenotypes have no effect on most CpG sites, most observations, Ojis, have their means k=1Kμjkpkis in [0, 1]. In fact, for both the RA dataset and the GALA II dataset, more than 99.99% of the fitted methylation values Ôjis based on HIRE estimates lie between zero and one. Therefore, the normal assumption fits the data reasonably well and does not have a large effect on the performance of HIRE.

One major issue for all of the cell-type deconvolution methods is that deconvolution cannot be achieved if the cellular compositions do not vary among samples. For example, assuming that the samples are mixtures of two cell types and pi = p for all of the samples, then the observed methylation profile Oi equals ui1p1 + ui2p2 = (ui1 + p2C)p1 + (ui2 − p1C)p2 := u~i1p1+u~i2p2 for any constant C. As a result, ui1 and ui2 are not estimable. In our paper, we show mathematically that HIRE is identifiable under mild conditions in Theorem 1 and that condition (b) of Theorem 1 formulates the requirement for the variability of the cellular compositions (Methods). HIRE can accurately estimate cellular compositions of tissues with great cellular heterogeneity, such as blood. Although the mild conditions in Theorem 1 are easily met for real DNA methylation data, identification of both sufficient and necessary conditions for model identifiability is a theoretically interesting and challenging statistical problem that we will investigate in a future study.

HIRE requires a moderate sample size to obtain precise estimates because HIRE needs to learn (1 + 2K + qK)m + (K − 1)n parameters with a total of mn observed values. Our simulation studies show that HIRE performs very well at the aggregate level with 180 samples (Table 1). If the sample size drops below 150, say to 120, HIRE can still control the FPR well but begins to lose power (Supplementary Table 9). For small sample sizes, we have also developed a special case of HIRE by reparameterizing all σgk2s as one single parameter σ2, and we found that such a variance-stabilized approach can achieve even better inflation control (see Supplementary Figs. 7176) and power comparable to HIRE (see Supplementary Table 10). Like the two datasets analyzed in the real application, a typical sample size for a current EWAS exceeds 500, thus guaranteeing a high TPR for HIRE. Given the decreasing cost of EWAS, we recommend that researchers collect at least 200 samples for their studies for association detection at the aggregate level and 600 samples for identification of cell-type-specific risk-CpG sites. A larger sample size can further boost the power.

With the popularity of EWAS, we believe that HIRE will be widely applied, and we hope that HIRE can motivate more researchers to mine out finer-scale results from EWAS.

Methods

Multiplicative effects of cellular composition on methylation

In this section, we illustrate that the effects of the cell-type composition are actually multiplicative. Let us assume that the beta-values that represent the methylation levels are observed across m CpG sites for n samples. As the measured sample comprises cells of various types, the observed beta-value is a weighted average of the mean methylation levels of distinct cell types, and the weights correspond to the proportions of each cell type. Let Oji denote the measurement at CpG site j for sample i. If we assume that there exist K cell types in all samples and that the mean methylation level for CpG site j in cell type k is μjk, then

Oji=k=1Kμjkpki+ϵji,

where pki is the proportion of cell type k in sample i with a natural constraint k=1Kpki=1, and ϵji is a random error.

Let us consider a case-control EWAS. Without loss of generality, we assume that CpG site j is differentially methylated between cases and controls in cell type 1 with a mean shift δj1 and that it is not differentially methylated in the remaining cell types. As a result, for case samples,

Oji=(μj1+δj1)p1i+k=2Kμjkpki+ϵji=δj1p1i+k=1Kμjkpki+ϵji.

If we then use Zi to indicate the case-control status of sample i, the observed methylation level becomes

Oji=δj1p1iZi+k=1Kμjkpki+ϵji. 3

Therefore, the proportions of cell type 1—p1i, i = 1, …, n—have multiplicative effects rather than additive effects on the mean difference between the case and control samples.

The existing methods, which either estimate the cell type proportions explicitly or approximate them implicitly with surrogate variables, add the estimated proportions and the case-control indicator Zi as the covariates to the regression as follows:

Oji=αj+τjZi+k=1K-1bjkp^ki+ϵji, 4

where bjks are the regression coefficients. As a result, CpG site j is called differentially methylated on the basis of hypothesis testing for τj = 0. In general, τj in Eq. (4) is not equal to δj1 in Eq. (3). Please see the Supplementary Notes for a numerical example. Moreover, testing for τj = 0 loses the information regarding cell type in which CpG site j may be at risk. To account for the multiplicative effects, we propose the HIRE model that conserves the individual cell-type level information, which is introduced in the next section.

The HIRE model

HIRE uses a hierarchical model to closely follow the data generation process for the EWAS data. To begin, we assume that the baseline methylation level for CpG site j in cell type k is μjk. For sample i with phenotypes xi = (xi1, …, xiq), the mean methylation value for CpG site j in cell type k is assumed to be μjk+=1qβjkxi. In other words, the phenotypes have linear effects where βjk characterizes the influence of phenotype on CpG site j in cell type k. Let uijk represent the signal from CpG site j in cell type k for sample i with xi. We assume that uijk follows a normal distribution with mean μjk+=1qβjkxi and standard deviation σjk,

uijk~Nμjk+=1qβjkxi,σjk2. 5

After uijks are generated for all of the K cell types, the observed methylation value Oji is sampled as follows:

Oji~Nk=1Kuijkpki,σϵj2. 6

Collectively, O = {Oji : 1 ≤ j ≤ m, 1 ≤ i ≤ n} denote the observed data; u = {(uij1, …, uijK)T : 1 ≤ i ≤ n, 1 ≤ j ≤ m} are the missing data; and μj = (μj1, …, μjK)T, B(j)=(βjk)K×q, σϵj2, the diagonal matrix Σj=diag(σj12,,σjK2) for j = 1, …, m, and pi = (p1i, …, pKi)T for i = 1, …, n are the parameters. With Θ={pi,μj,B(j),Σj,σϵ,j2:1jm,1in}, the complete data log-likelihood function, lc, can be expressed as follows:

lc(ΘO,u)=i=1nj=1m-12logσϵ,j2-(Oji-uijTpi)22σϵ,j2-12k=1Klogσjk2-12(uij-μj-B(j)xi)TΣj-1(uij-μj-B(j)xi)+Constant.

Accordingly, we develop a generalized expectation-maximization algorithm32 to estimate the parameters. In the expectation-maximization algorithm, a good initialization can lead to faster convergence than random starts. We adopt the cellular composition estimations from the methylation matrix decomposition algorithm16 with slight modifications as the initializations. The initial values for the baseline methylation profiles μjk are accordingly estimated by simple linear regressions. As the number of risk-CpG sites is often small, all of the phenotype effects βjk are set to zero at the beginning. For the standard deviations, the initial values are randomly sampled from inverse gamma distributions with small means. We choose the number of cell types K by using a variant of the penalized Bayesian information criterion (pBIC)20 (see details in Supplementary Methods).

For each phenotype , we can conduct the hypothesis test H0:βjk=0 versus H1:βjk0 for any cell type k and any CpG site j. Combining Eqs. (5) and (6), we obtain the following equations:

EOji=μj1+k=2K(μjk-μj1)pki+k=1K=1qβjkxipki,i=1,,n. 7

We can then take (Oj1, …, Ojn) as the response vector and concatenate 1n, (pk1, …, pkn) (k = 2, …, K) and (x1pk1,,xnpkn) (=1,,q;k=1,,K) to a n × (p + 1) · K design matrix in the linear regression. We plug in the estimated cellular compositions p^ki and conduct the hypothesis test for βjk=0 using the two-sided t-tests in the linear models. We claim that CpG site j has an association with phenotype at the aggregate level if phenotype affects CpG site j in at least one of the K cell types. Note that in the regression we incorporate the estimated cellular compositions into the linear model as multiplicative effects rather than additive effects.

More technical details of the method and the algorithm are available in the Supplementary Methods.

Data simulation

We compared the performance of HIRE with five previous methods—unadjusted analysis, SVA, RefFreeEWAS, EWASHer, and ReFACTor—in 18 simulation settings. We set the sample size n to 180, 300, and 600 and let the underlying cell type number K be 3, 5, and 7. For each pair of (n, K), we investigated the true null case and the true alternative case. As a result, we have in total 3 (the number of sample sizes) × 3 (the number of cell types) × 2 (the true null case and the true alternative case) = 18 simulation settings. For each setting, we considered 10,000 CpG sites and simultaneously accounted for the following factors.

Cell lineage. We first constructed the baseline methylation matrix μ = (μjk)m×K, in which each column corresponds to the baseline methylation levels of a cell type. To mimic the phenomenon in which cell types from the same lineage have similar methylation profiles, we assumed that Ksim of the total K cell types were similar. Specifically, without loss of generality, we assumed that the first Ksim cell types came from the same cell lineage and that the remaining K − Ksim cell types are irrelevant to one another. We set Ksim to 2, 2, and 3 for K = 3, 5, and 7, respectively. We generated μjk for cell types k = 1, Ksim + 1, …, K from the beta distribution beta(3, 6) on each CpG site j independently. For each of the remaining cell types k′ = 2, …, Ksim, we randomly selected 20% of the CpG sites and drew their μjks independently from beta(3, 6); and for the remaining 80% of CpG sites, we let their μjk be μj1 plus a very small randomness, thus inducing the similarities among cell types 1 to Ksim.

Discrete and continuous phenotypes. We further generated a discrete and a continuous phenotype x = (x1, x2)T for each individual i (i = 1, …, n). We let the first n/3 individuals be the control samples with xi1 = 0 for i = 1, …, n/3 and the remaining 2n/3 individuals serve as cases with xi1 = 1 for i = n/3 + 1, …, n. The continuous phenotypes x2 = (x12, …, xi2, …, xn2)T were independently drawn from a Unif(20, 50) to act as age.

Phenotype effects with different magnitudes and directions. We then simulated the phenotype effect βjk of each phenotype on CpG site j in cell type k. For the true null cases, all of the βjks are zero. For a true alternative setting, we set nonzero phenotype effects as follows.

For phenotype 1—the case/control status, we let it affect the first 10 CpG sites in all of the cell types: βjk1 ≠ 0 for j = 1, …, 10 and k = 1, …, K. We then assumed that the next 10 CpG sites were influenced by the disease status in the first Ksim cell types which come from the same lineage but not the other cell types: βjk1 ≠ 0 (k = 1, …, Ksim) and βjk1 = 0 (k = Ksim + 1, …, K) for any j = 11, …, 20. Furthermore, for cell type k ∈ {Ksim + 1, …, K}, we let the disease status affect CpG sites j = 20 + 10(k − Ksim − 1) + 1, …, 20 + 10(k − Ksim) only in cell type k. We generated the cell-type-specific effects of age in a similar fashion for CpG site loci 21 to 40 + 10(K − Ksim).

For each nonzero βjk1, we let βjk1 = rjk · ωjk, where ωjk ~ Unif(0.07, 0.15) and rjk takes values of 1 and −1 with equal probabilities. Thus, βjk1s can have both positive and negative effects. In the same spirit, we generated nonzero βjk2s with rjks and ωjks where ωjk~Unif(0.007,0.015).

Association between phenotypes and cellular compositions. Notice that the phenotypes may be associated with the cellular composition. Therefore, when K = 3, we drew pi = (p1i, …, pKi) from a Dirichlet distribution Dir(4, 4, 2 + 0.1xi2) if sample i is a control and pi ~ Dir(4, 4, 5 + 0.1xi2) if it is a case; when K = 5, we let pi ~ Dir(3, 3, 3, 3, 2 + 0.1xi2) for a control sample and pi ~ Dir(3, 3, 3, 3, 5 + 0.1xi2) for a case sample; and when K = 7, we sampled pi ~ Dir(1, 3, 3, 3, 2, 2, 2 + 0.1xi2) for controls and pi ~ Dir(1, 3, 3, 3, 2, 2, 5 + 0.1xi2) for cases.

Finally, we generated the observed value Oji for CpG site j of sample i as follows: sample uijk from N(μjk + βjk1xi1 + βjk2xi2, 0.012) for k = 1, …, K; and sample Oji from N(k=1Kuijkpki,0.012). In case Oji lies outside the interval (0, 1), we truncate it to zero if Oji is lower than zero and to one if Oji is greater than one.

Semi-simulated dataset including samples with known cell mix proportions

The GEO dataset GSE11055424 contains purified cell-type-specific methylation profiles for six cell types: neutrophils, monocytes, B cells, CD4+ T, CD8+ T, and NK. Moreover, GSE110554 includes mixed samples whose methylation signals were aggregated from the six cell types with predetermined cell mix proportions. Therefore, because of the known cell type and cellular proportion information, GSE110554 is an ideal dataset with which to test HIRE’s performance.

In GSE110554, the number of mixed samples is much smaller than the typical size of an EWAS and, as discussed in the manuscript, HIRE usually requires hundreds of samples to obtain accurate and stable results. Therefore, to increase the sample size, we first generated a simulated methylation dataset with 600 samples using the purified methylation profiles. We focused on 10k CpG sites, including the 450 IDOL CpG sites, which were previously identified as the optimal library of CpG sites for estimation of leukocyte subtype proportions24, and another 9550 CpG sites whose methylation values across the purified cell types fell within the range of [0.2, 0.8] and had large standard deviations11. We then combined the 600 samples and six mixed samples (generated by method A)24 available in GSE110554 to compose a semi-simulated dataset.

After applying HIRE to the semi-simulated data, we annotated the estimated cell types based on the methylation profiles from GSE110554. Supplementary Figure 69 shows the heatmap for the Pearson correlation matrix between inferred cell types and the underlying truth. The correlation signals on the diagonal are the strongest in each row. HIRE successfully recovers the six underlying cell types. We also compared the estimated cellular compositions with the underlying true proportions for the six mixed samples. Each panel in Supplementary Fig. 70 displays a scatter plot between the cellular proportion estimates and the true mix proportions for a given cell type; they all indicate that HIRE obtains good estimates for cellular compositions.

Cell type matching protocol

Assume that we have the reference methylation profiles for the H annotated cell types. We first denote the methylation profile for cell type h as ϕh = (ϕ1h, …, ϕmh). We aim to annotate μk using the references. Following the previous study33, first, we calculate the cosine similarity, the Pearson correlation, and the Spearman correlation between μk and ϕh for each cell type h ∈ {1, …, H}. Notice that the three similarity measures lie between −1 and 1, and a high positive value indicates great similarity between two vectors. Second, for each similarity measure (=1,2,3), we identify the cell type h that has the maximal degree of similarity with μk. If at least two out of the three similarity measures identify the same reference cell type h~ and their corresponding similarity values are greater than 0.5, then we annotate μk with the reference cell type h~. Otherwise, μk is believed to belong to a new cell type that is not included in the references. We repeat the above process for each methylation profile μk estimated from HIRE.

Blood cell references

The two real data sets analyzed in our applications were obtained from whole blood. Therefore, we prepared the references from a whole blood methylation study34 with GEO accession code GSE35069. The study included seven isolated blood cell subpopulations—CD4+ T cells, CD8+ T cells, CD14+ monocytes, CD19+ B cells, CD56+ NK cells, neutrophils, and eosinophils—for six individuals. Accordingly, we define the reference profile ϕh for cell type h as the average methylation profile of these individuals, i.e., ϕh:=16i=16ϕhi.

Data preprocessing

The RA dataset is publicly available in GEO with accession number GSE42861. The dataset measures the methylation levels of the whole blood. The methylation data have been normalized by Illumina’s control probe scaling procedure (see Liu et al.3 “Illumina 450K microarray data preprocessing” section for details). The dataset includes 689 samples, and the RA status, age, gender, smoking history, and batch information are available for each sample. We removed two samples GSM1051535 and GSM1051691 because their smoking information is missing. CpG sites with a high methylation mean (>0.8) and a low methylation mean (<0.2) were discarded11,13. We adjusted the data for batch effects using COMBAT35. The correction process was justified because we did not observe a high degree of co-linearity between the RA status and the batches (Supplementary Fig. 68). The 10,000 most variable CpG sites were kept. For the RA status, we denoted RA patients with 1 and the normal control subjects with 0; we represented men with 1 and women with 0; for the smoking history, we used (0, 0, 0) to refer to “never,” (1, 0, 0) to “ex,” (0, 1, 0) to “current,” and (0, 0, 1) to “occasional” smokers.

We downloaded the GALA II dataset from Gene Expression Omnibus (GEO) with accession number GSE77716. The dataset contains the whole-blood DNA methylation beta-values from 573 samples. The beta-values have been normalized by SWAN36 and corrected for batch effects by COMBAT35. There are two types of covariates: gender and ethnicity. Ethnicity includes Mexican, Mixed Latino, Puerto Rican, and Other Latino. Out of the 573 samples, one sample “GSM2057284” has no gender information, so we removed it. As suggested by previous studies11,13, CpG sites with a mean methylation value of less than 0.2 or higher than 0.8 were filtered out. We selected the 10,000 most variable of the remaining CpG sites. For gender, we denoted men with 1 and women with 0. For the ethnicity variables, we used three dummy variables to represent the four ethnicity categories. In particular, (0, 0, 0), (1, 0, 0), (0, 1, 0), and (0, 0, 1) corresponded to Mexican, Mixed Latino, Puerto Rican, and Other Latino, respectively.

For ReFACTor and EWASHer, according to their rules, we first filtered out CpG sites that were consistently hypomethylated or consistently hypermethylated and then regressed out the known covariates. We finally used the residuals to perform their analysis. Note that in their software these steps are processed automatically. For RefFreeEWAS, SVA, and the unadjusted analysis, the phenotypes and the covariates were regarded as the fixed effects in the regression model. In detail, for ReFACTor, in both GALA II and RA datasets, the cell type number “K” was specified to be six, which was the same as in their paper13. For RefFreeEWAS, we fixed the dimensionality of latent space “d” at six in the real data. For SVA, we also fixed the number of surrogate variables to six.

Gene enrichment analysis was carried out on the Broad Institute website http://software.broadinstitute.org/gsea/msigdb/annotate.jsp. The canonical pathways were selected as the basis gene sets, and only pathways with a false discovery rate of less than 0.05 were reported.

Identifiability of HIRE

Although the non-negative matrix factorization (NNMF) O = μP has been widely applied in cell type deconvolution16, where O is the observed methylation matrix, μ is the unknown methylation profile, and P is the unknown cellular compositions, model identifiability is rarely discussed. During the review period of our paper, Rahmani et al.37 provided a setting under which the NMMF model is not identifiable.

Why then does NNMF always provide satisfactory cell type deconvolution results in real practice, and why can HIRE estimate all those parameters well? Here, we show mathematically that the HIRE model is identifiable under mild conditions that are easily met in reality.

Let us first introduce some notations and definitions. In the HIRE model, the whole parameter set is denoted by Θ:={Pi,μj,B(j),σjk2,σϵj2:1jm,1in,1kK,1q}, where pi is the cellular composition vector of sample i, μj is the baseline methylation vector of CpG site j, B(j) is the phenotype effect vector on CpG site j, σjk2 is the cell-type-k noise variance on CpG site j, and σϵj2 is the overall noise variance on CpG site j.

The observed data in our study are the methylation matrix O = {Oij:1 ≤ i ≤ n, 1 ≤ j ≤ m} and the covariate matrix X=(x1,,x,,xq), where x is the column vector that indicates phenotype- for the n samples. The observed likelihood function (ΘO)=i=1nj=1mN(Oji:PiTμj+=1qxiPiTB(j),k=1Kσjk2Pik2+σϵj2) (see Eq. (S7) in the Supplementary Methods), where N(O : η, τ2) indicates the normal density with mean η and variance τ2 at value O.

We further define 1K = (1, 1, …, 1)T as a K-dimension column vector with all entries being one, an n by K matrix J1 as 1n1KT, and an n by K matrix Jx as x1KT for each 1q. We use ⊙ to represent the entry-wise matrix product for two matrices M and N with the same dimension, i.e., (MN)ij:=MijNij.

Theorem 1. If (a) for each cell type k, there exists a CpG site rk such that B(rk)=0 for any phenotype and μrkk=1 while μrkk=0 for k′ ≠ k, and (b) the cellular compositions P satisfies that rank((J1PT,Jx1PT,,JxPT,,JxqPT))=(q+1)K and rank((1n,PT)(1n,PT))=K+1, then the HIRE model is identifiable. In other words, L(ΘO)=L(Θ~O) for any O implies Θ=Θ~.

Proof: First, by integrating out all O elements except Oji, L(ΘO)=L(Θ~O) implies N(Oji:PiTμj+=1qxiPiTB(j),k=1Kσjk2Pki2+σϵj2) = N(Oji:P~iTμ~j+=1qxiP~iTB~(j),k=1Kσ~jk2P~ki2+σ~ϵj2). Because the univariate normal distribution is identifiable, we have

PiTμj+=1qxiPiTB(j)=P~iTμ~j+=1qxiP~iTB~(j), 8
k=1Kσjk2Pki2+σϵj2=k=1Kσ~jk2P~ki2+σ~ϵj2. 9

Taking j = rk in Eq. (8), we have LHS=PiTμrk+=1qxiPiTB(rk)=PiTμrk=0+Pki1+0=Pki and similarly RHS=P~ki, so Pki=P~ki, which holds for any i and k. Hence, we obtain P=P~. Next, we rewrite Eq. (8) into a matrix form.

(PiT,xi1PiT,,xiqPiT)μjB1(j)Bq(j)=(PiT,xi1PiT,,xiqPiT)μ~jB~1(j)B~q(j),i=1,,n.

By combining these n equations, it follows that

(J1PT,Jx1PT,,JxPT,,JxqPT)μjB1(j)Bq(j)=(J1PT,Jx1PT,,JxPT,,JxqPT)μ~jB~1(j)B~q(j). 10

Because the rank of A:=(J1PT,Jx1PT,,JxPT,,JxqPT) is (q + 1)K (full column rank), A has a left inverse A−1. Multiplying Eq. (10) by A−1 from the left on both sides, we obtain μj=μ~j and B(j)=B~(j) for 1q. Therefore, we have μ=μ~, B=B~.

In addition, because Eq. (9) holds for any i, we can also rewrite it into a matrix form.

1P112PK121P122PK221P1n2PKn2σϵj2σj12σjK2=1P112PK121P122PK221P1n2PKn2σ~ϵj2σ~j12σ~jK2

The left matrix is equal to (1n,PT)(1n,PT) which has a full column rank; therefore, it has a left inverse. Consequently, σϵj2=σ~ϵj2 and σjk2=σ~jk2. As a result, Θ=Θ~, and we have proven the identifiability of HIRE.

Conditions (a) and (b) are easily met for DNA methylation data. Condition (a) requires that for each cell type k, there exists a CpG site that is not associated with any phenotype and is only methylated in cell type k but not methylated in any other cell type. Given the 450K CpG sites assayed by the microarray, we can expect that such CpG sites are not absent at all. Moreover, condition (a) can also be relaxed to the condition that for each cell type k, there exists a CpG site rk such that B(rk)=0 for any phenotype and μrkk=1 while μrkk=0 for k′ ≠ k or there exists a CpG site rk such that B(rk)=0 for any phenotype and μrkk=0 while μrkk=1 for k′ ≠ k. The proof follows in a similar manner.

For condition (b), intuitively, the rank requirement of (1n,PT)(1n,PT) asks the cellular compositions to vary across subjects, which guards against the case in which all the subjects have the same cellular compositions and hence no cell type deconvolution is possible; the rank requirement on (J1PT,Jx1PT,,JxPT,,JxqPT) is the same requirement as those in a standard linear regression, which requires that no collinearity exists among the covariates. Because the sample size n is much larger than the underlying cell type number K and the phenotype number q, the two rank requirements can commonly be satisfied in reality.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Supplementary information

Peer Review File (4.6MB, pdf)
Reporting Summary (67.6KB, pdf)
Source Data (38.9MB, zip)

Acknowledgements

X.L. was supported in part by Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China (19XNLG08), and the fund for building world-class universities (disciplines) of Renmin University of China. X.L. is grateful for the Hong Kong Ph.D. Fellowship (PF13-11656) from the Hong Kong Research Grants Council when X.L. was a Ph.D. student at the Chinese University of Hong Kong. C.Y. was supported in part by the National Science Funding of China [61501389]; the Hong Kong Research Grants Council [22302815, 12316116, 12301417 and 16307818]; The Hong Kong University of Science and Technology [startup grant R9405 and IGN17SC02]. Y.W. was supported in part by the Early Career Scheme 24301416 and General Research Fund 14306417 from the Research Grants Council of the Hong KongSpecial Administrative Region and Direct Grants from the Research Committee of the Chinese University of Hong Kong. We acknowledge Mingxuan Cai for his contribution to part of the HIREewas code. We are grateful to the High-Performance Computing Platform of Renmin University of China and the Department of Statistics at the Chinese University of Hong Kong for providing computing resources.

Author contributions

Y.W. and C.Y. conceived the study. X.L. and Y.W. developed the method. Y.W. and X.L. proved the model identifiability. X.L. implemented the algorithm and prepared the software package. X.L. and C.Y. analyzed the data. X.L., Y.W., and C.Y. wrote the paper.

Data availability

The RA whole blood methylation dataset is available in the Gene Expression Omnibus (GEO) with the accession number GSE42861. The GALA II whole blood methylation dataset can be downloaded from GEO with the accession number GSE77716. The accession number for the blood cell references is GSE35069. The purified methylation data and mixed samples used to generate the semisimulated dataset are taken from GSE110554.

Code availability

The software and detailed documentations are available on Bioconductor with the software HIREewas page [http://www.bioconductor.org/packages/release/bioc/html/HIREewas.html].

Competing interests

The authors declare no competing interests.

Footnotes

Peer review information: Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Can Yang, Email: macyang@ust.hk.

Yingying Wei, Email: yweicuhk@gmail.com.

Supplementary information

Supplementary Information accompanies this paper at 10.1038/s41467-019-10864-z.

References

  • 1.Rakyan VK, Down TA, Balding DJ, Beck S. Epigenome-wide association studies for common human diseases. Nat. Rev. Genet. 2011;12:529–541. doi: 10.1038/nrg3000. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Verma M. Epigenome-wide association studies (EWAS) in cancer. Curr. Genom. 2012;13:308–313. doi: 10.2174/138920212800793294. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Liu Y, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol. 2013;31:142–147. doi: 10.1038/nbt.2487. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gao X, Jia M, Zhang Y, Breitling LP, Brenner H. DNA methylation changes of whole blood cells in response to active smoking exposure in adults: a systematic review of DNA methylation studies. Clin. Epigenetics. 2015;7:113. doi: 10.1186/s13148-015-0148-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Joehanes R, et al. Epigenetic signatures of cigarette smoking. Circ. Cardiovasc. Genet. 2016;9:436–447. doi: 10.1161/CIRCGENETICS.116.001506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wahl S, et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature. 2017;541:81–86. doi: 10.1038/nature20784. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Teschendorff AE, et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 2010;20:440–446. doi: 10.1101/gr.103606.109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Horvath S. DNA methylation age of human tissues and cell types. Genome Biol. 2013;14:3156. doi: 10.1186/gb-2013-14-10-r115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Houseman EA, et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics. 2012;13:86. doi: 10.1186/1471-2105-13-86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jaffe AE, Irizarry RA. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 2014;15:R31. doi: 10.1186/gb-2014-15-2-r31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Zou J, Lippert C, Heckerman D, Aryee M, Listgarten J. Epigenome-wide association studies without the need for cell-type composition. Nat. Methods. 2014;11:309–311. doi: 10.1038/nmeth.2815. [DOI] [PubMed] [Google Scholar]
  • 12.McGregor K, et al. An evaluation of methods correcting for cell-type heterogeneity in DNA methylation studies. Genome Biol. 2016;17:84. doi: 10.1186/s13059-016-0935-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Rahmani E, et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods. 2016;13:443–445. doi: 10.1038/nmeth.3809. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Teschendorff AE, Relton CL. Statistical and integrative system-level analysis of DNA methylation data. Nat. Rev. Genet. 2017;19:129–147. doi: 10.1038/nrg.2017.86. [DOI] [PubMed] [Google Scholar]
  • 15.Accomando WP, Wiencke JK, Houseman EA, Nelson HH, Kelsey KT. Quantitative reconstruction of leukocyte subsets using DNA methylation. Genome Biol. 2014;15:R50. doi: 10.1186/gb-2014-15-3-r50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Houseman EA, et al. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics. 2016;17:259. doi: 10.1186/s12859-016-1140-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zheng X, Zhang N, Wu H-J, Wu H. Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies. Genome Biol. 2017;18:17. doi: 10.1186/s13059-016-1143-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Leek JT, Storey JD. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 2007;3:e161. doi: 10.1371/journal.pgen.0030161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Houseman EA, Molitor J, Marsit CJ. Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics. 2014;30:1431–1439. doi: 10.1093/bioinformatics/btu029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Pan W, Shen X. Penalized model-based clustering with application to variable selection. J. Mach. Learn. Res. 2007;8:1145–1164. [Google Scholar]
  • 21.Zheng SC, et al. Correcting for cell-type heterogeneity in epigenome-wide association studies: revisiting previous analyses. Nat. Methods. 2017;14:216–217. doi: 10.1038/nmeth.4187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Rahmani E, et al. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nat. Methods. 2017;14:218–219. doi: 10.1038/nmeth.4190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl Acad. Sci. USA. 2003;100:9440–9445. doi: 10.1073/pnas.1530509100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Salas LA, et al. An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the illumina humanmethylationepic beadarray. Genome Biol. 2018;19:64. doi: 10.1186/s13059-018-1448-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Neumann E, et al. Identification of differentially expressed genes in rheumatoid arthritis by a combination of complementary DNA array and rna arbitrarily primed-polymerase chain reaction. Arthritis Rheumatol. 2002;46:52–63. doi: 10.1002/1529-0131(200201)46:1&#x0003c;52::AID-ART10048&#x0003e;3.0.CO;2-1. [DOI] [PubMed] [Google Scholar]
  • 26.Fasanelli F, et al. Hypomethylation of smoking-related genes is associated with future lung cancer in four prospective cohorts. Nat. Commun. 2015;6:10192. doi: 10.1038/ncomms10192. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Ambatipudi S, et al. Tobacco smoking-associated genome-wide DNA methylation changes in the epic study. Epigenomics. 2016;8:599–618. doi: 10.2217/epi-2016-0001. [DOI] [PubMed] [Google Scholar]
  • 28.Pino-Yanes M, et al. Genetic ancestry influences asthma susceptibility and lung function among latinos. J. Allergy Clin. Immunol. 2015;135:228–235. doi: 10.1016/j.jaci.2014.07.053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Li Y, et al. c-Myb enhances breast cancer invasion and metastasis through the wnt/β-catenin/axin2 pathway. Cancer Res. 2016;76:3364–3375. doi: 10.1158/0008-5472.CAN-15-2302. [DOI] [PubMed] [Google Scholar]
  • 30.Zhang X-H, Tee LY, Wang X-G, Huang Q-S, Yang S-H. Off-target effects in CRISPR/Cas9-mediated genome engineering. Mol. Ther. Nucleic Acids. 2015;4:e264. doi: 10.1038/mtna.2015.37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Bolstad BM, Irizarry RA, Åstrand M, Speed TP. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics. 2003;19:185–193. doi: 10.1093/bioinformatics/19.2.185. [DOI] [PubMed] [Google Scholar]
  • 32.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 1977;39:1–38. [Google Scholar]
  • 33.Kiselev VYu, Yiu A, Hemberg M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods. 2018;15:359. doi: 10.1038/nmeth.4644. [DOI] [PubMed] [Google Scholar]
  • 34.Reinius LE, et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS ONE, 2012;7:e41361. doi: 10.1371/journal.pone.0041361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Johnson WE, Li C, Rabinovic A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics. 2007;8:118–127. doi: 10.1093/biostatistics/kxj037. [DOI] [PubMed] [Google Scholar]
  • 36.Maksimovic J, Gordon L, Oshlack A. Swan: Subset-quantile within array normalization for illumina infinium humanmethylation450 beadchips. Genome Biol. 2012;13:R44. doi: 10.1186/gb-2012-13-6-r44. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Rahmani E, et al. BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference. Genome Biol. 2018;19:141. doi: 10.1186/s13059-018-1513-2. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Peer Review File (4.6MB, pdf)
Reporting Summary (67.6KB, pdf)
Source Data (38.9MB, zip)

Data Availability Statement

The RA whole blood methylation dataset is available in the Gene Expression Omnibus (GEO) with the accession number GSE42861. The GALA II whole blood methylation dataset can be downloaded from GEO with the accession number GSE77716. The accession number for the blood cell references is GSE35069. The purified methylation data and mixed samples used to generate the semisimulated dataset are taken from GSE110554.

The software and detailed documentations are available on Bioconductor with the software HIREewas page [http://www.bioconductor.org/packages/release/bioc/html/HIREewas.html].


Articles from Nature Communications are provided here courtesy of Nature Publishing Group

RESOURCES