Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2020 Dec 11;37(8):1052–1059. doi: 10.1093/bioinformatics/btaa930

Complete deconvolution of DNA methylation signals from complex tissues: a geometric approach

Weiwei Zhang 1, Hao Wu 2,, Ziyi Li 3,
Editor: Elofsson Arne
PMCID: PMC8150138  PMID: 33135072

Abstract

Motivation

It is a common practice in epigenetics research to profile DNA methylation on tissue samples, which is usually a mixture of different cell types. To properly account for the mixture, estimating cell compositions has been recognized as an important first step. Many methods were developed for quantifying cell compositions from DNA methylation data, but they mostly have limited applications due to lack of reference or prior information.

Results

We develop Tsisal, a novel complete deconvolution method which accurately estimate cell compositions from DNA methylation data without any prior knowledge of cell types or their proportions. Tsisal is a full pipeline to estimate number of cell types, cell compositions and identify cell-type-specific CpG sites. It can also assign cell type labels when (full or part of) reference panel is available. Extensive simulation studies and analyses of seven real datasets demonstrate the favorable performance of our proposed method compared with existing deconvolution methods serving similar purpose.

Availability and implementation

The proposed method Tsisal is implemented as part of the R/Bioconductor package TOAST at https://bioconductor.org/packages/TOAST.

Supplementary information

Supplementary data are available at Bioinformatics online.

1 Introduction

DNA methylation plays important roles in cell development and gene regulation (Bird, 2002; Smith et al., 2013). Aberration of DNA methylation is closely associated with human disease (Robertson, 2005). A majority of the DNA methylation studies are performed on samples from human or model animal. The samples, including solid tissue and blood, are usually mixtures of many distinct cell types. The sample mixing complicates the data analysis. For example, it has been discovered that the mixing proportions are sometimes confounded with the experimental factor of interest (e.g. age), resulting in many false positives from Epigenome-wide Association Study (Jaffe et al., 2014). On the other hand, cell composition has been reported to associate with disease progression and treatment response (Giannakopoulos et al., 2003; Ino et al., 2013), hence the composition is of great interests in clinical research by itself. Thus, estimating and accounting for cell composition are important questions in DNA methylation data analyses.

Experimental techniques, such as Fluorescence-Activated Cell Sorting (FACS) and single cell RNA sequencing, have been developed to quantify cell compositions. However, they are too laborious and expensive to be applied in large-scale studies. As an alternative, a number of computational methods, referred to as ‘deconvolution’ methods, have been developed for DNA methylation data.

The existing deconvolution methods can be classified as ‘reference-based’ (Houseman et al., 2012; Liu et al., 2013), ‘reference-free’ (Houseman et al., 2014; Rahmani et al., 2016; Zou et al., 2014) and ‘partial reference-free’ (Rahmani et al., 2018). Reference-based methods require a reference panel, i.e. cell-type-specific DNA methylation profiles, while reference-free and partial reference-free methods don’t require such data. It was reported that the reference-based methods in general provide more accurate and robust estimation than reference-free methods (Teschendorff et al., 2017). However, the application of reference-based methods is very limited due to the availability of the reference panel. Existing reference data are only available for a few tissue types (blood, breast and brain) from a small number of individuals. Even for the well-studied blood sample, if the population under study does not match the reference individuals in their methylation-altering factors, such as age, gender and genetics, the results could be impacted by potential biases and become unreliable.

Reference-free methods have the advantage that they don’t require reference panels and are therefore applicable to any tissue type. However, determining the correct number of cell types in the mixture and assigning cell type labels to decomposed elements are often the hurdles in real application for reference-free methods. Rahmani et al. (2018) developed a promising Bayesian model that incorporates prior cell composition knowledge in deconvolution. However, the prior knowledge only exists for a small number of well-studied tissues, which limits its application in real data. Another seminal work Linseed (Zaitsev et al., 2019) adopts a geometric approach to completely deconvolve gene expression data, through which it aims at the complete spectrum of cell types present in the mixture rather than a few major cell types. Their idea provides novel insights in estimating cell composition in gene expression, but has not been applied on DNA methylation data due to the following technical difficulties. First of all, the number of features in DNA methylation data is much larger than that of gene expression. As a result, feature selection in DNA methylation cannot simply mimic the mutual-linearity method in Linseed for gene expression. Secondly, Linseed estimates cell composition using DSA (Zhong et al., 2013), a marker-guided deconvolution algorithm solely designed for gene expression. There is no counterpart method developed for DNA methylation. Last but not least, the cell-type-specific DNA methylation sites are relatively under-studied compared to well-understood cell-type-specific genes in gene expression data. This increases the difficulty of identifying cell-type-specific features in DNA methylation.

In this article, we develop a complete deconvolution method for DNA methylation data. Addressing the problems described above, our method consists of a full pipeline from choosing the most plausible number of cell types, estimating cell compositions, to identifying cell-type-specific CpG sites without the need of any prior information. We use our previously develop method TOAST (Li et al., 2019) to resolve the difficulty in feature selection. The deconvolution step uses SISAL (Bioucasdias, 2009), a geometric algorithm designed for imaging processing, to solve the simplex constructed by DNA methylation data. When reference panels are available, we provide a data-driven algorithm to automatically assign cell type labels as the last step of our pipeline. This step can assign cell labels accurately even if the reference panels come from a different population or of large noise. We call our method ‘Tsisal’, a complete deconvolution method based on TOAST and SISAL for DNA methylation data. Extensive simulation studies and seven large methylation datasets are used to evaluate the proposed method.

2 Materials and methods

We assume the input data are beta values from DNA methylation microarray, which is more affordable and widely applied in large-scale studies than bisulfite-sequencing. Let Y be an m×n matrix representing DNA methylation data for n samples and m CpG sites. The values in Y are constrained to the unit interval [0,1]. We assume that the samples are mixtures of K cell types. Most deconvolution methods seek the optimal solutions for matrix factorization Y=WH. Here W is a m×K matrix representing m methylation sites for K cell types, and the elements of W are bounded by 0 and 1. H is a K×n matrix for K cell-type-specific mixing proportions from n samples. We require the entries of H to be non-negative and every column sums up to one.

Figure 1 shows the workflow of Tsisal, which consists of four main steps. Given the raw data matrix, the first step is to choose the plausible number of cell types K. With the estimated K, we adopt our previously developed cross-cell type differential analysis to select cell-type-specific features. The third step transforms the raw data of the selected features to construct a (K-1)-simplex, and then estimates the cell-type proportions through identifying the corners of this simplex geometrically. Finally, if a reference panel is available, even imperfect as in circumstances where the data is collected from one population but the reference panel is from a different one, our method assigns labels to the estimated anonymous cell types through a data-driven approach. The four steps are described in detail below.

Fig. 1.

Fig. 1.

Overview of Tsisal. Tsisal takes the DNA methylation data from complex samples as input, and outputs cell-type proportions. Letters indicate analysis steps: (a) estimating the number of cell types; (b) feature selection; (c) simplex corner identification; (d) cell type label assignment

2.1 Estimating the number of cell types

The number of cell types present in complex samples is usually not known. Current methods mostly use prior knowledge and specify the number of cell types in an ad-hoc way. We here develop a data-driven method based on Akaike information criterion (AIC) to estimate the number of cell types. AIC has been previously used in epigenomic deconvolution of breast tumors and has shown reliable results (Onuchic et al., 2016). Given the raw data matrix Y, for a range of possible cell type numbers, we apply our deconvolution algorithm (described later) to estimate cell-type-specific methylation profile W^ and proportion matrix H^. With the estimated W^ and H^, we compute the deconvolution accuracy as the sum of squared error of true observations and reconstructed data, i.e. SSR(k)=(Y-W^H^)2. The AIC is defined as

AICk=llnSSRkl+2pk+2pkpk+1l-pk-1,

where l=m×n is the number of observations in Y, p(k)=k(m+n) is the number of parameters to be estimated assuming there are k distinct cell types. The first term of AIC represents the model accuracy while the second and third terms are the penalties for increasing the cell type numbers. Such penalty is commonly imposed to avoid overfitting. We choose k that minimizes AIC to be the number of cell types in the mixture, since the most plausible k should achieve a balance between the estimated error and number of model parameters. In this work we assume that the number of cell types is between 2 and 15. We allow users to specify or change the search range of the number of cell types in our software.

2.2 Feature selection

Similar to most of the deconvolution methods, we also need to select a list of informative CpG sites. Selecting features with the largest variances was widely used by many deconvolution methods. However, feature variance contains both within and across cell type variance (Wang et al., 2019), thus the selection solely based on variance cannot identify cell type specific features that we desired. Another possible solution is to take advantage of the mutual-linearity property described in Zaitsev et al. (2019) to filter out uninformative features. The difficulty of applying this method in DNA methylation is that the methylation data has a much larger number of CpG sites than the amount of genes in gene expression data. Evaluating pair-wise correlation of CpG sites is computationally prohibitive.

Recognizing the above difficulties, we apply our previously developed feature selection method TOAST (Li et al., 2019). It improves feature selection through iteratively conducting cross-cell type differential analysis and at the end the selected features are those demonstrating highest cross-cell type differences. Different from Li et al. (2019) that uses all selected features in deconvolution analysis, our deconvolution step (described in Section 2.3) relies more on the features that demonstrate stronger cell-type specificity, i.e. CpGs located nearest to the corner of simplex. As a result, Tsisal could be more robust to the noisy features that are not cell type specific than TOAST. This step selects a list of around 1000 CpG sites for subsequent deconvolution.

2.3 Simplex corner identification

We denote the observed data for these selected features by YM. YM is a submatrix of Y and YM=WMH. Each row of YM, denoted by yi. is a linear combination of kth cell compositions from all samples, i.e.

yi.=wi1h1·++wiKhK., for k=1,,K.

After row-normalizing and transposing YM to YMT, wij’s are normalized to wik= wik/k=1Kwik, solving for H reduces to a standard (K-1)-simplex problem:

YMT=HpTWMT, WMTWKn, WK=wRK:w0, 1KTw=1. (*)

Here Hp=h1,1phK,1p  h1,nphK,np denotes the simplex vertices, i.e. row-normalized proportion matrix H. The proportion matrix H is calculated as H=α1h1,1pαKhK,1p  α1h1,npαKhK,np, where coefficients α1,,αK are the solution of α1αK×Hp=1,1,,1,1.

There is an extensive literature on methods to identify the corners of standard simplex, including the vertex component analysis (VCA) (Nascimento et al., 2005), pixel purity index (PPI) algorithm (Boardman, 1993), simplex identification via split augmented Lagrangian (SISAL) (Bioucasdias, 2009) and many others. Among these methods, SISAL has been reported to be one of the most robust and accurate one compared with more than 10 other methods (Bioucasdias et al., 2012). In general, geometric approach to identify vertices of a given simplex is an optimization problem with certain constraints. SISAL replaces the hard positivity constraint that were commonly used in other methods by hinge-type soft constraints to achieve more robustness against noises and faster computational speed.

We apply SISAL to identify vertices of the simplex described in (*). The vertices of this simplex Hp are used to compute the cell-type proportions H. We then geometrically evaluate whether a CpG site is cell type-specific by calculating the Euclidean distance between CpG sites and each vertex. The closest CpG sites of every vertex are deemed cell type-specific and will be used in labeling cell types. The cell type-specific CpG sites will also be output for other downstream analysis.

2.4 Cell type label assignment

We develop a data-driven approach to assign cell type labels to the estimations from the reference-free deconvolution. Our method takes an external reference panel, the estimated cell compositions and the top cell-type-specific CpG sites as inputs. It can robustly assign cell labels even when the reference panel is from a different population than the DNA methylation data analyzed. For example, the reference is from adults but the DNA methylation data are collected from children as shown in our later application to the neuroblastoma data (Section 3.2.2).

Specifically, our method pools the posterior probability from a Naïve-Bayes classifier and the Pearson correlation coefficient of the reference methylation profiles versus the estimated cell type profiles. Denote the reference methylation profiles as a C by L matrix W and the estimated cell type profiles as a C by K matrix W^. For cell type k and CpG site c, assume the estimated methylation profile w^kc follows normal distribution given the kth cell type label being l:

w^kc|Zk=lNμlc,σlc2.

Assume the prior probability of a cell type matching any particular known label is uniform. Naïve-Bayes classifier outputs the posterior probability

ql,k=PrZk=l|w^.k=Prw^.k|Zk=ll=1LPrw^.k|Zk=l.

Denote the posterior probability matrix by Q=ql,k. Naïve-Bayes classifier was previously used to assign cell type labels in single cell RNA-seq data (Grabski et al., 2020), which is a similar problem but with a more complex data structure. Pearson correlation coefficient was also often used to quantify the similarity between the reference and the estimated profiles (Li et al., 2019). Let P=(pl,k) be the L by K matrix of Pearson’s correlation coefficient pl,k=rw.l,w^.k. Among these two metrics, the posterior probability is very sensitive to distinguish different cell types but doesn’t quantify the similarity. Even when the correlations are low for all cell types, the one with the largest correlation can have very large posterior probability. On the other hand, the Pearson correlation quantify the similarity but has low sensitivity in distinguishing cell types. To balance the distinction among different cell types and the similarities to the reference, we define the similarity score between the estimated and reference cell type as the mean of the calculated correlation coefficient P and posterior probability Q, and use this value to allocate the predicted cell type. This averaged metric maintains the information from the correlations of estimated versus reference cell types, and at the same time provides a quantification for how confident this matching is through posterior probability. The metric, albeit ad hoc, produces better results in our simulation and real data analyses.

Assuming there are K cell types in mixture, L cell types (LK) in the reference, we then obtain a L×K similarity score matrix. The (l,k)th entry of the matrix represents the similarity score for assigning the estimated cell k to the reference cell type l. We iteratively search for the largest element in the similarity score matrix, and assign the predicted cell type corresponding to the column where the largest element is located to the reference cell type corresponding to its row. Cell types that do not belong to any reference cell types or the maximum similarity value less than a threshold t, will be designated as ‘unassigned’. This assignment scheme allows partial label annotating and works especially well when reference panel has large discrepancies from the data of interest. Higher threshold t forces the matching to have high confidence, but could result in less cell type labeled. By default, we set the threshold to be 0.5. When there are external information for cell proportions, we recommend users vary the threshold to get results that are consistent with the information. For example, in tumor data deconvolution, the sum of immune cell type proportions should be roughly one minus the tumor purity. The tumor purities can be estimated using existing tools (Carter et al., 2012; Zheng et al., 2017).

2.5 Comparisons with other deconvolution methods

We compared Tsisal to three state-of-art reference-free or partial reference-free deconvolution methods, RefFreeEWAS (Houseman et al., 2014), DSA (Zhong et al., 2013) and TOAST (Li et al., 2019). The RefFreeEWAS was applied using the default setup of the RefFreeCellMix function from R/Cran package RefFreeEWAS with the top 1000 most variable sites selected from the dataset. The implementation of DSA is obtained from GitHub (https://github.com/zhandong/DSA). DSA needs marker features of the cell types as input. We feed the Tsisal-identified top 50 cell-type-specific CpG sites for each cell type to DSA as markers. The TOAST-estimated cell-type proportions were computed using the default setup of the csDeconv function from R/Bioconductor package TOAST. In simulations, true proportions of all cell types were known. To assign predicted cell types to the known cell types, we calculated the Pearson correlation coefficient between estimated and true proportions, and assigned the predicted cell type based on highest correlation coefficient iteratively. This method was used for Tsisal and existing methods under comparison to ensure fairness.

2.6 Datasets

We evaluated the performance of our proposed method using a total of seven datasets with details presented in Table 1. After the data were downloaded, we performed a quantile normalization of these datasets. For European Prospective Investigation into Cancer and Nutrition (EPIC) data, we only selected 424 normal samples for analysis.

Table 1.

Summarization of real datasets used in this study

Dataset name Accession Sample size Reference
Aging GSE40279 656 Hannum et al. (2013)
RA GSE42861 689 Liu et al. (2013)
EPIC GSE51032 424 Riboli et al. (2002)
Hannon et al. I GSE80417 675 Hannon et al. (2016)
Hannon et al. II GSE84727 847 Hannon et al. (2016)
KIRC TCGA 325 Tomczak et al. (2015)
Neuroblastoma GSE40279 35 Gomez et al. (2015)

RA, rheumatoid arthritis; EPIC, European Prospective Investigation into Cancer and Nutrition.

3 Results

3.1 Simulation

To examine the performance of Tsisal in a controlled setting, we analyzed synthetic DNA methylation mixtures generated by simulation. The simulated DNA methylation data was generated by simulating two matrices: cell-type-specific methylation profile matrix (W) and cell-type proportion matrix (H). We simulated W based on DNA methylation 450K array data of purified human blood cells from GEO (accession GSE35069). This dataset contains the DNA methylation profiles from six types of blood cells, including CD4 T, CD8 T, CD56 Natural Killer (CD56NK), B cell, monocyte (Mono) and granulocyte (Gran), each cell type had measurements from six replicated samples (Reinius et al., 2012). For our simulation study, we combined CD4 T, CD8 T and CD56NK to one pseudo-cell-type (presented as CD_T_NK hereafter) when estimating the cell-type-specific mean and variance of each feature, and used the methylation data of these four cell types as template to generate W. Specifically, we used methylation levels from each cell type and, assuming it followed beta distribution, estimated parameters for each CpG site. The individualized cell-type specific methylation profile matrix W was then randomly generated from beta distributions with the estimated parameters. The proportion matrix H was randomly generated using the uniform distribution with values between 0.05 and 0.95, and re-normalized the cell-type proportions to sum up to one for each sample. Eventually, matrix Y was simulated as multiplication of these two matrices plus small Gaussian noise with zero mean and standard deviation 0.01 and then trimmed to range [0,1]. For all simulation settings, results from 100 Monte Carlo experiments are summarized and presented.

We first evaluate the performance of Tsisal in choosing the optimal number of cell types from simulation data. A total of 100 samples are generated using the above simulation steps. For each assumed cell type number k (2k15), we use Tsisal to calculate the AIC value. As a comparison, we also use function EstDimIC in RefFreeEWAS package to compute the AIC and BIC, and function RefFreeCellMixArrayDevianceBoots in RefFreeEWAS package to calculate bootstrapped deviances. The most plausible K determined by the four methods Tsisal, RefFreeEWAS_AIC (RF_AIC), RefFreeEWAS_BIC (RF_BIC), RefFreeEWAS_bootstrap (RF_BS) are chosen by minimizing the corresponding values. Figure 2A shows the number of times that four methods correctly estimated (estimates equal to 4, represented by ‘Correct’), underestimated (estimates less than 4, represented by ‘Under’) and overestimated (estimates greater than 4, represented by ‘Over’) the number of cell types in 100 simulation experiments. Clearly, Tsisal has the higher accuracy in estimating the number of cell types. In 100 simulations, Tsisal has 50 correct estimates, 27 underestimates (the estimates are all 3) and 23 overestimates (the estimates are all 5). Compared with Tsisal, RefFreeEWAS has poor performance and always underestimated, the estimates are all 2 no matter AIC, BIC and bootstrap deviances.

Fig. 2.

Fig. 2.

Performance of Tsisal in estimating the number of cell types and cell-type proportions on synthetic mixtures. (A) Barplots of the number of times correctly estimated (estimates equal to 4), underestimated (estimates less than 4) and overestimated (estimates greater than 4) the number of cell types by Tsisal, RefFreeEWAS_AIC (RF_AIC), RefFreeEWAS_BIC (RF_BIC) and RefFreeEWAS_bootstrap (RF_BS) in 100 simulation experiments. Boxplots of mean Pearson correlation coefficients (B) and mean absolute errors (C) from RefFreeEWAS, DSA, TOAST and Tsisal under different sample sizes 30, 50, 100, 200. The presented results are summarized over 100 Monte Carlo simulation experiments

Next, we assess the performance of Tsisal in estimating cell-type proportions under different sample sizes. To fairly evaluate the performance of different methods (RefFreeEWAS, DSA, TOAST, Tsisal), we compute the mean Pearson correlation coefficient (MC) and mean absolute error (MAE) of estimated versus true proportions over the four cell types. Higher MC and lower MAE is expected from better method. As shown in Figure 2B and C, Tsisal consistently achieves higher correlations and lower mean absolute errors than TOAST, DSA and RefFreeEWAS. We also observe that Tsisal has increased correlations and decreased errors when sample size increases from 30 to 200. DSA has better performance than RefFreeEWAS, which demonstrates that the markers identified by Tsisal are cell-type-specific and improve the accuracy of deconvolution.

We further evaluate the performance of Tsisal in cell type label assignment. We add different levels of noise to reference panel, in order to understand how Tsisal performs when reference panels are deviated from the truth. Supplementary Figure S1 summarizes the similarity scores of the four cell type assignments. It shows that the similarity scores assigned to the four cell types drop with the increase of noise level. But overall, the accuracy of assignment is very high. This proves that our proposed method has robust performance in assigning cell types even if the reference data are imperfect. Take Mono for example, when the standard deviations of added noise increase to 5 and 10 times, the similarity scores assigned to Mono are 0.96 and 0.91. Even if the standard deviation increases to 20 times, the similarity score is still more than 0.86.

Finally, we examine whether the selected CpG sites for each cell type are truly cell-type-specific. We conduct additional simulation study and the 10 closest CpG sites to each vertice are identified as cell type-specific features. Supplementary Figure S2 shows the heatmap of predicted marker methylation profiles in pure reference panel. The selected CpG sites are hyper-methylated in the corresponding cell type and hypo-methylated in other cell types, demonstrating they are cell-type-specific.

We also compare Tsisal with EDec (Onuchic et al., 2016) and BayesCCE (Rahmani et al., 2018). EDec is downloaded from https://github.com/BRL-BCM/EDec. EDec needs marker features of the cell types as input. We select features for EDec in two ways. One is to feed the Tsisal-identified top 50 cell-type-specific CpG sites for each cell type to EDec as markers (this approach termed as EDec+T hereafter), the other is to assume some of the reference data are known, and use the known reference data to estimate cell-type-specific markers and then feed them into EDec (this approach termed as EDec hereafter). The BayesCCE Matlab toolbox is downloaded from GitHub (https://github.com/cozygene/BayesCCE). Supplementary Figure S3 shows the violin plots of mean Pearson correlation coefficients and mean absolute errors between estimated and true proportions from Tsisal, EDec+T, EDec, BayesCCE under different sample sizes 30, 50, 100, 200. The presented results are summarized over 100 Monte Carlo simulations. For all simulation scenarios, Tsisal provides the best results, followed by EDec. We notice that EDec performs better than EDec+T because EDec uses part of reference data, which reduce the difficulty of decomposition. We also observe that the performance of Tsisal, EDec+T, EDec has been greatly improved when sample size increases from 30 to 50, and the performance difference between the three methods becomes smaller with the increase of sample size. Among the three methods, BayesCCE has the worst performance (mean correlation ∼0.5, and mean absolute error ∼0.18), and the performance of BayesCCE remains almost unchanged with the increase of sample size. Therefore, we do not consider BayesCCE in the subsequent real data analysis.

Overall, the simulation studies show that Tsisal provides more accurate results in choosing the number of cell types and estimating cell-type proportions compared with existing methods. And our cell label assignment step provides robust and accurate cell type labeling even when reference panels have big deviations from the true reference.

3.2 Real data analyses

In this section, we evaluate the performance of Tsisal using seven real DNA methylation datasets. In Section 3.2.1, we compare the accuracy of estimated proportion from Tsisal versus state-of-art methods. In Section 3.2.2, we focus on evaluating cell type number estimation and cell label assignment of Tsisal by relating the results to known sample phenotypes.

3.2.1 Decomposition of five whole-blood methylation datasets

Five large whole-blood methylation datasets are obtained to evaluate the Tsisal and three existing methods RefFreeEWAS, DSA and TOAST (Table 1). Compared with the synthetic data, the five datasets do not have true cell-type proportions to provide benchmarks. We therefore obtain blood reference panels from R package FlowSorted.Blood.450k (Jaffe et al., 2014), which provides methylation profiles of six cell types CD8T, CD4T, natural killer cell (NK), B-cell, monocytes (Mono), Granulocyte (Gran). After removing batch effect between mixture and reference data using Combat (Johnson et al., 2007), we apply reference-based deconvolution method EpiDISH (Teschendorff et al., 2017). These proportion estimates are used as silver standard to benchmark the methods.

Figure 3 shows the Pearson correlations between reference-based solved and estimated proportions by different methods for each cell type in the five datasets. Overall, Tsisal outperforms RefFreeEWAS and DSA, showing larger correlation between the reference-based solved and estimated proportions on the five datasets. For some datasets, the improvements in proportion estimations are substantial. For example, in the Riboli et al. dataset, the correlations of estimated and reference-based solved proportions for CD4T are 0.094 from RefFreeEWAS and -0.082 from DSA, but 0.824 from Tsisal. Overall, TOAST has comparable estimation accuracy as Tsisal. However, Tsisal has three important advantages over TOAST. First, TOAST can only output proportion estimations, whereas Tsisal can provide both proportion estimations and cell-type-specific CpG sites for each constituent cell type. These cell-type- specific CpG sites could be used to specify cell types as well as to conduct downstream enrichment analysis and disease diagnosis. Second, Tsisal is a full pipeline of obtaining cell composition with its unique cell label assignment, which we will demonstrate more results along this direction in Section 3.2.2. Lastly, Tsisal provides superior computational performance than TOAST. For a simulation with 459226 CpG sites and 100 samples, Tsisal takes about 631 s, and TOAST takes 3528 s for one run on a MacBook Air laptop with i5 1.4 GHz CPU and 4 GB RAM.

Fig. 3.

Fig. 3.

Barplots of Pearson correlations between reference-based solved and estimated proportions by different methods for each cell type in the five datasets under the assumption of six constituting cell types in blood: CD8T, CD4T, natural killer cell (NK), B-cell, monocytes (Mono), Granulocyte (Gran)

We further compare Tsisal with EDec and evaluate the performance in the five datasets. Supplementary Figure S4 shows the Pearson correlation coefficients between reference-based solved and estimated proportions by Tsisal, EDec+T and EDec for each cell type in the five datasets.

Overall, Tsisal outperforms EDec+T and EDec, showing larger correlation between the reference-based solved and estimated proportions on the five datasets.

3.2.2 Decomposition of KIRC and neuroblastoma data

Lastly, we apply Tsisal to two cancer datasets, kidney renal clear cell carcinoma (KIRC) data and neuroblastoma data. Details of both datasets are shown in Table 1. Here we associate the estimations with observed survivals and other clinical phenotypes. These two analyses epitomize the application of Tsisal in real scenario, where the data of interest are from a completely different population than the reference panel in use. KIRC data is obtained from kidney renal tumor samples and the neuroblastoma data is from young children with neuroblastoma. However, the immune reference used to guide cell type assignment is obtained from six healthy adults (Reinius et al., 2012). In such circumstances, reference-based methods could lead to inaccurate estimations as reported in a previous study (Yousefi et al., 2015). Meanwhile, reference-free method is able to provide better estimations (Rahmani et al., 2018).

The DNA methylation 450K data for 325 KIRC tumor samples are downloaded from TCGA database (https://portal.gdc.cancer.gov/). Tsisal chooses 12 as the number of cell types and is applied to estimate cell-type proportions of all samples. Immune reference panel by Reinius et al. (2012) is provided to Tsisal for inferring cell type labels and the threshold of cell type label assignment is set to be 0.55. CD4T and Gran were found above the threshold and thus had matched proportions. This is consistent with the finding by Zhang et al. (2019). That study analyzed the RNA-seq data of TCGA KIRC samples and reported three major immune cell types identified as CD4T, Gran and CD8T. These immune cells were also identified as relatively more abundant immune cell types in tumors by a previous review (Whiteside, 2008). Figure 4A shows the distribution of the estimated proportions. We observe CD4T accounts for 17% among all cell types, Gran accounts for 15%. The sum of ten unknown/cancer cells accounts for 68%. Moreover, Supplementary Figure S5 shows the scatter plot of Tsisal-estimated tumor purities (sums of proportions for the unknown cancer cell types) and the purity estimates from LUMP (Aran et al., 2015) and InfiniumPurify (Zheng et al., 2017). The Pearson correlation coefficient of Tsisal and LUMP is 0.75, and is 0.8 from InfiniumPurify. These results show that our proposed method has high consistency with other methods, and also shows the reliability of our deconvolution algorithm.

Fig. 4.

Fig. 4.

Decomposition results of KIRC and neuroblastoma data. (A) Distribution of the estimated proportions of 12 cell types in KIRC tumor samples. (B) Functional enrichment of KIRC unknown/cancer specific CpG sites. (C) Kaplan–Meier survival curves for KIRC are stratified by estimated proportions of Gran, Cancer cells 1, 4 and 6. Tumors in the top 20th percentile of Gran, Cancer cells 1, 4 and 6 are compared with those in the bottom 20th percentile. P-values are obtained by the Log-rank test. (D) Estimated proportions of CD56NK is associated with diagnosis time (left panel), tumor stage (middle panel) and MYCN status (right panel) of neuroblastoma samples. For two groups comparison, P-values are obtained from Wilcoxon Rank-Sum test, and for multiple groups comparison, P-values are obtained from ANOVA F-test

We next explore the functional enrichment of the identified cancer-specific CpG sites. For each unknown/cancer cell, we identify the top 50 cell-type-specific CpG sites using Tsisal, which result in a total of 500 specific CpG sites for these ten unknown/cancer cell types. Enrichment analysis by EnrichR presents many terms directly related with cancer (Fig. 4B). For example, regulation of glycolytic process affects the sensitivity of tumor cells (Pitroda et al., 2009). Altered glycolysis is a metabolic hallmark found in many cancer cells (Le, 2019; Xiong et al., 2011). The important role of regulating coenzyme metabolic process has also been reported in several previous studies (Thapa et al., 2020).

We further investigate whether the estimated proportions of blood cells are associated with the survival of KIRC patients. The survival rate for the top and bottom 20% of samples ranked by Gran, Cancer cells 1, 4 and 6 proportions are examined using Cox proportional hazards regression. We observe that patients with greater than 20% of Cancer cells 1, 4 and 6 survive significantly shorter (P = 0.002, 4e-04, 0.003) than those with less than 20%. Whereas patients with the top 20% of Gran survive significantly longer (P = 0.007) than the lowest 20% (Fig. 4C).

The neuroblastomas data are downloaded from GEO (accession GSE54719). This dataset contains the DNA methylation 450K profiles of 35 neuroblastomas tumor samples (Gomez et al., 2015). Tsisal chooses 11 as the optimal cell type number and estimates associated cell type proportions. Immune cell profiles by Reinius et al. (2012) and three neuroblastoma cancer cell lines (Be2c, Sknsh, Sknshra) from ENCODE are combined together as reference panel for cell type label assignment. Supplementary Figure S6 shows the similarity scores for the Tsisal-identified cell types compared to actual cell types in reference panel. It shows that the similarity scores of the estimated cell types assigned to six immune cell types all exceed 0.6, and the score is even higher for cancer cell lines (∼0.8). We use a threshold of cell type label assignment 0.6. CD8T, CD56NK, Mono, Be2c, Sknsh and Sknshra were found above the threshold and thus had matched proportions. Supplementary Figure S7 shows the distribution of the estimated proportions. CD56NK accounts for the largest proportion (∼27%) among the three immune cells. The total proportion of eight cancer cells is 49%.

We conduct the functional enrichment of the identified cancer-specific CpG sites. Supplementary Figure S8 shows the functional enrichment of the identified 400 cancer-specific CpG sites. Enrichment analysis by EnrichR presents many terms directly related with cancer. For example, glucocorticoids play a fundamental role in the maintenance of both resting and stress-related homeostasis, so glucocorticoids are essential for survival (Nicolaides et al., 2015). Glycosaminoglycan plays a key role in cancer cell biology and treatment (Afratis et al., 2012), and central nervous system development plays an important role in neurodegenerative diseases (Palubinsky et al., 2012).

Next, we study the association between the estimated cell-type proportions and subject phenotypes. We consider three phenotypes for each sample, diagnosis time, tumor stage and MYCN status. Samples are first grouped to ‘<18 month’ (diagnosed early than 18 month) and ‘>18 month’. It has been reported that children diagnosed in an earlier age (<18 month) are more likely to be cured than those diagnosed later (>18 month) (Cheung, 2012). By the tumor stage, samples can be classified into four groups: Stage 1, Stage 3, Stage 4 and Stage 4S. Tumor stage reflects the severity of the disease and later-staged neuroblastoma is related with shorter survival time (Franks et al., 1997). Stage 4S is a special stage, in which the subject age is younger than 1 year old, and the survival time of this stage is shorter than Stage 1, but longer than Stage 3 and 4 (Tonini et al., 1997). MYCN is an oncogene that helps regulate cell growth. Based on MYCN status, the samples are divided into amplified and non-amplified. Subjects with too many amplifications of the MYCN oncogenes tend to have a shorter survival time than those with non-amplified samples (Brodeur, 2003).

Figure 4D shows the association between the estimated proportions of CD56NK and diagnosis time, tumor stage and MYCN status of neuroblastoma samples. We observe significant differences in estimated proportions of CD56NK between samples grouped by diagnosis time, tumor stage and MYCN status. The estimated proportions of CD56NK in ‘<18 month’ group is significantly higher than that in ‘>18 month’ group (P = 6.5e-05). From Stage 1 to Stage 4, the estimated proportion of CD56NK decreases gradually, but the proportion of Stage 4S is higher than that of Stage 3 and Stage 4. And the estimated proportion of CD56NK in amplified group is significantly lower than that in non-amplified group (P = 3.8e-06). These findings are consistent with existing knowledge that CD56NK cells have a positive role in disease control and samples with higher proportions of CD56NK are associated with better survival outcomes among neuroblastoma patients (Castriconi et al., 2004; Schleinitz et al., 2010).

4 Discussion

In this article, we present Tsisal, a complete deconvolution algorithm for estimating cell compositions from DNA methylation data. Without relying on prior knowledge, Tsisal provides comprehensive solutions for choosing optimal cell type number, estimating cell composition and identifying cell-type-specific CpG sites. If reference panel is available, even being incomplete or measured from a different population, Tsisal can accurately assign cell type labels to the anonymous cell composition estimations.

Similar to other reference-free deconvolution methods, our proposed algorithm is better suited to datasets with relatively large sample size. With the development of technology, the price of DNA methylation 450K array has dropped significantly than before. This made large-scale study using DNA methylation array possible and has witnessed many datasets produced, including TCGA, ROSMAP, EPIC, etc. Moreover, it has been recognized that a reasonably large sample size is crucial for identifying reliable biomarkers from complex diseases. As a result, our proposed method has good potentials to play important role in analyzing heterogeneous DNA methylation data that exist or will be obtained in the future. For datasets with smaller sample size, e.g. with fewer than 20 samples, especially those obtained from model animals, our recent work using marker-guided algorithm in gene expression (Li et al., 2020) provides a promising solution. However, such method requires a better knowledge of cell types-specific markers. We are working toward this direction by gaining more marker information through the proposed method in this work.

The similarly scores in assigning cell type labels need a cut-off value to determine ‘assigned’ and ‘unassigned’ cell types. We acknowledge that choosing an appropriate cut-off value should take more information into consideration. For example, extensive studies are needed to evaluate the variance of DNA methylation profiles for each cell type within the same population and across populations. This is beyond the scope of the current work and will need to be investigated in future studies. Intuitively, the users could use a higher threshold when reference panel has similar population to data of interest and a lower threshold if larger differences of DNA methylation are expected between the two populations.

Lastly, the idea and principal of the current method is also applicable to other epigenetic data modalities, for example, bisulfite-sequencing data and m6A data. Those data have different data structures that require additional data processing and modeling to extend the current method. Nevertheless, these technologies are still relatively expensive in the current stage. We will extend our method to solve deconvolution problem from these modalities when more data become available.

Supplementary Material

btaa930_Supplementary_Data

Acknowledgement

The authors thank the two anonymous reviewers and the editor for their constructive comments to improve this work.

Funding

This project was partly supported by the National Natural Science Foundation of China [61902061 to W.Z.], National Institutes of Health [R01GM122083, P01NS097206 and U01MH116441 to H.W.].

Conflict of Interest: none declared.

Data availability

All real datasets analyzed are publicly available through Gene Expression Omnibus (Aging data, GSE40279; RA, GSE42861; EPIC, GSE51032; Hannon et al. I and II, GSE80417 and GSE84727; Neuroblastoma, GSE40279) and TCGA data portal (https://portal.gdc.cancer.gov/).

Contributor Information

Weiwei Zhang, School of Science, East China University of Technology, Nanchang, Jiangxi 330013, China.

Hao Wu, Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA 30322, USA.

Ziyi Li, Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.

References

  1. Afratis N.  et al. (2012) Glycosaminoglycans: key players in cancer cell biology and treatment. FEBS J., 279, 1177–1197. [DOI] [PubMed] [Google Scholar]
  2. Aran D.  et al. (2015) Systematic pan-cancer analysis of tumour purity. Nat. Commun., 6, 8971. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bioucasdias J.M. (2009) A variable splitting augmented Lagrangian approach to linear spectral unmixing. In Proceedings of the IEEE GRSS Workshop Hyperspectral Image Signal Process: Evolution in Remote Sens, pp. 1–4.
  4. Bioucasdias J.M.  et al. (2012) Hyperspectral unmixing overview: geometrical, statistical, and sparse regression-based approaches. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens, 5, 354–379. [Google Scholar]
  5. Bird A. (2002) DNA methylation patterns and epigenetic memory. Genes Dev., 16, 6–21. [DOI] [PubMed] [Google Scholar]
  6. Boardman J. (1993) Automating spectral unmixing of AVIRIS data using convex geometry concepts. In: Summaries 4th Annual JPL Airborne Geoscience Workshop, Vol. 1, pp. 11–14. [Google Scholar]
  7. Brodeur G.M. (2003) Neuroblastoma: biological insights into a clinical enigma. Nat. Rev. Cancer, 3, 203–216. [DOI] [PubMed] [Google Scholar]
  8. Castriconi R.  et al. (2004) Natural killer cell-mediated killing of freshly isolated neuroblastoma cells: critical role of DNAX accessory molecule-1–poliovirus receptor interaction. Cancer Res., 64, 9180–9184. [DOI] [PubMed] [Google Scholar]
  9. Carter S.L.  et al. (2012) Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol., 30, 413–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Cheung N.V. (2012) Association of age at diagnosis and genetic mutations in patients with neuroblastoma. JAMA, 307, 1062–1071. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Franks M.M.  et al. (1997) Neuroblastoma in adults and adolescents: an indolent course with poor survival. Cancer, 79, 2028–2035. [DOI] [PubMed] [Google Scholar]
  12. Giannakopoulos P.  et al. (2003) Tangle and neuron numbers, but not amyloid load, predict cognitive status in Alzheimer’s disease. Neurology, 60, 1495–1500. [DOI] [PubMed] [Google Scholar]
  13. Gomez S.  et al. (2015) DNA methylation fingerprint of neuroblastoma reveals new biological and clinical insights. Epigenomics, 5, 1137–1153. [DOI] [PubMed] [Google Scholar]
  14. Grabski I.N.  et al. (2020) Probabilistic gene expression signatures identify cell-types from single cell RNA-seq data. bioRxiv. [DOI] [PMC free article] [PubMed]
  15. Hannon E.  et al. (2016) An integrated genetic-epigenetic analysis of schizophrenia: evidence for co-localization of genetic associations and differential DNA methylation. Genome Biol., 17, 176. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Hannum G.  et al. (2013) Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell, 49, 359–367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Houseman E.A.  et al. (2012) DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics, 13, 86. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Houseman E.A.  et al. (2014) Reference-free cell mixture adjustments in analysis of DNA methylation data. Bioinformatics, 30, 1431–1439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ino Y.  et al. (2013) Immune cell infiltration as an indicator of the immune microenvironment of pancreatic cancer. Br. J. Cancer, 108, 914–923. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Jaffe A.E.  et al. (2014) Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol., 15, R31–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Johnson W.E.  et al. (2007) Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics, 8, 118–127. [DOI] [PubMed] [Google Scholar]
  22. Le W.  et al. (2019) Detection of cancer cells based on glycolytic-regulated surface electrical charges. Biophys. Rep., 5, 10–18. [Google Scholar]
  23. Li Z.  et al. (2019) TOAST: improving reference-free cell composition estimation by cross-cell type differential analysis. Genome Biol., 20, 190. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Li Z.  et al. (2020) Robust partial reference-free cell composition estimation from tissue expression. Bioinformatcis, 36, 3431–3438. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Liu Y.  et al. (2013) Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat. Biotechnol., 31, 142–147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Nascimento J.M.  et al. (2005) Does independent component analysis play a role in unmixing hyperspectral data. IEEE Trans. Geosci. Remote Sensing, 43, 175–187. [Google Scholar]
  27. Nicolaides N.C.  et al. (2015) Stress, the stress system and the role of glucocorticoids. Neuroimmunomodulation, 22, 6–19. [DOI] [PubMed] [Google Scholar]
  28. Onuchic V.  et al. (2016) Epigenomic deconvolution of breast tumors reveals metabolic coupling between constituent cell types. Cell Rep., 17, 2075–2086. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Palubinsky A.M.  et al. (2012) The role of central nervous system development in late-onset neurodegenerative disorders. Dev. Neurosci., 34, 129–139. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Pitroda S.P.  et al. (2009) STAT1-dependent expression of energy metabolic pathways links tumour growth and radioresistance to the Warburg effect. BMC Medicine, 7, 68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Rahmani E.  et al. (2016) Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat. Methods, 13, 443–445. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Rahmani E.  et al. (2018) BayesCCE: a Bayesian framework for estimating cell-type composition from DNA methylation without the need for methylation reference. Genome Biol., 19, 141. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Reinius L.E.  et al. (2012) Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One, 7, e41361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Riboli E.  et al. (2002) European Prospective Investigation into Cancer and Nutrition (EPIC): study populations and data collection. Public Health Nutr., 5, 1113–1124. [DOI] [PubMed] [Google Scholar]
  35. Robertson K.D. (2005) DNA methylation and human disease. Nat. Rev. Genet., 6, 597–610. [DOI] [PubMed] [Google Scholar]
  36. Schleinitz N.  et al. (2010) Natural killer cells in human autoimmune diseases. Immunology, 131, 451–458. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Smith Z.D.  et al. (2013) DNA methylation: roles in mammalian development. Nat. Rev. Genet., 14, 204–220. [DOI] [PubMed] [Google Scholar]
  38. Teschendorff A.E.  et al. (2017) A comparison of reference-based algorithms for correcting cell-type heterogeneity in Epigenome-Wide Association Studies. BMC Bioinformatics, 18, 105. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Thapa M.  et al. (2020) Role of coenzymes in cancer metabolism. Semin. Cell Dev. Biol., 98, 44–53. [DOI] [PubMed] [Google Scholar]
  40. Tomczak K.  et al. (2015) Review The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. (Pozn), 1A, 68–77. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Tonini G.  et al. (1997) MYCN oncogene amplification in neuroblastoma is associated with worse prognosis, except in stage 4s: the Italian experience with 295 children. J. Clin. Oncol., 15, 85–93. [DOI] [PubMed] [Google Scholar]
  42. Wang X.  et al. (2019) Bulk tissue cell type deconvolution with multi-subject single-cell expression reference. Nat. Commun., 10, 380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Whiteside T.L. (2008) The tumor microenvironment and its role in promoting tumor growth. Oncogene, 27, 5904–5912. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Xiong Y.  et al. (2011) Regulation of glycolysis and gluconeogenesis by acetylation of PKM and PEPCK. Cold Spring Harb. Quant. Biol., 76, 285–289. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Yousefi P.  et al. (2015) Sex differences in DNA methylation assessed by 450 K BeadChip in newborns. BMC Genomics, 16, 911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Zaitsev K.  et al. (2019) Complete deconvolution of cellular mixtures based on linearity of transcriptional signatures. Nat. Commun., 10, 2209. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Zhang S.  et al. (2019) Immune infiltration in renal cell carcinoma. Cancer Sci., 110, 1564–1572. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Zheng X.  et al. (2017) Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies. Genome Biol., 18, 17–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Zhong Y.  et al. (2013) Digital sorting of complex tissues for cell type-specific gene expression profiles. BMC Bioinformatics, 14, 89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Zou J.  et al. (2014) Epigenome-wide association studies without the need for cell-type composition. Nat. Methods, 11, 309–311. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btaa930_Supplementary_Data

Data Availability Statement

All real datasets analyzed are publicly available through Gene Expression Omnibus (Aging data, GSE40279; RA, GSE42861; EPIC, GSE51032; Hannon et al. I and II, GSE80417 and GSE84727; Neuroblastoma, GSE40279) and TCGA data portal (https://portal.gdc.cancer.gov/).


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES