Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2017 May 4;33(17):2651–2657. doi: 10.1093/bioinformatics/btx303

Accounting for tumor purity improves cancer subtype classification from DNA methylation data

Weiwei Zhang 1,2,, Hao Feng 3,, Hao Wu 3,*, Xiaoqi Zheng 1,*
PMCID: PMC6410888  PMID: 28472248

Abstract

Motivation: Tumor sample classification has long been an important task in cancer research. Classifying tumors into different subtypes greatly benefits therapeutic development and facilitates application of precision medicine on patients. In practice, solid tumor tissue samples obtained from clinical settings are always mixtures of cancer and normal cells. Thus, the data obtained from these samples are mixed signals. The ‘tumor purity’, or the percentage of cancer cells in cancer tissue sample, will bias the clustering results if not properly accounted for.

Results: In this article, we developed a model-based clustering method and an R function which uses DNA methylation microarray data to infer tumor subtypes with the consideration of tumor purity. Simulation studies and the analyses of The Cancer Genome Atlas data demonstrate improved results compared with existing methods.

Availability and implementation: InfiniumClust is part of R package InfiniumPurify, which is freely available from CRAN (https://cran.r-project.org/web/packages/InfiniumPurify/index.html).

Contact: hao.wu@emory.edu or xqzheng@shnu.edu.cn

Supplementary information: Supplementary data are available at Bioinformatics online.

1 Introduction

Classifying tumor samples into subtypes based on different types of clinical or molecular data is a key step in understanding cancer etiology and designing personalized treatment for cancer patients (Chung et al. 2002; Hoadley et al. 2014; Ogino et al. 2012). Originally, classification of cancer subtype was mostly based on clinical histological information. For example, according to the size and the appearance of malignant cells under a microscope, lung carcinomas are categorized into two main classes: non-small-cell lung cancer (NSCLC) and small-cell lung cancer (SCLC) (Leong et al. 2014). With the advances of high-throughput technologies, tumor subtype classification has been performed more frequently using molecular signals such as DNA sequence variants, gene expressions or DNA methylation. For example, the PAM50 gene expression assay was used to categorize breast tumors into five intrinsic subtypes: luminal A, luminal B, human epidermal growth factor receptor2 (HER2) enriched, basal-like and normal-like (Parker et al. 2009). Similarly, glioblastoma multiforme was classified into four molecular subtypes: classical, neural, proneural and mesenchymal, where the former two were characterized by higher expression of epidermal growth factor receptor (EGFR) and neuron maker genes, respectively (Verhaak et al. 2010).

DNA methylation is an important epigenetic modification of the DNA molecule, and plays a crucial role in many biological processes, including repression of gene transcription, maintenance of gene imprinting and X-chromosome inactivation (Bird 2002; Hackett and Surani 2013; Li et al. 1993). Aberrant DNA methylation pattern has been identified as a hallmark in different types of cancers (Das and Singal 2004; Hansen et al. 2011). DNA methylation profile has widely been used to perform categorization on clinical presentation and patient prognosis (Stefansson et al. 2015; Zhuang et al. 2012). Clustering of lung cancer cell lines using DNA methylation markers showed that NSCLC and SCLC cell lines had different DNA methylation patterns (Virmani et al. 2002). DNA methylation profiling was also used in clustering of acute lymphoblastic leukemia (ALL) patients and served as a complementary method for diagnosis of ALL (Nordlund et al. 2015). These results suggest that each cancer subtype carries unique DNA methylation signature that can help to identify the subtypes.

A number of methods have been applied for clustering tumor samples based on high-throughput data, including nonparametric (K-means, agglomerative hierarchical clustering, etc.) and model-based methods (Houseman et al. 2008; Kuan et al. 2010). In particular, Non-negative Matrix Factorization (NMF) is a popular method for sample clustering based on gene expression data (Brunet et al. 2004). The method is based on matrix factorization with non-negative constraints, and was shown to have good performance. To systematically compare these methods, a recently developed tool ClustEval evaluated the currently available clustering methods by using different datasets, varying parameters, and quality metrics. It suggested that no method performed the best in all settings (Wiwie et al. 2015).

Among all published results for cancer type classification, one important aspect is consistently ignored: the clinical tumor samples contain different types of cells as well as their adjacent normal cells. Due to the inclusion of normal cells in the tumor samples, the clinical tumor samples cannot be regarded as ‘pure’ cancer cells. Previous studies have shown that tumor purities (the percentages of cancer cells in solid tumor samples) have a strong influence on the analysis of genomic data in cancer studies, and may bias the biological interpretation of results (Aran et al. 2015). Our exploratory analyses also show that applying traditional clustering methods such as K-means or NMF directly on the methylation profiles from tumor samples gives biased results (more details are provided in Section 3): samples with similar tumor purities tend to be clustered together. This is undesirable since there is no evidence showing associations between tumor purity and cancer subtypes. Thus, we believe it is of great necessity to consider tumor purity in the clustering procedure.

The importance of accounting for tumor purity in data analysis has been well recognized. For example, it is recommended to include purity in differential expression analysis (Aran et al. 2015). We recently developed InfiniumPurify, which incorporates purity in differential methylation (DM) analysis (Zheng et al. 2017). However, up to date there is no clustering method available with consideration of tumor purity. In this study, we developed a rigorous statistical method InfiniumClust to perform sample clustering on DNA methylation data with the consideration of tumor purity. InfiniumClust models the DNA methylation levels of a tumor sample as a mixture of normal and cancer data, where the mixing proportion is the tumor purity. The ‘pure’ cancer data are further assumed to be from a mixture of different cancer subtypes. When tumor purities are known, the parameter estimation and sample clustering are performed through an Expectation-Maximization (EM)-based algorithm. We performed extensive real-data based simulations and demonstrated good performances with InfiniumClust. We further applied InfiniumClust to DNA methylation data for 23 cancer types from The Cancer Genome Atlas (TCGA). Compared with existing clustering methods that ignore tumor purity, InfiniumClust provides less biased and more meaningful results. To our best knowledge, InfiniumClust is the first available tool for unsupervised clustering which taking tumor purity into account. InfiniumClust is currently available from R package InfiniumPurify, which can be obtained at https://cran.r-project.org/web/packages/InfiniumPurify/index.html.

2 Materials and methods

Sample clustering based on high-throughput data usually starts with feature (CpG site, gene, etc.) selection. It is a common practice to select a small number (such as 1000) of features with the largest variances, and use their data for clustering. Those features are highly heterogeneous, thus they contain information for subtypes. In contrast, using data for all features is not ideal because a large portion of them show no variation among samples, thus bringing noise to the clustering procedure. Under the above analyses, we first selected some highly variable CpG sites and then performed clustering based on these feature-reduced data.

The raw data for the clustering procedure are from Illumina Infinium DNA methylation 450k arrays, which report methylation beta values of more than 480,000 CpG sites. Methylation beta values range from 0 to 1, hence they cannot be considered as froming a normal distribution. We first transformed the beta values using an arcsine transformation: fx=arcsin(2x-1). Such transformation has previously been used in DM analysis (Park and Wu 2016). The transformed data follow the normal distribution better compared with the raw data, thus fitting our model assumption well. In addition, compared with commonly used logit transform, the arcsine transformation is more linear (especially at the boundaries). This is important since the methylation level from tumor sample is a mixture of the cancer and normal methylation level, and the signal mixing is at the original scale. A more linear transformation allows one to use a linear model for the transformed data with a better approximation.

2.1 The data model

Let X,Y be C×N matrices of transformed methylation levels for normal and pure tumor cells, where i=1,2,,C indexes CpG sites, and j=1,2,,N indexes samples. Assuming tumor samples have K subtypes with proportions pk, and satisfy k=1Kpk=1. Define a latent indicator variable Z as the membership of samples, i.e. Zjk=1 means the jth pure tumor cell comes from the subtype k. Apparently, each sample can only belong to one cancer subtype, so k=1KZjk=1. We assume that the transformed methylation level of CpG site i in normal cells j follows the normal distribution: XijNμi0,σi02. The transformed methylation level at CpG site i in tumor cells j clustering into subtype k is assumed to follow a mixture of normal distributions: Yij|Zjk=1Nμik,σik2 where different subtypes have different means and variances. In practice, clinical tumor samples are affected by tumor purity, so methylation for pure tumor cells is unobserved. Instead, the observed data from clinical tumor samples, denoted by Yij', is from mixed cancer-normal tissues. For tumor sample j, let λj be the tumor purity, we have

Yij=λjYij+(1λj)Xij.

Assuming that Xij and Yij are independent, we have the distribution for the mixed signal as

Yij|Zjk=1N(λjμik+(1λj)μi0,λj2σik2+(1λj)2σi02).

The data model shows that due to the presence of cancer/normal sample mixing, directly clustering Yij' could lead to biased results. In the next section, we present a model-based clustering algorithm where tumor purities are considered.

2.2 Model-based clustering method

For our method presented below, the tumor purities λj are assumed to be known. There are a number of methods available for purity estimation (Ahn et al. 2013; Bao et al. 2014; Carter et al. 2012; Yoshihara et al. 2013), and an informative review is presented by Wang et al. (2016). After obtaining the tumor purities λj from existing methods, the clustering problem model transforms into a K-component normal mixture model.

We develop the following method, termed InfiniumClust, to cluster methylation beta values from 450k arrays. In the clustering problem, the input data are Yij' and λj. Denote the parameter set to be estimated as

Θ={p1,p2,,pK1;μ11,,μ1K,μ21,,μ2K,,μC1,,μCK;μ10,,μC0;σ112,,σ1K2,σ212,,σ2K2,,σC12,,σCK2;σ102,,σC02}.

In detail, p1,p2,,pK-1 are mixing proportions, (μ11,,μ1k,μ21,, μ2K ,,μC1,,μCK;σ112,,σ1K2,σ212,,σ2K2,,σC12,,σCK2) are means and variances of each mixing cancer component and (μ10,,μC0;σ102,,σC02) are means and variances of the normal cells. Under these setups, the clustering problem can be performed through the following EM algorithm.

First, the conditional likelihood for observing the methylation status of sample j is

p(Yij|Zjk=1)=pkϕ(Yij;λjμik+(1λj)μi0,λj2σik2+(1λj)2σi02),

where ϕ is the probability density function of normal distribution. Treating Zjk as missing data, the joint likelihood of the observed and missing data for CpG site i and sample j is

p(Yij;Zj)=k=1K{pkϕ(Yij;λjμik+(1λj)μi0,λj2σik2+(1λj)2σi02)}Zjk.

So the complete data log-likelihood for parameters is

l(Θ;Y,Z)=i=1Cj=1Nk=1KZjk(logϕ(Yij;λjμik+(1λj)μi0,λj2σik2+(1λj)2σi02)+logpk).

In EM algorithm, the E-step involves calculating the conditional expectation of the complete data log-likelihood, which gives the objective Q function as

Q(Θ|Θ(t))=EΘ(t){l(Θ;Y,Z )|Y}=i=1Cj=1Nk=1KEΘ(t)(Zjk|Y)(logϕ(Yij;λjμik+(1λj)μi0, λj2σik2+(1λj)2σi02)+logpk).

E-step calculates the expected value of Zjk conditional on the observed data and the parameter values at the current step, denoted by Θ(t). At current iteration step t, denote the expected value of Zjk as Zjk(t). E-step gives

Zjk(t)EΘ(t)(Zjk|Y)=PrΘ(t)(Zjk=1|Y)=PrΘ(t)(Zjk=1,Y)PrΘ(t)(Y)=PrΘ(t)(Y|Zjk=1)PrΘ(t)(Zjk=1)PrΘ(t)(Y)=pk(t)i=1Cϕ(Yij;λjμik+(1λj)μi0,λj2σik2+(1λj)2σi02)k=1Kpk(t)i=1Cϕ(Yij;λjμik+(1λj)μi0,λj2σik2+(1λj)2σi02).

The M-step maximizes the conditional expectation of the objective function QΘ|Θt with respect to current-step parameters.

For the updates of μik and μi0, we have

ABμik(t+1)Cμi0(t+1)=0

and

Aμi0(t+1)=B,

where

A=j=1NZjk(t)λjYijλj2σik2(t)+(1λj)2σi02(t),B=j=1NZjk(t)λj2λj2σik2(t)+(1λj)2σi02(t),C=j=1NZjk(t)λj(1λj)λj2σik2(t)+(1λj)2σi02(t).

Combining the K +1 equations, we can obtain the updates for μik,μi0. In practice, we update the μik,μi0 one-by-one for each CpG site and for each group, by solving one equation at a time. So the maximization of μ’s is essentially a conditional maximization step in the Expectation Conditional Maximization (ECM) algorithm (Meng and Rubin 1993).

For the updates for σik2,σi02, the M-step is more challenging. The partial derivatives of Q function with respect to σik2 or σi02 show that the updates of σik2,σi02 exist on both numerator and denominator of a summation over sample j from 1 to N. Therefore, the closed form solutions for updating σik2,σi02 do not exist and they have to be solved numerically. We adopted the ‘optimize’ function in R directly on objective likelihood function QΘ|Θt to update σik2 and σi02.

The update for pk can be achieved by solving the following partial derivative

Q(Θ,Θ(t))pk=i=1Cj=1N{Zjk(t)1pkZjk(t)11l=1K1pl}=0,

thus

pk(t+1)=j=1NZjk(t)k=1Kj=1NZjk(t).

The EM algorithm starts with initial values obtained from the K-means clustering directly performed on Yij'. The results from the EM procedure provide the posterior probabilities of each sample being in each subtype, which can be used to determine the subtype assignments.

3 Results

For all the results presented in this section, the estimated purities are provided by InfiniumPurify (Zhang et al. 2015, 2017) and obtained from https://zenodo.org/record/253193.

3.1 DNA methylation subtypes are biased by tumor purities

We first explored the real data to check whether tumor purity tends to bias the clustering results by comparing the purity distributions among clusters. We focus our attention on breast cancer (BRCA) since it has mature clinical subtypes known as Luminal A, Luminal B, HER2-enriched, Basal-like and Normal-like, which are characterized by expressions of ER, PR and Her2. We also downloaded the consensus clustering results by NMF (cNMF) using DNA methylation 450k array data, and performed K-means clustering on the same data set. We tried two sets of estimated tumor purities including the InfiniumPurify purities, which are estimated from DNA methylation 450k array data, and ABSOLUTE purities are based on SNP array data. The later ABSOLUTE purity estimates are actually the de facto gold standards provided by TCGA. These two types of tumor purities are shown to be highly correlated (Zhang et al. 2015). In the following analysis, we examined tumor purities distribution from both InfiniumPurify and ABSOLUTE on clusters of K-means, cNMF and PAM50, respectively.

Figure 1 presents the purity distributions of different clusters obtained from the above three methods. We observed significant purity differences using InfiniumPurify among different subtypes from K-means, PAM50 and especially cNMF (Fig. 1B). For example, the third cluster in cNMF has an averaged purity of 0.5, way below other groups. Figure 1C shows the results from PAM50 subtypes, where the purity differences show significance (P-value = 4.8e−4), for example, two luminal subtypes, especially luminal B, have much higher purities than basal-like samples. But the P-value is the smallest compared with K-means and cNMF (P-values are 1.02e−51 and 5.04e−98, respectively).

Fig. 1.

Fig. 1

Purity distribution for different BRCA subtypes. (A–C) InfiniumPurify purity distribution on K-means clustering subtypes, cNMF subtypes and PAM50 molecular subtypes. (D–F) ABSOLUTE purity distribution on K-means clustering subtypes, cNMF subtypes and PAM50 molecular subtypes. P-values are from linear regression with ANOVA F-test

Even though the number of tumor samples with ABSOLUTE purities is much less than that with InfiniumPurify purities (264 versus 746 samples), we still detected significant discrepancies in purities among different subtypes in Figure 1D and E. For example, the ABSOLUTE purities of the third cluster are still way below other groups (Fig. 1E), and the PAM50 subtypes still have the smallest purity differences compared with K-means, cNMF (Fig. 1F). Overall, we consistently observe significant purity differences among different subtypes in typically used K-means and cNMF methods by purities from both InfiniumPurify and ABSOLUTE. These results demonstrate that tumor purities will bias the clustering results if not correctly accounted for, and a clustering method with consideration of purity is therefore needed.

3.2 Simulation

To evaluate InfiniumClust in cancer sample clustering, we conducted comprehensive simulation studies to compare the performance of our method with other available methods. In the simulation presented below, we used data from BRCA as template. Since methylation level ranges from 0 to 1, it is a natural choice to simulate methylation level from beta distribution. In detail, data are generated using the following scheme:

  1. Pure normal samples: For sample j at CpG i, we generated its methylation level as XijBetaαi0,βi0. Here αi0 and βi0 are the method of moments estimates (MME) from the beta values of a total of 96 normal BRCA samples.

  2. Pure tumor samples: For sample j at CpG i, let YijBetaαik,βik in subtype k, where αik and βik are estimated from the following procedure. We first convert αi0 and βi0 to mean and dispersion (denoted by mi and di, respectively). The dispersion parameter in beta distribution represents the variance that is independent of mean. Because the pure tumor samples are more heterogeneous than normal, we multiply the above dispersions by 2 as the tumor dispersions (Neve et al. 2006; Zheng et al. 2014). For the mean of pure tumor samples, we randomly selected K normal samples from BRCA and used their beta values as the mean. With means and dispersions, we converted them back to α and β by the formulas αik=mik(1/di-1) and βik=1-mik1/di-1, and then generated beta values for K subtypes.

  3. Observed samples: We generated tumor purity values λj, j=1,2,,N uniformly from [0.05,0.95]. Substituting Xij (from (1)), Yij (from (2)) and λj into the formula Yij'=λjYij+1-λjXij. Then Yij' is the observed methylation level, which is a mixture of methylation level from pure cancer and normal samples.

We applied InfiniumClust and K-means to the simulated data, and compared their clustering performances. Because the group assignment of each sample is known, the accuracy is defined as the percentage of correctly clustered samples. Since the group indicators from all clustering methods are dummy variables, we use the following procedure to match the clustering results with the truth. Assuming there are N clusters, we first tabulate the group assignments for all samples from the truth and clustering results into a N×N table. Entry (i,j) in the table represents the number of samples belonging to the ith group in truth, and predicted as the jth group from clustering method. We then shuffle the rows and columns of the table, so that the sum of the diagonal elements achieves the maximum. Finally, the sum of the diagonal elements over the total number of samples is defined as the accuracy. For these simulations the data are generated from a 3-subtype mixture with mixing proportions ratio 0.2:0.5:0.3, and all simulations are repeated for 20 times.

First, we evaluated the effect of CpG selection on the accuracy of InfiniumClust. Instead of selecting CpG sites with the top 1000 largest variances, we randomly selected 1000 CpG sites to run InfiniumClust. Results show that the accuracy could be significantly worse compared with choosing 1000 CpG sites with the highest variances (Fig. 2A). This demonstrates that probes with larger variances due to different normal cell contaminations are more informative for clustering.

Fig. 2.

Fig. 2

(A) Predicting accuracy on InfiniumClust by selecting 1000 CpG sites with the largest variance and 1000 randomly selected CpG sites. (B,C) Predicting accuracy in different number of CpG sites and sample sizes on InfiniumClust and K-means

Based on the above analysis, we selected CpG sites with largest variances in tumor tissues and used their data as the input for clustering in the following analysis. First, we compared InfiniumClust and K-means at different numbers of selected CpG sites from 50, 100, 200, 300, 500, 800 and 1000, respectively. As shown in Figure 2B, overall the accuracies of InfiniumClust are much higher than those by K-means (around 0.94 versus 0.81) regardless of the number of CpG sites used. More importantly, InfiniumClust is robust against the numbers of CpG sites, but the accuracy of K-means gradually decreases with more CpG sites used.

We also tested the performance of InfiniumClust with varied sample sizes. If we have only 10 samples, the accuracies by InfiniumClust and K-means are almost the same (∼0.8). But the accuracy of InfiniumClust increases to 0.9 when sample size increases from 30 to 500, while the accuracy of K-means roughly remains the same (∼0.76) (Fig. 2C).

Compared with traditional clustering methods, InfiniumClust takes purity into consideration. So we further tested the effect of purity on the algorithm from the following aspects. First, we divided tumor samples into two groups (high purity and low purity) using the median of purities among samples as cutoff. Figure 3A shows that tumor samples with higher purities have higher chance to be clustered correctly. It is expected because samples with lower purities tend to be incorrectly clustered due to their higher normal cell contamination. They are likely to be clustered together since they are more similar to normal samples. Second, we examined the influence of accuracy of purity estimation on the algorithm. We randomly shuffled the purities of all tumor samples, and used them as input to implement InfiniumClust. As shown in Figure 3B, clustering accuracies significantly decreased to around 0.75, which is similar to the performance of K-means. This is not surprising because InfiniumClust uses shuffled purities, which is equivalent to ignoring tumor purities by K-means. Next, to test the robustness of our model against purity estimation, we added different levels of random noise with the Gaussian distribution to the purities of tumor samples. To be specific, tumor purities are added by a random noise of the Gaussian distribution with mean 0 and standard deviations from 0 to 1, step by 0.02. Note that the output purities could possibly be ranged out of [0, 1] after adding noise, so we set them as 0.01 if lower than 0 and 0.99 if larger than 1. As expected, the accuracy of InfiniumClust decreases with the increase of standard deviation, but still over 0.8 (K-means ∼0.75, Fig. 3C). This indicates that InfiniumClust still has better performance than K-means even if estimated tumor purities are biased.

Fig. 3.

Fig. 3

(A) Predicting accuracy of InfiniumClust on samples with high purity and samples with low purity. (B) Comparing the accuracy of InfiniumClust for the samples with the precise purity and samples with the shuffled purity. (C) Scatter plot of the noise of purity versus the accuracy of InfiniumClust, polynomial regression curve is displayed

We then evaluated the performance of InfiniumClust under different subtype proportions. We selected 200 samples, 1000 CpG sites with the largest variance to run InfiniumClust and K-means, with different proportion ratios of three subtypes. As shown in Figure 4A, InfiniumClust always performs well, on average about 0.9 for most scenarios. Even if proportions of subtypes are very unbalanced, e.g. 0.05:0.05:0.9, InfiniumClust still has good accuracy (>0.8). We also examined several proportions at four and five subtypes. InfiniumClust achieves accuracies as high as 0.9, whereas the accuracies of K-means are only around 0.7 (Fig. 4B and C).

Fig. 4.

Fig. 4

(A) Heatmap of different proportion of subtypes of K = 3 on InfiniumClust, where row indexes the proportion of the first subtype, column indexes the proportion of the second subtype. (B,C) Barplot of predicting accuracy in different proportion ratios of subtypes of K = 4 and K = 5

Another possible procedure to account for the purity effect is to estimate pure cancer methylome and perform clustering (such as K-means) on purified data directly. Given methylation levels of normal-adjacent sample and estimated purity, the pure cancer methylome can be inferred by simply subtracting out the normal signal from the tumor data based on a linear equation with consideration of purity. One caveat in such approach is that there are much fewer normal samples in TCGA compared with tumor samples (only 674 normal-adjacent samples from 12 cancer types), i.e. only a small proportion of the tumor samples has corresponding normal controls. In this case, one can compute the average normal methylome and subtract that to obtain purified cancer methylome. We conducted simulations to evaluate the performance of this approach (termed as ‘puKmeans’ hereafter). Overall, the accuracies of puKmeans are higher than K-means, but lower than InfiniumClust (Supplementary Fig. S4). This is expected, because puKmeans takes the point estimates of pure cancer methylome as inputs but ignores the variance, thus the information in data are not used in an optimal way. Furthermore, we found that puKmeans is sensitive to the number of CpG sites used in the clustering. In our simulation, using 100 CpG sites provides the best results, and using more CpG sites leads to lower accuracy. On the contrary, InfiniumClust is very robust against the number of CpG sites. These results demonstrate the advantage of the proposed model-based clustering method over a simplified approach to cluster on purified data.

Under all simulation scenarios, InfiniumClust achieves better accuracies. Moreover, InfiniumClust is robust against CpG site selections, sample sizes, biases in purity estimation and proportions of clusters. These results demonstrate the advantages of InfiniumClust in clustering tumor samples while considering tumor purity, as well as a well-constructed model.

3.3 Application of InfiniumClust to TCGA data

With the success of InfiniumClust on simulation data, we next tested InfiniumClust on real tumor samples. We analyzed all samples with both NMF clustering results and 450k array data (23 cancer types from TCGA). Data from 1000 CpG sites with the largest variance among tumor samples were used for InfiniumClust and K-means. The numbers of clusters of NMF used on these cancer types was the same number of clusters for InfiniumClust and K-means.

First, we explored purity distributions between correctly clustered and incorrectly clustered samples. Since the true clusters are unknown in real data, we used the ‘consensus’ samples of the three methods (NMF, K-means and InfiniumClust) in each cancer type as a proxy for truth under the assumption that samples clustered into the same group by different methods tend to form a true cluster. The ‘consensus’ sample is defined as follows. First consider two methods A and B that both cluster samples into N groups. The group indices from the clustering methods are dummy indicator variables that bear no biological meaning. To look at the agreements of the two methods, we first fix the group indices for method A, and then shuffle the group indices of method B to get the maximum overlap from all groups between the two methods. The overlapped samples in each cluster of the two methods are termed as ‘consensus’ samples, while the rest are ‘non-consensus’ samples. Similarly, for ‘consensus’ samples among three methods, we first get consensus samples between any two methods, then compare the consensus samples with the results from the third method using the same procedure. The consensus samples obtained from this comparison are defined as ‘consensus’ samples among three methods. All others are deemed ‘non-consensus’ samples. For all 23 cancer types in TCGA, we found that 50% of the samples belong to the consensus group on the average. In general, we observed significant differences in purity levels between the two groups in most cancer types (Fig. 5A, results for other cancer types are shown in Supplementary Fig. S1). Samples with higher purity tend to be in the consensus group. The result is consistent with the simulation result that samples with higher purities tend to be clustered correctly.

Fig. 5.

Fig. 5

(A) The distribution of InfiniumPurify Purity in consensus samples versus non-consensus samples in THCA and BRCA. (B) Testing purity differences among clusters from three methods for 23 cancer types. P-values are from linear regression and F-test

Next, we examined the purity difference among different clustering results in these three methods (Supplementary Fig. S5). We computed the P-values by testing purity difference among clusters from different methods. As shown in Figure 5B after -log10 transformation, InfiniumClust gives much less significant P-values compared with the other two methods. These results indicate that InfiniumClust is less affected by the purities due to the inclusion of purities in clustering. They also show that compared with K-means, purity has stronger influences on NMF. Overall, these results emphasize the risk of ignoring tumor purity when applying unsupervised clustering, and demonstrate good performances from InfiniumClust.

We also checked the overlap of clusters in these three methods. Supplementary Figure S2 shows the pairwise overlaps from the three methods. The average overlap between clusters of InfiniumClust and K-means is 0.72 for 23 cancer types. In contrast, the overlaps between the clustering results from NMF and other two methods are much lower: the average overlap of NMF and InfiniumClust is 0.54. These results are expected because algorithm-wise, InfiniumClust and K-means are very similar (K-means and the normal mixture model perform similarly when data are normally distributed). The only difference is the consideration of purities in InfiniumClust. In contrast, NMF is based on a different method, thus it tends to produce different results.

To test the robustness of our method, we also used ABSOLUTE purities to repeat the above analysis. We selected eight cancer types with ABSOLUTE purities to compute purity distributions in consensus and non-consensus samples, the purity difference among different clusters, and the overlap of clusters in these three methods, respectively. The results are consistent with those using InfiniumPurify purities. In particular, we also observed significant differences in purity levels between the consensus and non-consensus in most cancer types; InfiniumClust gives much less significant results compared with the other two methods (K-means and NMF); especially. In BLCA and LUAD, the clusters overlap between InfiniumClust and K-means is even over 0.9 (Supplementary Fig. S3). These results further prove that InfiniumClust is less influenced by tumor purities. Therefore, we believe InfiniumClust provides robust results in real data.

We further applied puKmeans in TCGA tumor data (12 cancer types, 5185 tumor samples and 674 normal samples). As a comparison, we also conducted InfiniumClust and K-means clustering in these cancer-normal mixtures. Since it is difficult to evaluate the performances without a gold standard, we only performed some exploratory analyses. As shown in Supplementary Table S1, the average overlap between clusters of InfiniumClust and puKmeans is slightly higher than that between InfiniumClust and K-means for 12 cancer types (0.712 versus 0.696).

4 Discussion

In this study, we systematically investigated the impact of tumor purity as a confounding factor in unsupervised clustering of tumor samples, and proposed a statistical model to adjust the effect of purity in tumor sample clustering. We first found that under traditional K-means and NMF approaches, tumor purities bias the clustering results (samples with similar purities are likely to cluster together), and that tumor samples with low purities tend to be misclassified. We designed a model-based statistical method InfiniumClust for subtype classification based on DNA methylation data. In InfiniumClust, methylation levels from tumor samples at each CpG site are modeled as mixture of normal distributions. Parameter estimation and sample clustering is conducted by an EM type algorithm. Based on simulation, InfiniumClust achieved more robust and accurate results compared with K-means algorithm. When applying to real TCGA tumor samples, InfiniumClust obtained the least biased clusters compared to K-means and the well-established NMF method. These results reinforce our claim that purity difference may confound genomic analyses if not correctly accounted for. To the best of our knowledge, InfiniumClust is the first method for unsupervised clustering for cancer subtypes adjusted for tumor purity.

In our model, we assume a Gaussian distribution for the transformed methylation level in each CpG site: data from normal samples follow a single Gaussian distribution and data from tumor samples follow a mixture of Gaussian distributions. We validate the assumptions in real data, and demonstrate that they approximately hold even though there is mild violation (Supplementary Material S1). However, according to our simulation results, the clustering algorithm will still perform well even if some CpG sites do not satisfy the normality assumption.

The current version of InfiniumClust is specifically designed for Infinium 450k methylation array, which is the most widely used platform for DNA methylation. It is conceivable that the same principle and methods can be applied to data from other platforms, or even other types of genomics data. For example, gene expression or copy number variation perhaps play more direct roles in tumorigenesis, and their data from tumor samples are influenced by purities as well. We will pay further attention to model gene expression and copy number data with consideration of purities. It could even be possible to integrate all these data into a unified model to better improve the clustering accuracy.

Supplementary Material

Supplementary Data

Acknowledgement

The results here are in whole or part based upon data generated by The Cancer Genome Atlas (TCGA) Research Network: http://cancergenome.nih.gov/.

Funding

This project was partially supported by the National Natural Science Foundation of China [61572327 to X.Z.], National Institute of Health [R01GM122083 to H.W.] and Jiangxi province humanities fund [TJ1301 to W.Z.].

Conflict of Interest: none declared.

References

  1. Ahn J. et al. (2013) DeMix: deconvolution for mixed cancer transcriptomes using raw measured data. Bioinformatics, 29, 1865–1871. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Aran D. et al. (2015) Systematic pan-cancer analysis of tumour purity. Nat. Commun., 6, 8971.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bao L. et al. (2014) AbsCN-seq: a statistical method to estimate tumor purity, ploidy and absolute copy numbers from next-generation sequencing data. Bioinformatics, 30, 1056–1063. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Bird A. (2002) DNA methylation patterns and epigenetic memory. Genes Dev., 16, 6–21. [DOI] [PubMed] [Google Scholar]
  5. Brunet J.P. et al. (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. U. S. A., 101, 4164–4169. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Carter S.L. et al. (2012) Absolute quantification of somatic DNA alterations in human cancer. Nat. Biotechnol., 30, 413–421. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Chung C.H. et al. (2002) Molecular portraits and the family tree of cancer. Nat. Genet., 32, 533–540. [DOI] [PubMed] [Google Scholar]
  8. Das P.M., Singal R. (2004) DNA methylation and cancer. J. Clin. Oncol., 22, 4632–4642. [DOI] [PubMed] [Google Scholar]
  9. Hackett J.A., Surani M.A. (2013) DNA methylation dynamics during the mammalian life cycle. Philos. Trans. R. Soc. Lond. B Biol. Sci., 368, 20110328.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Hansen K.D. et al. (2011) Increased methylation variation in epigenetic domains across cancer types. Nat. Genet., 43, 768–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hoadley K.A. et al. (2014) Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell, 158, 929–944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Houseman E.A. et al. (2008) Model-based clustering of DNA methylation array data: a recursive-partitioning algorithm for high-dimensional data arising as a mixture of beta distributions. BMC Bioinformatics, 9, 365.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Kuan P.F. et al. (2010) A statistical framework for Illumina DNA methylation arrays. Bioinformatics, 26, 2849–2855. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Leong D. et al. (2014) Advances in adjuvant systemic therapy for non-small-cell lung cancer. World J. Clin. Oncol., 5, 633–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Li E. et al. (1993) Role for DNA methylation in genomic imprinting. Nature, 366, 362–365. [DOI] [PubMed] [Google Scholar]
  16. Meng X., Rubin D.B. (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika, 80, 267–278. [Google Scholar]
  17. Neve R.M. et al. (2006) A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. Cancer Cell, 10, 515–527. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Nordlund J. et al. (2015) DNA methylation-based subtype prediction for pediatric acute lymphoblastic leukemia. Clin. Epigenetics, 7, 11.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Ogino S. et al. (2012) How many molecular subtypes? Implications of the unique tumor principle in personalized medicine. Expert. Rev. Mol. Diagn., 12, 621–628. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Park Y., Wu H. (2016) Differential methylation analysis for BS-seq data under general experimental design. Bioinformatics, 32, 1446–1453. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Parker J.S. et al. (2009) Supervised risk predictor of breast cancer based on intrinsic subtypes. J. Clin. Oncol., 27, 1160–1167. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Stefansson O.A. et al. (2015) A DNA methylation-based definition of biologically distinct breast cancer subtypes. Mol. Oncol., 9, 555–568. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Verhaak R.G. et al. (2010) Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell, 17, 98–110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Virmani A.K. et al. (2002) Hierarchical clustering of lung cancer cell lines using DNA methylation markers. Cancer Epidemiol. Biomarkers Prev., 11, 291–297. [PubMed] [Google Scholar]
  25. Wang F. et al. (2016) Tumor purity and differential methylation in cancer epigenomics. Brief Funct Genomics, 15, 408–419. [DOI] [PubMed] [Google Scholar]
  26. Wiwie C. et al. (2015) Comparing the performance of biomedical clustering methods. Nat. Methods, 12, 1033–1038. [DOI] [PubMed] [Google Scholar]
  27. Yoshihara K. et al. (2013) Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun., 4, 2612.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Zhang N. et al. (2015) Predicting tumor purity from methylation microarray data. Bioinformatics, 31, 3401–3405. [DOI] [PubMed] [Google Scholar]
  29. Zheng X. et al. (2017) Estimating and accounting for tumor purity in the analysis of DNA methylation data from cancer studies. Genome Biol., 18, 17.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Zheng X. et al. (2014) MethylPurify: tumor purity deconvolution and differential methylation detection from single tumor DNA methylomes. Genome Biol., 15, 419.. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Zhuang J. et al. (2012) The dynamics and prognostic potential of DNA methylation changes at stem cell gene loci in women's cancer. PLoS Genet., 8, e1002517.. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES