Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2022 Dec 28;24(1):bbac568. doi: 10.1093/bib/bbac568

A Bayesian model for identifying cancer subtypes from paired methylation profiles

Yetian Fan 1,2, April S Chan 3, Jun Zhu 4,5, Suet Yi Leung 6, Xiaodan Fan 7,
PMCID: PMC9851340  PMID: 36575828

Abstract

Aberrant DNA methylation is the most common molecular lesion that is crucial for the occurrence and development of cancer, but has thus far been underappreciated as a clinical tool for cancer classification, diagnosis or as a guide for therapeutic decisions. Partly, this has been due to a lack of proven algorithms that can use methylation data to stratify patients into clinically relevant risk groups and subtypes that are of prognostic importance. Here, we proposed a novel Bayesian model to capture the methylation signatures of different subtypes from paired normal and tumor methylation array data. Application of our model to synthetic and empirical data showed high clustering accuracy, and was able to identify the possible epigenetic cause of a cancer subtype.

Keywords: cancer subtyping, Bayesian method, paired methylation profiles, clustering

Introduction

Cancers are highly complex and heterogeneous diseases. According to certain characteristics, including different histologies, molecular profiles and specific mutations, cancers can be grouped into different subtypes, which play a crucial role in providing proper treatment and prognosis [1–3]. For example, based on visual characteristics of tumor tissues under microscopes, a cancer can be divided into several clinically meaningful subtypes, under a procedure called a histological evaluation [4, 5]. For another instance, it is a fundamental and validated procedure to analyze gene expression profiles to identify cancer subtypes, which could be utilized as predictive markers to design personalized treatments [6–8]. Since genetic and epigenetic alternations are commonly known causal factors of cancers [9–12], grouping cancers according to different DNA methylation aberration patterns could promote the understanding of the critical role of epigenetic mechanisms in cellular processes and improve the effectiveness of cancer detection, diagnosis and prognosis [13–15].

Specifically, differential methylation analysis has shown that alterations in DNA methylation are closely associated with the occurrence and development of tumors [16–19]. Hypermethylation can result in gene silencing by repressing transcription of the promoter regions of tumor suppressor genes [20], whereas global hypomethylation can increase genomic and chromosomal instability [21, 22], both leading to cancer development. As some CpG islands are only methylated in specific tumor types, analysis of cancer type-specific differentially methylated regions revealed stochastic methylation variation that can be identified as signatures to distinguish different types of cancers [23]. Moreover, analysis of the distinct methylation patterns may distinguish different histological subtypes of lung cancers [24]. However, it is difficult to predict the clinical progression by current prognostic factors and treatments. But DNA methylation profiles can be used to identify tumor subtypes and provide a better understanding of individual tumor biology, which will help clinicians pick personalize patient treatment strategies [25]. All these studies suggest that the subtypes of cancers can be identified by the unique DNA methylation signature.

Recently, particularly promoted by high-throughput measurement technologies, a variety of computing methods have been applied to cluster cancers to subtypes. We will review some of the most related works. Auwera et al. [25] developed a statistical method to investigate some specific DNA methylation patterns, which was applied to distinguish subtypes of breast cancers. Rhee et al. [26] introduced an integrated approach combining DNA methylation and gene expression data to analyze breast cancers, where the results showed that methylation status may contribute to the inference of cancer subtypes. By analyzing the cancer cell-intrinsic signals from bulk tumor transcriptomic data, the computational deconvolution method DeClust stratifies patients into subtypes and provides the tumor-type-specific stromal profiles [27]. Zhang et al. [28] proposed InfiniumClust by applying mixture of Gaussian distributions to cluster cancer subtypes with consideration of tumor purity. Siegmund et al. [29] introduced a Bernoulli-lognormal mixture model to cluster the cancers based on DNA methylation data obtained by the quantitative MethyLight technology. Holm et al. [13] used K-means and hierarchical clustering to cluster cancers, and their results demonstrate that there are three groups of breast cancers with specific methylation profiles. Nonnegative matrix factorization method was applied to decompose DNA methylation profiles to categorize five consensus clusters, which were correlated with specific genetic abnormalities [30]. Furthermore, some methods integrated different types of data for clustering. For example, based on DNA methylation, gene expression, copy number, mutational and clinical data from pancreatic patients, Mishra and Guda [31] investigated the correlation between DNA methylation and differential gene expression profiles. For a more comprehensive review of the molecular subtyping methods, one can refer to Zhao et al. [32].

Different from subtyping methods which rely on predefined genes or regions, de novo subtyping methods automatically identify biomarkers for patient clustering in a data-driven fashion. Thus de novo methods are more useful in the scenario where the molecular mechanism is less clear, such as the effect of differential methylation. Some de novo subtyping methods are supervised learning method, which need a training dataset where the patients are well labeled to groups. Such supervised subtyping methods cannot help on complex diseases without enough labeling. A major problem of the existing de novo unsupervised subtyping methods is that the subtypes they infer may not be related to any disease mechanism, but purely correspond to different values of confounding variables such as age or race. To get rid of such confounding variables, more and more biomedical studies collected paired samples from tumor tissue and adjacent normal tissue, with the hope that the comparison between the paired samples can reveal how normal cells changed to cancer cells. A principled statistical method is needed to explicitly model such changes in order to efficiently detect the typical cancerization paths, where each path corresponds to a cancer subtype. In other words, traditional subtyping methods produce cancer status clusters, but we are interested in detecting aberrant-changes clusters.

In this paper, we propose a Bayesian mixture model for Subtyping (BaySub), which can provide the aberrant DNA methylation patterns of the cancer subtypes. Next, we describe our novel algorithm and demonstrate its performance in both synthetic and empirical data. The Bayesian estimation procedure for this model is provided in the Supplementary Information.

Proposed method

Data

We analyze paired DNA methylation array data from cancer patients. The original DNA methylation Inline graphic-value is the ratio of the methylated probe intensity to the overall intensity, which ranges from 0 to 1. To better fit the Gaussian assumption of statistical modeling, M-value is more widely used to analyze differential methylation as it is a real number and approximately homoscedastic [33], which is defined as Inline graphic.

Model

The BaySub algorithm is illustrated in Figure 1. Suppose we have Inline graphic pairs of M-value vectors for matched tumor and normal samples, denoted by Inline graphic, where Inline graphic and Inline graphic are both M-value vectors for Inline graphic CpG sites. We assume the Inline graphic normal samples are evolved from a reference normal cell with only non-disease-causing changes at a small percent of CpG sites. As a CpG site is either methylated or unmethylated, a binary variable Inline graphic following Bernoulli distribution is used to model the methylation status of the reference normal cell at the Inline graphic-th CpG site:

graphic file with name DmEquation1.gif

Figure 1.

Figure 1

A flowchart to illustrate the BaySub algorithm. The methylation status of the reference normal cell is represented by a binary vector Inline graphic, and the rate of methylation is set to be Inline graphic. Based on this reference methylation status of normal tissue, a binary underlying methylation profile Inline graphic could be generated with mutation probability Inline graphic. Suppose there are Inline graphic different paths Inline graphic changing the methylation status from normal tissues to become tumor tissues, and the mutation rates is fixed at Inline graphic. After obtaining the membership of the cancer subtypes Inline graphic, the tumor methylation profile can be generated from normal methylation profile Inline graphic, the membership Inline graphic and the mutation paths Inline graphic. Therefore, the M-values of normal tissues follow normal distribution according to its methylation profile Inline graphic, and corresponding M-values of tumor tissues follow normal distribution according to its methylation profile Inline graphic.

which means the variable Inline graphic takes the value 1 with methylation probability Inline graphic.

For every normal tissue, we assume its binary underlying methylation profile Inline graphic is generated from the reference normal cell profile Inline graphic with random turning over probability Inline graphic:

graphic file with name DmEquation2.gif

where ‘Inline graphic’ stands for the exclusive-or operator.

For a normal tissue to become a tumor tissue, we assume that there are Inline graphic different paths from the corresponding normal tissue, each originated from a binary modification profile Inline graphic and leading to a different subtype of the cancer of interest. More specifically, we assume the probability of turning over an original status in Inline graphic is Inline graphic, i.e.

graphic file with name DmEquation3.gif

We assume each tumor sample contains one and only one subtype. Use a binary vector Inline graphic to denote the signature membership of the Inline graphic-th tumor, where Inline graphic 0 or 1, Inline graphic. More specifically, we assume the signature follows multinomial distribution:

graphic file with name DmEquation4.gif

The binary tumor methylation profile is assumed to be generated from the corresponding normal tissue profile with the perturbation at a subset of sites specified by the corresponding subtype signature:

graphic file with name DmEquation5.gif

Due to the methylation measurement error or the cell heterogeneity of a tissue, we assume the observation M-value follows a normal distribution centered around a typical value of the corresponding methylation status.

For the Inline graphic-th normal tissue at the Inline graphic-th CpG site, the measured M-value Inline graphic is assumed to following Inline graphic if the corresponding Inline graphic says it is unmethylated (i.e. Inline graphic), otherwise following Inline graphic:

graphic file with name DmEquation6.gif

Similarly, we assume for the Inline graphic-th tumor tissue at the Inline graphic-th CpG site, the measured M-value Inline graphic:

graphic file with name DmEquation7.gif

In the above model, each row of the modification matrix Inline graphic provides a unique path from a normal cell to a cancer cell, thus corresponding to one cancer subtype. If a cell in a row of Inline graphic takes value 1, it means the methylation change of the corresponding CpG site is associated with the cancerization of that subtype. For individual patient, the corresponding Inline graphic specifies its subtype membership, which will be estimated from its posterior distribution using a maximum a-posteriori (MAP) approach.

For Bayesian inference, we introduced uniform priors and truncated uniform priors to ensure proper posterior purely determined by the data. Markov chain Monte Carlo (MCMC) algorithms are utilized to iteratively update all the parameters of our method until convergence. Detail distributions are given in the Supplementary Materials.

Results

Synthetic data experiments

In this section, we evaluate our method BaySub on the synthetic datasets. To check the effect of different number of true subtypes, we perform three experiments following our model based on the setting in Table 1. For each experiment, we repeat for 20 times independently, i.e. generating 20 independent and identically distributed datasets and estimating separately, to evaluate both the mean performance for this experiment and the performance variation. For the Bayesian inference for each dataset, we run 10 independent chains of our MCMC algorithm. The number of iterations for each run of the MCMC algorithm is set to 200, which turns out to be enough for convergence and posterior sample collection.

Table 1.

Simulation settings

Parameters Experiment 1 Experiment 2 Experiment 3
# Instance 100 100 100
# Subtypes 2 3 4
# CpG Sites 8000 8000 8000

For convergence diagnosis, the key variables Inline graphic, Inline graphic and Inline graphic are selected to show the convergence of our method. One of 20 trials is randomly selected from each experiment. Figure 2 describes the changes of variable with the increasing of iterations. The results reveal that our method converges rapidly and stably. Besides, the Gelman–Rubin convergence diagnostic (Inline graphic) is calculated to evaluate the convergence of MCMC. The values of variables selected from the last 60 iterations are divided into two sequences of length 30. It turns out that all the Inline graphic values are smaller than 1.1, which indicates adequate convergence. Thus, both Figure 2 and Table 2 show rapid convergence of our algorithm.

Figure 2.

Figure 2

The traceplots of key variables for simulation datasets. The whole MCMC samples are shown in the left figures, and the burn-in periods are illustrated in the right figures.

Table 2.

The Inline graphic values of key parameters for the three simulation experiments. A Inline graphic value closer to 1 indicates good convergence of the MCMC algorithm

Parameter/experiment 1 2 3
Inline graphic 1.038 1.024 1.044
Inline graphic 1.002 0.993 1.067
Inline graphic 1.090 0.994 0.991

We adopt two popular clustering performance measures, adjusted Rand index (ARI) and normalize mutual information (NMI), to evaluate the clustering performance of our method [34, 35]. Moreover, we use two measures to quantify the inference accuracy of differential methylation. Based on the true value of aberrant methylation signatures Inline graphic, we calculate the inference accuracy of all elements in the variable Inline graphic (denoted by AE), defined as

graphic file with name DmEquation8.gif

Alternatively, we measure the site-wise inference accuracy by the percent of CpG sites whose status are totally correctly inferred (denoted by AS), i.e.

graphic file with name DmEquation9.gif

Based on the above measurement, three synthetic datasets are utilized to evaluate the performance of BaySub method. As each experiment is repeated for 20 times independently, the mean performance and standard deviation are summarized in Table 3, which illustrates that our method obtains a high accuracy for both detecting cancer subtypes and identifying the methylation patterns.

Table 3.

Performance of BaySub on simulation datasets. The numbers in the brackets are the standard deviation of corresponding performance measures based on 20 independent trials

Experiment Subtypes ARI NMI AE(%) AS(%)
1 2 1 (0.00) 1 (0.00) 99.91 (0.18) 99.90 (0.18)
2 3 1 (0.00) 1 (0.00) 99.99 (0.01) 99.98 (0.03)
3 4 1 (0.00) 1 (0.00) 99.98 (0.02) 99.93 (0.07)

For visualization, we plot the heatmap in Figure 3 to demonstrate the selected CpG sites, which can capture the specific methylation signatures associated with different cancer subtypes. For clarity, we only plot those differentially methylated CpG sites on which the value of corresponding elements in Inline graphic should be 1 (i.e. differentially methylated between normal and tumor tissues). The colors in the left bar represent the true subtypes. From the figure, it is easy to see some block structures in heatmap, which indicates that these data include several methylation patterns.

Figure 3.

Figure 3

Heatmap of methylation signatures captured by BaySub algorithm on simulation datasets.

Since each experiment is repeated for 20 times based on 20 independent datasets synthesized from a same parameter setting, after burn in, the estimates of all the variables in every trial are calculated by the mean of the last 30 iterations. Then the accuracy of variables can be checked by the mean and the responding standard deviation, which are the point estimates from the 20 independent trials. Table 4 presents the variables estimated by our method. The results show that our estimating method preforms well on these simulation datasets.

Table 4.

Results of parameters estimation on simulation datasets. The numbers in the brackets are the standard deviation of corresponding estimates based on 20 independent trials

Parameters True Value Experiment 1 Experiment 2 Experiment 3
Inline graphic –2.6 –2.600 (0.002) –2.600 (0.002) –2.600 (0.002)
Inline graphic 1.5 1.494 (0.012) 1.500 (0.002) 1.499 (0.002)
Inline graphic 1.3 1.299 (0.003) 1.300 (0.003) 1.301 (0.003)
Inline graphic 1.4 1.422 (0.039) 1.401 (0.003) 1.400 (0.003)
Inline graphic –2.9 –2.903 (0.01) –2.902 (0.007) –2.897 (0.008)
Inline graphic 1.1 1.101 (0.012) 1.103 (0.007) 1.101 (0.009)
Inline graphic 1.1 1.099 (0.002) 1.100 (0.003) 1.102 (0.003)
Inline graphic 1.7 1.703 (0.006) 1.699 (0.005) 1.701 (0.003)
Inline graphic 0.5 0.501 (0.006) 0.500 (0.006) 0.501 (0.006)
Inline graphic 0.99 0.990 (0.000) 0.990 (0.000) 0.990 (0.000)
Inline graphic 0.2 0.202 (0.003) 0.201 (0.003) 0.200 (0.003)

In our method, the number of subtypes Inline graphic should be given in advance, nevertheless it might be difficult to obtain the accurate value for real datasets. Here, Akaike information criterion (AIC) and Bayesian information criterion (BIC) are utilized to seek an appropriate model by compromising the goodness-of-fit and model complexity. We calculate the AIC and BIC values of the model with Inline graphic values ranging from 2 to 6 on Experiment 2 and Experiment 3 datasets. The results are demonstrated in Figure 4, which shows that the minimum values corresponds with the true number of subtypes for both AIC and BIC. Hence, Inline graphic can be readily learnt from the data.

Figure 4.

Figure 4

The plots of AIC (left) and BIC (right) for different numbers of subtypes in Experiment 2 (dashed line) and Experiment 3 (solid line). The horizontal axis represents the assumed value of the number of subtypes Inline graphic, and the vertical axis represents the corresponding values of AIC or BIC. The true value of Inline graphic for red dashed line is 3, and that for green solid line is 4.

One may still worry a wrongly specified number of subtypes may lead to meaningless results. To address this concern, we also test our method with a wrong number of cancer subtypes. We randomly select two datasets from Experiment 2 and Experiment 3 separately, and set the value of Inline graphic to be the true value plus or minus 1. The experiments results are shown in Table 5. For example, when the Inline graphic value is set to be 2 but the true value is 3, the first and third true subtypes are merged into one inferred group, which is marked by Inline graphic. For another instance, when the Inline graphic value is set to be 4 but the true value is 3, our algorithm may lead to two prediction results. In the first result, the inferred subtype Inline graphic is actually an empty class (thus purely dominated by the prior without support from the data), while the other three inferred subtypes correspond to the three true subtypes perfectly. In the second result, the third true subtype is divided into the two inferred subtype Inline graphic and Inline graphic. All the results demonstrate that our method BaySub could deal with a slightly wrong number of cancer subtypes reasonably.

Table 5.

Performance of BaySub with a wrongly specified number of cancer subtypes. The true number of subtypes is shown in the first column, and the used values of Inline graphic by our method for inference are displayed in the second column. The relations between true membership and predicted membership are provided in the last column. The numbers on the left and right of the symbol ‘Inline graphic’ represent the true and inferred membership, respectively. The percentage in the bracket indicates the proportion of the predicted membership involved in the corresponding matching. For example, ‘Inline graphic(65.2%)’ means 65.2% of the patients in the inferred subtype 1 are actually from the true subtype 3

Subtypes Used Inline graphic Matching results
3 2 Inline graphic (34.8%), Inline graphic(100%), Inline graphic(65.2%)
3 4 Inline graphic (100%), Inline graphic(100%), Inline graphic(100%)
or Inline graphic(100%), Inline graphic(100%), Inline graphic(100%), Inline graphic(100%)
4 3 Inline graphic (39.6%), Inline graphic(60.4%), Inline graphic(100%), Inline graphic(100%)
4 5 Inline graphic (100%), Inline graphic(100%), Inline graphic(100%), Inline graphic(100%)
or Inline graphic(100%), Inline graphic(100%), Inline graphic(100%), Inline graphic(100%), Inline graphic(100%)

As tumor heterogeneity affects the assignment of subtypes in the clinic, it is commonly concerned during data analysis that tumor tissues contain multiple cancer subtypes or are contaminated by normal cells during dissection. If every tumor contains more than one subtype, it would be more challenging for our method BaySub to deal with such a mixture of different subtypes and normal data. To check the performance of our method on ‘contaminated’ datasets, we run 9 extra experiments with different numbers of true subtypes and different component proportions. Suppose there are normal tissues and Inline graphic subtypes of tumor tissues; each of them have Inline graphic pairs of methylation vectors for Inline graphic CpG sites, denoted by Inline graphic, where Inline graphic, and Inline graphic represents normal tissues, for different tumor tissues, Inline graphic. Then a mixture dataset could be generated by integrating different subtypes and normal data by following steps: for the Inline graphic-th paired mixture methylation vectors Inline graphic, first, two subtypes of cancers are randomly selected from all the subtypes (denoted by subtype Inline graphic and Inline graphic), which are randomly selected from the subtypes Inline graphic. Second, the Inline graphic-th paired methylation vectors is a linear combination of tumor tissues from these two selected subtypes and normal tissues, whose proportion is represented by a vector Inline graphic, i.e. Inline graphic. The predicting target is set to be the major subtype, which is the largest proportion. Last, we repeat these two procedures to obtain the ‘contaminated’ datasets until the number of pairs equals 100. The results are shown in Table 6, which illustrate that our method could still achieve high accuracy without significant deterioration by these ‘contaminated’ data when one subtype gains an outright majority, while the predicting accuracy would decrease with the gradual disappearance of this majority advantage.

Table 6.

Performance of BaySub on ‘contaminated’ datasets. The true number of subtypes is reported in the first column. The proportions of Inline graphic are shown in the second column. The biggest number is the proportion of major subtype, which is the predicting target. All the accuracies are displayed in the third to sixth columns

Subtypes Proportions ARI NMI AE(%) AS(%)
2 (0.8,0.1,0.1) 1 1 99.56 99.54
2 (0.6,0.2,0.2) 1 1 100 100
2 (0.4,0.3,0.3) 0.2981 0.2914 91.73 83.65
3 (0.8,0.1,0.1) 1 1 100 99.99
3 (0.6,0.2,0.2) 1 1 99.73 99.20
3 (0.4,0.3,0.3) 0.2417 0.3622 83.81 51.44
4 (0.8,0.1,0.1) 1 1 99.98 99.91
4 (0.6,0.2,0.2) 1 1 99.73 98.93
4 (0.4,0.3,0.3) 0.2776 0.4213 84.32 41.86

In our model, we assume that the methylation value of every CpG site is independent and identically distributed (i.i.d.), and the distributions of methylation profiles for both tumor and normal tissues follow the Gaussian distribution. In this part, we will evaluate the performance of our method BaySub on the simulation datasets, which are generated from t distribution and the methylation status of the tumor and normal tissues are dependent. First, the vector Inline graphic, representing the methylation status of the reference normal cell, is separated into 100 regions randomly. For every CpG site, the methylation status takes the value 1 with a probability of 0.7 or 0.3, and the probabilities for two adjacent regions are different. Second, for every row of modification matrix Inline graphic, it is also separated into 100 regions randomly. The values within one region are the same. They take the value 1 with a probability of 0.2. Therefore, the elements in both vector Inline graphic and matrix Inline graphic are dependent, so the generated methylation profiles for tumor and normal tissues are dependent. The other parameters are the same as the values shown in Table 4. Although the BaySub algorithm is based on the assumption of i.i.d. and normal distribution, the experiment results illustrated in Table 7 verify that it could obtain high prediction accuracy and extract meaningful methylation patterns from the whole picture of cancer.

Table 7.

Performance of BaySub on t distribution and correlated methylation levels

Experiment # Instance # Subtypes # CpG Sites ARI NMI AE(%) AS(%)
1 100 2 8000 1 1 99.91 99.81
2 100 3 8000 1 1 98.47 95.46
3 100 4 8000 1 1 98.28 93.27

Empirical data experiments

In this section, we evaluate the performance of our method BaySub on the real data of human cancer. We downloaded the methylation data from both the research on whole-genome sequencing in gastric cancer and TCGA database, which contain the methylation array data from both tumor and corresponding normal tissues [36]. All the DNA methylation profiles are measured by the same probe layout of 450K array, which is the most widely used platform for DNA methylation. Based on the research, gastric cancer is a heterogeneous disease, which can be separated into three different molecular subgroups: microsatellite instability (denoted by MSI), Epstein–Barr virus (denoted by EBV) and others. For TCGA database, based on the primary sites, these datasets are used to compose the second trial. Since these methylation profiles have some missing data, the common CpG sites shared by all datasets in the corresponding trial will be applied to infer the cancer subtypes. More details of every trial are shown in Table 8. The number of iterations is 200. The number of chains is set to 10.

Table 8.

Specification of real datasets

Trial index Primary site Data Project Subtypes Pairs # CpG sites
1 Gastric Gastric Cancer MSI / EBV / others 6+3+24 459 158
2 Esophagus ESCA Squamous Cell Neoplasms / Adenomas and Adenocarcinomas 3+12 390 481

For the evaluation of the predicting accuracy, the true molecular subgroup of human cancer can be utilized to evaluate the estimated variable Inline graphic. To evaluate the achieved methylation signatures Inline graphic, we centralized all the CpG sites for every cancer subtype to generate a standard value Inline graphic. First, we calculate the mean M-value vectors of every cancer subtype on all Inline graphic CpG sites, denoted by Inline graphic, where Inline graphic. Then, based on the definition of M-value, 0 is selected as the segmentation criterion to divide all CpG sites for the every cancer subtype into hypermethylation and hypomethylation. For example, for the Inline graphic-th CpG site of Inline graphic subtype, the average status of methylation for tumor is represented by Inline graphic, if Inline graphic, otherwise Inline graphic, and it is the same with the average methylation status for normal data Inline graphic. Therefore, the average status of methylation on all CpG sites for every subtype can be represented by Inline graphic, where Inline graphic. Finally, we can calculate the target variable Inline graphic, where Inline graphic. Based on the obtained Inline graphic, we can calculate both the accuracy of elements and the accuracy of CpG sites in variable Inline graphic. The comparison results are shown in Table 9, which illustrates the good performance of our method BaySub on both predicting accuracy and identifying methylation patterns of different cancer subtypes. The posterior estimate of key parameters are listed in Table 10, including posterior mean and standard deviation.

Table 9.

Performance of BaySub on real datasets

Trial index ARI NMI AE(%) AS(%)
1 0.1922 0.1477 88.86 78.21
2 0.1791 0.2630 86.06 78.18

Table 10.

Results of parameters estimation on real datasets. The numbers in the brackets are the standard deviation

Parameter/Trial 1 2
Inline graphic –3.2087 (0.0011) –2.7211 (0.0010)
Inline graphic 1.9599 (0.0004) 1.3372 (0.0008)
Inline graphic 6.4645 (0.0039) 1.4978 (0.0028)
Inline graphic 1.5536 (0.0006) 1.5689 (0.0014)
Inline graphic –3.5837 (0.0026) –2.9001 (0.0132)
Inline graphic 1.4812 (0.0019) 0.9904 (0.0000)
Inline graphic 7.3066 (0.0046) 1.2662 (0.0124)
Inline graphic 1.9583 (0.0011) 1.7434 (0.0061)
Inline graphic 0.5530 (0.0008) 0.5297 (0.0008)
Inline graphic 0.9996 (0.0000) 0.9927 (0.0001)
Inline graphic 0.0708 (0.0003) 0.0920 (0.0011)

Last, we also analyze the convergence of our method on the two real datasets. Both Figure 5 and Table 11 show that our method on real datasets converges rapidly.

Figure 5.

Figure 5

The traceplots of key variables for real datasets. The whole MCMC samples are shown in the left figures, and the burn-in periods are illustrated in the right figures.

Table 11.

The Inline graphic values of key parameter for real datasets

Parameter/Experiment 1 2
Inline graphic 1.001 1.044
Inline graphic 1.015 1.067
Inline graphic 1.021 0.991

Performance comparison

In this section, we compare the performance of BaySub with other popular clustering methods: K-means [37], K-medoids (PAM, CLARA) [38], Hierarchical Clustering (denoted by HC) [39] and Non-negative Matrix Factorization (denoted by NMF) [40]. As the clustering results of algorithms K-means and NMF are unstable due to different initial values, the average accuracy of 100 trials is reported as the comparison performance. The values of Inline graphic for all methods are set to be 4. First, we compare the accuracy of these methods on ‘pure’ datasets (every tumor contains only one subtype). Most parameters are the same with the values of the simulation datasets, which are illustrated in Table 4. To improve the diversity of methylation statuses for tumor and normal tissues, the mean and the variance for normal tissue (Inline graphic, Inline graphic) and tumor tissue (Inline graphic, Inline graphic) are increased, respectively. The value of Inline graphic is decreased from 0.2 to 0.05 to reduce the difference between cancer subtypes. Based on Table 12, these factors have little influence on our method BaySub, while the performances of other methods decline dramatically. Moreover, we also compare the performance of BaySub with other clustering methods on ‘contaminated’ datasets. We change the purity of tumor by reducing the dominance of major subtype, and alter the probability Inline graphic turning over on the modification profile Inline graphic. The comparison results in Table 13 display that the purity and the difference effect the performances of both BaySub and other clustering methods, but the predicting accuracy of BaySub method outperforms others. Last, we compare the performance of BaySub with other methods on empirical data, whose results are shown in Table 14. In conclusion, based on Tables 1214, our method BaySub obtains better performance than some other clustering methods on both synthetic and empirical datasets.

Table 12.

Performance comparison on ‘pure’ datasets

Subtypes Modified Parameters ARI NMI
BaySub K-means PAM CLARA HC NMF BaySub K-means PAM CLARA HC NMF
4 Inline graphic 1 0.8582 0.2858 0.4851 0.9224 0.0353 1 0.8848 0.4151 0.5811 0.8689 0.0854
4 Inline graphic 1 0.9080 1 1 1 0.9678 1 0.9228 1 1 1 0.9648
4 Inline graphic 1 0.7246 0.0686 0.0246 0.1822 0.0109 1 0.7009 0.1388 0.0448 0.1370 0.0524
4 Inline graphic 1 0.9284 1 1 1 0.8492 1 0.9360 1 1 1 0.8318
4 Inline graphic 1 0.0233 –0.0076 –0.0076 0.0048 0.0018 1 0.0575 0.0255 0.0255 0.0346 0.0385
4 Inline graphic 1 0.8559 1 1 1 0.6921 1 0.8802 1 1 1 0.7219

Table 13.

Performance comparison on ‘contaminated’ datasets

Subtypes Proportion Modified parameters ARI NMI
BaySub K-means PAM CLARA HC NMF BaySub K-means PAM CLARA HC NMF
4 (0.8,0.1,0.1) Inline graphic 1 0.8876 1 1 1 0.4496 1 0.9227 1 1 1 0.5309
4 (0.6,0.2,0.2) Inline graphic 1 0.8924 1 1 1 0.2867 1 0.9212 1 1 1 0.3945
4 (0.4,0.3,0.3) Inline graphic 0.7338 0.2951 0.3055 0.2531 0.2266 0.2050 0.7643 0.4058 0.4215 0.3948 0.3497 0.3006
4 (0.8,0.1,0.1) Inline graphic 0.9734 0.9439 0.9862 1 1 0.1805 0.9442 0.9584 0.9667 1 1 0.2584
4 (0.6,0.2,0.2) Inline graphic 1 0.9421 0.9409 0.9554 1 0.1674 1 0.9538 0.9209 0.9323 1 0.2405
4 (0.4,0.3,0.3) Inline graphic 0.6386 0.3067 0.2784 0.1865 0.1955 0.0934 0.6421 0.4074 0.2931 0.2329 0.2864 0.1474
4 (0.8,0.1,0.1) Inline graphic 0.8088 0.8247 0.1276 0.1276 0.2776 0.0265 0.7916 0.8483 0.1645 0.1645 0.3664 0.0661
4 (0.6,0.2,0.2) Inline graphic 0.7821 0.5975 0.0734 0.0718 0.2742 0.0166 0.7515 0.5956 0.1315 0.1048 0.3001 0.0563
4 (0.4,0.3,0.3) Inline graphic 0.0417 0.0668 0.0208 0.0118 0.0339 0.0063 0.2059 0.1159 0.0787 0.0548 0.0533 0.0437

Table 14.

Performance comparison on empirical datasets

Trial Index Subtypes ARI NMI
BaySub K-means PAM CLARA HC NMF BaySub K-means PAM CLARA HC NMF
1 3 0.1922 0.0213 –0.0823 –0.0823 –0.1042 0.1193 0.1477 0.2009 0.0264 0.0264 0.0838 0.1415
2 2 0.1791 0.1652 –0.1320 –0.1320 –0.1320 0.0272 0.2630 0.0907 0.0644 0.0644 0.0644 0.0438

Conclusions

In this paper, we propose a Bayesian mixture model named BaySub for predicting the cancer subtypes based on paired methylation array data. Our method can capture the methylation signatures for different cancer subtypes. We evaluate the performance of our method on both synthetic and empirical data experiments. In synthetic data experiments, our method achieve high predicting accuracy. Besides, the results of these experiments reveal that the proposed algorithm has good robustness and convergence. Moreover, our method can not only seek an appropriate number of subtypes by AIC and BIC values, but also can deal with the wrong specification of the number of subtypes. In real data experiments, the performance of our method is evaluated on two datasets, which are generated from both the gastric cancer project and TCGA database. The results illustrate the good performance of our method on real datasets, and the estimated parameters of the model converge rapidly and stably. Furthermore, in the section of performance comparison, BaySub algorithm provides better performance than some other clustering methods on both synthetic and empirical datasets. Note that BaySub can essentially deal with any disease where methylation changes play an important role, as long as each patient provides both a normal and a disease sample for methylation profiling.

There are several directions to improve our method. First of all, we assume the methylation values of every CpG site are independent and identically distributed (i.i.d.) to reduce the complexity of model. Although simulation experiments show that if the simulated datasets are generated from heavy tail distribution and correlated methylation levels, BaySub based on the i.i.d. assumption could still extract meaningful information from the whole picture of simulated datasets, methylation statuses of neighboring sites seem positively correlated in reality instead of independent. Thus, future studies on releasing this i.i.d. assumption may contribute to improving the performance of cancer subtyping. Secondly, the current version of BaySub is designed for the methylation array data, but the binary methylation profiles used in BaySub essentially fits the single-cell methylation data well. One may modify the Gaussian observation model to adapt BaySub for single-cell methylation data analyses. Moreover, BaySub is currently designed for paired methylation data from tumors and adjacent normal tissues. In real tumor studies, not all patients can provide paired samples. Thus, it is important to extend BaySub by integrating unpaired and paired samples.

Key Points

  • We propose a new Bayesian method named BaySub for clustering cancer patients based on paired methylation array data, which can simultaneously identify the methylation changes associated with different cancer subtypes.

  • Our method can not only seek an appropriate number of clusters through AIC and BIC, but also properly handle the wrong specification of the number of subtypes.

  • We evaluated the performance of our method based on both synthetic and empirical data, which showed that our method could achieve high clustering accuracy.

Acknowledgments

This work was supported partially by two grants from the Research Grants Council of the Hong Kong Special Administrative Region, China (Theme-based Research Scheme T12-710/16-R; General Research Fund 14303819).The results published in this paper are in whole or part based upon data generated by the TCGA Research Network: https://www.cancer.gov/tcga.

A Appendix

A.1 Bayesian inference

In this section, we introduce the Bayesian approach to estimate the parameter for our proposed model. For Bayesian inference, we assign uniform priors to most parameters except for the Inline graphic, Inline graphic, Inline graphic and Inline graphic. Truncated uniform priors are used for these four parameters, and the truncation for Inline graphic and Inline graphic is defined by the boundaries (-50,50)which for Inline graphic and Inline graphic is (0,100). With all the above assumptions and uninformative priors[41], we can get the conditional distributions of variables:

graphic file with name DmEquation10.gif

Therefore, the full posterior distribution Inline graphic is proportional to

graphic file with name DmEquation11.gif

All the parameters of our method are iteratively updated from their corresponding posterior distribution by the Metropolis-within-Gibbs algorithms. The algorithm is summarized as follows:

  1. Initialize all the parameters Inline graphic,Inline graphic,Inline graphic,Inline graphic,Inline graphic,Inline graphic,Inline graphic, Inline graphic,Inline graphic,Inline graphic,Inline graphic,Inline graphic, and compute Inline graphic.

  2. Sample
    graphic file with name DmEquation12.gif
  3. Sample
    graphic file with name DmEquation13.gif
  4. Sample
    graphic file with name DmEquation14.gif
  5. graphic file with name DmEquation15.gif
    Sample
    graphic file with name DmEquation16.gif
  6. graphic file with name DmEquation17.gif
    Sample
    graphic file with name DmEquation18.gif
  7. graphic file with name DmEquation19.gif
    Sample
    graphic file with name DmEquation21.gif
  8. graphic file with name DmEquation22.gif
    Sample
    graphic file with name DmEquation23.gif
  9. graphic file with name DmEquation24.gif
    Sample
    graphic file with name DmEquation25.gif
  10. Sample
    graphic file with name DmEquation26.gif
  11. Sample
    graphic file with name DmEquation27.gif
  12. Compute Inline graphic

  13. Sample
    graphic file with name DmEquation28.gif
  14. Sample
    graphic file with name DmEquation29.gif

A.2 Sensitivity analysis

We conduct sensitivity analysis to investigate the influence of the truncated uniform priors. The truncation boundaries for Inline graphic and Inline graphic is set to be (-50,50), (-100,100) and (-200,200), respectively, and that for Inline graphic and Inline graphic is (0,100), (0,200) and (0,400). The experiment results illustrate the choices of truncation boundary have little effect on the performance of algorithm BaySub.

A.3 Comparison on computational cost

For Simulation 1 to Simulation 3 in this paper, both the K-means, K-medoids and Hierarchical Clustering methods only take about 1 min, while BaySub takes about 10 iterations (about 90 s) to converge and 200 iterations (about 28 min) to collect posterior sample according to Figure 5, although it is time-consuming to perform Bayesian inference by MCMC for large datasets and high-dimensional models. Consider the datasets are offline and MCMC methods can provide the full picture of posterior distribution, it is acceptable.

Author Biographies

Yetian Fan is a lecturer at School of Mathematics and Statistics, Liaoning University, and a postdoctoral fellow at Department of Statistics, The Chinese University of Hong Kong.

April S Chan is a research associate at Department of Pathology, School of Clinical Medicine, The University of Hong Kong.

Jun Zhu is a professor at Icahn School of Medicine at Mount Sinai and Sema4.

Suet Yi Leung is a professor at Department of Pathology, School of Clinical Medicine, The University of Hong Kong.

Xiaodan Fan is a professor at Department of Statistics, The Chinese University of Hong Kong.

Contributor Information

Yetian Fan, School of Mathematics and Statistics, Liaoning University, Shenyang, 110036, China; Department of Statistics, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong SAR, China.

April S Chan, Department of Pathology, School of Clinical Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China.

Jun Zhu, Sema4, Stamford, CT, 06902, USA; Icahn School of Medicine at Mount Sinai, New York, NY, USA.

Suet Yi Leung, Department of Pathology, School of Clinical Medicine, The University of Hong Kong, Pokfulam, Hong Kong SAR, China.

Xiaodan Fan, Department of Statistics, The Chinese University of Hong Kong, Sha Tin, New Territories, Hong Kong SAR, China.

References

  • 1. Bailey P, Chang DK, Nones J, et al. Genomic analyses identify molecular subtypes of pancreatic cancer. Nature 2016;531(7592):47–52. [DOI] [PubMed] [Google Scholar]
  • 2. Felipe De Sousa EM, Wang X, Jansen M, et al. Poor-prognosis colon cancer is defined by a molecularly distinct subtype and develops from serrated precursor lesions. Nat Med 2013;19(5):614–8. [DOI] [PubMed] [Google Scholar]
  • 3. Yu F, Quan F, Xu J, et al. Breast cancer prognosis signature: linking risk stratification to disease subtypes. Brief Bioinform 2019;20(6):2130–40. [DOI] [PubMed] [Google Scholar]
  • 4. Dai X, Li T, Bai Z, et al. Breast cancer intrinsic subtype classification, clinical use and future trends. Am J Cancer Res 2015;5(10):2929. [PMC free article] [PubMed] [Google Scholar]
  • 5. Arnold M, Soerjomataram I, Ferlay J, et al. Global incidence of oesophageal cancer by histological subtype in 2012. Gut 2015;64(3):381–7. [DOI] [PubMed] [Google Scholar]
  • 6. Shen R, Olshen AB, Ladanyi M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009;25(22):2906–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Blenkiron C, Goldstein LD, Thorne NP, et al. MicroRNA expression profiling of human breast cancer identifies new markers of tumor subtype. Genome Biol 2007;8(10):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Vadapalli S, Abdelhalim H, Zeeshan S, et al. Artificial intelligence and machine learning approaches using gene expression and variant data for personalized medicine. Brief Bioinform 2022;23(5):bbac191. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. You JS, Jones PA. Cancer genetics and epigenetics: two sides of the same coin? Cancer Cell 2012;22(1):9–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Wu CT, Morris JR. Genes, genetics, and epigenetics: a correspondence. Science 2001;293(5532):1103–5. [DOI] [PubMed] [Google Scholar]
  • 11. Banerji S, Cibulskis K, Rangel-Escareno C, et al. Sequence analysis of mutations and translocations across breast cancer subtypes. Nature 2012;486(7403):405–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Fang R, Yang H, Gao Y, et al. Gene-based mediation analysis in epigenetic studies. Brief Bioinform 2021;22(3):bbaa113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Holm K, Hegardt C, Staaf J, et al. Molecular subtypes of breast cancer are associated with characteristic DNA methylation patterns. Breast Cancer Res 2010;12(3):1–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Heyn H, Esteller M. DNA methylation profiling in the clinic: applications and challenges. Nat Rev Genet 2012;13(10):679–92. [DOI] [PubMed] [Google Scholar]
  • 15. Zhang C, Zhao N, Zhang X, et al. Survivalmeth: a web server to investigate the effect of dna methylation-related functional elements on prognosis. Brief Bioinform 2021;22(3):bbaa162. [DOI] [PubMed] [Google Scholar]
  • 16. Robertson KD. DNA methylation and human disease. Nat Rev Genet 2005;6(8):597–610. [DOI] [PubMed] [Google Scholar]
  • 17. Gopalakrishnan S, Van Emburgh BO, Robertson KD. DNA methylation in development and human disease. Mutation Research/Fundamental and Molecular Mechanisms of Mutagenesis 2008;647(1–2):30–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Jin Z, Liu Y. DNA methylation in human diseases. Genes & Diseases 2018;5(1):1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Yang X, Gao L, Zhang S. Comparative pan-cancer dna methylation analysis reveals cancer common and specific patterns. Brief Bioinform 2017;18(5):761–73. [DOI] [PubMed] [Google Scholar]
  • 20. Kulis M, Esteller M. DNA methylation and cancer. Adv Genet 2010;70:27–56. [DOI] [PubMed] [Google Scholar]
  • 21. Davis CD, Uthus EO. DNA methylation, cancer susceptibility, and nutrient interactions. Exp Biol Med 2004;229(10):988–95. [DOI] [PubMed] [Google Scholar]
  • 22. Kawano H, Saeki H, Kitao H, et al. Chromosomal instability associated with global DNA hypomethylation is associated with the initiation and progression of esophageal squamous cell carcinoma. Ann Surg Oncol 2014;21(4):696–702. [DOI] [PubMed] [Google Scholar]
  • 23. Hansen KD, Timp W, Bravo HC, et al. Increased methylation variation in epigenetic domains across cancer types. Nat Genet 2011;43(8):768. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Virmani AK, Tsou JA, Siegmund KD, et al. Hierarchical clustering of lung cancer cell lines using DNA methylation markers. Cancer Epidemiology and Prevention Biomarkers 2002;11(3):291–7. [PubMed] [Google Scholar]
  • 25. Van der Auwera I, Yu W, Suo L, et al. Array-based DNA methylation profiling for breast cancer subtype discrimination. PloS One 2010;5(9):e12616. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Rhee JK, Kim K, Chae H, et al. Integrated analysis of genome-wide DNA methylation and gene expression profiles in molecular subtypes of breast cancer. Nucleic Acids Res 2013;41(18):8464–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Wang L, Sebra RP, Sfakianos JP, et al. A reference profile-free deconvolution method to infer cancer cell-intrinsic subtypes and tumor-type-specific stromal profiles. Genome Med 2020;12(1):1–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Zhang W, Feng H, Wu H, et al. Accounting for tumor purity improves cancer subtype classification from DNA methylation data. Bioinformatics 2017;33(17):2651–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Siegmund KD, Laird PW, Laird-Offringa IA. A comparison of cluster analysis methods using DNA methylation data. Bioinformatics 2004;20(12):1896–904. [DOI] [PubMed] [Google Scholar]
  • 30. Reilly B, Tanaka TN, Diep D, et al. DNA methylation identifies genetically and prognostically distinct subtypes of myelodysplastic syndromes. Blood Adv 2019;3(19):2845–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Mishra NK, Guda C. Genome-wide DNA methylation analysis reveals molecular subtypes of pancreatic cancer. Oncotarget 2017;8(17):28990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Zhao L, Lee VH, Ng MK, et al. Molecular subtyping of cancer: current status and moving toward clinical applications. Brief Bioinform 2019;20(2):572–84. [DOI] [PubMed] [Google Scholar]
  • 33. Du P, Zhang X, Huang CC, et al. Comparison of beta-value and m-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 2010;11(1):1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Hubert L, Arabie P. Comparing partitions. Journal of Classification 1985;2(1):193–218. [Google Scholar]
  • 35. Strehl A, Ghosh J. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research 2002;3(Dec):583–617. [Google Scholar]
  • 36. Wang K, Yuen ST, Xu J, et al. Whole-genome sequencing and comprehensive molecular profiling identify new driver mutations in gastric cancer. Nat Genet 2014;46(6):573–82. [DOI] [PubMed] [Google Scholar]
  • 37. Hartigan JA, Wong MA. Algorithm as 136: a k-means clustering algorithm. J R Stat Soc Ser C Appl Stat 1979;28(1):100–8. [Google Scholar]
  • 38. Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, Vol. 344. Hoboken, New Jersey: John Wiley & Sons, 2009. [Google Scholar]
  • 39. Xu R, Wunsch D. Survey of clustering algorithms. IEEE Trans Neural Netw 2005;16(3):645–78. [DOI] [PubMed] [Google Scholar]
  • 40. Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature 1999;401(6755):788–91. [DOI] [PubMed] [Google Scholar]
  • 41. Gelman A. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian Anal 2006;1(3):515–34. [Google Scholar]

Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES