Abstract
DNA methylation datasets in cancer studies are comprised of measurements on a large number of genomic locations called cytosine-phosphate-guanine (CpG) sites with complex correlation structures. A fundamental goal of these studies is the development of statistical techniques that can identify disease genomic signatures across multiple patient groups defined by different experimental or biological conditions. We propose BayesDiff, a nonparametric Bayesian approach for differential analysis relying on a novel class of first order mixture models called the Sticky Pitman-Yor process or two-restaurant two-cuisine franchise (2R2CF). The BayesDiff methodology flexibly utilizes information from all CpG sites or biomarker probes, adaptively accommodates any serial dependence due to the widely varying inter-probe distances, and makes posterior inferences about the differential genomic signature of patient groups. Using simulation studies, we demonstrate the effectiveness of the BayesDiff procedure relative to existing statistical techniques for differential DNA methylation. The methodology is applied to analyze a gastrointestinal (GI) cancer dataset exhibiting serial correlation and complex interaction patterns. The results support and complement known aspects of DNA methylation and gene association in upper GI cancers.
Keywords: 2R2CF, First order models, Mixture models, Sticky Pitman-Yor process, Two-restaurant two-cuisine franchise
1. Introduction
Recent advances in array-based and next-generation sequencing (NGS) technologies have revolutionized biomedical research, especially in cancer. The rapid decline in the cost of genome technologies has facilitated the availability of datasets involving intrinsically different sizes and scales of high-throughput data and provided genome-wide, high resolution information about the biology of cancer. A common analytical goal is the identification of differential genomic signatures between groups of samples corresponding to different treatments or biological conditions, e.g., treatment arms, response to adjuvant chemotherapy, tumor subtypes, or cancer stages. Challenges include the high dimensionality of genomic biomarkers or probes, usually in the hundreds of thousands, and the relatively small number of patient samples, usually no more than a few hundred. This “small , large ” setting results in unstable inferences due to collinearity. Further, the data exhibit complex interaction patterns, such as signaling or functional pathway-based interactions, and location-based serial correlation for high-throughput sequencing data. These characteristics challenge statistical techniques for detecting differential signatures.
Differential DNA methylation in cancer studies
DNA methylation is an epigenetic mechanism that involves the addition of a methyl (CH3) group to DNA, resulting in the modification of gene functions. It typically occurs at genomic locations called cytosine-phosphate-guanine (CpG) sites. Alterations in DNA methylation, e.g., hypomethylation of oncogenes and hypermethylation of tumor suppressor genes, are often associated with the development and progression of cancer (Feinberg and Tycko, 2004). It was previously believed that these alterations occur almost exclusively at promoter regions known as CpG islands, i.e., chromosomal regions with high concentrations of CpG sites. However, with the advent of high-throughput technologies, it has been shown that a significant proportion of cancer-related alterations do not occur in promoters or CpG islands (Irizarry et al., 2009), prompting higher resolution, epigenome-wide investigations.
Gastrointestinal (GI) cancer, the most common form of cancer in the U.S. (Siegel et al., 2017), refers to malignant conditions affecting the digestive system associated with epigenetic alterations (Vedeld et al., 2017). Molecular characterization of different cancer types, facilitated by the identification of differentially methylated CpG sites, is therefore key to better understanding GI cancer. In the motivating application, we analyze methylation profiles publicly available from The Cancer Genome Atlas (TCGA) project and comprising 1,224 tumor samples belonging to four GI cancers of the upper digestive tract: stomach adenocarcinoma (STAD), liver hepatocellular carcinoma (LIHC), esophageal carcinoma (ESCA) and pancreatic adenocarcinoma (PAAD). For 485,577 probes, where each probe is mapped to a CpG site, DNA methylation levels or Beta-values ranging from 0 (no methylation) to 1 (full methylation) were measured using the Illumina Human Methylation 450 platform.
Figure 1 displays the methylation levels for CpG sites near TP53, a tumor suppressor gene located on chromosome 17. A random subset of the tumor samples was chosen to facilitate an informal visual evaluation. Each plotted point represents the methylation level of a tumor sample at a CpG site. As indicated in the figure legend, the four sets of colors and shapes of the points represent the four upper GI cancers. The vertical dashed lines indicate the boundaries of the TP53 gene. Although differential methylation is clearly visible at some CpG sites, the differences are generally subtle, demonstrating the need for sophisticated statistical analyses. An obvious feature is the correlation of the apparent methylation statuses of nearby CpG sites (Eckhardt et al., 2006; Irizarry et al., 2008; Leek et al., 2010). The dependence of proximal CpG sites is also seen in Figure 6 of Supplementary Material with highly significant tests for serial correlations. Furthermore, the highly variable inter-probe spacings in Figure 1 suggests the need to model distance-based dependencies.
Figure 1:

Methylation levels of CpG sites near gene TP53 for randomly chosen tumor samples of the TCGA upper GI dataset. Each plotted point represents the methylation level at a CpG site, with the shapes and colors corresponding to the different GI cancers indicated in the legend. The vertical dashed lines demarcate the TP53 gene boundaries.
Existing statistical approaches for differential DNA methylation and limitations
Numerous frequentist and Bayesian methods have been developed for differential DNA methylation, and can be broadly classified into four categories: (i) Testing-based methods, such as Illumina Methylation Analyzer (IMA) (Wang et al., 2012), City of Hope CpG Island Analysis Pipeline (COHCAP) (Warden et al., 2013), and BSmooth (Hansen et al., 2012). These methods rely on two-sample or multiple-sample tests for the mean group differences at each CpG site. (ii) Regression based models, such as Methylkit (Akalin et al., 2012), bump hunting (Jaffe et al., 2012), Biseq (Hebestreit et al., 2013), and RADMeth (Dolzhenko and Smith, 2014). After applying smoothing or other adjustments, these methods fit individual regression models for each CpG site and test for significance. (iii) Beta-binomial model-based methods, such as MOABS (Sun et al., 2014), DSS (Feng et al., 2014), and methylSig (Park et al., 2014). These methods fit separate models to each CpG site. (iv) Hidden Markov models (HMMs), such as Meth-Pipe (Song et al., 2013), Bisulfighter (Saito et al., 2014), and HMM-DM (Yu and Sun, 2016). These methods detect differentially methylated sites using inferred hidden states.
The aforementioned methods have several deficiencies. Because they fit separate models to each probe, most methods ignore the strong correlations between neighboring probes, reducing detection power. Additionally, beta-binomial, HMM, and most testing-based methods are able to accommodate only two treatments and rely on inefficient adjustments for multiple treatments. The methods that account for serial dependence (e.g., HMMs) do not adjust for the widely varying inter-probe distances, and instead, assume uniform inter-probe dependencies. The few methods that account for inter-probe distances (e.g., Hansen et al., 2012; Jaffe et al., 2012; Hebestreit et al., 2013) rely on ad hoc parameter-tuning procedures that do not adjust for the data characteristics.
Motivated by these challenges, we propose general and flexible methodology for differential analysis in DNA methylation data, referred to as BayesDiff. Rather than fitting a separate model to each CpG site or probe, BayesDiff relies on a common analytical framework for simultaneous inferences that adapts to the unique data attributes. To diminish collinearity effects and achieve dimension reduction, the probes are allocated to a smaller, unknown number of latent clusters based on the similarities of probe-specific multivariate parameters. Posterior inferences are made on differential state variables to delineate the disease genomic signature of multiple treatments.
For realistically modeling the probe-cluster allocation mechanism of DNA methylation profiles, we devise an extension of Pitman-Yor processes (PYPs) (Perman et al., 1992) called the Sticky PYP (equivalently, the two-restaurant two-cuisine franchise). In addition to accounting for long-range biological interactions, this nonparametric process accommodates distance-based serial dependencies. Separately for the differential and non-differential probes, the data flexibly direct the choice between PYPs, and their special case, Dirichlet processes, in finding the best-fitting allocation schemes.
We implement an inferential procedure for Sticky PYPs using a Markov chain Monte Carlo (MCMC) algorithm specifically designed for posterior inferences in the typically large methylation datasets. Simulation results show that our approach significantly outperforms existing methods for multigroup comparisons in data with or without serial correlation. For the motivating TCGA dataset, in addition to confirming known features of DNA methylation and disease-gene associations, the analysis reveals interesting aspects of the biological mechanisms of upper GI cancers.
The rest of the paper is organized as follows. Section 2 describes the BayesDiff approach, with Section 2.1 introducing the Sticky PYP or two-restaurant two-cuisine franchise (2R2CF) for differential DNA methylation. Section 3 outlines an effective inference procedure for detecting differential probes. Section 4 uses artificial datasets with varying noise and correlation levels to assess the accuracy of BayesDiff in detecting disease genomic signatures and compares BayesDiff with established techniques for DNA methylation data. The motivating upper GI dataset is analyzed in Section 5. Finally, conclusions and future work are discussed in Section 6.
2. The BayesDiff Model
Sequencing technologies measure DNA methylation levels of biomarkers represented by CpG sites (“probes”) and matched patient or tissue samples (“individuals”). Usually, is much larger than . The methylation levels, which belong to the interval [0, 1], are arranged in an matrix of proportions, , for individuals and probes , with the probes sequentially indexed by their genomic locations. The distances between adjacent probes are denoted by , and typically exhibit high variability. For instance, in the upper GI TCGA dataset, the inter-probe distances range from 2 base pairs to a million base pairs; a base pair is a unit of DNA length consisting of two nucleobases bound to each other by hydrogen bonds (e.g., Baker et al., 2008).
Each individual is associated with a known experimental or biological condition (“treatment”) denoted by and taking values in with . In the motivating TCGA data, there are upper GI cancer types. We model the logit transformation of the methylation levels, , as follows:
| (2.1) |
where represents the ith subject’s random effect, represents the jth probe’s random effect, and is the treatment –probe interaction random effect; refer to the directed acyclic graph (DAG) in Figure 11 of Supplementary Material. Logit methylation levels differ from M-values (Du et al., 2010), commonly used in differential analyses, by ; however, the key results are identical for both transformations.
The main inferential goal is the detection of differential probes, i.e., probes , for which does not have all identical elements. Consequently, we define a binary differential state variable, , with indicating that probe is not differential and indicating that it is differential:
| (2.2) |
for The parameters of interest are , with the differential genomic signature consisting of probes with . Figure 11 of Supplementary Material positions the differential state variables in the BayesDiff model hierarchy. Motivated by the distance-dependent correlations of DNA methylation data and the deficiencies of existing statistical approaches, this paper fosters a Bayesian nonparametric framework for random effects underlying the differential state variables.
Modeling probe clusters
In addition to high-dimensionality, the analytical challenges include pervasive collinearity caused by dependencies between physically proximal probes. Additionally, non-adjacent probes may have long-range dependencies due to biological interactions, e.g., signaling or functional pathways. To accommodate these complex dependence structures and extract information from the large number of probes, we allocate the probes to a much smaller number, , of latent clusters based on the similarities of their random effects . We favor clustering to dimension reduction methods such as principal components analysis (PCA); each PC being a linear combination of all biomarkers, PCA is less useful because it is unable to select sparse features, i.e., probes. By contrast, the proposed approach facilities biological interpretations by identifying CpG sites relevant to the differential genomic signatures between multiple treatments.
Suppose an allocation variable, , assigns probe to one of latent clusters, where is unknown. The event indicates that the jth probe is assigned to the kth latent cluster, We assume that the clusters are associated with latent vectors, , where the probe-specific random effects and cluster-specific latent vectors have the relation:
| (2.3) |
That is, all probes in a cluster are assumed to have identical random effects equal to that cluster’s latent vector. The differential state variables, defined in equation (2.2), then become a shared attribute of their parent cluster, and clusters as a whole are either differentially or non-differentially methylated. Further, if probe belongs to cluster (i.e., ), then the condition in equation (2.2) is equivalent to , and the differential cluster indexes comprise
| (2.4) |
Mixture models for allocation
Bayesian infinite mixture models are a natural choice for allocating probes to a smaller, unknown number of latent clusters. Dirichlet processes (Ferguson, 1973) are arguably the most frequently used infinite mixture models; see Müller and Mitra (2013, chap. 4) for a comprehensive review. The use of Dirichlet processes to achieve dimension reduction has precedence in the literature, albeit in unrelated applications (see Medvedovic et al., 2004; Kim et al., 2006; Dunson et al., 2008; Dunson and Park, 2008; Guha and Baladandayuthapani, 2016). Lijoi, Mena, and Prünster (2007a) advocated the use of Gibbs-type priors (Gnedin and Pitman, 2006) for accommodating more flexible clustering mechanisms and demonstrated the utility of Pitman-Yor processes (PYPs) in genomic applications. An overview of Gibbs-type priors and characterization of the learning mechanism is provided by De Blasi et al. (2015). Formally, the PYP (Perman et al., 1992) relies on a discount parameter , positive mass parameter , and -variate base distribution , and is denoted by . The value yields a Dirichlet process with mass parameter and base distribution . Suppose are distributed as . The stick-breaking representation of is , where random distribution is the discrete mixture , with denoting a point mass located at the atom . The random stick-breaking probabilities have the form , and for , where . Guha and Baladandayuthapani (2016) introduced VariScan, a technique that utilizes PYPs and Dirichlet processes for clustering, variable selection, and prediction in high-dimensional regression problems in general, and in gene expression datasets in particular. They also demonstrated that PYPs are overwhelmingly favored over Dirichlet processes in gene expression datasets, which typically exhibit no serial correlation.
Limitations of existing mixture models
Although the aforementioned mixture models achieve dimension reduction and account for long-range biological interactions between non-adjacent probes, a potential drawback is their implicit assumption of a priori exchangeability of the probes. Consequently, these techniques cannot accommodate serial correlation in methylation data. Infinite HMMs, such as the hierarchical Dirichlet process hidden Markov model (HDP-HMM) (Teh et al., 2006) and Sticky HDP-HMM (Fox et al., 2011), could be utilized to fill this gap. Although these models are a step in the right direction, they have several undesirable features for differential analysis. First, the degree of first order dependence is uniform irrespective of the inter-probe distances. This is unrealistic in methylation datasets where the correlation typically decreases with inter-probe distance (Hansen et al., 2012; Jaffe et al., 2012; Hebestreit et al., 2013). Second, an ad hoc exploratory analysis of the GI cancer dataset reveals that the serial correlation in the treatment-probe effects is weaker than the serial dependence between the differential state variables in equation (2.2). Although there may not be a biological explanation for this phenomenon, this makes sense from a statistical perspective because the differential states are binary functions of the treatment-probe interactions; the differential states are more sensitive in detecting first order dependence even when the higher-dimensional (and noisier) treatment-probe interactions show negligible correlation. This suggests that a hypothetical two-group Markov model, rather than an infinite-group Markov model such as HDP-HMM or Sticky HDP-HMM, would provide a better fit for the data. Third, the range of allocation patterns supported by infinite HMMs is relatively limited. In particular, realistic allocation patterns, such as power law decays in the cluster sizes and large numbers of small-sized clusters, a common feature of cancer datasets (Lijoi et al., 2007b), are assigned relatively small prior probabilities by infinite HMMs.
2.1. Sticky PYP: A Two-restaurant, Two-cuisine Franchise (2R2CF) for Differential Analysis
The proposed Sticky PYP comprises a cohort of regular PYPs producing the probe-specific random effects by switching the generative PYP at random locations along the probe sequence. Alternatively, the Chinese restaurant franchise (CRF) metaphor for HDP-HMMs and Sticky HDP-HMMs can be generalized to the two-restaurant two-cuisine franchise (2R2CF) to give an equivalent representation of Sticky PYPs appropriate for differential analysis. We first present a descriptive overview of 2R2CF.
Imagine a franchise with two restaurants labeled 1 and 2. Each restaurant consists of two sections, labeled section 1 and 2. Each section serves a single cuisine and the section-cuisine menu consists of infinite dishes. Section 1 of both restaurants exclusively serves cuisine 1. The cuisine 1 menus of restaurant 1 and 2, along with the selection probabilities of the dishes, are identical. Similarly, section 2 of both restaurants exclusively serves cuisine 2, and the cuisine 2 menus of the two restaurants are identical.
A succession of customers, representing the CpG sites or probes, arrive at the franchise. The waiting times between successive customers correspond to the inter-probe distances, . Each customer first selects a restaurant and then a section (equivalently, cuisine) in that restaurant. Each restaurant section has an infinite number of tables, and a customer either sits at a table already occupied by the previous customers or sits at a new table. All customers at a table are served the same dish chosen from the section-cuisine menu by the first customer who sat at that table. The first customer at a table independently picks a dish from the infinite cuisine menu with a cuisine-specific probability associated with each dish. As a result, multiple tables at a restaurant section may serve the same dish.
Restaurant 1 specializes in cuisine 1. Consequently, section 1 is more popular with the restaurant 1 patrons. Similarly, restaurant 2 specializes in cuisine 2, and so, restaurant 2 customers tend to favor section 2 over section 1. By design, if a customer has eaten a cuisine 1 (2) dish, then the next customer is more likely to visit restaurant 1 (2), where cuisine 1 (2) is more popular. In this manner, each customer tends to select the same cuisine as the previous customer.
In the metaphor, cuisine 1 symbolizes the non-differential state and cuisine 2 symbolizes the differential state. The dish that franchise customer eats represents the probe-specific random effect, . Since cuisine 1 represents the non-differential state, its dishes are characterized by -variate random vectors with all equal elements; see equation (2.2). In contrast, cuisine 2 (differential state) dishes are characterized by -variate random vectors with at least two unequal elements.
The dependence in the restaurant and cuisine choices of consecutive customers account for the long runs of differential or non-differential states seen in DNA methylation data. However, a customer’s influence on the next customer diminishes as the time interval between the two customers increases; the differential statuses of two adjacent probes are statistically independent in the limit as the inter-probe distance grows.
The 2R2CF process is illustrated in Figure 2 and discussed below in greater detail. The following specification conditions on , an unknown distribution in that is assigned a Dirichlet process prior with mass parameter and univariate normal base distribution, . The stick-breaking representation of the Dirichlet process implies that distribution is almost surely discrete because it has the infinite mixture distribution:
| (2.5) |
Figure 2:

Cartoon representation of the two-restaurant two-cuisine franchise for differential analysis, showing the progressive choice of restaurant, cuisine section, and table by customer , for . The numbered circles represent table numbers. See the text in Section 2.1 for a detailed description of the 2R2CF process.
The distribution of the random probabilities, , which depend on mass parameter , was derived in Sethuraman (1994); see also Ishwaran and James (2003) and Lijoi and Prünster (2010). In the sequel, we condition on distribution ; equivalently on the probabilities, , and univariate atoms, , for , the natural numbers.
Cuisine 1 menu.
Recall that cuisine 1 represents the non-differential state, for which the -variate random vectors (i.e., dishes in the metaphor) has all equal elements. Cuisine 1 menu, with its countably infinite dishes and their associated probabilities, is modeled as a discrete menu distribution, , in . With denoting the column vector of ones, let
| (2.6) |
Cuisine 1 menu distribution is defined as the law of random vector . Then represents the available cuisine 1 dishes and the support of . The selection probability associated with dish is
The continuity of base distribution in equation (2.5) guarantees that the menu dishes are unique. On the other hand, the discreteness of distribution has practical implications for 2R2CF: (a) cuisine 1 consists of discrete dishes, as required, rather than a continuous spectrum, and (b) since section 1 at both restaurants serve the same menu, two section 1 customers may eat the same dish even if they select different restaurants.
Cuisine 2 menu.
As mentioned, cuisine 2 depicts the differential state and its dishes represent -variate random vectors with at least two unequal elements. Its menu comprises countably infinite cuisine 2 dishes along with associated probabilities. The menu is therefore modeled by a -variate distribution, , satisfying two conditions: (i) it has a countably infinite support, and (ii) each -variate atom of has at least two unequal elements. For any given , a probability mass function for that satisfies these two conditions is
| (2.7) |
where denotes the mass function of distribution evaluated at . For a graphical summary of the parameters in equations (2.5) – (2.7), refer to the “Cuisine menus” block of Figure 11 of Supplementary Material. In line 1 of expression (2.7), normalizing constant is the total probability that a -variate random vector whose elements are i.i.d. has at least two distinct elements. Referring back to the atoms, , of distribution in (2.5), is the list of cuisine 2 dishes and the support of . The selection probability associated with dish is .
Restaurant, section, table, and dish choices of the 2R2CF customers
Let the restaurant chosen by franchise customer be denoted by and the chosen cuisine (i.e., section) be denoted by . Suppose he or she sits at table in that restaurant section and eats dish .
Customer 1.
At time 0, suppose the first customer selects restaurant with probability and selects restaurant with probability . For reasons that will become clear, we refer to as the baseline non-differential proportion and as the baseline differential proportion. Typically, the differential state is less frequent, and so (i.e., ). Proportion is given a uniform prior on the interval (1/2, 1).
Choice of cuisine
Next, customer 1 selects a section in restaurant . Since each restaurant specializes in its namesake cuisine, the cuisine is more popular with the restaurant customers. This is modeled as follows. Within restaurant (with for customer 1), customer selects cuisine 1 with probability
| (2.8) |
for a speciality cuisine popularity parameter, , determining the degree to which a restaurant’s patrons favor its namesake cuisine. For instance, if is nearly 1, then a restaurant 1 (2) customer almost always (never) chooses cuisine 1. At the other extreme, if is nearly 0, then the customer chooses cuisine 1 with approximate probability irrespective of the restaurant. Parameter is assigned an independent uniform prior on the unit interval. The probability that a restaurant customer chooses cuisine 2 is then .
Choice of table and dish
Within section of restaurant , we assume without loss of generality (since the table identifiers are arbitrary) that customer 1 sits at table . At table 1, customer 1 randomly orders a cuisine dish from menu distribution . The dish he or she eats represents the random effect of the first probe. That is, . As the 2R2CF process evolves as more patrons arrive, the tables in a restaurant’s section are sequentially assigned new labels as they are occupied.
Table 1:
True parameter values used to generate the artificial datasets.
|
| ||||||||
|---|---|---|---|---|---|---|---|---|
| 20 | 20 | 0.33 | 20 | 0.9 | 0.1 | 0 | 1 | 0.1225 |
Customer , for .
The restaurant choice of a subsequent customer is influenced by the previous customer’s cuisine and waiting time. Suppose customer arrives at the franchise after a time interval of following the th customer. Without loss of generality, can be scaled so that their total equals 1. Since the probes in differential analysis typically represent CpG sites on a chromosome or gene, it has a scaled length of 1.
To model the dependencies between the franchise customers, we define a non-negative dependence parameter that transforms waiting time to an affinity measure between customer and customer :
| (2.9) |
The affinity measure belongs to the unit interval when . If is defined as 0 irrespective of waiting time . The affinity influences the restaurant choice through assumption (2.10) below.
Choice of restaurant
The cuisine of the th customer influences the restaurant choice of the th customer through affinity and popularity parameter :
Specifically, the probability that customer selects restaurant 1 is assumed to be
| (2.10) |
and . The idea is illustrated in the top panel of Figure 2, where customer chooses restaurant 1 with probability and restaurant 2 with probability . If dependence parameter , then the restaurant choices of the customers are independent; specifically, irrespective of the cuisine .
It can be verified that is a probability mass function if and only if . Since the scaled waiting times are bounded above by 1, a globally sufficient condition is . We therefore assume a mixture prior for dependence parameter :
| (2.11) |
where the second mixture component involves a continuous distribution, , restricted to the interval , enforcing the globally sufficient condition. In our experience, posterior inferences on are relatively robust to the continuous prior provided the prior is not highly concentrated on a small part of interval . Refer to the “Restaurant and cuisine of customer ” block of Figure 11 of Supplementary Material for a DAG depicting the relationships of these parameters.
When , we have a zero-order Sticky PYP; when , we obtain a first order Sticky PYP. Some interesting consequences of specification (2.10) are:
Zero-order Sticky PYP: When , each customer independently chooses restaurant 1 (or 2) with a baseline probability of (or ). The customers act identically.
First order Sticky PYP with large: At large relative distances, customer acts approximately independently of the history. Somewhat similarly to customer 1, customer chooses restaurant 1 (2) with a probability approximately, but not exactly, equal to .
First order Sticky PYP with small: In the limit as (e.g., for a small inter-probe distance ), the restaurant choice of customer follows a hidden Markov model.
Since it drives the dependence characteristics of DNA methylation data, parameter is of interest. Prior specification (2.11) allows the data to direct the model order through posterior probability, , an MCMC estimate of which is readily available; see Section 3.
Choice of cuisine
In restaurant , customer selects cuisine-section with distribution defined in expression (2.8). For bookkeeping purposes, among franchise customers , let be the number of customers that choose section in restaurant ; that is, for .
For a graphical depiction of cuisine selection by the th customer, see the middle panel of Figure 2, where . That is, customer , having already chosen restaurant 1, now chooses a cuisine-section. Restaurant 2 has been greyed out because it is no longer accessible to this customer. In the lower panel of Figure 2, we find that the customer picked cuisine-section 1, and so .
Choice of table
Applying the above notation, among customers , there are customers in the same restaurant and section as the th customer. Suppose these customers have occupied tables , and that there are customers seated at the th table. Let comprise these aggregated table occupancies.
Recall that the newly arrived jth customer may sit at any of the occupied tables or a new th table. Two possibilities are illustrated in the lower panel of Figure 2. For a PYP with mass parameter and cuisine-specific discount parameter , the predictive distribution of table of customer is related to the table occupancies as follows:
| (2.12) |
where the second line corresponds to customer sitting at a new table, in which case the new number of occupied tables is and table index . Otherwise, if customer sits at a previously occupied table, then table index and the number of occupied tables remains unchanged: . See the DAG in Figure 11 of Supplementary Material.
The above predictive distribution implies that customer is more likely to choose tables with several occupants, positively reinforcing that table’s popularity for future customers. The number of occupied tables stochastically increases with the PYP mass and discount parameter.
For section , if the PYP discount parameter , we obtain the well-known Pòlya urn scheme for Dirichlet processes (?). PYPs act as effective dimension reduction devices because the random number of occupied tables is much smaller than the number of customers. In general, as the number of patrons in section of restaurant grows as more customers arrive at the franchise, that is, as , the number of occupied tables, , is asymptotically equivalent to
| (2.13) |
for a positive random variable (Lijoi and Prünster, 2010). The asymptotic order of the number of occupied tables increases with discount parameter .
Choice of dish
As discussed, all customers seated at a given table of section are served the same dish, chosen from the cuisine menu by the first customer to sit at that table. Let denote the common dish eaten by customers at the kth table, . The dish that customer eats represents the probe-specific random effect , and
| (2.14) |
In the latter case (line 2), the dish randomly selected by customer is registered as , where , and is served to all future customers who sit at table . Assumptions (2.12) and (2.14) imply that although the restaurants serve the same menus, the overall relative popularity of each dish is restaurant-specific. Refer to the block entitled “Cuisine menus” in Figure 11 of Supplementary Material.
The aforementioned process continues for the remaining 2R2CF customers. Expressions (2.8) and (2.10) guarantee that a cuisine is more popular at its namesake restaurant and the cuisine selected by a customer influences the restaurant choice of the next customer, making the next customer likely to select the same cuisine. This accounts for the lengthy runs of differential or non-differential probes seen in methylation data. In addition to achieving dimension reduction, the proposed Sticky PYP models the serial dependencies of adjacent probes as a decreasing function of the inter-probe distances.
Latent clusters and their differential states
Latent clusters, introduced earlier in expression (2.3), comprise probes with identical random effects and form the basis of the dimension reduction strategy. Returning to the 2R2CF metaphor, we identify a cluster as the set of customers who eat the same dish. However, in addition to the customers seated at a table, multiple tables in both restaurants may serve the same dish because of the shared cuisine menu. Therefore, irrespective of the restaurant, aggregating customers eating the same dishes, we obtain the probe-cluster allocation variables , and the number of latent clusters, . The collection of customers eating the same cuisine 2 (differential state) dishes corresponds to a distinct differential cluster in , defined in equation (2.4).
From expression (2.13), we expect the number of occupied tables in the franchise to be much smaller than the number of customers, . Furthermore, since multiple tables may serve the same dish, we expect the number of latent clusters, , to be smaller than the number of occupied tables. With high probability, this implies that is much smaller than .
PYP discount parameter .
Consider the differential state cuisine menu, , defined in (2.7). It can be shown that as the number of treatments, , and the number of probes, , increase, the differential clusters are not only asymptotically identifiable but consistently detectable in the posterior; refer to Section 4 of Guha and Baladandayuthapani (2016) for a detailed discussion of this remarkable phenomenon in standard PYP settings. Since the differential clusters can be inferred with high accuracy when and are large, discount parameter is given the mixture prior:
| (2.15) |
where corresponds to a Dirichlet process. This provides 2R2CF the posterior flexibility to choose between a Dirichlet process and a more general PYP for a suitable clustering pattern of the differential probes. An allocation pattern typical of Dirichlet processes, such as exponentially decaying cluster sizes dominated by a few large clusters, results in a high posterior probability that equals 0. By contrast, an allocation pattern characteristic of non-Dirichlet PYPs, such as slower-than-exponential power law decays in the cluster sizes and relatively large numbers of smaller-sized clusters, causes the posterior of discount parameter to concentrate near 1 and exclude 0. A proof of the intrinsically different cluster patterns of Dirichlet processes and PYPs is given in Theorem 2.1 of Guha and Baladandayuthapani (2016).
Since distribution is discrete, all atoms of -variate distribution may not be unique. Indeed, this is common for treatments. However, as grows, and provided the number of probes, , grows at a slower-than-exponential rate as , the probability that two atoms allocated to the probes are identical rapidly decays to 0. In regression problems unrelated to differential analysis, Section 2.3 of Guha and Baladandayuthapani (2016) derived a similar result for a simpler zero-order stochastic process. We have verified this phenomenon in simulation studies on differential analysis datasets. In several hundred artificial datasets generated from the Sticky PYP, for probes and as small as four, no two allocated atoms of were identical.
PYP discount parameter .
Consider again the (non-differential) cuisine menu defined in (2.6). In general, the flexibility provided by PYP allocation patterns is not necessary for non-differential probes. This is because the allocation patterns of are driven by univariate parameter in (2.6) and mixture allocations of univariate objects are unidentifiable (e.g., Frühwirth-Schnatter, 2006). Consequently, we set PYP discount parameter , reducing the two PYPs associated with the non-differential state (i.e., section 1 in both restaurants) to Dirichlet processes.
Other model parameters
Depending on the specifics of the application, an appropriate model is assumed for the subject-specific parameters . For example, we may assume . In other applications, it may be more appropriate to assume non-zero means and flexible error distributions: , where represents lane or batch effects in methylation data, and the i.i.d. follow a random distribution with a Dirichlet process prior. Similarly, appropriate models for the probe-specific parameters may include i.i.d. zero-mean normal distributions, and finite mixtures or HMMs with state-specific normal distributions. Inverse-gamma priors are assigned to and . Suitable priors are assumed for mass parameters , and in expressions (2.6) and (2.12). Mean and variance of base distribution in expression (2.6) are given a joint normal-inverse gamma prior. The DAG in Figure 11 of Supplementary Material summarizes the complex relationships between the different model parameters.
3. Posterior Inference
Due to the analytical intractability of the BayesDiff model, we rely on MCMC methods for posterior inference and detection of differential probes.
3.1. MCMC Strategy
The model parameters are initialized using naïve estimation techniques and iteratively updated by MCMC techniques until the chain converges. We split the MCMC updates into three blocks. An outline of the MCMC procedure is as follows. Further details can be found in Section 7 of Supplementary Material.
Restaurant-cuisine-table-dish of customer : For each probe , we sample vector given the vectors of the other probes. This is achieved by proposing a new value of from a carefully constructed approximation to its full conditional, and by accepting or rejecting the move in a Metropolis-Hastings step. As discussed in Section 2.1, probe-cluster allocations are immediately available from the restaurant-cuisine-table allocations of the probes, as are available the latent clusters with their allocated probes and the set of differential clusters, .
Latent vectors : The latent vector elements are not necessarily distinct because of the Dirichlet process prior on distribution . Although the latent vector elements are known from the aforementioned block 1 updates, MCMC mixing is considerably improved by updating the latent vector elements conditional on the probe-cluster allocations. As the calculation in Supplementary Material shows, this is achieved by Gibbs sampling.
Remaining model parameters: Generated by standard MCMC techniques.
We discarded an initial burn-in of 10,000 MCMC samples and used the subsequent 50,000 draws for posterior inferences. Convergence was informally assessed by trace plots of various hyperparameters to validate the MCMC sample sizes. For the proposed moves (in discrete parameter space) described in Step 1, the average Metropolis-Hastings acceptance rate exceeded 90% in all our analyses.
3.2. Detection of Differential Probes with FDR Control
Post-processing the MCMC sample, a Bayesian approach for controlling the false discovery rate (FDR) (Newton et al., 2004) is applied to detect the probes with differential state . Specifically, let be the nominal FDR level and be the posterior probability that probe is differential, so that . An empirical average estimate, , is available from the MCMC sample. To achieve the desired FDR level in calling the differential probes, we first rank all the probes in decreasing order of . Let denote the ordered posterior probability estimates. For each , we evaluate the posterior expected FDR of calling differential the first probes in the sorted sequence:
| (3.1) |
where the simplification occurs because the are sorted. Finally, we pick the largest value of , denoted by , for which . A nominal FDR level of is achieved by labeling the first probes, arranged in decreasing order of , as differential.
4. Simulation Studies
Using artificial datasets with treatments, we analyzed the accuracy of BayesDiff in detecting differentially methylated probes. We compared the results with established differential methylation procedures and general statistical techniques for multiple treatment comparisons. We also evaluated the ability of the BayesDiff procedure in discovering the complex dependence structures of DNA methylation data.
Generation strategy
Proportions representing DNA methylation data were generated using the logit transformation as in equation (2.1). The inter-probe distances were the actual distances from the motivating TCGA dataset, scaled to add to 1. In order to capture the complexity of methylation data, such as the existence of multiple latent methylation states (e.g,. CpG islands and shores), different read depths across CpGs, and the incomplete conversion of bisulphite sequencing, the generation strategy was partly based on techniques implemented in WGBSSuite, a flexible stochastic simulation tool for generating single-base resolution methylation data (Rackham et al., 2015). However, the generation procedure differed from WGBSSuite in some respects. Specifically, it allowed more than two treatments . Additionally, as in actual methylation datasets, the generation procedure incorporated serial dependence not only in the methylation levels but also in the differential states of the probes.
The probe-specific read depths were generated as . Unlike assumption (2.1), there were no subject-specific random effects in the generation mechanism. Instead, the normal means incorporated additive probe-specific random effects, , that were generated using the following steps:
Generate the true methylation status of the probes, denoted by , using the 4-state distance-based HMM of Rackham et al. (2015), with the states respectively representing the methylated, first transit, demethylated, and second transit states.
Set the baseline methylation levels for the methylated, (first or second) transit, and demethylated states as , and .
- For , compute the probe-specific means:
Generate for .
Noise and dependence levels
We investigated four scenarios corresponding to the combinations of two noise levels and two dependence levels. For each scenario, 20 datasets were independently generated, with each dataset consisting of probes and four samples associated with each of treatments, i.e., a total to samples. The low noise setting corresponded to true variance parameter ; equivalently, to a signal-to-noise of . The high noise setting corresponded to or . The true between-probe dependencies comprised two levels: no serial correlation (i.e., a zero-order Sticky PYP) with , and positive serial correlation (i.e., a first order Sticky PYP) with . Although may appear to be small, its value is calibrated to the scaled inter-probe distances and represents fairly high inter-probe dependence. For example, when the distance between two adjacent probes is equal to the standardized average distance of gives an affiliation of in equation (2.9). For convenience, we will refer to the two dependence levels as “no-correlation” and “high correlation.” True values of the other model parameters were common to the four scenarios and are displayed in Table 1. Setting a true baseline differential proportion of resulted in approximately 10% true differentially methylated CpGs in each dataset.
Posterior inferences
Assuming all model parameters to be unknown, each artificial dataset was analyzed using a BayesDiff model that differed in key respects from the true generation mechanism. For example, unlike the 4-state HMM of the generation strategy, the probe-specific random effects were analyzed using a BayesDiff model that ignored the first order dependence, and instead, relied on a 3-state finite mixture model representing the methylated, transit, and unmethylated states. Additionally, in contrast to the zeroed-out subject-specific random effects during data generation, BayesDiff assumed i.i.d. normal random effects with zero means.
To assess BayesDiff’s accuracy in detecting the absence or presence of inter-probe serial correlation, in the no-correlation situation, we evaluated , the log-Bayes factor comparing zero order to first order Sticky PYPs. In the high correlation situation, we evaluated , the log-Bayes factor comparing first order to zero order Sticky PYPs. Thus, in any scenario, a large positive value of this measure provides strong evidence that BayesDiff detects the correct model order.
Although conceptually straightforward, the estimation of Bayes factors requires multiple MCMC runs even for relatively simple parametric models (Chib, 1995). Basu and Chib (2003) extended the estimation strategy to infinite dimensional models such as Dirichlet processes. However, the computational costs are prohibitively high for big datasets, and multiple MCMC runs stretch present-day computational resources beyond their limits. Faced with these challenges, we relied on an alternative strategy for estimating the lower bounds of log-Bayes factors as by-products of the Section 3.1 algorithm. As it turns out, this is often sufficient to infer Sticky PYP model orders. Let denote all BayesDiff model parameters except . In the high correlation situation, applying Jensen’s inequality, a lower bound for the corresponding log-Bayes factor is . Unlike log-Bayes factors, this lower bound can be easily estimated using a single MCMC run. In the no-correlation situation, a lower bound for the log-Bayes factor, , is similarly estimated.
In the four simulation scenarios, box plots of these estimated lower bounds for the 20 datasets are presented in Figure 3. Except for the high noise–no-correlation scenario, for which the results were inconclusive, the estimated lower bounds of the log-Bayes factors in favor of the true correlation structure were all positive and large. In the low noise–no-correlation scenario, BayesDiff decisively favored zero-order models, and the smallest lower bound among the 20 datasets was 13.9, corresponding to Bayes factors exceeding . The 25th percentile of these lower bounds was 30.9, corresponding to Bayes factors exceeding . This is strong evidence that the BayesDiff approach is reliable in this scenario. For the high-correlation scenarios, the estimated lower bounds were even higher, indicating that BayesDiff overwhelmingly favors first order models when the data are serially correlated.
Figure 3:

Simulation study box plots for estimated lower bounds of the log-Bayes factors in favor of the true model order.
Comparisons with other methods
We evaluated the success of the BayesDiff procedure in detecting disease genomic signatures relative to six well-known statistical methods. These included generic multiple comparison techniques, namely, one-way analysis of variance (ANOVA) and the Kruskal-Wallis test. Also included were specially developed methods for detecting differential methylation in more than two treatments: COHCAP (Warden et al., 2013), methylKit (Akalin et al., 2012), BiSeq (Hebestreit et al., 2013), and RADMeth (Dolzhenko and Smith, 2014). The ANOVA and Kruskal-Wallis test procedures were performed separately on each probe after applying the inverse-logit transform to the data. The COHCAP method was directly applied to the synthetic data. The remaining three methods are designed for bisulfite sequencing, which consists of total methylation reads for each CpG site. For these methods, the methylation reads were calculated by multiplying the methylation proportions by the total reads. As recommended, the bandwidth smoothing parameter of the BiSeq method was tuned to optimize overall detection. For all six competing methods, probe-specific p-values were evaluated and adjusted for multiplicity using the FDR control procedure of Benjamini and Hochberg (1995).
Like most nonparametric Bayes techniques, the computational times of BayesDiff are considerably higher than frequentist methods, but negligible compared to the time frames over which the experimental data are collected. Furthermore, as we demonstrate, the substantially greater accuracy of BayesDiff more than compensates for its computational costs. On a personal computer with an Intel Core i7–4770 processor with 3.40 GHz frequency and 8 GB RAM, the average run time for the Section 3.1 MCMC algorithm, applied to the synthetic datasets with samples, treatments, and probes, was 0.60 seconds per iteration. However, the computational times are greatly reduced by running the datasets in parallel across multiple cores of a research computing cluster. Analyzing datasets of various sizes, we found that the computational cost is but does not appreciably depend on . This is reasonable because the mixture model primarily focuses on . Due to the intensive nature of the one-parameter-at-a-time Gibbs sampling updates in Block 2, the Metropolis-Hastings algorithm of Guha (2010) can be applied to significantly speed up the updates and make the calculations more scalable. As part of ongoing work developing a fast R package, we find that ten- to hundred-fold speedups are possible with this fast MCMC strategy, which can also accelerate the block 1 parameter updates of Section 3.1.
We computed the receiver operating characteristic (ROC) curves for differential probe detection for all seven methods. For a quantitative assessment, we calculated the area under curve (AUC), declaring the method with the largest AUC as the most reliable in each scenario. The ROC curves, averaged over the 20 datasets under each simulation scenario, are shown with the AUCs in Figure 7 of Supplementary Material. In all except the high-noise–no-correlation scenario, BayesDiff uniformly outperformed the other methods. Even in the high-noise–no-correlation scenario, BayesDiff performed better in the low FPR region. As expected, all seven methods had lower accuracies for higher noise levels. BayesDiff did significantly better than the competing methods in the high correlation scenarios, suggesting that the incorporation of between-probe dependencies improves its accuracy in situations typical of DNA methylation data.
Since researchers typically focus on small false positive rates (FPRs), that is, small significance levels, we also calculated the measures, AUC20 and AUC10. AUC20 (AUC10) is defined as the area under the ROC curve multiplied by 5 (10) when the FPR does not exceed 0.2 (0.1). The multiplicative factors ensure that the areas potentially vary between 0 and 1. The three versions of AUC are presented in Table 3 in Supplementary Material. As also seen in Figure 7, Table 3 reveals that in three of the four scenarios, BayesDiff had the largest overall AUC. Furthermore, BayesDiff had vastly improved reliability for low FPRs. For example, consider the low noise–high correlation scenario. The overall AUC for BayesDiff was 0.035 greater than that for ANOVA. In contrast, the gains for BayesDiff, relative to ANOVA, were +0.107 for AUC20 and +0.146 for AUC10. The advantages of BayesDiff were even greater relative to the other competing methods. In the high noise–low-correlation scenario, BayesDiff had a relatively low AUC, as mentioned. However, even in this scenario, it had the greatest AUC20 and AUC10 among all the methods. Additionally, for a nominal FDR of , the achieved FDR of BayesDiff was between 0 and 0.03 in every dataset and simulation scenario. These results demonstrate the ability of BayesDiff to accurately detect differential probes, even in challenging situations in which the FPR is small.
5. Data Analysis
We returned to the motivating DNA methylation data consisting of the upper GI cancers: stomach adenocarcinoma (STAD), liver hepatocellular carcinoma (LIHC), esophageal carcinoma (ESCA), and pancreatic adenocarcinoma (PAAD). Applying the BayesDiff procedure, we detected the differentially methylated CpG loci among the cancer types.
Data processing
The dataset was obtained from The Cancer Genome Atlas project, publicly available through The Genomic Data Commons (GDC) portal (Grossman et al., 2016). The measurements on 485,577 probes located at CpG sites were made using the Illumina Human Methylation 450 platform. At the time of analysis, the dataset consisted of 1,224 tumor samples. We analyzed the data on a gene-by-gene basis, selecting 443 genes with mutations in at least 5% of the samples. To ensure that all CpG sites potentially linked to a gene were included in the analysis, we selected sites located within 50K base pairs outside the gene body, upstream from the 5’ end as well as downstream from the 3’ end. The number of gene-specific CpG sites ranged from 1 to 769, and are displayed in Figure 8(a) of Supplementary Material. As a final preprocessing step, since the methylation patterns of short genes are less informative in cancer investigations, we eliminated the 25 genes mapped to 20 or fewer CpG sites.
Inference procedure
The data were analyzed using the proposed BayesDiff approach. Exploratory analyses indicated that eliminating the probe-specific random effects in expression (2.1) produces a satisfactory model fit. Since experimental batch information is not available in the TCGA dataset, we assumed that the parameters in (2.1) are i.i.d. with a random distribution having a Dirichlet process prior. The MCMC procedure of Section 3.1 was applied to obtain posterior samples for each gene. For detecting differential CpG sites, we applied the Section 3.2 procedure with a nominal FDR of .
Results
Among the differentially methylated CpG sites detected by our approach, approximately 40.6% of the sites were located outside the gene body. Figure 4 displays the associations between detected methylation status and position of the CpG sites. We defined “near the 5’ (3’) end” as CpG sites located within one-fourth length of the gene body, either inside or outside the gene boundary, and closer to the transcription start (termination) site. Our results indicate that the proportion of differential methylation is higher for CpG sites inside the gene body and most differentially methylated loci are situated within the gene body, as is well known from numerous previous studies. However, our analysis also revealed significant amounts of differential methylation outside the gene body. Despite the common belief that DNA methylation analysis should focus on the 5’ end region, we found that CpG sites near the 3’ ends also displayed considerable differential methylation. These findings support the recommendations of Irizarry et al. (2009) that studies of DNA methylation alteration should be conducted on a higher resolution, epigenome-wide basis.
Figure 4:

Associations of detected methylation status and position of CpG sites.
Among the differentially methylated sites detected by BayesDiff, we estimated the pairwise differences between random effects associated with the four cancer types. Site-wise summaries of the largest pairwise differences of the cancer-specific effects are displayed in Figure 5. None of the four cancer types displayed consistent hypermethylation or hypomethylation across all genes or over entire chromosomes. However, we found that LIHC is frequently differentially methylated relative to one of the other cancer types, implying that it is the most volatile disease with respect to DNA methylation.
Figure 5:

Site-wise summary of the largest pairwise differences of differentially methylated loci among the four upper GI cancer types.
For each gene, Figure 8(b) of Supplementary Material displays 95% credible intervals for the lower bounds of log-Bayes factors of a first versus zero-order 2R2CF model, i.e., versus in expression (2.9). Models with first order dependence are overwhelmingly favored for a majority of the genes, suggesting that statistical techniques that fail to account for dependence between neighboring CpG sites are less effective for these data. Figure 9 of Supplementary Material displays the detailed differential methylation pattern for the top two mutated genes, TP53 and TTN. An obvious feature of both genes is that the differential methylation patterns are strongly serially correlated. For gene TP53, there are almost no differentially methylated loci within the gene body. The 3’ end region outside the gene body has a cluster of differentially methylated loci, for which cancer type STAD is mostly hypermethylated. The results for gene TTN tell a quite different story: most of the differentially methylated loci are inside the gene body and near the 5’ end. Cancer type LIHC is hypomethylated compared to PAAD around the 5’ end region, but it is hypermethylated compared to STAD near the 3’ end. Genes with at least 90% differentially methylated sites detected by BayesDiff are listed in Table 4 of Supplementary Material, along with the largest pairwise difference between the four cancer types among the differentially methylated loci. The number of CpG sites within each segment is listed in Table 5 of Supplementary Material.
Existing medical literature both supports and complements our findings. For example, hypermethylation of the EDNRB and SLIT2 genes have been found in STAD (Tao et al., 2012). Gene FBN2 was hypermethylated in ESCA (Tsunoda et al., 2009). While several studies have found that the gene and protein expressions of ABC transporter genes, such as ABCC9, are useful for understanding the prognosis of esophageal cancer (Vrana et al., 2018), we find that hypermethylation of ABCC9 is a major difference between cancer types ESCA and LIHC. Gene FLRT2 is a potential tumor suppressor that is hypermethylated and downregulated in breast cancer (Bae et al., 2017). Our results indicate that this gene is also hypermethylated in cancer type STAD versus LIHC. Mutations in SPTA1 gene has been linked with PAAD (Murphy et al., 2013); our results indicate that hypermethylation of this gene distinguishes PAAD from LIHC.
Finally, we compared our findings with those of ANOVA for multiple treatment comparisons. Table 6 lists the common set of genes with at least 90% differentially methylated sites identified by both BayesDiff and ANOVA. Table 7 displays the genes identified by only ANOVA, whereas Table 8 displays the large number of genes detected by only BayesDiff. Cross-referencing with the medical literature, we find that genes FLRT2 and FBN2 were detected by both methods. However, genes EDNRB, SLIT2, ABCC9, and SPTA1 were only identified by BayesDiff, revealing the benefits of the proposed Bayesian nonparametric method.
Accounting for data characteristics
To avoid making misleading biological interpretations, a statistical model must account for the observed biomarker means and variances, especially in multiple-testing approaches where the first two sample moments are important (Subramaniam and Hsiao, 2012). From this perspective, certain aspects of the BayesDiff model, such as variance being a priori unrelated to the mean in expression (2.1), may appear to be unduly restrictive. However, even though it was not specifically designed to match data summaries such as sample moments, in practice, the nonparametric nature of the Sticky PYP allows the posterior to flexibly adapt to unique data characteristics, such as sample moments, and account for mean-variance relationships in a robust manner. For example, consider again the top mutated genes, TP53 and TTN, discussed in Figure 9 of Supplementary Material. The ability of BayesDiff to match the sample moments of the gene-specific probes can be demonstrated as follows. Given the inter-probe distances, the joint posterior of the BayesDiff parameters induces predictive distributions on the measurements for each probe. Functionals of these predictive distributions, such as probe-specific sample moments, are easily estimated by post-processing the MCMC sample. For these two genes, Figure 10 of Supplementary Material reveals that the sample moments predicted by BayesDiff are a close match to the actual first and second sample moments with correlations exceeding 99% in each plot. Similar results were observed in other datasets.
6. Discussion
DNA methylation data exhibit complex structures due to biological mechanisms and distance-dependent correlations between adjacent CpG sites or probes. The identification of the differential signatures of multiple sets of tumor samples is crucial for developing targeted treatments for disease. This paper formulates a flexible approach applicable to multiple treatments called BayesDiff. The technique relies on a novel Bayesian mixture model called the Sticky PYP or the two-restaurant two-cuisine franchise. In addition to facilitating simultaneous inferences on the probes, the model accommodates distance-based serial dependence and accounts for the complex interaction patterns commonly observed in cancer data. An effective MCMC strategy for detecting the differential probes is developed. The success of the BayesDiff procedure in differential DNA methylation, relative to well-established methodologies, is exhibited via simulation studies. The new technique is applied to the motivating TCGA dataset to detect the differential genomic signatures of four upper GI cancers. The results both support and complement known facts about epigenomic differences between these cancer types, while identifying genes with high proportions of differentially methylated CpG sites.
In addition to providing a good fit to the data, a statistical model must be able to account for features such as sample moments. The success of the BayesDiff model in this regard is demonstrated in Section 5 using the upper GI dataset. It must be emphasized, however, that BayesDiff may be less successful in accounting for the characteristics of some other datasets, possibly due to slow asymptotic convergence of the posterior to the underlying generative process. In such situations, more flexible global transformations (Li et al., 2016) or variance-stabilizing transformations (Durbin et al., 2002) may be utilized. Alternatively, local Laplace approximations of exponential family likelihoods through link functions (Zeger and Karim, 1991; Chib and Winkelmann, 2001) may extend the BayesDiff model to better explain the data characteristics.
Like most Bayesian models comprising several latent parameters, the proposed 2R2CF may be marginalized over different parameter sets to obtain equivalent versions of the same model. For example, we could marginalize over restaurants to obtain an equivalent “sticky cuisine” version in which there is just one restaurant with two cuisine-sections and a customer more likely to favor the cuisine selected by the previous customer. Alternatively, we could marginalize over sections to obtain an equivalent “sticky restaurant franchise” in which each restaurant comprises a single section with restaurant-specific probabilities ensuring that Cuisine 1 or 2 dishes are more popular at their namesake restaurant; a customer is then more likely to favor the restaurant selected by the previous customer. In all equivalent versions, however, a probe’s differential state is determined by the customer’s dish in the metaphor.
The 2R2CF perspective offers the twin advantages of parameter interpretability and generalizability. Section 9 of Supplementary Material presents the generalized form of the Sticky PYP, revealing the full potential of the proposed method in analyzing not only DNA methylation datasets, but other types of omics datasets, such as gene expression, RNASeq, and copy-number alteration data. Beyond biomedical applications, the generalized formulation offers a diverse palette of parametric and nonparametric models for capturing the distinctive features of datasets. These Bayesian mixture models are special cases of Sticky PYPs for particular choices of a countable group parameter (e.g., two “restaurants” in the 2R2CF metaphor for differential methylation problems) and countable state parameter (e.g., two “cuisines” in 2R2CF) with the state of a customer influencing the group of the next customer. In addition to extending PYPs to discrete time series-type data, the range of models includes Dirichlet processes, PYPs, infinite HMMs, hierarchical Dirichlet processes (Teh et al., 2006), hierarchical Pitman-Yor processes (Teh et al., 2006; Camerlenghi et al., 2019), finite HMMs, nested Chinese restaurant processes (Blei and Jordan, 2005), nested Dirichlet processes (Rodriguez et al., 2008), and analysis of densities models (Tomlinson and Escobar, 2003).
Ongoing work involves extending the correlation structure to model more sophisticated forms of inter-probe dependence in DNA methylation data. Commented R code implementing the BayesDiff method is available on GitHub at https://github.com/cgz59/BayesDiff. Using high-performance Rcpp subroutines, we are developing a fast R package for detecting differential genomic signatures in a wide variety of omics datasets. Initial results indicate that order-of-magnitude speedups will allow the fast analyses of high-dimensional datasets on personal computers.
Supplementary Material
Acknowledgments
This work was supported by the National Science Foundation and National Institutes of Health under Grants DMS-1854003, R01 CA269398, and U01 CA209414 to SG, and Grants DMS-1463233 and R01 CA160736 to VB. The authors thank the anonymous Editor, Associate Editor, and two referees for many insightful comments that improved the content and presentation of the paper.
References
- Akalin A, Kormaksson M, Li S, Garrett-Bakelman FE, Figueroa ME, Melnick A, and Mason CE (2012). “methylKit: a comprehensive R package for the analysis of genome-wide DNA methylation profiles.” Genome Biology, 13(10): R87. 3, 19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bae H, Kim B, Lee H, Lee S, Kang H-S, and Kim SJ (2017). “Epigenetically regulated fibronectin leucine rich transmembrane protein 2 (FLRT2) shows tumor suppressor activity in breast cancer cells.” Scientific Reports, 7(1): 272. 23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker TA, Bell SP, Gann A, Levine M, Losick R, and Inglis C (2008). Molecular biology of the gene. San Francisco, CA, USA:: Pearson/Benjamin Cummings. 4 [Google Scholar]
- Basu S and Chib S (2003). “Marginal likelihood and Bayes factors for Dirichlet process mixture models.” Journal of the American Statistical Association, 98(461): 224–235. 18 [Google Scholar]
- Benjamini Y and Hochberg Y (1995). “Controlling the false discovery rate: a practical and powerful approach to multiple testing.” Journal of the Royal Statistical Society: Series B (Methodological), 289–300. 20 [Google Scholar]
- Blei DM and Jordan MI (2005). “Variational inference for Dirichlet process mixtures.” Bayesian Analysis, 1: 1–23. 25 [Google Scholar]
- Camerlenghi F, Lijoi A, Orbanz P, and Prünster I (2019). “Distribution theory for hierarchical processes.” The Annals of Statistics, 47(1): 67 – 92. 25 [Google Scholar]
- Chib S (1995). “Marginal likelihood from the Gibbs output.” Journal of the American Statistical Association, 90(432): 1313–1321. 18 [Google Scholar]
- Chib S and Winkelmann R (2001). “Markov chain Monte Carlo analysis of correlated count data.” Journal of Business & Economic Statistics, 19(4): 428–435. 24 [Google Scholar]
- De Blasi P, Favaro S, Lijoi A, Mena RH, Prünster I, and Ruggiero M (2015). “Are Gibbs-type priors the most natural generalization of the Dirichlet process?” IEEE transactions on pattern analysis and machine intelligence, 37(2): 212–229. 6 [DOI] [PubMed] [Google Scholar]
- Dolzhenko E and Smith AD (2014). “Using beta-binomial regression for high-precision differential methylation analysis in multifactor whole-genome bisulfite sequencing experiments.” BMC Bioinformatics, 15(1): 215. 3, 19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Du P, Zhang X, Huang C-C, Jafari N, Kibbe WA, Hou L, and Lin SM (2010). “Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis.” BMC Bioinformatics, 11: 1–9. 5 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunson DB, Herring AH, and Engel SM (2008). “Bayesian selection and clustering of polymorphisms in functionally-related genes.” Journal of the American Statistical Association, 103: 534–546. 6 [Google Scholar]
- Dunson DB and Park J-H (2008). “Kernel stick-breaking processes.” Biometrika, 95: 307–323. 6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Durbin B, Hardin J, Hawkins D, and Rocke D (2002). “A variance-stabilizing transformation for gene-expression microarray data.” Bioinformatics, 18: S105––S110. 24 [DOI] [PubMed] [Google Scholar]
- Eckhardt F, Lewin J, Cortese R, Rakyan VK, Attwood J, Burger M, Burton J, Cox TV, Davies R, Down TA, et al. (2006). “DNA methylation profiling of human chromosomes 6, 20 and 22.” Nature Genetics, 38(12): 1378. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feinberg AP and Tycko B (2004). “The history of cancer epigenetics.” Nature Reviews Cancer, 4(2): 143. 2 [DOI] [PubMed] [Google Scholar]
- Feng H, Conneely KN, and Wu H (2014). “A Bayesian hierarchical model to detect differentially methylated loci from single nucleotide resolution sequencing data.” Nucleic Acids Research, 42(8): e69–e69. 3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferguson TS (1973). “A Bayesian analysis of some nonparametric problems.” The Annals of Statistics, 1: 209–230. 6 [Google Scholar]
- Fox E, Sudderth E, Jordan M, and Willsky A (2011). “The sticky HDP-HMM: Bayesian nonparametric hidden markov models with persistent states.” Annals of Applied Statistics, 5: 1020–1056. 6 [Google Scholar]
- Frühwirth-Schnatter S (2006). Finite Mixture and Markov Switching Models. New York: Springer. 15 [Google Scholar]
- Gnedin A and Pitman J (2006). “Exchangeable Gibbs partitions and Stirling triangles.” Journal of Mathematical Sciences, 138: 5674–5685. 6 [Google Scholar]
- Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, and Staudt LM (2016). “Toward a shared vision for cancer genomic data.” New England Journal of Medicine, 375(12): 1109–1112. 21 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guha S (2010). “Posterior simulation in countable mixture models for large datasets.” Journal of the American Statistical Association, 105(490): 775–786. 20 [Google Scholar]
- Guha S and Baladandayuthapani V (2016). “A nonparametric Bayesian technique for high-dimensional regression.” Electronic Journal of Statistics, 10: 3374–3424. 6, 14, 15 [Google Scholar]
- Hansen KD, Langmead B, and Irizarry RA (2012). “BSmooth: from whole genome bisulfite sequencing reads to differentially methylated regions.” Genome Biology, 13(10): R83. 3, 7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hebestreit K, Dugas M, and Klein H-U (2013). “Detection of significantly differentially methylated regions in targeted bisulfite sequencing data.” Bioinformatics, 29(13): 1647–1653. 3, 7, 19 [DOI] [PubMed] [Google Scholar]
- Irizarry RA, Ladd-Acosta C, Carvalho B, Wu H, Brandenburg SA, Jeddeloh JA, Wen B, and Feinberg AP (2008). “Comprehensive high-throughput arrays for relative methylation (CHARM).” Genome Research, 18(5): 780–790. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M, et al. (2009). “Genome-wide methylation analysis of human colon cancer reveals similar hypo-and hypermethylation at conserved tissue-specific CpG island shores.” Nature Genetics, 41(2): 178. 2, 22 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ishwaran H and James LF (2003). “Generalized weighted Chinese restaurant processes for species sampling mixture models.” Statistica Sinica, 13: 1211–1235. 8 [Google Scholar]
- Jaffe AE, Murakami P, Lee H, Leek JT, Fallin MD, Feinberg AP, and Irizarry RA (2012). “Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies.” International Journal of Epidemiology, 41(1): 200–209. 3, 7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim S, Tadesse MG, and Vannucci M (2006). “Variable selection in clustering via Dirichlet process mixture models.” Biometrika, 93: 877–893. 6 [Google Scholar]
- Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, and Irizarry RA (2010). “Tackling the widespread and critical impact of batch effects in high-throughput data.” Nature Reviews Genetics, 11(10). 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li D, Wang X, Lin L, and Dey DK (2016). “Flexible link functions in nonparametric binary regression with Gaussian process priors.” Biometrics, 72(3): 707—–719. 24 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lijoi A, Mena R, and Prünster I (2007a). “Bayesian nonparametric estimation of the probability of discovering new species.” Biometrika, 94: 769–786. 6 [Google Scholar]
- — (2007b). “Controlling the reinforcement in Bayesian nonparametric mixture models.” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69: 715–740. 7 [Google Scholar]
- Lijoi A and Prünster I (2010). Models beyond the Dirichlet process, 80–136. Cambridge Series in Statistical and Probabilistic Mathematics. 8, 13 [Google Scholar]
- Medvedovic M, Yeung KY, and Bumgarner RE (2004). “Bayesian mixture model based clustering of replicated microarray data.” Bioinformatics, 20: 1222–1232. 6 [DOI] [PubMed] [Google Scholar]
- Müller P and Mitra R (2013). “Bayesian nonparametric inference–why and how.” Bayesian Analysis (Online), 8(2). 6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Murphy SJ, Hart SN, Lima JF, Kipp BR, Klebig M, Winters JL, Szabo C, Zhang L, Eckloff BW, Petersen GM, et al. (2013). “Genetic alterations associated with progression from pancreatic intraepithelial neoplasia to invasive pancreatic tumor.” Gastroenterology, 145(5): 1098–1109. 23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Newton MA, Noueiry A, Sarkar D, and Ahlquist P (2004). “Detecting differential gene expression with a semiparametric hierarchical mixture method.” Biostatistics, 5(2): 155–176. 16 [DOI] [PubMed] [Google Scholar]
- Park Y, Figueroa ME, Rozek LS, and Sartor MA (2014). “MethylSig: a whole genome DNA methylation analysis pipeline.” Bioinformatics, 30(17): 2414–2422. 3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Perman M, Pitman J, and Yor M (1992). “Size-biased sampling of Poisson point processes and excursions.” Probability Theory and Related Fields, 92(1): 21–39. 4, 6 [Google Scholar]
- Rackham OJ, Dellaportas P, Petretto E, and Bottolo L (2015). “WGBSSuite: simulating whole-genome bisulphite sequencing data and benchmarking differential DNA methylation analysis tools.” Bioinformatics, 31(14): 2371–2373. 17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rodriguez A, B. DD, and Gelfand AE (2008). “The nested Dirichlet process (with discussion).” Journal of the American Statistical Association, 103: 1131–1144. 25 [Google Scholar]
- Saito Y, Tsuji J, and Mituyama T (2014). “Bisulfighter: accurate detection of methylated cytosines and differentially methylated regions.” Nucleic Acids Research, gkt1373. 3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sethuraman J (1994). “A constructive definition of Dirichlet priors.” Statistica Sinica, 639–650. 8 [Google Scholar]
- Siegel RL, Miller KD, and Jemal A (2017). “Cancer statistics, 2017.” CA: A Cancer Journal for Clinicians, 67(1): 7–30. 2 [DOI] [PubMed] [Google Scholar]
- Song Q, Decato B, Hong EE, Zhou M, Fang F, Qu J, Garvin T, Kessler M, Zhou J, and Smith AD (2013). “A reference methylome database and analysis pipeline to facilitate integrative and comparative epigenomics.” PloS One, 8(12): e81148. 3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Subramaniam S and Hsiao G (2012). “Gene-expression measurement: variance-modeling considerations for robust data analysis.” Nature Immunology, 13(3): 199–203. 23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun D, Xi Y, Rodriguez B, Park HJ, Tong P, Meong M, Goodell MA, and Li W (2014). “MOABS: model based analysis of bisulfite sequencing data.” Genome Biology, 15(2): R38. 3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tao K, Wu C, Wu K, Li W, Han G, Shuai X, and Wang G (2012). “Quantitative analysis of promoter methylation of the EDNRB gene in gastric cancer.” Medical Oncology, 29(1): 107–112. 23 [DOI] [PubMed] [Google Scholar]
- Teh YW, Jordan MI, Beal MJ, and Blei DM (2006). “Hierarchical Dirichlet processes.” Journal of the American Statistical Association, 101: 1566–1581. 6, 25 [Google Scholar]
- Tomlinson G and Escobar M (2003). “Analysis of Densities.” Talk given at the Joint Statistical Meeting, 103: 1131–1144. 25 [Google Scholar]
- Tsunoda S, Smith E, De Young NJ, Wang X, Tian Z-Q, Liu J-F, Jamieson GG, and Drew PA (2009). “Methylation of CLDN6, FBN2, RBP1, RBP4, TFPI2, and TMEFF2 in esophageal squamous cell carcinoma.” Oncology Reports, 21(4): 1067–1073. 23 [DOI] [PubMed] [Google Scholar]
- Vedeld HM, Goel A, and Lind GE (2017). “Epigenetic biomarkers in gastrointestinal cancers: The current state and clinical perspectives.” In Seminars in cancer biology. Elsevier. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vrana D, Hlavac V, Brynychova V, Vaclavikova R, Neoral C, Vrba J, Aujesky R, Matzenauer M, Melichar B, and Soucek P (2018). “ABC Transporters and Their Role in the Neoadjuvant Treatment of Esophageal Cancer.” International Journal of Molecular Sciences, 19(3): 868. 23 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang D, Yan L, Hu Q, Sucheston LE, Higgins MJ, Ambrosone CB, Johnson CS, Smiraglia DJ, and Liu S (2012). “IMA: an R package for high-throughput analysis of Illumina’s 450K Infinium methylation data.” Bioinformatics, 28(5): 729–730. 3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Warden CD, Lee H, Tompkins JD, Li X, Wang C, Riggs AD, Yu H, Jove R, and Yuan Y-C (2013). “COHCAP: an integrative genomic pipeline for single-nucleotide resolution DNA methylation analysis.” Nucleic Acids Research, 41(11): e117–e117. 3, 19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu X and Sun S (2016). “HMM-DM: identifying differentially methylated regions using a hidden Markov model.” Statistical Applications in Genetics and Molecular Biology, 15(1): 69–81. 3 [DOI] [PubMed] [Google Scholar]
- Zeger SL and Karim MR (1991). “Generalized linear models with random effects: A Gibbs sampling approach.” Journal of the American Statistical Association, 86: 79–86. 24 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
