Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures

Sarah Filippi; Chris C Holmes; Luis E Nieto-Barajas

doi:10.1214/16-ejs1171

. Author manuscript; available in PMC: 2018 Apr 24.

Published in final edited form as: Electron J Stat. 2016 Nov 16;10(2):3338–3354. doi: 10.1214/16-ejs1171

Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures

Sarah Filippi ¹, Chris C Holmes ¹, Luis E Nieto-Barajas ^1,²

PMCID: PMC5915294 EMSID: EMS76651 PMID: 29707100

Abstract

In this article we propose novel Bayesian nonparametric methods using Dirichlet Process Mixture (DPM) models for detecting pairwise dependence between random variables while accounting for uncertainty in the form of the underlying distributions. A key criteria is that the procedures should scale to large data sets. In this regard we find that the formal calculation of the Bayes factor for a dependent-vs.-independent DPM joint probability measure is not feasible computationally. To address this we present Bayesian diagnostic measures for characterising evidence against a “null model” of pairwise independence. In simulation studies, as well as for a real data analysis, we show that our approach provides a useful tool for the exploratory nonparametric Bayesian analysis of large multivariate data sets.

Keywords: Bayes nonparametrics, contingency table, dependence measure, hypothesis testing, mixture model, mutual information

1. Introduction

Identifying dependences among pairs of random variables measured on the same sample, producing datasets of the form D = {(x_i; y_i); i = 1, … , n}, is an important task in modern exploratory data analysis where historically the Pearson correlation coefficient and the Spearman’s rank correlation have been used. More recently there has been a move to the use of non-linear or distribution free methods such as those based on Mutual Information (MI) [Cover and Thomas, 2012, Kinney and Atwal, 2014]. In this paper we present Bayesian non-parametric methods for screening large data sets for possible pairwise associations (dependencies). Having an explicit probability measure of dependences has numerous advantages both in terms of interpretability and for integration across different experimental conditions and/or within a formal decision theoretic analysis. As data sets become ever larger and more complex we increasingly require Bayesian procedures that can scale to modern applications and this will be a key design criteria here. The main building block of our procedures will be the Dirichlet Process Mixture (DPM) model, which is the most popular Bayesian nonparametric model.

We frame the problem of screening for evidence of pairwise dependence as a nonparametric model choice problem with alternatives:

\begin{matrix} M_{0} : X and Y are independent random variables \\ M_{1} : X and Y are dependent random variables . \end{matrix}

(1)

Given a set of measurement pairs D, for n exchangeable observations one could then evaluate the posterior probability for competing models P(ℳ₁|D) = 1 – P(ℳ₀|D) or consider the Bayes factor P(D | ℳ₀)/P(D | ℳ₁) which is a measure of the strength of evidence for independence between the two samples against dependence. However with p measurement variables under study there are $\approx \frac{1}{2} p^{2}$ such pairwise Bayes factors to compute, where even just one such evaluation might be problematic to compute. This motivates us to explore scalable alternatives to a formal Bayesian testing approach, by deriving summary statistics and functionals of the posterior that can provide strong indication in favour or against independence.

Bayesian nonparametric hypotheses testing via Polya tree priors has been the focus of a couple of recent research papers [Holmes et al., 2015, Filippi and Holmes, 2015]. Here, however, we specify model uncertainty in the distribution of X and Y via DPMs of Gaussians. This provides flexibility while also encompassing smoothness assumptions on the underlying joint distributions. Another advantage is that DPMs have been widely studied in the Bayesian nonparametric literature with excellent open source implementation packages available [e.g. Jara et al., 2011]. Moreover, although not explored here, the use of DPMs makes our approach readily extendable to situations when X and Y are themselves collections of multivariate measurements. Here we consider pairwise dependence between univariate measurements where for ℳ₀, independence, the joint distribution factorises into a product of two univariate DPMs on X and Y, while for ℳ₁ we can define a joint DPM model on the bivariate measurement space (X, Y).

In theory, given a DPM prior on the unknown densities, the Bayes factor can be calculated via the marginal likelihood. However this requires integrating over an infinite dimensional parameter space that does not have a tractable form. Moreover, using computational approaches to approximate the marginal likelihood is highly non-trivial, particularly when considering the need to scale to many thousands of comparisons with large p. To overcome this issue we present two new approaches to deriving scalable diagnostic measures corresponding to probabilistic measures of dependence, bypassing the need to calculate Bayes Factors that might not be feasible or desirable. Our methods are motivated by two recent proposals in the literature [Lock and Dunson, 2013, Kamary et al., 2014], although neither of these papers consider the problem we address here as outlined below.

Our first approach utilises the well known latent allocation, or clustering, structure of the DPM model to induce a partition of the two-dimensional data space. By running a Gibbs sampler under the independence model the cluster allocation of observations to specific mixture components at each iteration can then be used to define a latent contingency table given by the mixture component memberships. For each of these contingency tables we perform a parametric Bayesian independence-vs.-dependence test using conjugate multinomial-Dirichlet priors that lead to explicit analytic forms for the conditional marginal likelihoods. This proposal follows a similar idea considered in Lock and Dunson [2013] who studied the two-sample testing problem. A key difference in what we present here, in addition to that we consider the problem of pairwise dependence, is that Lock and Dunson [2013] use a finite mixture model to induce a partition instead of an infinite nonparametric mixture model used here.

In our second approach, we adapt a recent procedure of [Kamary et al., 2014], turning the model choice problem into an estimation problem by writing the competing models under a hierarchy that incorporates both models, ℳ* = πℳ₁ + (1 – π)ℳ₀. We investigate the specification of ℳ* either as a mixture model with mixing component 0 ≤ π ≤ 1, or as a predictive linear ensemble of the two sub-models with constraints on the weights. We then estimate π which becomes a measure of the evidence for dependence. DPMs are used to obtain the likelihood associated to each of the competing models in ℳ*, requiring a separate MCMC run for each potential pair of random variables.

We compare and contrast the two procedures with particular regard to their scalability to large data sets. This latter feature naturally includes the amenity of the methods to simulation with modern parallel computation. We demonstrate that our association measures are scalable and successfully detect some highly non-linear dependences with equivalent performance to the current best conventional methods using mutual information, with the added advantages that fully probabilistic Bayesian methods enjoy. As mentioned above, some of these key advantages includes the ability to integrate results within a formal decision analysis framework, or within optimal experimental design, and the combination of results with other sources of information, or across studies such as arise in meta-analysis.

The rest of the paper is as follows. In Section 2 we review the Dirichlet Process and the DPM of Gaussians. In Section 3 we describe the two approaches to quantify the evidence for dependence using Dirichlet Process Mixtures. In Section 4 we illustrate our approach on the exploratory analysis of a real-world example from the World Health Organisation data set of country statistics and also on simulated data generated from simple models. We conclude the paper with a short discussion in Section 5.

2. Dirichlet Process Mixtures

The Dirichlet process [Ferguson, 1973] is the most important process prior in Bayesian non-parametric statistics. It is flexible enough to approximate (in the sense of weak convergence) any probability law, although the paths of the process are almost surely discrete [Blackwell and MacQueen, 1973]. Many years ago this discreteness was considered a drawback but nowadays it is simply a feature that characterises the Dirichlet process. This feature has recently been highly exploited in clustering applications (e.g. [Dahl, 2006]).

The Dirichlet process is defined as follows. Let G be a probability measure defined on $(𝒳, ℬ)$ , where $𝒳$ ⊂ ℝ^p and ℬ the corresponding Borel’s σ-algebra. Let G be a stochastic process indexed by the elements of ℬ. G is a Dirichlet process with parameters c and G₀ if for every measurable partition (B₁, … , B_k) of $𝒳$ ,

(G (B_{1}), \dots, G (B_{k})) \sim Dir (c G_{0} (B_{1}), \dots, c G_{0} (B_{k})) .

From here we can see that, for every B ∈ ℬ, E{G(B)} = G₀(B) and Var{G(B)} = G₀(B) {1 – G₀(B)}/(c + 1). Therefore the parameter c is known as precision parameter and G₀ as the centering measure.

The Dirichlet process when used as a priori induces exchangeability in the data. In notation, let X₁, … , X_n be a sample of random variables such that conditional on $G, X_{i} | G \overset{iid}{\sim} G .$ If we further take $G ~ 𝒟 𝒫 (c, G_{0})$ then the marginal distribution of the data (X₁, … , X_n) once the process G has been integrated out, is characterised by what is known as the Pólya urn [Blackwell and MacQueen, 1973]. We start with X₁ ~ G₀ then

X_{n} | X_{1, \dots,} X_{n - 1} \sim \frac{c G_{0} + \sum_{j = 1}^{n - 1} δ_{X_{j}}}{c + n - 1}

(2)

Instead of placing the Dirichlet process prior directly on the observable data, it can be used as the law of the parameters of another model (kernel) that generated the data. In notation, let us assume that for each i = 1, … , n,

X_{i} {| θ}_{i} \overset{ind}{\sim} f (x_{i} {| θ}_{i}),

with f a parametric density function. We can further take

θ_{i} | G \overset{iid}{\sim} G

with

G \sim 𝒟𝒫 (c, G_{0}) .

This hierarchical specification can be seen as a mixture of density kernels f (x | θ) with mixing distribution coming from a Dirichlet process, i.e, ∫ f(x | θ)G(dθ). This model is known as Dirichlet process mixture (DPM) model and was first introduced by Lo et al. [1984] in the context of density estimation and written in hierarchical form by Ferguson [1983].

The most typical choice of kernel f is the (multivariate) normal, in which case $θ_{i} = (μ_{i}, σ_{i}^{2}),$ with scalars mean and variance, in the univariate case, and θ_i = (μ_i, Σ_i), with mean vector and variance-covariance matrix, in the multivariate case. We will work with this specific kernel throughout this paper.

As can be seen by construction (2), in the mixture case, the Dirichlet process induces a joint distribution on the set (θ₁, … , θ_n) that allows for ties in the θ_i’s. This in turn induces a clustering structure in the θ_i’s (and X_i’s). Posterior inference of the DPM model usually relies on a Gibbs sampler [Smith and Roberts, 1993]. At each iteration of the Gibbs sampler the model produces a different clustering structure. The number of clusters is a function of the sample size n and the precision parameter c of the underlying Dirichlet process. The larger the value of c, the larger the number of clusters induced. This clustering structure and parameter c will play a central role in one of the independence test procedures that will be described later.

3. Two approaches for measuring dependence

As noted in Section 1, the calculation or approximation of the formal Bayes factor under ℳ₀ and ℳ₁ is not feasible when considering a large number of model comparisons. Indeed it may not even be desirable given that our objective is to highlight potential departures from independence rather than answer a formal model choice question. In this section we describe two distinct approaches for comparing models ℳ₀ and ℳ₁ defined in (1) based on DPM models that are computable and scalable to large data.

3.1. Contingency tables approach

The first approach is motivated by the paper from Lock and Dunson [2013] who turned a two-sample testing problem into a discrete test on the clustered data. Recall that the two-sample testing problem considers the same measurement variable recorded on separate subjects under two different conditions; whereas we are considering different measurement variables recorded on the same subject. Similar to Lock and Dunson [2013], our procedure consists in marginally discretising the data into ordered categories and performing a Dirichlet-multinomial independence test on the induced contingency table. This amounts to first clustering the data under ℳ₀ and then exploring for evidence of departure from ℳ₀, toward ℳ₁, by testing for statistical association between the cluster memberships in X and Y. Uncertainty in the cluster memberships is accounted for by the DPM defined under ℳ₀, as outlined below.

To begin assume that the data are marginally clustered in K_X and K_Y clusters and denote by ξ_X,i ∈ {1 … , K_X} and ξ_Y,i ∈ {1, … , K_Y} the cluster indicators for the data points x_i and y_i respectively, for i = 1, … , n. Using these cluster indicators, we can construct a contingency table M_{ξ_X,ξ_Y} = {m_kl} of size K_X × K_Y, such that $m_{k l} = \sum_{i = 1}^{n} I (ξ_{X, i} = k, ξ_{Y, i} = l),$ for k = 1, … K_X and l = 1, … , K_Y. The contingency table M_{ξ_X,ξ_Y} represents a discretised version of the (unnormalised) marginals and joint distribution of the continuous vector (X, Y). We can then apply Bayesian independence tests for discrete / categorical variables following Gunel and Dickey [1974] and Good and Crook [1987] who proposed a conjugate multinomial-Dirichlet independence test which is described as follows. Let M_{ξ_X,ξ_Y} ~ Mult(n, p) with p = {p_kl} the matrix of cell probabilities of dimension K_X × K_Y. Consider a conjugate prior distribution p ~ Dir(α), with α = {α_kl} such that ∑_kl α_kl = a. In practice we suggest to use α_kl = a(K_XK_Y)⁻¹ or α_kl = 1/2 for all 1 ≤ k ≤ K_X and 1 ≤ l ≤ K_Y. Under model ℳ₁ the probability of having observed the counts in M_{ξ_X,ξ_Y} is

P (M_{ξ_{X}},_{ξ_{Y}} | M_{1}, ξ_{X}, ξ_{Y}) = \int P (M_{ξ_{X}},_{ξ_{Y}} | p) f (p) d p = \frac{Γ (a)}{Γ (a + n)} \prod_{k, l} \frac{Γ (α_{k l} + m_{k l})}{Γ (α_{k l})} .

(3)

Under the independent model ℳ₀ the observed counts M_{ξ_X,ξ_Y} can be expressed in terms of the marginal counts m_X = {m_k·} and m_Y = {m_·l} whose implied distributions are again multinomial with probability vectors p_X = {p_k·} and p_Y = {p_·l}, respectively, with p_k· = ∑_l p_kl and p_·l = ∑_k p_kl. The induced prior distributions are also Dirichlet with parameters α_X = {α_k·} and α_Y = {α_·l}. Then, the probability of M_{ξ_X,ξ_Y} under ℳ₀ becomes

\begin{array}{l} P (M_{ξ_{X}, ξ_{Y}} | M_{0}, ξ_{X}, ξ_{Y}) = \int P (m_{X} | p_{X}) f (p_{X}) {d p}_{X} \int P (m_{Y} | p_{Y}) f (p_{Y}) {d p}_{Y} \\ = \frac{Γ^{2} (a)}{Γ^{2} (a + n)} \prod_{k} \frac{Γ (α_{k} . + m_{k} .)}{Γ (α_{k} .)} \prod_{l} \frac{Γ (α ._{l} + m ._{l})}{Γ (α ._{l})}, \end{array}

(4)

where α_k· = ∑_l α_kl and α_·l = ∑_k α_kl.

To compare evidence in favour of each model, we use expressions (3) and (4) to compute the Bayes factor BF_ξ = P(M_{ξ_X,ξ_Y} | ℳ₀, ξ_X, ξ_Y)/P(M_{ξ_X,ξ_Y} | ℳ₁, ξ_X, ξ_Y). Using equal prior probabilities for both models, i.e. P(ℳ₀) = P(ℳ₁) = 0.5, we obtain that the posterior probabilities for the independence and dependence models are P(ℳ₁ | M_{ξ_X,ξ_Y}) = 1/(1 + BF_{ξ_X,ξ_Y}) = 1 − P(ℳ₀ | M_{ξ_X,ξ_Y}). where

B F_{ξ_{X}, ξ_{Y}} = \frac{Γ (a)}{Γ (a + n)} \prod_{k} \frac{Γ (α_{k \cdot} + m_{k \cdot})}{Γ (α_{k \cdot})} \prod_{l} \frac{Γ (α_{\cdot l} + m_{\cdot l})}{Γ (α_{\cdot l})} \prod_{k, l} \frac{Γ (α_{k l})}{Γ (α_{k l} + m_{k l})} .

(5)

It should also be noted that this contingency table approach would also afford a conditional frequentist test. For example, consider Pearson’s chi-squared test of independence [Pearson, 1922]. Denote by m_k· = ∑_l m_kl and m_·l = ∑_k m_kl the number of individuals classified in cluster k of X and cluster l of Y, respectively. Then, the well known test statistic is

T = \sum_{k = 1}^{K_{X}} \sum_{l = 1}^{K_{Y}} \frac{{(m_{k l} - m_{k .} m_{. l} / n)}^{2}}{m_{k .} m_{. l} / n} .

(6)

Under the null hypothesis ℳ₀ of independence, statistic T follows a χ² distribution with (K_X – 1)(K_Y – 1) degrees of freedom. If the test statistic is improbably large according to that chi-square distribution, then one rejects the null hypothesis ℳ₀ in favour of the dependence hypothesis ℳ₁.

The hypothesis testing approach described in this section assumes that the data are marginally clustered. However, these clusters are not known a prior. A Bayesian approach for data clustering is to define a prior distribution over the clustering and then update the posterior based on the evidence provided by the data. Here we make use of the DPM model structure to create an empirical partition of the two-dimensional data space, taking into account the uncertainty on the allocation process. More precisely, we consider two independent DPM prior models for each of the marginal densities with the following specifications:

f_{0, X} (x) \sim \int N (x | θ_{X}) G_{X} (d θ_{X}) and f_{0, Y} (y) \sim \int N (y | θ_{Y}) G_{Y} (d θ_{Y}),

(7)

where $θ_{X} = (μ_{X}, σ_{X}^{2})$ and $θ_{Y} = (μ_{Y}, σ_{Y}^{2})$ with

G_{X} \sim 𝒟 𝒫 (c_{0}, G_{0}) and G_{Y} \sim 𝒟 𝒫 (c_{0}, G_{0})

(8)

and G₀ = N(μ | μ₀, σ²/k₀) IGa(σ² | ν/2 – 1, ψ/2). The latent clustering structure induced by the DPM models defined by (7) and (8), can then be used to construct a contingency table as described above. Note that in an ideal world one would carefully specify subjective beliefs on the prior marginals for X and Y. However, when the number of variables is large this is not feasible and we require some default specification as done here, by assuming a common prior after suitable transformation of the data.

Although it is clear from the properties of the DP that it induces a partition, in practice it is not easy to determine an optimal one. Fitting a DPM model via a Gibbs sampler provides a partition at each iteration. We can proceed in two different ways. One is to use all potential partitions coming from the MCMC, and for each of them perform the Bayesian independence test and report the expected posterior probability. More precisely, the functional we consider is

p_{dep} = \int \frac{1}{(1 + B F_{ξ_{X}, ξ_{Y}})} p (ξ_{X}, ξ_{Y}) d ξ_{X} d ξ_{Y} .

(9)

This is the procedure we recommend and develop below. An alternative approach would be to consider the selection of one of the partitions using an appropriate optimization criterion, for example using the criterion of Dahl [2006] who proposes to choose the partition that minimises the squared deviations with respect to the average pairwise clustering matrix, and use that single partition to perform the test, ignoring the uncertainty in the partition structure as in Lock and Dunson [2013] for the two-sample test. In Supplementary Material we provide an empirical comparison between both procedures.

In the rest of the paper we will focus on the first alternative that considers all potential partitions; we will refer to this procedure as CT-BF.

3.2. Mixture model predictive approach

In this section we consider an alternative approach for testing between hypothesis (1). Motivated by Kamary et al. [2014] we replace the testing problem with an estimation one by defining a predictive ensemble model ℳ* whose components are the competing models ℳ₀ and ℳ₁. To be precise, let f₀ and f₁ denote the densities of (X, Y) defined by models ℳ₀ and ℳ₁, respectively. Then we define a predictive mixture model as a linear combination

Algorithm 1 Independence measure based on Contingency table (CT-BF).

Require: Data $D = {x_{i}, y_{i}}_{i = 1}^{n}$

Require: Prior parameters a

Require: Prior parameters for the DPM and number of iterations N_it

Ensure: Probability of dependence p_dep

DPM inference:

Infer a DPM model for the distribution f_0,X (x) using a Gibbs Sampler with n_it iterations → for each iteration 1 ≤ j ≤ N_it, record a vector of cluster indicator $ξ_{X}^{(j)}$

Infer a DPM model for the distribution f_0,Y (y) using a Gibbs Sampler with N_it iterations → for each iteration 1 ≤ j ≤ N_it, record a vector of cluster indicator $ξ_{Y}^{(j)}$

for 1 ≤ j ≤ N_it do

Construct a contingency table M^(j) of size $K_{X}^{(j)} \times K_{Y}^{(j)}$ based on $ξ_{X}^{(j)}$ and $ξ_{Y}^{(j)}$ p^(j) ← 1/(1 + BF) where BF is defined in (5).

end for

$p_{dep} \leftarrow \frac{1}{n_{i t}} \sum_{j = 1}^{n_{i t}} p^{(j)}$

of sub-models of the form

f^{*} (x, y) = π f_{1} (x, y) + (1 - π) f_{0} (x, y),

(10)

where π is a free regression parameter with constraint 0 ≤ π ≤ 1 and f₀(x, y) = f_0,X (x)f_0,Y (y). This model embeds both ℳ₀ and ℳ₁ for values of π equal to 0 or 1. The main idea of this method is to estimate from the data the mixture parameter π, which indicates the preference of the data for dependence model ℳ₁. In contrast to the latent contingency table procedure this approach requires the explicit construction of a joint model under hypothesis ℳ₁.

Since f₀ and f₁ are unknown densities, we assume Bayesian nonparametric prior distributions. For f_{0_X} (x) and f_0,Y (y) we consider the DPM model defined by equations (7) and (8). For f₁ we take a bivariate DPM model defined as

f_{1} (x, y) \sim \int N (x, y | θ_{X, Y}) G_{X, Y} (d θ_{X, Y}),

(11)

where θ_X,Y = (μ, Σ), with

G_{X, Y} \sim 𝒟 𝒫 (c_{1}, G_{1})

(12)

and G₁ = N(μ | μ₀, (1/k₀)Σ) IW(Σ | ν, Ψ). The parameter π has also to be estimated so we take a prior of the form π ~ Be(a₀, b₀). We ensure that the centring measures G₀ and G₁ are comparable by setting their hyper-parameters as follows: we have G_d–1 = N(μ | μ₀, (1/k₀)Σ) IW(Σ | ν, Ψ) for d = 1 and 2 with ν = d + 2, the d-dimensional vector μ₀ ~ N(0_d, c_μ I_d), the d × d-matrix Ψ ~ IW(ν, c_Ψ I_d) where I_d is the identity matrix of dimension d. The hyper-parameters c_μ, c_Ψ and k₀ are set to be equal for G₀ and G₁.

Our objective is to highlight pairwise dependence across many pairs of variables, and order the pairs into those showing evidence from strongest to weakest association. This motivates us to consider a simplified method by assessing the relative posterior predictive evidence under ℳ₀ to that of ℳ₁, by calculating an ensemble model using the posterior predictive probability of the observed data f₁(x_new, y_new|D) and f₀(x_new, y_new|D) separately. In the following we will use the notations f̂_j (x_new, y_new) = f_j (x_new, y_new|D), j = 0, 1 to denote the posterior predictive distribution. It is important to note that for all [p × (p – 1)/2] X, Y pairs we use the same prior, and hence same model complexity across all pairs, so ranking by the improvement in posterior predictive likelihood under ℳ₁ relative to ℳ₀ should not a priori favour certain pairs over others. This procedure significantly simplifies the inference as we can infer the posterior models by first fitting the three DPM models separately each using the entire sample data, and then updating the ensemble parameter π from its posterior conditional distribution

f (π | D) \propto f (π) \prod_{i} (π {\hat{f}}_{1} (x_{i}, y_{i}) + (1 - π) {\hat{f}}_{0} (x_{i}, y_{i})),

which is a simple line search on [0, 1]. We will refer to this inference procedure as MixMod-ensemble – see Algorithm 2.

Algorithm 2 Independence test MixMod-ensemble.

Require: Data $D= {x_{i}, y_{i}}_{i = 1}^{n};$ Prior parameters a₀ and b₀; Prior parameters for the DPMs

Ensure: Estimate of mixture parameter π

DPMs inference:

f̂_0,X $⟵$ posterior prediction of a DPM for distribution of {x_i}_i averaged over all Gibbs sampler iteration

f̂_0,Y $⟵$ posterior prediction of a DPM for distribution of {y_i}_i averaged over all Gibbs sampler iteration

f̂₁ $⟵$ posterior prediction of a DPM for distribution of {x_i, y_i}_i averaged over all Gibbs sampler iteration

Estimation of π̂:

Define a fine grid of [0, 1] with intervals of length η = 10⁻⁴

for j = 0, … , η⁻¹ do

π^(j) $⟵$ j × η

$L_{j} ⟵ \sum_{i = 1}^{n} log (π^{(j)} {\hat{f}}_{1} (x_{i}, y_{i}) + (1 - π^{(j)}) {\hat{f}}_{0, X} (x_{i}) {\hat{f}}_{0, Y} (y_{i})) + log (Be (π^{(j)} | a_{0}, b_{0}))$

end for

$\hat{π} ⟵ \frac{1}{\sum_{j} exp (L_{j})} \sum_{j} π^{(j)} exp (L_{j})$

An alternative approach, more closely resembling Kamary et al. [2014], is to consider ℳ* as a mixture-model rather than an ensemble model where with probability π the data arises from f₀ and with probability 1 − π from f₁. Diebolt and Robert [1994] show that posterior sampling in a mixture model is simplified if we introduce latent variable indicators ζ_i ~ Ber(π) that determine whether observation i comes from f₁, when ζ_i = 1, or from f₀, when ζ_i = 0. Conditional on these latent indicators the mixture components f₀ and f₁ can be updated using only the data points allocated to each model. As noted by Kamary et al. [2014], the Gibbs sampler implemented in this way can become quite inefficient if the parameter π approaches the boundaries {0, 1}, specially for large sample sizes. We refer to this method as MixMod. For our purposes this requires specifying a Gibbs sampler for the mixture model utilising three DPM models {f₁(x, y), f_0,X (x), f_0,Y (y)} and the mixture allocations for points across all p × (p – 1)/2 pairs.

In the paper we will illustrate the performance using MixMod-ensemble, and in the Supplementary Material we provide a comparison between MixMod and MixMod-ensemble.

Regardless of the posterior inference procedure, different estimators of π could be obtained from its posterior distribution. We chose to select the expected value as a statistic of dependence, that is,

\hat{π} = E (π | D) = \int_{0}^{1} π f (π | D) d π .

(13)

3.3. Computational tractability

Both of the Bayesian non-parametric approaches proposed here are motivated by the increasing necessity of screening large data sets for possible pairwise dependencies where calculation of the formal Bayes factor under ℳ₀ and ℳ₁ is unfeasible or undesirable. In this section, we discuss some computational advantages of our two methods including their amenity to implementation on modern computing architectures exploiting parallelisation on multi-core standalone machines, or clusters of multi-core and many-core machines, or cloud based computing environments.

In relation to parallelisation we see that both methods are divided in two steps: one starts by inferring DPMs using a Gibbs sampler and then perform a dependence test using every iteration of the Gibbs sampler. This decoupling of the inference step and the model comparison step allows to significantly reduce the computational cost of the procedure. In particular, only a couple of thousands of Gibbs sampling iterations are necessary to estimate the predictive posterior densities and posterior distributions over the latent allocation variables. In the environment for statistical computing R Core Team [2014], the parallelisation of both approaches is very simple and only consists in replacing the command apply by the command parLapply from the package parallel – which is included in versions of R following 2.14.0. The R code to run CT-BF and MixMod-ensemble independence tests is available in the Supplementary Material.

The CT-BF approach based on the construction of a contingency table is particularly attractive as it is trivially parallelizable and does not involve an explicit DPM model for the joint f₁(x, y) under ℳ₁. With p measurement variables under study, this approach only needs to infer p independent marginal DPMs, recording information from N_it Gibbs sampling iterations for each of them independently in parallel. The MCMC output from the p models is then combined and we perform N_it × p × (p – 1)/2 independent tests where following (5) only involves computing ratios of Gamma functions. As an illustration, in the example described in more details in Section 4, for p = 562 measurement variables, the first stage of inference on the DPMs take less than 3 minutes on a 48-core machine, and then the resulting 1.5 × 10⁸ pairwise tests of dependence for all pairs of variables are performed in one hour.

In comparison the MixMod-ensemble approach incurs a greater computational overhead as we require bivariate DPMs, f₁(x, y), to be fit for all pairs. In the illustration below the MixMod-ensemble procedure for the 1.5 × 10⁸ pairs takes approximatively 36 hours on the same 48-core machine.

4. Numerical Analysis

4.1. World Health Organisation dataset

In this section, we apply the two approaches described in Section 3 to detect dependencies in economic, social and health indicators from the World Health Organisation (WHO). The WHO Statistical Information System (WHOSIS) has recently been incorporated into the Global Health Observatory (GHO) that contains a data repository (http://www.who.int/gho/database/en/) with mortality and global health estimates, demographic and socioeconomic statistics as well as information regarding health service coverage and risk factors for 194 countries. We combined these datasets to obtain a set of 562 statistics per country. We aim at highlighting potential dependencies between these indicators. Scatterplots of some of these indicators are represented in Figure 1, where for example we see, unsurprisingly, strong dependencies between indicators such as life expectancy at birth and increased life expectancy at age 60 (Pair E).

Examples of the relationship between economic, social and health indicators provided by the WHO Statistical Information System. Each dot corresponds to one country.

We applied both the CT-BF and the MixMod-ensemble test to compute the probability of dependence for all the 157,641 pairs of indicators. The two proposed methods require the specification of several parameters of the prior distributions. The impact of these choices is discussed in Supplementary Material. For the approach based on contingency tables the prior specifications for models (7) and (8) are set as follows: c₀ = 10, μ₀ ~ N(0, 1), k₀ ~ Ga(1/2, 100/2), ν = 3 and ψ ~ IGa(1/2, 5). Note that c₀ controls the number of clusters induced, so in order to avoid having partitions with only one cluster we set this parameter at a relative large value. To specify the Dirichlet prior for the cell probabilities in the contingency table we took α_kl = 1/2, which is the Jeffreys prior in a multinomial model. In experimentation we found that the contingency table can be sensitive to the choice of the parameter c₀. This parameter influences the number of clusters in the DPM model and therefore the size of the contingency tables and it is important to specify a value that induces a reasonable number of clusters. We would recommend exploring several values. Results seem fairly insensitive to the choice of the parameters α_kl in the Dirichlet priors.

For the approach considering an ensemble mixture model, the parameters c₀ and c₁ are not fixed but specified by c₀, c₁ ~ Ga(1, 1) and μ₀ ~ N(0, 100). This change was introduced to allow the model to determine the best fit without constraining the number of clusters. In addition, the prior processes G₀ and G₁ are defined as follows: G_d–1 = N(μ | μ₀, (1/k₀)Σ) IW(Σ | ν, Ψ) for d = 1 and 2 with ν = d + 2, the d-dimensional vector μ₀ ~ N(0_d, 100 I_d), the d × d-matrix Ψ ~ IW(ν, 0.1 I_d) and k₀ ~ Ga(1/2, 50), where I_d is the identity matrix of dimension d. The prior distribution of the mixing proportion π was specified by taking a₀ = b₀ = 1/2. Our experience is that results are fairly robust to the prior parameter settings (see Supplementary Material).

The procedures were implemented in the environment for statistical computing R Core Team [2014] and make use of the package DPpackage [Jara et al., 2011]. Chains were run for 10,000 iterations with a burn in of 1,000 keeping one of every 5th draws for computing estimates.

For both approaches the tests were performed only for pairs containing measurements for at least 10 countries. For the CT-BF approach, the 562 DPMs are inferred using all the available data; however, the contingency tables were constructed taking into account only the countries for which both indicators (in the pair) are available. For the MixMod-ensemble approach, in order to avoid any bias towards one of the two models ℳ₀ or ℳ₁, both the DPMs on the marginals and the DPM on the joint space are inferred only on the countries for which measurements are available for both indicators. Extending the method to handle missing data is a future objective.

The measure of dependences obtained following our two approaches, i.e. p_dep for CT-BF and π̂ for MixMod-ensemble, defined respectively equations (9) and (13), are compared for each pair of variables in Figure 2 (left panel). Strong dependences (defined as p_dep > 0.8) are detected for 5% of pairs, and credible independence (i.e. p_dep < 0.2) between 30% of the indicators. We observe that the two probabilistic measures of dependence generally agree for most of the pairs, with the probability value obtained following the MixMod-ensemble method being generally higher than the probability measure obtained following the CT-BF approach. This elevation in the evidence in dependence is perhaps to be expected as MixMod-ensemble uses the conditional posterior predictive likelihood which will favour the more complex joint model of f₁(x, y). However, the two methods disagree (defined as the probability value obtained following one method is lower than 0.2 while it is larger than 0.8 following the other method) for less than 0.36% of the pairs; and these differences mainly occur when one of the (X, Y) variables is equal to 0 for more than 20% of the countries (see for example pairs C and D).

Performance comparison between the CT-BF and the MixMod-ensemble approaches (left) and the mutual information (right) for every pair of indicators in the WHO dataset. The probabilities of dependences obtained following CT-BF and MixMod-ensemble are respectively p_dep and π̂, defined equations (9) and (13) and approximated following algorithms 1 and 2. The letters A to F correspond to the 6 pairs of indicators illustrated in Figure 1.

On balance we prefer to use the CT-BF approach due to its computational scalability, 1 hour of run-time on a 48-core computer in comparison with 36 hours for MixMod-ensemble in this example. We compared the analysis from the CT-BF to that using a mutual information approach computed using the 20-nearest neighbours method, as in Kinney and Atwal [2014] (see Figure 2 right panel where the labelled points correspond to plots in Fig 1). We remark that some pairs of variables with strong dependences under CT-BF have a wide spread of mutual information, in particular we note pairs D and F that have a probability of dependence close to 1 for CT-BF but relatively low MI values. Visually at least one could argue that associations of the form seen in Figure 2 D and F may be of potential interest to follow up by the analyst.

4.2. Simulation Study for frequentist power analysis

In this section we perform a simulation study to examine the frequentist performance of the two proposed tests on some controlled scenarios. The objective is to verify that we are not losing much power against a popular non-probabilistic method based on mutual information, which is optimised for frequentist power. Simulated datasets are generated under the following four different scenarios:

A bivariate normal model: (X, Y) ~ N₂(0, Σ) with $Σ = (\begin{matrix} 1 & ρ \\ ρ & 1 \end{matrix}),$
A sinusoidal model: Y = 2 sin(X) + η, with η ~ N(0, ϕ²), and X ~ Un[0, 5π]
A parabolic model: Y = 2X²/3 + η, with η ~ N(0, ϕ²), and X ~ N(0, 1)
A circular model: X = 10 cos(θ) + η and Y = 10 sin(θ) + η, with θ ~ Un[0, 2π] and η ~ N(0, ϕ²).

For the sinusoidal, parabolic and circular models, the parameter ϕ controls the level of noise, whereas for the normal model the correlation ρ controls the degree of dependence between the two samples. We generated fifty independent datasets from each model with a sample size n = 250 with different correlations ρ ∈ {0, 0.1, 0.3, 0.5, 0.9}, for model (a), and levels of noise ϕ ∈ {1, 2, 3, 4, 5} for models (b)–(d). Figure 3 shows one of the fifty simulated dataset as illustration.

Samples of size 250 generated from the four scenarios for two levels of correlation ρ in the normal model and two levels of noise ϕ in the sinusoidal, parabolic and circular models.

For all the simulated datasets we apply our different procedures for testing hypothesis (1). We use the same priors specifications as described in Section 4.1.

To investigate the power of the two approaches, we create ROC curves that compare the rate of true positives (percentage of times the procedure detects dependence among the fifty datasets generated from a dependent model) and false positives (percentage of times the procedure detects dependence among fifty null datasets generated by randomly permuted the indexes of the two samples to destroy any dependences) for different threshold values. We also compare the performance of the proposed methods to the current state of the art conventional method, which is based on mutual information (using the 20 nearest neighbours). The ROC curves are reported in Figure 4; see also Supplementary Material that contains additional more extensive comparisons.

ROC curves for competing methods as a function of correlation and noise level for models (a)–(d). CT-BF (blue line), MixMod-ensemble (red line) and Mutual Information approximated using the 20 nearest neighbours (black dotted line).

We observe that the proposed methods have similar performances to the current leading conventional method for data coming from a sinusoidal or a parabolic model. For data generated from the circular model however the mutual information method outperforms our approaches.

5. Conclusion

We presented two Bayesian nonparametric procedures for highlighting pairwise dependencies between random variables that are scalable to large data sets. The methods make use of standard software in R for implementing DPM of Gaussians and are designed to exploit modern computer architectures. As such they are readily amenable to applied statisticians interested in exploratory analysis of large data sets. A power analysis shows that the procedures are comparable with that of current non-Bayesian methods based on mutual information, while having the advantage of being probabilistic in their measurement.

Supplementary Material

Supplementary Information

NIHMS76651-supplement-Supplementary_Information.pdf^{(286.2KB, pdf)}

References

Blackwell D, MacQueen JB. Ferguson distributions via pólya urn schemes. Annals of Statistics. 1973:353–355. [Google Scholar]
Cover TM, Thomas JA. Elements of information theory. John Wiley & Sons; 2012. [Google Scholar]
Dahl DB. Model-based clustering for expression data via a dirichlet process mixture model. Bayesian inference for gene expression and proteomics. 2006:201–218. [Google Scholar]
Diebolt J, Robert C. Estimation of finite mixture distributions by bayesian sampling. Journal of the Royal Statistical Society, Series B. 1994:363–375. [Google Scholar]
Ferguson TS. A bayesian analysis of some nonparametric problems. Annals of Statistics. 1973;1:209–230. [Google Scholar]
Ferguson TS. Recent Advances in Statistics. Academic Press; New York: 1983. Bayesian density estimation by mixtures of normal distributions; pp. 287–302. [Google Scholar]
Filippi S, Holmes C. A bayesian nonparametric approach to quantifying dependence between random variables. arXiv preprint arXiv: 1506.00829. 2015 [Google Scholar]
Good IJ, Crook JF. The robustness and sensitivity of the mixed.dirichlet bayesian test for independence in contingency tables. Annals of Statistics. 1987;15:670–693. [Google Scholar]
Gunel E, Dickey J. Bayes factors for independence in contingency tables. Biometrika. 1974;61:545–557. [Google Scholar]
Holmes CC, Caron F, Griffin JE, Stephens DA, et al. Two-sample bayesian nonparametric hypothesis testing. Bayesian Analysis. 2015;10(2):297–320. [Google Scholar]
Jara A, Hanson T, Quintana F, Mueller P, Rosner G. Dppackage: Bayesian semi- and nonparametric modeling in r. Journal of Statistical Software. 2011:1–30. URL http://www.jstatsoft.org/v40/i05/. [PMC free article] [PubMed] [Google Scholar]
Kamary K, Mengersen K, Robert CP, Rousseau J. Testing hypotheses via a mixture estimation model. preprint arXiv: 1412.2044. 2014 [Google Scholar]
Kinney JB, Atwal GS. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences. 2014;111(9):3354–3359. doi: 10.1073/pnas.1309933111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lo AY, et al. On a class of bayesian nonparametric estimates: I. density estimates. Annals of Statistics. 1984;12(1):351–357. [Google Scholar]
Lock EF, Dunson DB. Two-sample testing with dirichlet mixtures. preprint arXiv: 1311.0307. 2013 [Google Scholar]
Pearson K. On the χ2 test of goodness of fit. Biometrika. 1922;14:186–191. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2014. URL http://www.R-project.org/. [Google Scholar]
Smith AF, Roberts GO. Bayesian computation via the gibbs sampler and related markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological) 1993:3–23. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

NIHMS76651-supplement-Supplementary_Information.pdf^{(286.2KB, pdf)}

[R1] Blackwell D, MacQueen JB. Ferguson distributions via pólya urn schemes. Annals of Statistics. 1973:353–355. [Google Scholar]

[R2] Cover TM, Thomas JA. Elements of information theory. John Wiley & Sons; 2012. [Google Scholar]

[R3] Dahl DB. Model-based clustering for expression data via a dirichlet process mixture model. Bayesian inference for gene expression and proteomics. 2006:201–218. [Google Scholar]

[R4] Diebolt J, Robert C. Estimation of finite mixture distributions by bayesian sampling. Journal of the Royal Statistical Society, Series B. 1994:363–375. [Google Scholar]

[R5] Ferguson TS. A bayesian analysis of some nonparametric problems. Annals of Statistics. 1973;1:209–230. [Google Scholar]

[R6] Ferguson TS. Recent Advances in Statistics. Academic Press; New York: 1983. Bayesian density estimation by mixtures of normal distributions; pp. 287–302. [Google Scholar]

[R7] Filippi S, Holmes C. A bayesian nonparametric approach to quantifying dependence between random variables. arXiv preprint arXiv: 1506.00829. 2015 [Google Scholar]

[R8] Good IJ, Crook JF. The robustness and sensitivity of the mixed.dirichlet bayesian test for independence in contingency tables. Annals of Statistics. 1987;15:670–693. [Google Scholar]

[R9] Gunel E, Dickey J. Bayes factors for independence in contingency tables. Biometrika. 1974;61:545–557. [Google Scholar]

[R10] Holmes CC, Caron F, Griffin JE, Stephens DA, et al. Two-sample bayesian nonparametric hypothesis testing. Bayesian Analysis. 2015;10(2):297–320. [Google Scholar]

[R11] Jara A, Hanson T, Quintana F, Mueller P, Rosner G. Dppackage: Bayesian semi- and nonparametric modeling in r. Journal of Statistical Software. 2011:1–30. URL http://www.jstatsoft.org/v40/i05/. [PMC free article] [PubMed] [Google Scholar]

[R12] Kamary K, Mengersen K, Robert CP, Rousseau J. Testing hypotheses via a mixture estimation model. preprint arXiv: 1412.2044. 2014 [Google Scholar]

[R13] Kinney JB, Atwal GS. Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences. 2014;111(9):3354–3359. doi: 10.1073/pnas.1309933111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Lo AY, et al. On a class of bayesian nonparametric estimates: I. density estimates. Annals of Statistics. 1984;12(1):351–357. [Google Scholar]

[R15] Lock EF, Dunson DB. Two-sample testing with dirichlet mixtures. preprint arXiv: 1311.0307. 2013 [Google Scholar]

[R16] Pearson K. On the χ2 test of goodness of fit. Biometrika. 1922;14:186–191. [Google Scholar]

[R17] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2014. URL http://www.R-project.org/. [Google Scholar]

[R18] Smith AF, Roberts GO. Bayesian computation via the gibbs sampler and related markov chain monte carlo methods. Journal of the Royal Statistical Society. Series B (Methodological) 1993:3–23. [Google Scholar]

PERMALINK

Scalable Bayesian nonparametric measures for exploring pairwise dependence via Dirichlet Process Mixtures

Sarah Filippi

Chris C Holmes

Luis E Nieto-Barajas

Abstract

1. Introduction

2. Dirichlet Process Mixtures