A Bayesian Approach for Identifying Multivariate Differences Between Groups

Yuriy Sverchkov; Gregory F Cooper

doi:10.1007/978-3-319-24465-5_24

. Author manuscript; available in PMC: 2016 Oct 1.

Published in final edited form as: Adv Intell Data Anal. 2015 Nov 22;9385:275–285. doi: 10.1007/978-3-319-24465-5_24

A Bayesian Approach for Identifying Multivariate Differences Between Groups

Yuriy Sverchkov ^1,^✉, Gregory F Cooper ²

PMCID: PMC4825814 NIHMSID: NIHMS742516 PMID: 27069983

Abstract

We present a novel approach to the problem of detecting multivariate statistical differences across groups of data. The need to compare data in a multivariate manner arises naturally in observational studies, randomized trials, comparative effectiveness research, abnormality and anomaly detection scenarios, and other application areas. In such comparisons, it is of interest to identify statistical differences across the groups being compared. The approach we present in this paper addresses this issue by constructing statistical models that describe the groups being compared and using a decomposable Bayesian Dirichlet score of the models to identify variables that behave statistically differently between the groups. In our evaluation, the new method performed significantly better than logistic lasso regression in indentifying differences in a variety of datasets under a variety of conditions.

1 Introduction

There are many circumstances in which data collected from different sources are similar in some respects, but nonetheless differ in ways that are interesting to report. Such circumstances arise naturally in observational studies, where, for example, a clinical researcher may observe a difference in the prevalence of a condition between two groups of patients and would like to explore the reasons behind the difference; in randomized trials, where we might be interested not only in the effectiveness of a treatment but also whether its effects are particular to subgroups of the subjects and if so, what the relevant contextual relationships are; in comparative effectiveness research, where an observed difference between two clinical treatment approaches is to be explained; and other application areas. Identifying patterns of differences is also useful in abnormality and anomaly detection scenarios, where data on a potentially anomalous population of samples are compared to a “normal” baseline population.

We approach the problem of identifying interesting patterns of differences from a statistical standpoint, where given a pair of data groups over a vector of discrete random variables we would like to identify variables that exhibit statistical differences. A variable might have a different marginal distribution in the two groups and/or a different conditional distribution when conditioning on the values of some of the other variables. We present and evaluate a method for identifying differences in both of those categories.

The method accomplishes this task by building models for each of the groups and for both groups, scoring local differences in distribution by comparing how well these alternative parameterizations fit the data locally, and using these local scores to obtain a score of how different the two groups are as a whole. In this paper, the performance of our method for identifying differences between groups at the variable-level is evaluated using data based on four UCI Machine Learning Repository data sets [1].

2 Background

We review some general background literature about distribution comparison followed by background relevant to Bayesian networks, which is the model that our method uses, with a particular focus on learning models from data and the Bayesian Dirichlet score which we use.

2.1 Comparing Distributions

There are various statistical methods that are applicable to the problem of identifying differences across a pair of groups. The statistical approach that most closely relates is that of contrast set mining. Bay and Pazzani [2] present contrast-set mining as the discovery of joint variable-value assignments that have different levels of support in different groups. The approach taken parallels association-rule mining in that the space of possible joint variable-value assignments is searched to maximize a score (in association-rule mining, this score is the lift of a rule, while in contrast-set mining a chi-square test is used). The main challenge in contrast-set mining is the search of the exponentially large space of possible sets (joint variable-value assignments), and much of the literature is dedicated to discussing heuristics and pruning rules to make the search feasible. The output of contrast-set mining is the list of joint variable-value assignments (the sets) which have differing support across groups, ranked by the extent of that difference and tested for significance. Novak et al. [11] summarize further literature on contrast-set mining and its relation to association-rule mining, emerging pattern mining, and subgroup mining. While these approaches address a similar task to that of our method, these are all value-based approaches. Their task is to identify specific value ranges in which the differences between the groups are most pronounced. In contrast, our approach is variable-based, meaning that we identify variables the distributions of which are different across the groups.

The variable-based nature of our approach bears some similarity to traditional statistical methods. There are multiple traditional statistical tests that are designed to compare distributions. For categorical variables, the Chi-Square test is applicable, it tests whether two groups are independent. This can be used to determine if a variable has different distributions across two groups by testing whether it is dependent on the group variable. For continuous variables, the Kolmogorov-Smirnov test is often used to determine equality of distributions. Note that these tests are univariate, and cannot therefore be used to compare two multivariate groups of data directly. There are other measures of distribution differences that are multivariate in nature, such as Hotelling's T-squared test, mutual information, or Kullback-Leibler divergence. These measures are multivariate, but they do not allow for examining the contributions of differences in individual variables to the overall measure of difference across the groups. The approach we present bridges this gap by providing both a measure of overall difference, as well as a breakdown into contributions in the differences of distributions of individual variables.

2.2 Bayesian Networks

As mentioned above, the approach we present relies on building statistical models for the data groups to be compared. In particular, the model we construct is a Bayesian Network (BN). A BN over the variables X = (X₁, …, X_n), where each variable X_i is discrete and takes K_i values, consists of a directed acyclic graph (DAG) where each node represents a variable X_i and each node is associated with a conditional probability table (CPT) defined by a set of parameters

θ_{i j k} = P (X_{i} = x_{i k} | Π_{i} = π_{i j})

(1)

where x_ik represents the k-th value X_i takes and π_ij represents the j-th configuration of X_i's set of parents Π_i [8].

In order to obtain a BN from data, the DAG structure is needed. In some cases the structure or elements of the structure for a given domain may be known, but often the structure must be learned from data. Daly et al. [5] provide an extensive review of BN structure learning and divide existing methods into constraint-based methods, where conditional independencies (CI) in the data are used to constrain the structure; and score search, where the space of BN structures is searched for a structure that has the best score according to some scoring criterion. Constraint-based methods use CIs obtained from statistical tests on the data to eliminate possible arcs in the network structure, such as the the PC algorithm by Spirtes and Glymour [13], for example. Score-based techniques seek to optimize some score function of the graph based on the data. The space over which many methods in this category search is that of possible DAGs, which is combinatorial in the number of variables, and the task of optimizing the score is NP-hard in general [3]. Algorithms that feasibly search the entire space of DAGs in the case of up to approximately 30 variables include dynamic programming approaches [10,12] and an application of A* search to the space of DAGs [14]. For data with more variables, many search methods apply various heuristics and do not perform an exhaustive search of the space; most commonly these methods employ some sort of greedy search strategy [5].

In our implementation we used greedy-thick-thinning, an algorithm described but not named in [8], which maximizes the K2 score [4] in a greedy fashion by starting with an empty graph, adding arcs that most increase the score until no more arc additions can increase the score, and then performs arc removals that increase the score most until no more removals increase the score. Any score search strategy can be used with our method.

2.3 Bayesian Dirichlet Scores for Bayesian Networks

In this work we use Bayesian Dirichlet (BD) scoring in order to leverage both the mathematical properties of the score and its statistical interpretation. The BD score is motivated by the search for a maximum a posteriori (MAP) model, a graph structure ℳ that is most probable given the data 𝒟 and prior belief. Directly computing a posterior P(ℳ|𝒟) for the structure is an intractable task; however, we can show that it is proportional to an easily computable quantity. From Bayes' rule we have that P(ℳ|𝒟) ∝ P(ℳ)P(𝒟|ℳ), where P(ℳ) is a prior for the graph structure. Often the graph structure prior is assumed to be uniform, an assumption that we make in this paper, but one that can be easily relaxed, and the goal becomes to maximize P(𝒟|ℳ), which is a marginal likelihood. Under the assumptions of global and local parameter independence, and parameter modularity [9], the marginal likelihood for the full model P(𝒟|ℳ) is the product of local marginal likelihoods:

P (D | M) = E_{Θ | M} \prod_{i = 1}^{n} \prod_{j = 1}^{J_{i}} P (D | Θ_{i j}, M) = \prod_{i = 1}^{n} \prod_{j = 1}^{J_{i}} E_{Θ_{i j} | M} P (D | Θ_{i j}, M) .

(2)

Here we treat BN parameters as random variables with a prior distribution rather than as point values; that is, the particular value for a network parameter θ_ijk is just a point in the continuum of possible values that a random variable Θ_ijk takes. In the context of a BD score such as K2 [4] or the BDeu score [9], the prior distribution of Θ_ij = (Θ_ij₁, …, Θ_{ijK_i}) is Dirichlet with parameters α_ij = (α_ij₁, …, α_{ijK_i}). For a given structure, the distribution of a variable X_i given a parent configuration π_ij is Dirichlet-multinomial, with a closed-form marginal likelihood

E_{Θ_{i j} | M} P (D | Θ_{i j}, M) = \frac{Γ (α_{i j \cdot})}{Γ (α_{i j \cdot} + N_{i j \cdot})} \prod_{k - 1}^{K_{i}} \frac{Γ (α_{i j k} + N_{i j k})}{Γ (α_{i j k})}

(3)

where J_i is the number of configurations of the parent set Π_i, $α_{i j \cdot} : = \sum_{k = 1}^{K_{i}} α_{i j k}$ , N_ijk is the number of samples in the data for which X_i = x_ik and Π_i = π_ij, and $N_{i j \cdot} : = \sum_{k = 1}^{K_{i}} N_{i j k}$ . Different choices of the Dirichlet parameter priors lead to different BD scores: for example, the K2 score is obtained from using uniform priors (all α_ijk = 1), and the BDeu score is obtained from using priors with $α_{i j k} = \frac{α^{*}}{J_{i} K_{i}}$ where α* is the Equivalent Sample Size (ESS) hyperparameter.

Having outlined the differences of the proposed method with common approaches to the statistical comparison of data and reviewed the relevant background regarding about BNs and the BD score, we next describe our method.

3 Method

Consider two groups of data 𝒟₁ and 𝒟₂ over the same set of variables X = (X₁, …, X_n), and denote the concatenation of 𝒟₁ and 𝒟₂ by 𝒟_⋃. If 𝒟 ₁ and 𝒟₂ are not different in a statistical sense, they follow the same distribution, which is therefore the distribution of 𝒟_⋃. Let ℳ₁, ℳ₂, ℳ_⋃ denote maximum a posteriori (MAP) models within some space of models for the data in 𝒟₁, 𝒟₂, and 𝒟_⋃ respectively. In the case where 𝒟₁ and 𝒟₂ are the same, we expect that P(𝒟₁|ℳ₁) × P(𝒟₂|ℳ₂) ≤ P(𝒟_⋃|ℳ_⋃) in the large sample limit, since modeling the two groups as governed by independent distributions does not yield a better fitting model than when the groups are modeled as coming from the same distribution. In the case where 𝒟₁ and 𝒟₂ are statistically different, we expect P(𝒟₁|ℳ₁) × P(𝒟₂|ℳ₂) > P(𝒟_⋃|ℳ_⋃) in the large sample limit.

Let us extend this idea from the model level to the parameters of the models, an extension that can be applied when the models have the following properties: the distribution of a variable X_i is defined by a vector of parameters θ_i, parameters θ_i are drawn from a distribution Θ_i, and parameter independence holds, such that, Θ_i ⊥ Θ_i_′ for i ≠ i′. BNs with Dirichlet parameter priors have these two properties. In order to compare parameters across models, the parameters compared must match in meaning. First we will consider the case where ℳ₁, ℳ₂, and ℳ_⋃ have the same structure, and therefore, have parameters that can be perfectly matched across models; next we will extend the approach to the more general case of structures that have consistent ordering, where matching happens between sets of parameters.

In the case where ℳ₁, ℳ₂, and ℳ_⋃ have the same structure, we can consider each Dirichlet-multinomial component of the full model in isolation. Consider comparing the marginal likelihood of modeling θ_ij = P(X_i|π_ij) independently across the two groups of data

T_{i j} = P (D_{1}, D_{2} | Θ_{i j}^{(1)} ⊥ Θ_{i j}^{(2)}) = (E_{Θ_{i, j} | M_{1}} P (D_{1} | Θ_{i j}, M_{1})) (E_{Θ_{i, j} | M_{2}} P (D_{2} | Θ_{i j}, M_{2}))

(4)

to the marginal likelihood of modeling θ_ij as being the same for both groups

S_{i j} = P (D_{1}, D_{2} | Θ_{i j}^{(1)} = Θ_{i j}^{(2)}) = E_{Θ_{i j} | M_{\cup}} P (D_{\cup} | Θ_{i j}, M_{\cup}) .

(5)

The ratio T_ij/S_ij of these quantities is a Bayes factor that we can use to quantify the difference in the distribution X_i|π_ij across the two groups of data.

Next, let us consider the more general case where the structures of ℳ₁, ℳ₂, and ℳ_⋃ differ, but have consistent ordering, meaning that if X_i is an ancestor of X_j in any one of the networks, X_j cannot be an ancestor of X_i in any other network. Note that constraining the ordering of the variables in a BN does not constrain the space of joint probability distributions that can be represented. In our evaluation we enforce that constraint by learning ℳ_⋃ without order constraints and use the topological order of the learned network to constrain ℳ₁ and ℳ₂. There are many other possible approaches to enforcing these constraints, ranging from obtaining an order a priori to minimizing the number of explicit constraints using an iterative process. Exploring these alternative approaches is outside of the scope of this paper.

In this more general setting, the parent sets of a variable X_i can turn out to be different in the three models, and may have some partial overlap. To handle such overlap, we introduce a new index η as follows: Denote the parent sets of X_i in ℳ₁, ℳ₂, and ℳ_⋃ by $Π_{i}^{(1)}, Π_{i}^{(2)}, Π_{i}^{(\cup)}$ respectively. Let $J_{i}^{(\cdot)}$ be the number of possible configurations of $Π_{i}^{(\cdot)}$ , and enumerate those configurations by j = 1, …, $J_{i}^{(\cdot)}$ . Let $Π_{i}^{\cap}$ denote $Π_{i}^{(1)} \cap Π_{i}^{(2)} \cap Π_{i}^{(\cup)}$ . Let H_i be the number of possible configurations of $Π_{i}^{\cap}$ and enumerate those configurations by η = 1, …, H_i. For example, suppose that in data where all variables are binary, for a variable X₁ we have $Π_{1}^{(\cup)} = {X_{2}, X_{3}, X_{4}}, Π_{1}^{(1)} = {X_{2}, X_{3}, X_{5}}$ , and $Π_{1}^{(2)} = {X_{2}, X_{4}, X_{5}}$ . Then we have that $Π_{1}^{\cap} = {X_{2}}$ , and there are two possible configurations η = 1 and η = 2 for this set, corresponding to x₂₁ and x₂₂. Let J_i (η) indicate the subset of parent configurations j ∈ {1, …, J_i} that are consistent with configuration η. That is, for example, if η = 1 represents x₂₁ in our example, then $J_{1}^{\cup} (1)$ is the set of j-values that correspond to the set of parent assignments {(x₂₁, x₃₁, x₄₁), (x₂₁, x₃₁, x₄₂), (x₂₁, x₃₂, x₄₁), (x₂₁, x₃₂, x₄₂)}.

We can then then compare the marginal likelihood of modeling the entire parameter set indexed by η as independent

S_{i η} = P (D_{1}, D_{2} | Θ_{i η}^{(1)} ⊥ Θ_{i η}^{(2)}) = (\prod_{j \in J_{i}^{1} (η)} E_{Θ_{i j} | M_{1}} P (D_{1} | Θ_{i j}, M_{1})) (\prod_{j \in J_{i}^{2} (η)} E_{Θ_{i j} | M_{2}} P (D_{2} | Θ_{i j}, M_{2}))

(6)

to the marginal likelihood of modeling the parameter set indexed by η as identical

T_{i η} = P (D_{1}, D_{2} | Θ_{i η}^{(1)} = Θ_{i η}^{(2)}) = \prod_{j \in J_{i}^{\cup} (η)} E_{Θ_{i j} | M_{\cup}} P (D_{\cup} | Θ_{i j}, M_{\cup}) .

(7)

In the case of identical structures, S_iη and T_iη are equivalent to S_ij and T_ij.

One interesting and useful task is the detection of differences in the distributions when only a few parameters (out of many) differ between the two groups. We can use the marginal likelihoods derived above to obtain a measure that is sensitive to the presence of changes in only some conditional distributions of a variable X_i, while other conditional distributions may indeed be identical across groups. Particularly, we can compute the posterior odds of seeing a difference in X_i as follows:

O_{i} = \frac{1 - P (Θ_{i}^{(1)} = Θ_{i}^{(2)} | D_{1}, D_{2})}{P (Θ_{i}^{(1)} = Θ_{i}^{(2)} | D_{1}, D_{2})} = (\prod_{η = 1}^{H_{i}} \frac{1}{P (Θ_{i η}^{(1)} = Θ_{i η}^{(2)} | D_{1}, D_{2})}) - 1 .

(8)

Since the η-level is defined to be the finest level at which parameters can be compared across the two groups, we consider only the two cases of Θ_ij either being independent for the two groups or being identical for the two groups. By introducing priors for these two cases we are able to compute Eq. (8). Let $p_{i η} = P (Θ_{i η}^{(1)} = Θ_{i η}^{(2)})$ denote the prior probability that the distribution of X_i|π_iη is the same across the two groups. Then we have that

\frac{1}{P (Θ_{i η}^{(1)} = Θ_{i η}^{(2)} | D_{1}, D_{2})} = \frac{S_{i η} (1 - p_{i η}) + T_{i η} p_{i η}}{T_{i η} p_{i η}} .

(9)

Plugging Eq. (9) into Eq. (8) gives

O_{i} = (\prod_{η = 1}^{H_{i}} (\frac{S_{i η} (1 - p_{i n})}{T_{i η} p_{i η}} + 1)) - 1 .

(10)

In the absence of information that would lead one to expect differences in some parameters more than in others, the priors p_iη can be related to the prior probability p_i of seeing no difference in the conditional distribution of variable X_i by the relation $p_{i η} = p_{i}^{1 / H_{i}}$ .

The same approach can be extended to obtain posterior odds of observing a difference in any parameter of the model, expressed as

O = \frac{1 - P (Θ^{(1)} = Θ^{(2)} | D_{1}, D_{2})}{P (Θ^{(1)} = Θ^{(2)} | D_{1}, D_{2})} = (\prod_{i = 1}^{n} \prod_{η = 1}^{H_{i}} (\frac{S_{i η} (1 - p_{i n})}{T_{i η} p_{i η}} + 1)) - 1 .

(11)

Using (11) entails that the prior for seeing no difference between the two groups is $p = \prod_{i = 1}^{n} \prod_{η = 1}^{H_{i}} p_{i η}$ . Given such an overall prior p, a natural choice for non-informative priors is P_iη = p^1/(^nH_i⁾: this choice of priors assumes that we are equally and independently likely to see a difference in each variable, and equally and independently likely to see a difference in each conditional probability distribution of each variable.

4 Evaluation

We evaluated the performance of the odds ratio O_i in Eq. 10 as a score for detecting variable-level differences. Next we describe the baseline method against which we compared our method and the experimental setup, followed by the experimental results.

4.1 Baseline Method

As mentioned in the introduction, to our knowledge there is no prior work that addresses the difference detection problem in the same manner as our approach: a variable-based analysis, accounting for multivariate relationships, identifying variable-level differences, and requiring no informative prior knowledge. As a result, we chose to simulate a process often followed by analysts and researchers, where logistic regression models with interactions are constructed to predict a variable X_i using candidate predictors, and the researcher would judge a predictor's relevance based on the strength of its corresponding weight.

For this purpose, we use lasso-regularized logistic regression [7], which maximizes an ℒ₁-regularized log-likelihood of a logistic model, where the strength of the regularization is modulated by a parameter λ. The effect of regularization is that as λ decreases from +∞, predictors enter the model (their coefficients in the logistic model become nonzero). To detect variable-wise differences across two pre-defined groups using lasso-regularized logistic regression, we take the data from the two groups and add a group-indicator variable Z to the data. The group indicator Z is a binary variable that takes the value 1 for cases coming from one of the groups, and the value 0 for cases coming from the other group. A regression model is built for predicting each variable X_i given all the other data variables X_j : 1 ≤ j < i that precede it in the variable ordering (we provide an ordering from a true generating model for the purposes of this evaluation), the group indicator Z, and interactions of Z with each of the data variables X_j. Non-binary variables were handled by using multinomial logistic regression for the target and binary coding for input variables. The largest value of λ at which a given predictor becomes nonzero can then be used as a score of how useful that predictor is for predicting X_i. Hence, for each X_i we can use the largest λ that corresponds to a nonzero coefficient in the logistic model for Z or an interaction with Z as the score for seeing a difference in the distribution of X_i across groups.

4.2 Data and Experimental Setup

Since in real-world data the differences between groups of data are not known in advance, for the evaluation we generated pairs of data groups from known distributions that are based on real-world data. We chose to learn networks from which to generate data because publicly available BN models are overwhelmingly diagnostic, meaning that they often contain many hidden variables, whereas we would like to have a ground-truth model that directly relates observed variables to each other. We picked data where all variables are categorical, since the BD score is designed for BNs that represent multinomial distributions. In this evaluation we used the balance-scale, car, hayes-roth, and nursery datasets available from the UCI Machine Learning Repository [1]. We learned a BN from the data for each of these sets, which is referred to as the “original BN” in the following description of the data-generation process.

We ran 72 blocks of tests, where each block is characterized by a data source (one of the UCI Datasets), a type of perturbation, the number of perturbations, and the number of samples per group. Each block consists of 20 group pairs, where each pair consists of a group of samples generated from the original BN of the data source and a group of samples generated from a perturbed BN of a data source (a different perturbed BN is obtained for each group pair). The perturbed BN was obtained by performing perturbations to the original BN. There were two categories of perturbations: parametric perturbations and structural perturbations. A parametric perturbation was performed by uniformly randomly selecting a variable X_i to perturb, selecting for it a conditional distribution X_i|π_ij to perturb, and then replacing its probability mass vector with a permutation of itself. A structural perturbation was performed by randomly (with probability 1/2) deciding whether to remove or add an arc, and then selecting a random arc to add (or remove) from the existing (or absent) arcs in the network. A node (variable) is considered perturbed by a structural perturbation only if an arc into the node is added or removed.

We provide the ordering of the variables in the generating model to the logistic regression method so that it may take advantage of that information. We do not provide this information to our method in the tests reported in Table 1.

Table 1.

Table of AUCs obtained from 72 blocks of tests. The first column indicates the data source for each block, the second column indicates whether the perturbation introduced was structural (Struct.) or parametric (Param.), and the third column indicates the number of perturbations. AUCs that were statistically significantly higher at the α = 0.05 level are shown in bold.

			O_i AUC	λ AUC	p- value	O_i AUC	λ AUC	p- value	O_i AUC	λ AUC	p- value
balance-scale	Param.	1	0.8931	0.7031	0.0074	0.9444	0.7981	0.0303	0.9563	0.8056	0.0542
		3	0.8303	0.6457	0.0007	0.7895	0.6675	0.0336	0.8936	0.7319	0.0024
		5	0.9038	0.7217	0.0003	0.8780	0.7733	0.0248	0.8973	0.7827	0.0131
	Struct.	1	0.9956	0.9788	0.1481	0.9981	0.9888	0.2029	1.0000	0.9981	0.3850
		3	0.9868	0.9836	0.7843	0.9964	0.9804	0.1088	0.9992	0.9892	0.1684
		5	0.9825	0.9721	0.4884	0.9958	0.9812	0.1509	0.9921	0.9975	0.4761

car	Param.	1	0.6587	0.5033	0.0282	0.7171	0.5744	0.0008	0.7221	0.6477	0.0604
		3	0.7544	0.6296	0.0026	0.7981	0.6595	<10⁻⁴	0.8260	0.7136	0.0001
		5	0.7540	0.6149	<10⁻⁴	0.7567	0.6867	0.0265	0.8079	0.7209	0.0018
	Struct.	1	0.9229	0.8525	0.1168	0.9367	0.9096	0.6049	0.9788	0.9442	0.0696
		3	0.8272	0.8952	0.0439	0.8739	0.8920	0.6019	0.9606	0.9572	0.8760
		5	0.7774	0.8891	0.0034	0.8695	0.9340	0.0593	0.9043	0.9538	0.1347

hayes-roth	Param.	1	0.7256	0.6944	0.6718	0.7750	0.7375	0.5904	0.6713	0.6275	0.5563
		3	0.7648	0.5889	0.0012	0.7210	0.5990	0.0277	0.6973	0.5978	0.0634
		5	0.7560	0.6766	0.1211	0.7490	0.7088	0.4394	0.7331	0.6830	0.2590
	Struct.	1	0.7763	0.8844	0.0401	0.8081	0.9137	0.0047	0.8344	0.9238	0.0058
		3	0.8814	0.8998	0.6295	0.9387	0.9026	0.2959	0.8666	0.9131	0.2554
		5	0.8546	0.9479	0.0137	0.9214	0.9505	0.1106	0.9136	0.9674	0.0297

nursery	Param.	1	0.6234	0.6153	0.9283	0.6644	0.6247	0.6983	0.8263	0.6931	0.0499
		3	0.6460	0.5845	0.2978	0.7179	0.5596	0.0035	0.6833	0.5852	0.0987
		5	0.6156	0.6105	0.9226	0.6149	0.5776	0.4796	0.7198	0.5755	0.0032
	Struct.	1	0.5897	0.7984	0.0035	0.7781	0.8469	0.2932	0.8534	0.8372	0.7410
		3	0.7686	0.8148	0.2443	0.8487	0.8527	0.8924	0.9251	0.8895	0.2596
		5	0.8142	0.8326	0.5419	0.8125	0.8345	0.4145	0.9270	0.8919	0.1333

			500 samples per group			1000 samples per group			5000 samples per group

Open in a new tab

4.3 Results

Table 1 shows areas under receiver operating characteristic curves (AUC) of perturbation detection obtained using the posterior odds O_i as compared to AUC's obtained using the λ-based score from lasso-regularized logistic regression for data group pairs generated from the respective data sources. The table also shows the p-value for a two-tailed test of the difference between the AUCs of the two methods, based on [6]. Of a total of 72 blocks of tests, in 53 the O_i AUC is higher than the λ AUC. At the α = 0.05 significance level, the O_i AUC's are statistically significantly better than the λ AUC's in 21 test blocks, whereas the O_i AUC is statistically significantly worse than the λ AUC in only eight test blocks. The p-value of a two-sided paired Wilcoxon signed rank test on the AUCs is less than 10⁻⁴, supporting that the overall better performance of O_i is not due to chance.

Every case where the the posterior odds performs statistically significantly worse than the regression-based method is a case of a structural perturbation, and we suspect that this is because perturbed structure is more difficult to recover with no order information. In a different series of tests where we provided order information to the posterior odds based method, of a total of 72 blocks of tests, in 62 the O_i AUC was higher than the λ AUC. At the α = 0.05 significance level, the O_i AUC was statistically significantly better than the λ AUC in 43 test blocks, and worse in only one block.

As is typical for statistical methods, we see better performance for data with more samples as well as for lower-dimensional data. The results also suggest that structural differences are easier to detect than parametric ones. We believe that this is because a structural difference reflects a more substantial distributional difference than a simple parametric one, since it can be expressed as a collection of parametric differences in a network containing the removed or added arcs. Overall, our experiments show consistently good AUC for the O_i score over the various generated group pairs.

5 Discussion

We introduced a novel variable-based approach for identifying statistical differences across a pair of groups. Evaluation of the approach on simulated data showed good performance compared to a logistic lasso baseline. The data used in the evaluation is low-dimensional because the logistic lasso baseline scales poorly to many dimensions. The most computationally demanding step in our approach is learning the three network structures. Consequently, our method scales to more dimensions to the extent that the BN structure learning algorithm used with it does. Any structure search strategy that maximizes a Bayesian Dirichlet score is a good fit for our method.

For Bayesian networks with Bayesian Dirichlet priors, we showed how to compute the posterior odds that a given variable has different distribution across the two groups, as well as the posterior odds that the two groups are different overall. The property that enables this is parameter independence in the BD framework. This approach can be applied to other models as well. The distribution of a variable in the BN formulation is simply a grouping of finer-level model parameters. Hence, any model that has similar groupings of parameters in a framework where parameter independence holds can be used with this approach.

Identification of variable-level differences across groups of multivariate data is useful in many application areas. The method presented here considers differences over the sets of relationships that are present in the MAP models constructed for modeling the groups as independent vs. identical. Particularly, for settings in which the typical approaches in practice tend to be univariate analyses and ad-hoc exploration of relationships that are suspected to be important a priori, we present a more systematic approach.

Acknowledgments

This research was supported by grant IIS-0911032 from the National Science Foundation and grant T15 LM007359 from the National Library of Medicine.

Contributor Information

Yuriy Sverchkov, Email: yuriy@biostat.wisc.edu.

Gregory F. Cooper, Email: gfc@pitt.edu.

References

1.Bache K, Lichman M. UCI machine learning repository. 2013 http://archive.ics.uci.edu/ml.
2.Bay SD, Pazzani MJ. Detecting group differences: mining contrast sets. Data Min Knowl Disc. 2001;5(3):213–246. [Google Scholar]
3.Chickering DM. Learning Bayesian networks is NP-complete. In: Fisher D, Lenz HJ, editors. Learning from Data Lecture Notes in Statistics. Vol. 112. Springer; Heidelberg: 1996. pp. 121–130. [Google Scholar]
4.Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992;9(4):309–347. [Google Scholar]
5.Daly R, Shen Q, Aitken S. Learning Bayesian networks: approaches and issues. Knowl Eng Rev. 2011;26(2):99–157. [Google Scholar]
6.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]
7.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
8.Heckerman D. A tutorial on learning with Bayesian networks. In: Jordan MI, editor. Learning in Graphical Models. MIT Press; Cambridge: 1999. pp. 301–354. [Google Scholar]
9.Heckerman D, Geiger D, Chickering DM. Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn. 1995;20:197–243. [Google Scholar]
10.Koivisto M, Sood K. Exact Bayesian structure discovery in Bayesian networks. J Mach Learn Res. 2004;5:549–573. [Google Scholar]
11.Novak PK, Lavrač N, Webb GI. Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res. 2009;10:377–403. [Google Scholar]
12.Silander T, Myllymaki P. A simple approach for finding the globally optimal Bayesian network structure. In: Dechter R, Richardson T, editors. Proceedings of the Twenty-second Annual Conference on Uncertainty in Artificial Intelligence (UAI 2006) AUAI Press; 2006. pp. 445–452. [Google Scholar]
13.Spirtes P, Glymour C. An algorithm for fast recovery of sparse causal graphs. Soc Sci Comput Rev. 1991;9(1):62–72. [Google Scholar]
14.Yuan C, Malone B, Wu X. Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011) Helsinki, Finland: 2011. Learning optimal Bayesian networks using A* search; pp. 2186–2191. [Google Scholar]

[R1] 1.Bache K, Lichman M. UCI machine learning repository. 2013 http://archive.ics.uci.edu/ml.

[R2] 2.Bay SD, Pazzani MJ. Detecting group differences: mining contrast sets. Data Min Knowl Disc. 2001;5(3):213–246. [Google Scholar]

[R3] 3.Chickering DM. Learning Bayesian networks is NP-complete. In: Fisher D, Lenz HJ, editors. Learning from Data Lecture Notes in Statistics. Vol. 112. Springer; Heidelberg: 1996. pp. 121–130. [Google Scholar]

[R4] 4.Cooper GF, Herskovits E. A Bayesian method for the induction of probabilistic networks from data. Mach Learn. 1992;9(4):309–347. [Google Scholar]

[R5] 5.Daly R, Shen Q, Aitken S. Learning Bayesian networks: approaches and issues. Knowl Eng Rev. 2011;26(2):99–157. [Google Scholar]

[R6] 6.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845. [PubMed] [Google Scholar]

[R7] 7.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Heckerman D. A tutorial on learning with Bayesian networks. In: Jordan MI, editor. Learning in Graphical Models. MIT Press; Cambridge: 1999. pp. 301–354. [Google Scholar]

[R9] 9.Heckerman D, Geiger D, Chickering DM. Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn. 1995;20:197–243. [Google Scholar]

[R10] 10.Koivisto M, Sood K. Exact Bayesian structure discovery in Bayesian networks. J Mach Learn Res. 2004;5:549–573. [Google Scholar]

[R11] 11.Novak PK, Lavrač N, Webb GI. Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res. 2009;10:377–403. [Google Scholar]

[R12] 12.Silander T, Myllymaki P. A simple approach for finding the globally optimal Bayesian network structure. In: Dechter R, Richardson T, editors. Proceedings of the Twenty-second Annual Conference on Uncertainty in Artificial Intelligence (UAI 2006) AUAI Press; 2006. pp. 445–452. [Google Scholar]

[R13] 13.Spirtes P, Glymour C. An algorithm for fast recovery of sparse causal graphs. Soc Sci Comput Rev. 1991;9(1):62–72. [Google Scholar]

[R14] 14.Yuan C, Malone B, Wu X. Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011) Helsinki, Finland: 2011. Learning optimal Bayesian networks using A* search; pp. 2186–2191. [Google Scholar]

PERMALINK

A Bayesian Approach for Identifying Multivariate Differences Between Groups

Yuriy Sverchkov

Gregory F Cooper

Abstract

1 Introduction

2 Background

2.1 Comparing Distributions

2.2 Bayesian Networks

2.3 Bayesian Dirichlet Scores for Bayesian Networks

3 Method

4 Evaluation

4.1 Baseline Method

4.2 Data and Experimental Setup

Table 1.

4.3 Results

5 Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A Bayesian Approach for Identifying Multivariate Differences Between Groups

Yuriy Sverchkov

Gregory F Cooper

Abstract

1 Introduction

2 Background

2.1 Comparing Distributions

2.2 Bayesian Networks

2.3 Bayesian Dirichlet Scores for Bayesian Networks

3 Method

4 Evaluation

4.1 Baseline Method

4.2 Data and Experimental Setup

Table 1.

4.3 Results

5 Discussion

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases