Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Feb 1.
Published in final edited form as: Bernoulli (Andover). 2020 Nov 20;27(1):637–672. doi: 10.3150/20-BEJ1253

Bayesian graph selection consistency under model misspecification

YABO NIU 1,*, DEBDEEP PATI 1, BANI K MALLICK 1
PMCID: PMC8300537  NIHMSID: NIHMS1636318  PMID: 34305432

Abstract

Gaussian graphical models are a popular tool to learn the dependence structure in the form of a graph among variables of interest. Bayesian methods have gained in popularity in the last two decades due to their ability to simultaneously learn the covariance and the graph. There is a wide variety of model-based methods to learn the underlying graph assuming various forms of the graphical structure. Although for scalability of the Markov chain Monte Carlo algorithms, decomposability is commonly imposed on the graph space, its possible implication on the posterior distribution of the graph is not clear. An open problem in Bayesian decomposable structure learning is whether the posterior distribution is able to select a meaningful decomposable graph that is “close” to the true non-decomposable graph, when the dimension of the variables increases with the sample size. In this article, we explore specific conditions on the true precision matrix and the graph, which results in an affirmative answer to this question with a commonly used hyper-inverse Wishart prior on the covariance matrix and a suitable complexity prior on the graph space. In absence of structural sparsity assumptions, our strong selection consistency holds in a high-dimensional setting where p = O(nα) for α < 1/3. We show when the true graph is non-decomposable, the posterior distribution concentrates on a set of graphs that are minimal triangulations of the true graph.

Keywords: decomposable graph, Gaussian graphical model, graph selection consistency, hyper-inverse Wishart distribution, minimal triangulation, model misspecification, partial correlation

1. Introduction

Graphical models provide a framework for describing statistical dependencies in (possibly large) collections of random variables [27]. In this article, we revisit the well-known problem of inference on the underlying graph using observed data from a Bayesian point of view. Research on Bayesian inference for natural exponential families and associated conjugate priors, called the Diaconis–Ylvisaker (DY priors), is pioneered by [13] and has a profound impact on the development of Bayesian Gaussian graphical models. Consider independent and identically distributed vectors Y1, Y2, … , Yn drawn from a p-variate normal distribution with mean vector 0 and a sparse inverse covariance matrix Ω. The sparsity pattern in Ω can be encoded in terms of a graph G on the set of variables as follows. If the variables i and j do not share an edge in G, then Ωij = 0. Hence, an undirected (or concentration) graphical model corresponding to G restricts the inverse covariance matrix Ω to a linear subspace of the cone of positive definite matrices.

A probabilistic framework for learning the dependence structure and the graph G requires specification of a prior distribution for (Ω, G). Conditional on G, a hyper-inverse Wishart (HIW) distribution [11] on Σ = Ω−1 and the corresponding induced class of distributions on Ω [36] are attractive choices of DY priors. A rich family of conjugate priors that subsumes the DY class is developed by [29]. Bayesian procedures corresponding to these Letac–Massam priors have been derived in a decision theoretic framework in the recent work of [33]. The key component of Bayesian structure learning is achieved through the specification of a prior distribution on the space of graphs. There is a need for a flexible but tractable family of such priors, capable of representing a variety of prior beliefs about the conditional independence structure. During the last two decades, there has been a growing literature on the development of HIW priors focusing on decomposable graphs [9,10,19,20] and their non-decomposable counterparts [2,12,25,31,37,41]. Although deemed as a restrictive model choice in the space of graphs, in the interest of tractability and scalability, HIW priors on decomposable graphs continue to be widely used. In this paper, we focus on the HIW priors on decomposable graphs as this construction enjoys many advantages, such as computational efficiency due to its conjugate formulation and exact calculation of marginal likelihoods [38]. Stochastic search algorithms are empirically demonstrated to have good practical performances in these models. For detailed descriptions and comparisons of various Bayesian computational methods in these scenarios, see [15,24].

There has been a growing literature on model selection consistency in Gaussian graphical models from a frequentist point of view [16,30,34,42]. Beyond the literature on Gaussian graphical models, there has been an incredible amount of frequentist work in the context of estimating high-dimensional covariance matrix with rates of convergence of various regularized covariance estimators derived in [6,7,17,26] among others. There is a relatively smaller literature on asymptotic properties of Bayesian procedures for covariance or precision matrices in graphical models; refer to [3,4]. Although the methodology in [4] can be used for graph selection, the theoretical results in [3,4] are focused on achieving an optimal rate of posterior convergence for the covariance and precision matrices. The literature on graph selection consistency in a Bayesian paradigm is surprisingly sparse [8,18,28]. In the context of decomposable graphs, the only article we were aware of is [18] which considered the behavior of Bayesian procedures that perform model selection for decomposable Gaussian graphical models. However, the analysis is restricted to the fixed dimensional regime and involves the behavior of the marginal likelihood ratios between graphs differing only by one edge. Although Bayes factors in the general case can be decomposed using Bayes factors between graphs that differ by one edge, it is not possible to derive the consistency results by simple aggregation of Bayes factors for single edge moves. Furthermore, in high dimension the number of single edge moves involved in the aggregation may increase with the graph size. In such cases, the consistency results necessitate more restrictive assumptions on the growth of the number of edges. For general graph selection consistency within a Bayesian framework, refer to the very recent articles [8,28] in the context of Gaussian directed acyclic graph (DAG) models.

In this article, focusing on the hyper-inverse Wishart g-prior [10] on the covariance matrix and a complexity prior on the graph, we derive sufficient conditions for strong selection consistency when p = O(nα) with α < 1/3 considering both the cases when the true graph is decomposable and when it is not. The key conditions relate to the precise upper and lower bounds on the partial correlation coefficients and a suitable complexity prior on the space of graphs. We emphasize here that we do not need conditions to be verified on all subgraphs – all assumptions are easy to understand and relatively straightforward to verify. Regarding our findings, we discover that the HIW g-prior places a heavy penalty on missing true edges (false negatives), but a comparatively smaller penalty on adding false edges (false positives). Henceforth in the high-dimensional regime a carefully chosen complexity prior on the graph space is needed for penalizing false positives and achieving strong consistency.

In the well-specified case, the hierarchical model used here is a subset of [8] since hyper-inverse Wishart prior is a special case of DAG-Wishart prior proposed in [5] under perfect DAGs. However, the assumptions in this paper are distinctly different from those stated in [8]. In particular, our assumptions are on the magnitude of the elements of the partial correlation matrix rather than on the eigenvalues of the covariance matrix as in [8]. Also, the main focus of this article is to study the behavior of graph selection consistency under model misspecification, which cannot be addressed within a DAG framework. To the best of our knowledge, this is the first paper to show the strong selection consistency under HIW priors for high-dimensional graphs under model misspecification. In particular, we show that the posterior concentrates on decomposable graphs which are in some sense closest to the true nondecomposable graph. Interestingly, the pairwise Bayes factors between such graphs are stochastically bounded. Our result under model-misspecification is inspired by [18], but extends to the case when p is growing with n and provides a rigorous proof of the convergence of the posterior distribution to the class of decomposable graphs which are closest to the true one. We also present a detailed simulation study both for the well-specified and misspecified case, which provides empirical justification for some of our technical results.

En-route, we develop precise bounds for the Bayes factor in favor of an alternative graph with respect to the true graph. The main proof technique is a combination of (a) localization: which involves breaking down the Bayes factor between any two graphs into local moves, that is, addition and deletion of one edge using decomposable graph chain rule [27] and (b) correlation association: which converts the Bayes factor between two graphs differing by an edge into a suitable function of sample partial correlations. By developing sharp concentration inequalities and tail bounds for sample partial correlations, we obtain bounds for ratios of local marginal likelihoods which are then combined to yield strong selection consistency results.

The remaining part of the paper is organized as follows. In Section 2, we introduce the necessary background and notations. Section 3 introduces the model with the HIW prior. Section 4 describes the main results of pairwise posterior ratio consistency and consistent graph selection when the true graph is decomposable. In Section 4, the results are presented progressively as follows. First, we provide a non-asymptotic sharp upper bound for pairwise Bayes factors. Next, we state the main theorem for posterior ratio consistency when p diverges with n where p is of the order nα for α < 1/2. Finally, we state the main theorem on strong graph selection consistency which further requires α < 1/3. Section 5 states the main results on consistent graph selection under model misspecification and results on the equivalence of minimal triangulations. Numerical experiments are presented in Section 6 followed by a discussion in Section 7.

2. Preliminaries

In this section, we define a collection of notations required to describe the model and the prior. Section 2.1 introduces sample and population correlations and partial correlations, Section 2.2 sets up the notations for undirected graphs and briefly introduces the definitions and properties associated with decomposable graphs. Section 2.3 contains matrix abbreviations and notations used throughout the paper.

2.1. Correlation and partial correlation

Let Xp = (X1, X2, … , Xp)T denote a random vector which follows a p-dimensional Gaussian distribution and x(1), x(2), … , x(n) denote n independent and identically distributed (i.i.d.) sample observations from Xp. Clearly, the n × p matrix formed by augmenting the n-dimensional column vectors xi, denoted by (x1, x2, … , xp), is the same as (x(1), x(2), … , x(n))T and xi=n11nTxi,i=1,2,,p. Here 1n is an n-dimensional vector with all ones. Let In denote the n × n identity matrix and E denote the expectation with respect to the Gaussian random variable Xp.

Definition 2.1 (Population correlation coefficient). The population correlation coefficient between Xi and Xj, 1 ≤ i, jp, is defined as

ρij=σijσiiσjj,

where σii=E(XiEXi)2 and σij=E{(XiEXi)(XjEXj)}.

Definition 2.2 (Sample/Pearson correlation coefficient). The sample correlation coefficient between Xi and Xj, 1 ≤ i, jp, is defined as

ρ^ij=σ^ijσ^iiσ^jj,

where σ^ii=(xixi1n)T(xixi1n)n and σ^ij=(xixi1n)T(xjxj1n)n.

Definition 2.3 (Population partial correlation coefficient). Let S = {i1, i2, … , i|S|}, where 1 ≤ i1, i2, … , i|S|p and |S| is the cardinality of set S. Define XS = (Xi1, Xi2, … , Xi|S|)T. The population partial correlation coefficient between Xi and Xj, where i, jS and 1 ≤ i, jp, holding XS fixed is defined as

ρijS=σijSσiiSσjjS,

where σiiS=σiiσSiTσSS1σSi, σijS=σijσSiTσSS1σSj. And σSi=E{(XSEXS)(XiEXi)}, σSS=E{(XSEXS)T(XSEXS)}.

Definition 2.4 (Sample partial correlation coefficient). Define xS = (xi1, xi2, … , xi|S|). The sample partial correlation coefficient between Xi and Xj, where i, jS and 1 ≤ i, jp, holding XS fixed is defined as

ρ^ijS=σ^ijSσ^iiSσ^jjS,

where σ^iiS=σ^iiσ^SiTσ^SS1σ^Si, σ^ijS=σ^ijσ^SiTσ^SS1σ^Sj. And σ^Si=(xSxS)T(xixi)n, σ^SS=(xSxS)T(xSxS)n, xS=(xi11n,,xiS1n).

2.2. Undirected decomposable graphs

Denote an undirected graph by G = (V, E) with a vertex set V = {1, 2, … , p} and an edge set E = {(r, s) : ers = 1, 1 ≤ r < sp} with ers = 1 if the edge (r, s) is present in G and 0 otherwise. For the purpose of a self-contained exposition, we first review some basic terminologies of graph theory. A path of length k in G from vertex u to v is a sequence of k − 1 distinct vertices of the form u = v0, v1, … , vk−1, vk = v such that (vi−1, vi) ∈ E for all i = 1, 2, … , k. The path is a k-cycle if the end points are the same, u = v. If there is a path from u to v, then we say u and v are connected. A subset SV is said to be an uv-separator if all paths from u to v intersect S. The subset S is said to separate A from B if it is an uv-separator for every uA, vB. A chord of a cycle is a pair of vertices that are not consecutive on the cycle, but are adjacent in G. A graph is complete if all vertices are joined by an edge. A clique is a complete subgraph that is maximal, i.e., maximally complete subgraph. See [27] for more graph related terminologies.

We shall focus on decomposable graphs in this paper. A graph is decomposable [27] if and only if its every cycle of length greater than or equal to four possesses a chord. A decomposable graph G can be represented by a perfect ordering of its cliques and separators. Refer to [27] for formal definitions of a clique and a separator, and other equivalent representations. An ordering of cliques CiC and separators SiS, where C={Ci}i=1k and S={Si}i=2k, (C1, S2, C2, S3, … Ck), is said to be perfect if for every i = 2, 3, … , k the running intersection property ([27], page 15) is fulfilled, meaning that there exists a j < i such that Si = CiHi−1Cj where Hi1=j=1i1Cj. A junction tree for the decomposable graph G is a tree representation of the cliques. (For a non-decomposable graph, the junction tree consists of its prime components that are not necessarily cliques, i.e., not complete). A tree with a set of vertices equal to the set of cliques of G is said to be a junction tree if, for any two cliques Ci and Cj and any clique C on the unique path between Ci and Cj, we have CiCjC. A set of vertices shared by two adjacent nodes of the junction tree is complete and defines the separator of the two subgraphs induced by the two adjacent nodes. Figures 1 and 2 briefly illustrate a decomposable and a non-decomposable graph, both defined on 6 nodes.

Figure 1.

Figure 1.

G6 is a 6-node decomposable graph and its junction tree decomposition (right) has 3 cliques and 2 separators, that is, C1 = {1, 2}, S2 = {2}, C2 = {2, 3, 4}, S3 = {3, 4}, C3 = {3, 4, 5, 6}.

Figure 2.

Figure 2.

G6 is a 6-node non-decomposable graph because it has a cycle of length four, 3 − 4 − 5 − 6 − 3, that does not have a cord. Its junction tree decomposition (right) has 3 prime components and 2 separators, that is, P1 = {1, 2}, S2 = {2}, P2 = {2, 3, 4}, S3 = {3, 4}, P3 = {3, 4, 5, 6}. Out of three prime components only P1 and P2 are cliques.

2.3. Miscellaneous notations

For an n × p matrix Y, YC is defined as the submatrix of Y consisting of columns with indices in the clique C. Let (y1, y2, … , yp) = (Y1, Y2, … , Yn)T, where yi, is the ith column of Yn×p. If C = {i1, i2, … , i|C|}, where 1 ≤ i1 < i2 < ⋯ < i|C|p, then YC = (yi1, yi2, … , yi|C|). For any square matrix A = (aij)p×p, define AC = (aij)|C|×|C| where i, jC, and the order of entries carries into the new submatrix AC. Therefore, YCTYC=(YTY)C. MNm×n(M, Σr, Σc) is an m × n matrix normal distribution with mean matrix M, Σr and Σc as covariance matrices between rows and columns, respectively.

Let P be the probability corresponding to the true data generating distribution. Denote Gk and Dk as the k-dimensional graph space and the k-dimensional decomposable graph space. Let Mt be the minimal triangulation space of Gt when Gt is non-decomposable. ab denotes C1baC2b for positive constants C1 and C2. ab denotes aC3b for a positive constant C3. ab denotes aC4b for a positive constant C4. For set relations, AB means A is a subset of B; AB means AB and AB; AB means A is not a subset of B. |·| determined by context can be the absolute value of a real number, the cardinality of a set or the determinant of a matrix. π(·) and π(· | Y) are the prior and the posterior distribution of graphs, respectively. Refer also to Table 2 for a detailed list of notations used in the theorem statements and the proofs.

Table 2.

Summary of notations

Symbol Definition
P The probability corresponding to the true data generating distribution
Gk, Dk k-dimensional graph space, k-dimensional decomposable graph space
Mk The minimal triangulation space of Gt when Gt is non-decomposable
ab C1baC2b for positive constants C1, C2
ab, ab aC3b for a positive constant C3, aC4b for a positive constant C4
AB, AB A is a subset of B, A is not a subset of B
AB AB and AB
|·| Absolute value, cardinality of a set or determinant of a matrix by context
π(·), π(· | Y) The prior distribution and posterior distribution of graphs
Y, YiT, yi The n × p data matrix, the ith row of Y, the ith column of Y
ρij, ρij|S The correlation and partial correlation between Xi and Xj given XS
ρ^ij, ρ^ijS The sample correlation and partial correlation between Xi and Xj given XS
ρL, ρU The lower and upper bound for all |ρij|V\{i,j}|, where (i, j) ∈ Et
Ci, C, Si, S A clique, the set of cliques, a separator, the set of separators
Gt, Ga, Gc The true graph, any decomposable graph, the complete graph
Gm, G0 A minimal triangulation of Gt, the empty graph
G^ The posterior mode in the decomposable graph space
Et, Ea, Ec, Ea1 Edge sets of Gt, Ga, Gc and Ea1=EaEt
p, V The graph dimension, the vertex set, where V = {1, 2, … , p}
x, x¯, x~ Nodes in the graph
i, j Determined by context, nodes in the graph or indices of nodes
S, S¯, S~ Separators in the graph
dS, q The cardinality of separator S, the prior edge inclusion probability
Δϵ, Δϵ(n), Δϵ(n) Probability regions of sample partial correlations
Πxy The set of all sets that separates nodes x and y, where (x, y) ∉ Et
G±(x,y)∈Et A graph with/without true edge (x, y)
G±(x,y)∉Et A graph with/without false edge (x, y)
G¯ica, G~itc The ith graph in the sequence from Gc to Ga and Gt to Gc

3. Bayesian hierarchical model for graph selection

Suppose we observe independent and identically distributed p-dimensional Gaussian random variables Yi, i = 1, … , n. To describe the common distribution of Yi, define a p × p covariance matrix ΣG that depends on an undirected decomposable graph as defined in Section 2.2. Assume Yi | ΣG, G ∼ Np(0, ΣG). In matrix notations,

Yn×pΣG,GMNn×p(0n×p,In,ΣG),

where Yn×p = (Y1, Y2, … , Yn)T and 0n×p is an n × p matrix with all zeros. The prior used here for covariance matrix ΣG given a decomposable graph G is the hyper-inverse Wishart prior, described below. We emphasize here that although we restrict our model and prior to decomposable graphs only, our results in Section 5 allow for model misspecification.

3.1. The hyper-inverse Wishart distribution

Denoted by HIWG(b, D) [10,11] the hyper-inverse Wishart (HIW) distribution is a distribution on the cone of p × p positive definite matrices with b > 2 degrees of freedom [24] and a fixed p × p positive definite matrix D such that the joint density factorizes on the junction tree of the given decomposable graph G as

p(ΣGb,D)=ΠCCp(ΣCb,DC)ΠSSp(ΣSb,DS), (3.1)

where for each CC, ΣC ∼ IW|C|(b, DC) with density

p(ΣCb,DC)ΣC(b+2C)2etr{12ΣC1DC},

where |C| is the cardinality of the clique C and etr(·) = exp{tr(·)}. IWp(b, D) is an inverse Wishart distribution with b degrees of freedom and a fixed p × p positive definite matrix D. Its normalizing constant is

12D(b+p1)2Γp1(b+p12),

where Γp(·) is a multivariate gamma function. Refer to [10] for more details about this parametrization of the inverse Wishart distribution.

3.2. Bayesian inference on decomposable graphs

Since the joint density factorizes over cliques and separators in the same way as in (3.1), the likelihood function can be rewritten as

f(YΣG)=(2π)np2ΠCCΣCn2etr(12ΣC1YCTYC)ΠSSΣSn2etr(12ΣS1YSTYS). (3.2)

From (3.1), the prior on ΣG is

f(ΣGG)=ΠCCp(ΣCb,DC)ΠSSp(ΣSb,DS)=ΠCC12DCb+C12ΓC1(b+C12)ΣCb+2C2etr(12ΣC1DC)ΠSS12DSb+S12ΓS1(b+S12)ΣSb+2S2etr(12ΣS1DS).

It is straightforward to obtain the marginal likelihood of the decomposable graph G,

f(YG)=(2π)np2h(G,b,D)h(G,b+n,D+YTY)=(2π)np2ΠCCw(C)ΠSSw(S),

where

h(G,b,D)=ΠCC12DCb+C12ΓC1(b+C12)ΠSS12DSb+S12ΓS1(b+S12),
w(C)=DCb+C12DC+YCTYCb+n+C122nC2ΓC(b+C12)ΓC1(b+n+C12).

Throughout the remainder of the paper, we shall be working with the hyper-inverse Wishart g-prior [10], denoted as

ΣGGHIWG(b,gYTY), (3.3)

where g is some suitably small fraction in (0, 1) and b > 2 is a fixed constant. Following the recommendation in [10], we choose g = 1/n for the rest of the paper. Intuitively, this choice of g avoids overwhelming the likelihood asymptotically as well as arbitrarily diffusing the prior. In that case,

w(C)=(n+1)C(b+n+C1)2YCTYCn2(2n)nC2ΓC(b+C12)ΓC1(b+n+C12).

The choice of focusing on the HIW g-prior in this paper is driven by the following two reasons. First, we can simplify the signal strength assumption in terms of the smallest nonzero entry in the partial correlation matrix, which serves as a natural interpretation of the edge strength compared to assumptions on the eigenvalues of the correlation matrix. Second, we conjecture that the results stated in Section 4 and Section 5 continue to hold for any choice of HIW priors. The proof techniques under HIW g-priors serve as representations to the principle ideas in the article and can be easily adapted to other variations of HIW priors.

To complete a fully Bayesian specification, we place a prior distribution π(·) on the decomposable graph G. Our theoretical results in Section 4 and Section 5 are independent of the prior choice on G if we consider a fixed p asymptotics. However, for p increasing with n we need a suitable penalty on the number of edges of the random graph to penalize the false positives. Here is a popular example [8,10,14,24,38] we use in the paper. Considering an undirected decomposable graph G, we assume the edges are independently drawn from a Bernoulli distribution with a common probability q,

π(Gq)[r<sqers(1q)1ers]1Dp(G).

Here, q is the prior edge inclusion probability. We control the parameter q to induce sparsity on the number of edges. [24] recommends using 2/(p − 1) as the hyper-parameter for the Bernoulli distribution in practice. For an undirected graph, it has its peak at p edges and the mode is smaller for decomposable graphs. We outline specific choices in Section 4 and Section 5 below.

To summarize, the Bayesian hierarchical model for Gaussian graphical models with decomposable graphs is as follows.

Yn×pΣG,GMNn×p(0n×p,In,ΣG),ΣGGHIWG(b,gYTY),π(Gq)[r<sqers(1q)1ers]1Dp(G).

This model and its numerous variations have been studied extensively during the past few decades and many efficient algorithms have been proposed for posterior sampling [1,20,24]. The most commonly used one is the add-delete Metropolis–Hastings sampler which is based on the reversible jump Markov chain Monte Carlo (MCMC) algorithm proposed in [20,21]. The algorithm samples decomposable graphs through local updates to traverse through Dp. To highlight the key steps of the MCMC procedure, let G denote the current decomposable graph at a certain step of the MCMC. Then

  • Propose a new decomposable graph G′ by adding or deleting an edge with equal probability from the current decomposable graph G while maintaining decomposability. This can be achieved by applying the procedure in [20].

  • Calculate the acceptance probability
    α(G,G)=min{1,f(YG)π(G)f(YG)π(G)}.
  • Let Gnew be the updated decomposable graph. Sample ΣGnew from its posterior distribution
    ΣGnewY,GnewHIWGnew{b+n,(g+1)YTY}.

    This step can be performed by using the sampling procedure in [9].

4. Theoretical results in the well-specified case

In this section, we present our main consistency results when the true graph is assumed to be decomposable. The assumptions and theorems introduced here serve as the foundation for the consistency results when the model is misspecified. Understanding the mechanism of how the results are derived for decomposable graphs is crucial to develop the theoretical tools when the true graph is nondecomposable. This section provides an alternative perspective of selection consistency problems for perfect DAGs [8] based on a set of new assumptions. A detailed comparison of assumptions with those in [8] is given in Section 4.4. The proofs are deferred to the Appendix and Supplementary Materials [32].

Before introducing the assumptions, we adapt previous notations to the high-dimensional graph selection problem. Let Y = (Y1, Y2, … , Yn)T and Ω0=Σ01 be the corresponding precision matrix. Without loss of generality, we assume all column means of Y are zero. Let ρij|V\{i,j} denote the true partial correlation between nodes i and j given the rest of the nodes V\{i, j}. Let ρL and ρU be the smallest and largest (in absolute value) non-zero population partial correlations, that is,

ρL=min1i<jp(i,j)EtρijV\{i,j},ρU=max1i<jp(i,j)EtρijV\{i,j}.

Furthermore, assume all partial correlations (with the form ρij|V\{i,j}) are uniformly bounded above by 1 to avoid degeneracy in the Gaussian distribution. More precisely, we assume 0 < ρLρU < 1.

Let Gt = (V, Et) denote the true decomposable graph induced by Ω0. Let Ga = (V, Ea) be any alternative decomposable graph other than the true graph Gt. Ea1=EtEa denotes the set of true edges in Ga. Notice, when EtEa, we have Ea1=Et. Letting |·| denote the cardinality of a set, |Et| is the number of edges in Gt (number of true edges) and |Ea1| is the number of true edges in Ga. Define Gc = (V, Ec), where Ec = {(i, j) : eij = 1, 1 ≤ i < jp} and |Ec| = p(p − 1)/2, to be the complete graph. By definition Gc is a decomposable graph. We use GaGt to denote EaEt, GaGt for EaEt, and GaGt for EaEt. In the following, we introduce a set of general assumptions essential to establish the theoretical results of the graph selection consistency. The main results will have additional restrictions on the parameters (α, λ, σ, γ) introduced below in the general assumptions.

Assumption 4.1 (Graph size).

pnαwhere0<α<1.

This assumption states that the dimension p of graphs has to grow slower than n, unlike [8] (hereafter CKG19) where p is allowed to grow up to a sub-exponential rate in n. This is simply because the sparsity assumptions in our paper are completely different from those in CKG19. As clarified below, we allow graphs to be “dense” while graphs considered by CKG19 are sparse in a systematic way. Similar to CKG19, [28] (hereafter LLL19) allows p to be sub-exponential in n but with some compromise in allowing dense graphs.

Assumption 4.2 (Signal strength and identifiability).

ρLnλwhere0λ<12.

This assumption indicates that the smallest nonzero partial correlation (in absolute value) cannot converge to zero faster than 1n. Interpreting ρL as the “signal size”, the assumption restricts the smallest signal size such that the hierarchical model in Section 3 is identifiable. We compare this assumption with analogous assumptions from CKG19 and LLL19 in Section 4.4.

Assumption 4.3 (Maximum number of edges in Gt).

Etnσwhere0<σ2α.

In the well-specified case (Theorems 4.2 and 4.3), the maximum number of edges in Gt is not restricted, that is, Gt is allowed to be complete, meaning that we do not impose any sparsity conditions on the true graph. For model misspecification, some restrictions on |Et| are needed to ensure the selection consistency. CKG19 and LLL19 mainly focus on ultra high-dimensional cases (p > n and pn) with sparsity assumptions on Gt which apply to the p < n case as well.

Assumption 4.4 (Prior edge inclusion probability).

qeCqnγwhere0<γ<1,0<Cq<.

The assumption states that −log(q) has to be a fractional power in n and cannot increase faster than n. A similar assumption on prior edge inclusion probability can be found in CKG19. A detailed comparison of assumptions on q is provided in Section 4.4. On the other hand, LLL19 uses a different prior specification, called the empirical sparse Cholesky (ESC) prior which has close connections with the prior setting in CKG19. However, the ESC prior does not impose elementwise sparsity with independent Bernoulli distributions, thus it loses the interpretation of the prior edge inclusion probability.

Assumption 4.5 (Imperfect linear relationship).

1ρUnkwherek0.

In general, ρU is strictly less than 1 to avoid degeneracy in the Gaussian distribution. However, it is allowed to grow with n to 1 at any polynomial rate (or slower). This limitation on the growth is due to a technical condition on the tail behavior of the distribution of non-zero sample partial correlations. See Section S.1 in the Supplementary Materials [32] for more details.

4.1. Pairwise Bayes factor consistency for fixed p

In this section, we investigate the behavior of the pairwise Bayes factor

BF(Ga;Gt)=f(YGa)f(YGt), (4.1)

for fixed p, ρL and ρU. We aim to derive sufficient conditions on the likelihood (3.2) and the prior on ΣG given by (3.3) such that the Bayes factor (4.1) converges to 0 as n → ∞ for any graph GaGt.

Theorem 4.1 (Upper bounds for pairwise Bayes factors). Let p be fixed. Given any decomposable graph GaGt, there exists a set Δa, such that on the set Δa, if n > max{p + b, 4p}, we have

  1. when GtGa,
    BF(Ga;Gt)<exp{nρL22+δ(n)},
  2. when GtGa,
    BF(Ga;Gt)<(ep2)n12(EaEt)(12τ),
    and
    P(Δa)142p2(1ρU)2(np)14τ{1τlog(np)}12,
    where τ>2andδ(n)=p2logn+nlogn+3p2logpsatisfyingδ(n)n0,asn.

When GtGa, Ga lacks at least one true edge in Gt. The first part of Theorem 4.1 says that the Bayes factor in favor of true-edge deletions is exponentially small with the actual rate depending on ρL. When GtGa, Ga contains all true edges in Gt and at least one false edge. The Bayes factor in favor of false-edge additions decreases to zero only at a polynomial rate depending on the number of false edges in Ga.

The behavior of Bayes factors observed from Theorem 4.1 depends solely on the property of the HIW prior as no prior on the graph space is involved and no sparsity condition on Gt is imposed. The HIW prior enforces a stronger penalty on true-edge deletions than false-edge additions. In high-dimensional cases, as the complexity of the graph space grows, the polynomial penalty on false-edge additions is not enough to overcome the complexity of the graph space. In such situations, a prior on the graph space is needed to achieve the selection consistency (refer to Theorems 4.2 and 4.3). In high-dimensional cases, the HIW prior alone does not favor parsimonious models, that is, it does not guarantee the selection of sparse graphs when Gt is sparse. The next two corollaries are direct consequences of Theorem 4.1.

Although Theorem 4.1 holds for finite dimensional graphs, its proof techniques serve as the foundation to the subsequent theorems on decomposable graphs presented in this article. Therefore, it is important to discuss the key steps of the proof. First, to find an upper bound of Bayes factors between any two decomposable graphs, the Bayes factors are decomposed in terms of “local moves” consisting of adding or deleting edges between two decomposable graphs. Hence, the Bayes factor between any two decomposable graphs which differ by one edge can be seen as the basic unit in the proof. Lemmas A.1–A.4 in Appendix A provide the theoretical support for this procedure. Second, we enumerate all such Bayes factors and express them as functions of sample partial correlations specific to each Bayes factor. Finally, in Section S.1 we develop results on the tail behavior of sample partial correlations, which enable us to replace the sample partial correlations with the upper or lower bound defined in Assumptions 4.2 and 4.5.

Corollary 4.1 (Finite graph pairwise Bayes factor consistency). Let Ga be any decomposable graph and GaGt. If the graph dimension p is a fixed constant, BF(Ga;Gt)P0,asn.

Corollary 4.2 (Finite graph strong selection consistency). Let p be fixed. Define

π~(GtY)=11+GaGtBF(Ga;Gt). (4.2)

We have π~(GtY)P0,asn.

Notice, the number of Bayes factor terms in the denominator on the right-hand side of (4.2) is finite. Thus, the strong selection consistency is trivial given Theorem 4.1.

4.2. Posterior ratio consistency for growing p

Next, we examine the convergence of the following posterior ratio,

PR(Ga;Gt)=f(YGa)π(Ga)f(YGt)π(Gt),

when the dimension of graphs p grows with the sample size n.

Theorem 4.2 (High-dimensional graph posterior ratio consistency). Let Ga be any decomposable graph and GaGt. If Assumptions 4.1–4.5 are satisfied with

0<α<12,0λ<min{14,12α2}, (4.3)

by choosing γ in the interval (0, 1 − σ − 2λ), we have PR(Ga;Gt)P0,asn.

We provide some intuitions of the constraints imposed on (α, λ, σ, γ) in Theorem 4.2. For posterior ratio consistency, the graph size p cannot grow at a rate equal to or faster than n. In Figure 3(a), if α > 1/4 and increases to 1/2, the range of λ becomes narrower. This means the rate at which ρL converges to zero needs to be slower to achieve the selection consistency. Therefore, for larger dimensional graphs a stronger identifiability is needed for consistent recovery of the true graph.

Figure 3.

Figure 3.

Possible range of (α, λ, σ, γ) required for the high-dimensional graph posterior ratio consistency. (a) The gray area shows all possible values of (α, λ) in the assumptions of Theorem 4.2. The range of λ is (0, 1/4) for all α ∈ (0, 1/4). As α increases passing 1/4, the range of λ becomes narrower. (b) For α = 1/4, then λ ∈ [0, 1/4) and σ ∈ (0, 1/2]. The plotted plane is the upper bound of γ in Theorem 4.2, that is, γ = 1 − σ − 2λ. The lower bound of γ is the bottom plane γ = 0 in this plot. The area between the upper and bottom planes contains all possible values of (λ, σ, γ) when α = 1/4.

From Theorem 4.1, recall that HIW priors do not enforce a strong penalty on false-edge additions. When the graph size grows with n, a prior on the number of edges is needed. The penalty on the number of edges is controlled by γ, that is, larger γ means smaller edge inclusion probability and larger penalty on dense graphs, and vice versa. Only the upper bound of γ is restricted by σ and λ, see Figure 3(b). The upper bound of γ decreases when σ increases (the total number of edges |Et| increases). This ensures the consistency when Gt is dense.

The upper bound of γ also decreases when λ increases (ρL converges to zero faster). As ρL goes to zero faster, the identifiability issue becomes more severe. Based on Theorem 4.1, the risk of false-edge additions (polynomial) is lower than the risk of true-edge deletions (exponential). In order to overcome the identifiability issue, increasing the edge inclusion probability would result in a lower risk. This explains the decrease in the upper bound of γ when λ increases. Furthermore, there is no restriction on σ in Assumption 4.3 suggesting that no sparsity assumption on the true graph is needed for the consistency.

4.3. Strong graph selection consistency for growing p

In this section, we examine the behavior of

π(GY)=f(YG)π(G)GDpf(YG)π(G),

as n, p → ∞.

Theorem 4.3 (Strong graph selection consistency). Let Ga be any decomposable graph and GaGt. If Assumptions 4.1–4.5 are satisfied with

0<α<13,0λ<min{1α4,13α2}, (4.4)

by choosing γ in the interval (α, 1 − σ − 2λ), we have

π(GtY)P1asn.

Strong selection consistency demands all posterior ratios to be converging simultaneously at a sufficiently fast rate so that the sum is convergent. Since the number of alternative graphs is of the order 2p2, to make the summation convergent, we require further assumptions on the model complexity and an accompanying stronger penalty on the number of edges. We achieve this by shrinking the dimension of the graph space (α < 1/3) and inducing a slightly stronger sparsity (by selecting larger γ) on the prior over the graph space. As α increases, the upper bound of λ (the fastest rate allowed for ρL to converge to zero) becomes more restrictive and so does its feasible range. This alleviates the identifiability problem caused by the growth of p.

The upper bound of γ is the same as in Theorem 4.2, but λ has a different range in Theorem 4.3, see Figure 4(b). While there is no specific lower bound of γ in Theorem 4.2, there is a lower bound on γ increasing with α in Theorem 4.3. This ensures when p grows with n, the upper bound of the edge inclusion probability q becomes smaller imposing a stronger penalty coming from the prior on the graph space. Finally, as in Theorem 4.2 we do not need any further restriction on σ in Assumption 4.3 meaning that the true graph is allowed to be dense or even complete in the strong selection consistency.

Figure 4.

Figure 4.

Possible range of (α, λ, σ, γ) required for the high-dimensional strong graph selection consistency. (a) The gray area shows all possible values of (α, λ) in Theorem 4.3. For α ∈ (0, 1/5], the range of λ narrows slowly with a rate of 1/4; for α ∈ (1/5, 1/3), the upper bound of λ decreases faster at a rate of 3/2. (b) For α = 1/5, then λ ∈ [0, 1/5) and σ ∈ (0, 2/5]. The plotted plane is the upper bound of γ in Theorem 4.3, that is, γ = 1 − σ − 2λ. The lower bound of γ is the bottom plane γ = 0.2 in this plot. The area between the upper and bottom planes contains all possible values of (λ, σ, γ) when α = 1/5.

In practice, one might be interested in a consistent point estimate rather than the entire posterior distribution. In Bayesian inference for discrete configurations, the posterior mode provides a natural point estimate. In the following, we investigate the consistency of the posterior mode obtained from our Bayesian hierarchical model as a simple by-product of Theorems 4.2 and 4.3. Define G^ to be the posterior mode in the decomposable graph space, that is,

G^=argmaxGDpπ(GY).

Then the following is true.

Corollary 4.3 (Posterior mode consistency when Gt is decomposable). Under the assumptions of Theorem 4.3, the probability which the posterior mode G^ is equal to the true graph Gt goes to one, that is,

P(G^=Gt)1asn.

4.4. Comparing assumptions with CKG19 and LLL19

The results in Section 4.2 and Section 4.3 are established with certain restrictions on the parameters (α, λ, σ, γ) in Assumptions 4.1–4.5. Recently, CKG19 and LLL19 developed the DAG selection consistency with the DAG-Wishart prior [5] and the ESC prior [28], respectively. For any undirected decomposable graph there exists a directed acyclic graph whose skeleton is the same as the decomposable graph [8]. Also, the DAG-Wishart prior is equivalent to the hyper-inverse Wishart prior if one considers decomposable graphs. The ESC prior is related to the DAG-Wishart prior from an autoregressive perspective of Gaussian DAG models [28]. Thus, most assumptions in LLL19 can be related to CKG19. Due to the similarities between the hierarchical models in CKG19 and this paper, we mainly focus on comparing the differences of assumptions between these two papers. For a fair comparison of the assumptions, we consider p to be increasing only at a fractional exponent in n.

4.4.1. Assumptions on eigenvalues of Ω0

In CKG19, the lower bound of the smallest eigenvalue of the true precision matrix Ω0 has to decrease at a rate slower than n−1/8 and the upper bound of the largest eigenvalue cannot grow faster than n1/8. In LLL19, only a fixed constant and its reciprocal are used to restrict the smallest and largest eigenvalues of the true precision matrix, but it can be relaxed to a similar assumption as in CKG19 with some scarification of other conditions. We do not need any assumptions on the eigenvalues of the true precision matrix. Instead, we only restrict the smallest nonzero partial correlation ρL in absolute value. This assumption will be further examined below.

In general, there is no association between eigenvalues and partial correlations of a positive definite matrix. Here is a simple example to demonstrate this. Consider a (p + 2)-dimensional positive definite precision matrix

[A02×p0p×2Ip]whereA=[abba]anda2b2>0,b0.

The two nontrivial eigenvalues are a ± b and the only non-zero partial correlation is b/a. For example, let a = ns and b = nt, where s > t > 0. Setting st < 1/2, the partial correlation does not go to zero faster than 1n. But in this case, both eigenvalues a ± b = O(ns) can diverge to infinity at any arbitrary polynomial rate. On the other hand, restricting the eigenvalues of Ω0 does not guarantee ρL can be controlled. Let a = ns and b = nt, where t > s > 0. If we restrict the eigenvalues to decrease slower than n−1/8 with s < 1/8. In this case, ts > t − 1/8 which means the partial correlation can converge to zero at any arbitrary polynomial rate. We state that there exist examples which eigenvalues and partial correlations are related as well. We omit the details for simplicity.

4.4.2. Assumptions on the sparsity of Gt

Considering decomposable graphs, CKG19 restricts the number of edges directed from any node of the true graph to be less than n1/4. Assumption 4 in CKG19 involving the signal size and eigenvalue restrictions limits this number even further. But Theorems 4.2 and 4.3 do not enforce any sparsity assumptions on Gt. For p = o(n1/2), CKG19 requires the number of children for each node in Gt to grow slower than n3/4, while Theorem 4.2 does not have any constraints on |Et| (it can be as large as p(p − 1)/2). On the other hand, LLL19 sets the same upper bound for the cardinality of the parent set and children set of each node which is allowed to increase with the sample size. When p < n, LLL19 does not have any constraints on the total number of edges.

In the strong selection consistency results, CKG19 adds another restriction on the total number of edges. Assuming p = o(n1/3) in CKG19, this implies that the total number of edges has to increase slower than n7/12, while Theorem 4.3 only requires it to be slower than n2/3 where Gt is allowed to be Gc. Overall, for p < n, CKG19 enforces stronger sparsity conditions on Gt at least in the well-specified case.

4.4.3. Assumptions on the prior edge inclusion probability

In CKG19, the prior edge inclusion probability always penalizes dense graphs through the maximum number of children for each node, while in this paper the prior edge inclusion probability is chosen based on ρL, |Et| and p (only in Theorem 4.3) under the well-specified case. There is no direct comparison between these two assumptions due to very different sparsity assumptions and perspectives under which the consistency is studied in both papers. As discussed before, the ESC prior in LLL19 does not have an interpretation of the prior edge inclusion probability.

4.4.4. Assumptions on the signal size

In CKG19, the signal size is defined as the smallest (in absolute value) non-zero entry in the lower triangular matrix of the modified Cholesky decomposition of Ω0, denoted by sn. The beta-min condition in LLL19 restricts the square of the signal size defined in CKG19. In this paper, we consider ρL, the smallest non-zero partial correlation (in absolute value), as the analogue of the signal size. Although the definitions are different, sn and ρL both relate to some quantities about the number of edges. In CKG19 the signal size goes to zero slower when the maximum number of children for each node increases; in this paper the signal size converges to zero slower when |Et| increases; in LLL19 the signal size cannot converge to zero faster than 1n which is similar to our general assumption on ρL. Both signal sizes serve as a measure of identifiability in the selection consistency.

In the following, we look at an example to show that neither of the assumptions in CKG19 and this paper implies the other. Let Ω denote the precision matrix of a complete graph with the maximum number of children d in the modified Cholesky decomposition Ω = L(aIp)−1LT , where a > 0. The following calculation does not depend on the choice of a as long as it is greater than zero. For simplicity, all non-zero lower triangular entries are assumed to be the same, say s. Thus, s is also the signal size defined in CKG19. The smallest partial correlation in terms of d and s is given by

ρL=11s2+d.

First, when s → ∞, ρL1d. In CKG19, d cannot grow faster than n14, hence the assumption about ρL in this paper is always satisfied. When s → 0, based on Assumption 4 in CKG19, the signal size must be converging to zero at a rate slower than n−1/4 while d grows slower than 1/s2. In this case, ρL must be slower than n−1/4 as well. Thus, the assumptions of the strong selection consistency (Theorem 4.3) in this paper are not met since it requires ρL to be slower than n−1/5. But Theorem 4.2 still holds.

4.4.5. Assumptions on HIW g-priors and DAG-Wishart priors

For the HIW g-prior used in this paper, the assumption that its degrees of freedom b > 2 is analogous to Assumption 5(i) in CKG19. Also, the fact that (1/n)YTY has a constant order of eigenvalues is essentially the same as Assumption 5(ii) in CKG19. Although the ESC prior in LLL19 is closely related to the DAG-Wishart prior under certain conditions, there is no direct connection with the HIW g-prior.

Based on the comparisons presented in this section, the assumptions in CKG19 (and LLL19) do not imply the ones in this paper and vice versa even when p < n. The assumptions in all three papers are only the sufficient conditions and violation of any assumption does not necessarily compromise the consistency.

5. Theoretical results under model misspecification

In this section, we investigate the effect of model misspecification when the underlying true graph Gt is non-decomposable. We develop theoretical results for non-decomposable graphs based on the theorems and techniques in the well-specified case. The general assumptions remain the same as in Assumptions 4.1–4.5 but different versions of restrictions on (α, λ, σ, γ) are introduced to handle the misspecification. We begin with some definitions on triangulations and minimal triangulations of a graph.

Definition 5.1 (Triangulation). Let G = (V, E) be a non-decomposable graph. A graph GΔ = (V, EF) where EF = ∅ is called a triangulation of G if GΔ is decomposable. The edges in F are called fill-in edges.

Definition 5.2 (Minimal triangulation [22,35]). Let GΔ = (V, EF) be a triangulation of the non-decomposable graph G = (V, E). GΔ is a minimal triangulation of G if (V, EF′) is non-decomposable for every F′F. In other words, a triangulation is minimal if and only if the removal of any single fill-in edge from it results in a non-decomposable graph.

Definition 5.2 captures the important aspect of minimal triangulations, that is, no fill-in edge in a minimal triangulation is redundant. To better understand this concept, see the following example in Figure 5. Graph Go6 is not decomposable since it contains a cycle with length 6, i.e., 1 − 2 − 3 − 4 − 5 − 6 − 1. Graph Gm1, Gm2, Gm3, Gtri are all triangulations of Go6 since they are all decomposable and the edge set of Go6 is a subset of theirs. Furthermore, Gm1, Gm2, Gm3 are minimal triangulations of Go6 since removing any fill-in edge in them results in non-decomposable graphs. Gtri is not a minimal triangulation of Go6 since the removal of edge 3 − 6 or 1 − 5 results in decomposable graphs Gm2 or Gm3, respectively. It has redundant fill-in edges, thus not minimal. Go4 is not a triangulation of Go6 since it is not decomposable due to the cycle with length 4, i.e., 1 − 3 − 4 − 6 − 1. This example shows that minimal triangulations are not unique. For a summary of minimal triangulations, see [22] for more details. Next, we state theorems of graph selection consistency when the true graph Gt is non-decomposable.

Figure 5.

Figure 5.

Go6 is not decomposable; Gm1, Gm2, Gm3 are all minimal triangulations of Go6; Gtri is a triangulation of Go6 but not minimal; Go4 is not a triangulation of Go6.

Theorem 5.1 (Convergence of minimal triangulations for finite graphs). Assume the true graph Gt is non-decomposable. When the graph dimension p is a fixed constant (ρL and ρU are fixed constants), we have the following results.

  1. Let Gm be any minimal triangulation of Gt and Ga be any decomposable graph that is not a minimal triangulation of Gt. If ρU ≠ 1, then BF(Ga;Gm)P0,asn.

  2. If ρU ≠ 1, we have GmMtπ(GmY)P1,asn, where Mt is the minimal triangulation space of Gt.

Theorem 5.1 states that Bayes factors favor minimal triangulations over any other decomposable graphs. This leads to the pairwise Bayes factor consistency for minimal triangulations. The second part of the theorem ensures that the posterior of a candidate graph under model misspecification eventually concentrates within the minimal triangulation space, that is, the sum of all posterior probabilities of minimal triangulations converges to 1. The minimal triangulations serve as a proxy of Gt. They contain all edges in Gt with the minimal number of fill-in edges. Thus, the minimal number of false positive edges and no false negative edges are both guaranteed during model fitting.

Theorem 5.2 (Equivalence of minimal triangulations for finite graphs). Assume the true graph Gt is non-decomposable. Let Gm1 and Gm2 be any two different minimal triangulations of Gt (with the same number of fill-in edges). When the graph dimension p is a fixed constant, the Bayes factor between them are stochastically bounded, that is, for any 0 < ϵ < 1, there exist two positive finite constants A1(ϵ) < 1 and A2(ϵ) > 1, such that

P{A1<BF(Gm1;Gm2)<A2}>1ϵforn>p+max{3,b,6log(10p2ϵ)}.

Although minimal triangulation graphs are not unique, Theorem 5.2 says that the Bayes factor between any two minimal triangulations is stochastically bounded. Asymptotically the posterior probability may not converge to 1 for a single minimal triangulation of Gt. In fact, the posterior assigns non-vanishing probabilities to different minimal triangulations. The highest posterior probability model depends on the given data set. Refer to the second simulation in Section 6 for an example. In [18], an asymptotic version of Theorem 5.3 is presented. In contrast, our result is non-asymptotic and provides an exact form of the constants, see Section B.2 in the Supplementary Materials [32] for details.

The Bayes factors considered in Theorems 5.1 and 5.2 are in fact between two decomposable graphs, except one of them is a minimal triangulation (which is decomposable) of the true graph. Therefore, the proofs follow similar steps as in the proof of Theorem 4.1.

Theorem 5.3 (Convergence of minimal triangulations for high-dimensional graphs). Assume the true graph Gt is not decomposable. When the graph dimension p grows with n, we have the following results.

  1. Let Gm be any minimal triangulation of Gt and Ga be any decomposable graph that is not a minimal triangulation of Gt. Assume
    0<α<12,0λ<min{14,12α2},0<σ<12α2λ.
    Choose γ in the interval (2α, 1 − σ − 2λ). Then under Assumptions 4.1–4.5, we have PR(Ga;Gm)P0,asn.
  2. If
    0<α<13,0λ<min{1α4,13α2},0<σ<13α2λ.
    And we choose γ in the interval (3α, 1 − σ − 2λ), then under Assumptions 4.1–4.5, we have GmMtπ(GmY)P1,asn, where Mt is the minimal triangulation space of Gt.

Theorem 5.3 is the high-dimensional version of convergence of minimal triangulations under model misspecification. The consistency results are the same as in Theorem 5.1 with some additional assumptions on the parameters (α, λ, σ, γ). Comparing with Theorems 4.2 and 4.3, assumptions on α and ρL remain the same. The upper bound of γ is also the same as in the well-specified case. Under model misspecification, a sparsity assumption on the number of edges through σ and a larger lower bound on γ are needed to ensure the consistency of minimal triangulations.

In the well-specified case, we only consider two cases in the proofs, i.e., GtGa and GtGa. In the case of model misspecification, there are three scenarios which are GmGa, GmGa with Ea1<Et, and GmGa with GtGm, Ga. The first two scenarios mirror the two cases in the well-specified case. The third scenario is added because of properties of minimal triangulations. Due to the proof techniques adopted here, in order to show the consistency for the third scenario, additional constraints on γ have to be placed. It results in lowering the edge inclusion probability which in return imposes a sparsity condition on the total number of edges. This occurs in both parts of Theorem 5.3. We state that the assumptions in Theorem 5.3 are only sufficient conditions needed to achieve consistency. The current proof techniques are not suitable to remove the extra assumption on the number of edges.

Theorem 5.4 (Equivalence of minimal triangulations for high-dimensional graphs). Assume the true graph Gt is not decomposable and the graph dimension p grows with n. Let Gm1 and Gm2 be any two different minimal triangulations of Gt. If the number of fill-in edges is finite, then the Bayes factor between them are stochastically bounded.

This is an extension of Theorem 5.2 when p grows with n. Based on Theorem 5.4, the equivalence among minimal triangulations is true when the number of fill-in edges is finite. Adding infinitely many fill-in edges prompts the minimal triangulations to drift further away from each other; thus, the equivalence may not be valid in such case. It is worth noting that any decomposable subgraph of the true graph is not a good posterior estimate. This is simply due to the fact that false-edge deletions result in an exponential decay of Bayes factors in favor of decomposable subgraphs. Analogous to Corollary 4.3, we show in Corollary 5.1 that when the true graph Gt is not decomposable, the posterior mode is one of the minimal triangulations and is consistent.

Corollary 5.1 (Consistency of the posterior mode when Gt is non-decomposable). Under the assumptions of the second part of Theorem 5.3, the posterior mode G^ is in the minimal triangulation space Mt of the true graph Gt with probability converging to one, that is,

P(G^Mt)1asn.

6. Simulations

We conduct two sets of simulations for demonstrating the convergence of Bayes factors in the well-specified case (Theorem 4.1) and in the misspecified case (Theorem 5.1) for fixed p.

6.1. Simulation 1: Demonstration of pairwise Bayes factor convergence rate

In this section, we conduct a simulation study in D3 to demonstrate the convergence rate of pairwise Bayes factors. Since there is no non-decomposable graph with 3 nodes, D3 is the same as G3. All 8 graphs in D3 are enumerated in Figure 6.

Figure 6.

Figure 6.

Enumerating all 3-node decomposable graphs in D3 with Gt as the true graph, G0 as the null graph and Gc as the complete graph.

The underlying covariance matrix Σ3 and its precision matrix Ω3 are shown below along with the correlation matrix R3 and the partial correlation matrix R¯3. Samples are drawn independently from N3(0, Σ3). The range of the sample size simulated is from 100 to 10,000 with an increment of 100. The Bayes factor for each sample size is averaged over 1000 simulation replicates. The degrees of freedom b in the HIW g-prior is chosen to be 3. The first six pairwise Bayes factors in logarithmic scale are shown in Figure 7(a) and the logarithm of BF(Gc; Gt) is shown separately in Figure 7(b) due to its slower convergence rate. To better understand the simulation results, asymptotic leading terms of pairwise Bayes factors in logarithmic scale and the empirically estimated slopes for n or log n are listed in the second and third columns of Table 1. To calculate the leading terms, sample partial correlations or sample correlations are replaced with their population counterparts that do not depend on n. The slopes of logarithms of the first six Bayes factors in Figure 7(a) are calculated in Table 1 by performing linear regressions on n. The last slope in Table 1 is calculated with a linear regression on log n; refer to Figure 7(b). Table 1 shows that the theoretical asymptotic leading terms match well with their empirical values.

Figure 7.

Figure 7.

Simulation results of pairwise Bayes factors of D3 in logarithmic scale. (a) Six Bayes factors where GtGa (at least missing one true edge of Gt). (b) Bayes factor BF(Gc; Gt) when GtGa = Gc (only false-edge additions).

Table 1.

Asymptotic leading terms and simulation slopes of Bayes factors in logarithmic scale

Bayes factor Asymptotic leading term Simulation slope
BF(G0; Gt) {log(1ρ122)+log(1ρ232)}n2=0.2967n −0.2963
BF(G13; Gt) {log(1ρ122)+log(1ρ2312)}n2=0.2639n −0.2637
BF(G23; Gt) log(1ρ122)n2=0.1767n −0.1765
BF(G−12; Gt) log(1ρ1232)n2=0.1438n −0.1439
BF(G12; Gt) log(1ρ232)n2=0.1120n −0.1198
BF(G−23; Gt) log(1ρ2312)n2=0.0872n −0.0873
BF(Gc; Gt) −0.5 · log n −0.5106
Σ3=[0.71190.42370.16950.42370.84750.33900.16950.33900.6356],Ω3=[210120.800.82].
R3=[1.00000.54560.25200.54561.00000.46190.25200.46191.0000],R¯3=[10.500.510.400.41].

From the simulation results, we can see that missing at least one true edge of Gt in Ga results in the Bayes factor converging to zero exponentially. This is perfectly illustrated by all six Bayes factors in Figure 7(a). On the other hand, adding false edges in Ga results in the Bayes factor going to zero at a polynomial rate which is much slower than the case of missing a true edge, see Figure 7(b). These discoveries are consistent with Table 1 and our proofs.

Next, we compare the rates in the convergence of the first six Bayes factors. The convergence rate associated with missing two true edges of Gt is faster than missing only one true edge, that is, BF(G0; Gt) vs. BF(G23; Gt) and BF(G0; Gt) vs. BF(G12; Gt). The convergence rate is faster when the missing true edge of Gt corresponds to a larger partial correlation (or correlation) in absolute value, that is, BF(G−12; Gt) vs. BF(G−23; Gt) and BF(G23; Gt) vs. BF(G12; Gt). One interesting fact is although G0 and G13 are both missing two true edges of Gt, with G13 having an additional false edge of Gt compared to G0, the convergence rate of the Bayes factor for G13 is slower than that for G0. The reason is clear from Table 1. As the absolute value of the correlation between nodes 2 and 3 (|ρ23| = 0.4619) is larger than the absolute value of the partial correlation between them given node 1 (|ρ23|1| = 0.4); thus, the leading term of BF(G0; Gt) is smaller than that of BF(G13; Gt). The effect due to the false edge 1 − 3 (polynomial rate) is overwhelmed by the leading term (exponential rate). It is evident that HIW priors place a larger penalty on false negative edges compared to false positive edges. This confirms Theorem 4.1. Similar conclusions can be made by comparing BF(G23; Gt) and BF(G−12; Gt), also by comparing BF(G12; Gt) and BF(G−23; Gt).

6.2. Simulation 2: Examination of model misspecification

In this section, we illustrate the stochastic equivalence between minimal triangulations when the true graph is non-decomposable. The smallest non-decomposable graph is a cycle of length 4 without a chord. So we focus our simulation in D4. Since the number of decomposable graphs increases exponentially with the dimension of graphs, we only select 5 alternative graphs in D4 besides the minimal triangulation, see Figure 8. The true covariance matrix Σ4 and its precision matrix Ω4 are listed below along with the correlation matrix R4 and the partial correlation matrix R¯4. All simulation settings are the same as the simulation in D3.

Figure 8.

Figure 8.

Some selected graphs in G4, including Gt as the true graph which is non-decomposable. Gm1 and Gm2 are two minimal triangulations of Gt.

Σ4=[1.83641.09090.89091.36361.09091.06060.72730.90910.89090.72730.92730.90911.36360.90910.90911.6364],Ω4=[21.2011.231.2001.2311012].
R4=[1.00000.78170.68270.78660.78171.00000.73340.69010.68270.73341.00000.73800.78660.69010.73801.0000],R¯4=[10.4900.500.4910.40000.4010.410.5000.411].

When the true graph Gt is non-decomposable, the two minimal triangulations of Gt act like two proxy graphs of Gt. We plot the first four pairwise Bayes factors where GmiGa (i = 1, 2) for Gm1 and Gm2 in logarithmic scale together in Figure 9(a) and (b), respectively. The logarithm of the Bayes factor between the two minimal triangulations is in Figure 9(c). Finally, we plot the Bayes factors between Gc (one triangulation of Gt, but not minimal) and both minimal triangulations in logarithmic scale in Figure 9(d).

Figure 9.

Figure 9.

Simulation results of pairwise Bayes factors of D4 in logarithmic scale. (a) Bayes factor BF(Ga; Gm1) when Gm1Ga (missing true edges). (b) Bayes factor BF(Ga; Gm2) when Gm2Ga (missing true edges). (c) Bayes factor between the two minimal triangulations of Gt, BF(Gm2; Gm1). (d) Bayes factor BF(Ga; Gmi) when GmiGa = Gc, i = 1, 2 (only false-edge additions).

Figure 9(a) and (b) are Bayes factors with true-edge deletions from Gm1 and Gm2, respectively. They behave the same as in the well-specified case, meaning that true-edge deletions cause exponential decay of Bayes factors under model misspecification as well. Figure 9(d) shows that Bayes factors with false-edge additions decay at a polynomial rate under model misspecification. This is also the same as in the well-specified case. Based on the simulation result in Figure 9(c), we can see the Bayes factor between the two minimal triangulations Gm1 and Gm2 appears to be stochastically bounded.

7. Discussion

In this paper, we provide a complete theoretical foundation for high-dimensional decomposable graph selection under model misspecification. When the graph dimension is finite, Fitch, Jones and Massam [18] present pairwise Bayes factor consistency results and stochastic equivalence among minimal triangulations. We provide more general results of both pairwise consistency and strong selection consistency in high-dimensional scenarios. To the best of our knowledge, these are the first complete results on this topic so far. Our current choice of edge inclusion probability requires the knowledge of total number of edges in the true graph. We anticipate this can be relaxed by placing an appropriate prior on γ.

In our results, the graph size cannot be equal to or exceed n1/2 and n1/3 for posterior ratio consistency and strong selection consistency, respectively. The limitation of the growth rate of the graph dimension is caused by the convergence rate of sample partial correlations and sample correlations. With the current techniques, without further investigating the relationship among sample partial correlations, these results cannot be improved. Observe that in the i.i.d. case without any sparsity assumptions, such an assumption on the growth of the graph dimension is common [23,39]. We conjecture that it may not be possible to relax the growth rate of p for achieving strong selection consistency using the current formulation of HIW priors. This is simply because HIW priors do not penalize false edges significantly enough so that in high dimension a prior on the graph space is needed to achieve both pairwise and strong selection consistency. Also any other sparsity restriction on the elements of the precision matrix is not supported by HIW priors due to its inability to enforce sufficient shrinkage conditional on the graph. This limits extending the technical results to ultra-high-dimensional cases by enforcing additional sparsity assumptions on the elements of the precision matrix. This apparent “flaw” lies in the construction of HIW priors and cannot be improved by adding any reasonable penalty on the graph space.

For technical simplicity, our results are based on the HIW g-prior only. We conjecture that the consistency results continue to hold for general HIW priors. Moreover, extensions to non-decomposable graphical models can be done by using G-Wishart priors, but major bottlenecks are expected stemming from the lack of a closed form for the normalizing constant. A recent work [40] on the development of exact formulas for the normalizing constant can be used to show consistency results of Bayes factors for general graphs. A future area of research will also involve studying repercussion of model misspecification on other related approaches of Bayesian graph learning [18,24].

Overview of results in the Appendix.

Note that the proofs in the Appendix require a set of auxiliary results. The readers are deferred to the Supplementary Materials [32] for these results. The Appendix begins with a set of graph theoretic results required to prove Theorem 4.1. Then we provide the proof of Theorem 4.1 followed by the proof of Theorem 4.2. We also provide the proofs of theorems related to minimal triangulations, that is, Theorems 5.1 and 5.2.

Supplementary Material

3

Acknowledgements

Debdeep Pati acknowledges support from NSF DMS (1854731, 1916371) and NSF CCF 1934904 (HDR-TRIPODS). Bani K. Mallick acknowledges support from NIH R01CA194391 (NCI) and NSF CCF 1934904 (HDR-TRIPODS).

Appendix A: Pairwise Bayes factor consistency and posterior ratio consistency – any graph Ga versus the true graph Gt

Lemma A.1 (Decomposable graph chain rule [27]). Let G = (V, E) be a decomposable graph and let G′ = (V, E′) be a subgraph of G that also is decomposable with |E\E′| = k. Then there is an increasing sequence G′ = G0G1 ⋯ ⊂ Gk−1Gk = G of decomposable graphs that differ by exactly one edge.

Assume GtGa, then Et>Ea1. By Lemma A.1, there exists a decreasing sequence of decomposable graphs from Gc to Ga that differ by exactly one edge, say {G¯ica}i=0EcEa, where Gc=G¯0caG¯1caG¯EcEa1caG¯EcEaca=Ga. There are |Ec| − |Ea| steps for moving from Gc to Ga. Let {ρx¯iy¯iS¯i}i=1EcEa be the corresponding population partial correlation (or correlation, when S¯i=) sequence and {BF(G¯ica;G¯i1ca)}i=1EcEa be the corresponding Bayes factor sequence for each step. By that, we mean in the ith step, edge (x¯i,y¯i) is removed; ρx¯iy¯iS¯i and BF(G¯ica;G¯i1ca) are the population partial correlation and the Bayes factor accordingly, i = 1, 2, … , |Ec| − |Ea|. S¯i is the specific separator corresponding to the ith step. Among them EtEa1 steps are removals of true edges that are true-edge deletion cases; EcEaEt+Ea1 steps are removals of false edges that can be seen as false-edge deletion cases.

Lemma A.2 (Origin of the exponential rate in true-edge deletion cases). Assume GtGa. There exists at least one sequence of partial correlations {ρx¯iy¯iS¯i}i=1EcEa for moving from Gc to Ga such that among all population partial correlations in the sequence that are corresponding to the removal of true edges, at least one is non-zero and it is not a population correlation (S¯i).

Proof. There are many sequences of {(x¯i,y¯i)}i=1EcEa (in different orders) that can achieve moving from Gc to Ga and still maintain decomposability along the way. Let (x¯,y¯)Et\Ea1. Thus, (x¯,y¯){(x¯i,y¯i)}i=1EcEa. Choose (x¯1,y¯1)=(x¯,y¯). This means the first step is the removal of a true edge in Et\Ea1 from Gc. (The rest of steps can be arbitrary as long as they maintain decomposability.) Let S¯ be the corresponding separator for the first step. Thus, we know S¯=V\{x¯,y¯}, since (x¯,y¯) is removed from Gc. In fact, the removal of any edge from a complete graph still maintains decomposability, that is, G¯1ca is a decomposable graph. Since (x¯,y¯)Et, by the pairwise Markov property, ρx¯,y¯V\{x¯,y¯}0. And ρLρx¯,y¯V\{x¯,y¯}ρU. Therefore, the partial correlation sequence starting with ρx¯,y¯V\{x¯,y¯} is the one which satisfies all conditions according to the lemma since ρx¯,y¯V\{x¯,y¯}0 and it is corresponding to the removal of a true edge (x¯,y¯). Thus, we have proved the existence and found the sequence as well. □

Lemma A.3 (The inheritance of separators). Let G = (V, E) and G′ = (V, E′) be two undirected graphs (not necessary to be decomposable). Assume EE′. If SV separates node xV from node yV in G′, where (x, y) ∉ E′, then S also separates them in G.

Proof. First, if x and y are not connected in G (x and y are not adjacent as well), any node set can separate them by definition, so does S. The lemma holds trivially in this case. When x and y are connected in G, assume S does not separate x from y in G. By the definition of separators (with the fact that x and y are connected), there exists a path from x to y in G, say x = v0, v1, … , vl−1, vl = y and viS, for all i = 0, 1, … , l. Since EE′, the path from x to y, {vi}i=1l1, is still a path from x to y in G′. By the definition of separators again, we know that S does not separate x from y in G′. But this contradicts with the assumption in the lemma. Therefore, S separates x from y in G. □

Assume GtGa, thus Et=Ea1. By Lemma A.1, there exists an increasing sequence of decomposable graphs from Gt to Ga that differ by exactly one edge, say {G~ita}i=0EaEt, where Gt=G~0taG~1taG~EaEt1taG~EaEtta=Ga. There are |Ea| − |Et| steps for moving from Gt to Ga. All of them are additions of false edges that are false-edge addition cases. Let {ρx~iy~iS~i}i=1EaEt be the corresponding population partial correlation (or correlation, when S~i=) sequence and {BF(G~ita;G~i1ta}i=1EaEt be the corresponding Bayes factor sequence for each step. By that, we mean in the ith step, edge (x~i,y~i)Et is added; ρx~i,y~iS~i and BF(G~ita;G~i1ta) are the population partial correlation and the Bayes factor accordingly, i = 1, 2, … , |Ea| − |Et|. S~i is the specific separator corresponding to the ith step.

Lemma A.4 (Origin of the polynomial rate in false-edge addition cases). Assume GtGa. For any edge sequence {(x~i,y~i)}i=1EaEt from Gt to Ga described above, all population partial correlations in {ρx~iy~iS~i}i=1EaEt are zero (or correlation, when S~i=).

Proof. Assume in the ith step, we add edge (x~i,y~i)Et to graph G~i1ta and S~i is the corresponding separator, where 1 ≤ i ≤ |Ea| − |Et|.

First, when S~i this case is showed in Figure 11. Since edge (x~i,y~i)Et is added in the ith step, by Lemma S.3.1, Cx~i and Cy~i are adjacent in some junction tree of G~i1ta where Cx~i and Cy~i are the cliques that contain x~i and y~i, respectively. And S~i is the separator between them, i.e., S~i=Cx~iCy~i. By the property of junction trees, we know S~i separates x~i from y~i in G~i1ta. Since {G~ita}i=0EaEt is an increasing sequence by edge, by Lemma A.3, we know S~i also separates x~i from y~i in G~0ta=Gt. By the global Markov property, ρx~iy~iS~i=0.

Figure 11.

Figure 11.

G~i1ta before adding edge (x~i,y~i)Et where S~i.

Next, when S~i=, see Figure 12, we show ρx~iy~i=0. By the property of junction trees, we know node x~i and y~i are disconnected. Furthermore, in the current graph G~i1ta, nodes before clique Cx~i (including nodes in Cx~i) and nodes after clique Cy~i. (including nodes in Cy~i) are disconnected. Since GtG~i1ta, then this is also true in Gt. Thus, nodes before clique Cx~i (including nodes in Cx~i) and nodes after clique Cy~i (including nodes in Cy~i) are disconnected in Gt. We can rearrange the precision matrix of Gt into a block matrix such that the block which x~i is in and the block which y~i is in are independent. Therefore, node x~i and y~i are marginally independent in Gt, ρx~iy~i=0. Notice when Ga = Gc this lemma still holds. □

Figure 12.

Figure 12.

G~i1ta before adding edge (x~i,y~i)Et where S~i=.

For the rest of proofs, when GtGa, moving from Gc to Ga is restricted to the sequence mentioned in Lemma A.2 (deleting a true edge at the beginning); when GtGa, moving from Gt to Ga (or Gc) can be sequences with any order of adding false edges (as long as decomposability is satisfied) according to Lemma A.4. Notice, there can be many other sequences of decomposable graphs for moving from Gc to Ga and Gt to Ga. We chose the sequences according to Lemma A.2 and A.4 mainly to simplify the proofs. The specific choices of sequences of decomposable graphs do not affect the final consistency results, but they do determine the sufficient conditions of each theorem. Other sequences of decomposable graphs can also be used to prove the consistency results, but they may result in different assumptions. Following the notations in Lemma A.2 and A.4, we have the decomposition of the Bayes factor in favor of Ga as follows. (i) When GtGa,

BF(Ga;Gt)=f(YGa)f(YGt)=f(YGa)f(YGc)f(YGc)f(YGt)=p(YGa)p(YG¯EcEa1ca)p(YG¯EcEa1ca)p(YG¯EcEa2ca)p(YG¯2ca)p(YG¯1ca)p(YG¯1ca)p(YGc)×p(YGc)p(YG~EcEt1tc)p(YG~EcEt1tc)p(YG~EcEt2tc)p(YG~2tc)p(YG~1tc)p(YG~1tc)p(YGt)=i=1EcEaBF(G¯ica;G¯i1ca)i=1EcEtBF(G~itc;G~i=1tc)=BFcaBFtc,
PR(Ga;Gt)=p(GaY)p(GtY)=f(YGa)π(Ga)f(YGt)π(Gt)=BF(Ga;Gt)π(Ga)π(Gt)=BFcaBFtc(q1q)EaEt.

BFca contains |Ec| − |Ea| terms, in which EtEa1 terms are true-edge deletion cases and EcEaEt+Ea1 terms are false-edge deletion cases. BFtc has |Ec| − |Et| terms that are all false-edge addition cases. (ii) When GtGa,

BF(Ga;Gt)=i=1EaEtBF(G~ita;G~i1ta)=BFta,
PR(Ga;Gt)=BFta(q1q)EaEt.

Here, we introduce the notation like BFca. The use of such notations is to directly point out the moving direction in the sequence of decomposable graphs, since the traditional Bayes factor notation, in this case BF(Ga; Gc), may be confused for denoting the movement from Ga to Gc but in reality we look at it as the movement from Gc to Ga. These two directions are very different, one corresponds to adding edges and the other corresponds to deleting edges. The additional notation is introduced to make these two cases distinct.

A.1. Proof of Theorem 4.1

First, for any τ* > 2, let ϵ1,n=log(np)τ(np). Then define

RijS={ρ^ijSρijS<ϵ1,n}.

Given any decomposable graph GaGt, when GtGa, by Lemma A.2, we have the edge sequence {(x¯i,y¯i)}i=1EcEa for moving from Gc to Ga and let (x¯1,y¯1)=(x¯,y¯) be the first in the sequence where a true edge is deleted from Gc. Let {(x~i,y~i)}i=1EcEt and {S~i}i=1EcEt be the edge sequence and the corresponding separator sequence for moving from Gt to Gc according to Lemma A.4. Let

Δta,ϵ1=(Rx¯y¯V\{x¯,y¯})(i=1EcEtRx~iy~iS~i).

Since ρU ≠ 1, by the proof of Lemma S.1.1, we have

P(Δta,ϵ1)P(Δϵ1)142p2(1ρU)2(np)14τ{1τlog(np)}12.

When GtGa, let {(x~i,y~i)}i=1EaEt and {S~i}i=1EaEt be the edge sequence and the corresponding separator sequence for moving from Gt to Ga according to Lemma A.4. (Notice here we use the same edge and separator notations as in Gt to Gc for consistency reason and Gt to Ga can be seen as a part of Gt to Gc.) Let

Δta,ϵ1=i=1EaEtRx~iy~iS~i.

Since ρU ≠ 1, by the proof of Lemma S.1.1, we also have

P(Δta,ϵ1)P(Δϵ1)142p2(1ρU)2(np)14τ{1τlog(np)}12.

Thus, Δa,ϵ1 = Δta1 when GtGa and Δa,ϵ1 = Δt⊊a1 when GtGa. For the following proof, we restrict it to the event Δa,ϵ1. Next, we consider two scenarios for Bayes factor consistency, that is, GtGa and GtGa.

First, when GtGa and GtGc, we have Et>Ea1 and |Ec| > |Et|. We begin by simplifying the upper bound of BFtc. (For Gt = Gc, BFtc = 1.) By Lemmas S.3.1 and A.4,

BFtc=i=1EcEtBF(G~itc;G~i1tc)<i=1EcEt(gg+1)b+n+dS~ib+dS~i12(1ρ^x~iy~iS~i2)n2<(2n)EcEt2{1log(np)τ(np)}(EcEt)n2whenn>b+p<(2n)EcEt2exp(nnp1τlognEcEt2τlogn)<(2n)EcEt2exp(EcEtτlogn)whenn>4p<exp{p2(121τ)(EcEt)logn}.

Next, we examine BFca. Based on Lemma A.2 and its proof, we divide it into two parts, that is, true-edge deletion cases and false-edge deletion cases. For true-edge deletion cases, we use {(x¯id,y¯id)}i=1EtEa1 denote the sequence of true edges and {S¯id}i=1EtEa1 are the corresponding separator sequence. For false-edge deletion cases, we use {(x¯ia,y¯ia)}i=1EcEaEt+Ea1 and {S¯ia}i=1EcEaEt+Ea1 Since p is finite, by the definition of ρL, then ρL is a positive finite constant.

BFca=i=1EcEaBF(G¯ica;G¯i1ca)<i=1EtEa1(1+1g)b+dS¯idb+n+dS¯id12(1ρ^x¯idy¯idS¯id2)n2×i=1EcEaEt+Ea1(1+1g)b+dS¯iab+n+dS¯ia12(1ρ^x¯iay¯iaS¯ia2)n2<{2p(n+1)}EcEa2(1ρ^x¯y¯V\{x¯,y¯}2)n2wlogassumep>b<{2p(n+1)}EcEa2{1(ϵ1ρx¯y¯V\{x¯,y¯})2}n2<{2p(n+1)}EcEa2exp(nρL22+nϵ1nϵ122)<{2p(n+1)}EcEa2exp{nρL22+nlogn12τlog(np)}whenn>2p<exp{nρL22+p2logn+nlogn12τlog(np)+2p2logp}whenn>1.

Let δ(n)=p2logn+nlogn+3p2logp and δ(n)/n → 0 as n → ∞. Hence,

BF(Ga;GtGtGa)=BFcaBFtc<exp{nρL22+δ(n)}.

When GtGa, by Lemmas S.3.1 and A.4 we have

BF(Ga;GtGtGa)=i=1EaEtBF(G~ita;G~i1ta)<exp{p2(121τ)(EaEt)logn}.

A.2. Proof of Theorem 4.2

From λ<12α, we have α+λ<12; from λ<14, we have 2λ<12. For any β* that satisfies

max{2λ,α+λ,1γ2}<β<12,

let ϵ2,n = (np)β*. Then define

RijS={ρ^ijSρijS<ϵ2,n}.

Given any decomposable graph GaGt, when GtGa, by Lemma A.2, we have the edge sequence {(x¯i,y¯i)}i=1EcEa for moving from Gc to Ga and let (x¯1,y¯1)=(x¯,y¯) be the first in the sequence where a true edge is deleted from Gc. Let {(x~i,y~i)}i=1EcEt and {S~i}i=1EcEt be the edge sequence and the corresponding separator sequence for moving from Gt to Gc according to Lemma A.4. Let

Δta,ϵ2(n)=(Rx¯y¯V\{x¯,y¯})(i=1EcEtRx~iy~iS~i).

Since 0<β<12 and Assumption 4.5, by Lemma S.1.2, when n → ∞,

P{Δta,ϵ2(n)}P{Δϵ2(n)}142p2(1ρU)2(np)β12exp{14(np)12β}1.

When GtGa, let {(x~i,y~i)}i=1EaEt and {S~i}i=1EaEt be the edge sequence and the corresponding separator sequence for moving from Gt to Ga according to Lemma A.4. (Notice here we use the same edge and separator notations as in Gt to Gc for consistency reason and Gt to Ga can be seen as a part of Gt to Gc.) Let

Δta,ϵ2(n)=i=1EaEtRx~iy~iS~i.

Since 0<β<12 and Assumption 4.5, by Lemma S.1.2, when n → ∞,

P{Δta,ϵ2(n)}P{Δϵ2(n)}142p2(1ρU)2(np)β12exp{14(np)12β}1.

Thus, Δa,ϵ2 (n) = Δta,ϵ2(n) when GtGa and Δa,ϵ2(n) = Δta,ϵ2(n) when GtGa. For the following proof, we restrict it to the event Δa2(n). Similar to the proof of Theorem 4.1, we consider two scenarios here for posterior ratio consistency, i.e., GtGa and GtGa.

First, when GtGa and GtGc, we have Et>Ea1 and |Ec| > |Et|. (For Gt = Gc, BFtc = 1.) By Lemmas S.3.1 and A.4,

BFtc=i=1EcEtBF(G~itc;G~i1tc)<(2n)EcEt2{1(np)2β}(EcEt)n2whenn>b+p<(2n)EcEt2{1+2(np)2β}(EcEt)n2whenn>max{2p,21(2β)+1}<exp{np2(np)2βEcEt4logn}whenn>4.

Similar to the proof of Theorem 4.1, we have

BFca=i=1EcEaBF(G¯ica;G¯i1ca)<{2p(n+1)}EcEa2(1ρ^x¯y¯V\{x¯,y¯}2)n2whenp>b<{2p(n+1)}EcEa2exp(nρL22+nϵ2nϵ222)<{2p(n+1)}EcEa2exp{nρL22+n(np)β12n12β}<exp{nρL22+n(np)β12n12β+3p2logn}.

When n > 3 exp{(1 − 2β*)−2}, wehave n(np)−2β* > 3 log n. Hence,

BF(Ga;GtGtGa)<exp{nρL22+n(np)β12n12β+2np2(np)2β}.

Therefore, when GtGa, for n > (log 2/Cq)1/γ,

PR(Ga;GtGtGa)<exp{nρL22+n(np)β12n12β+2np2(np)2β+(EaEt)log(2q)}.

By the construction of β*, we have

12λ>max{2α,12β,1β,1+2α2β},

and 1 − 2λ > σ + γ. Therefore, nρL22 is the leading term in the upper bound of PR(Ga; Gt | GtGa). Thus, PR(Ga; Gt) → 0, as n → ∞ when GtGa.

When GtGa, by Lemmas S.3.1 and A.4, we have

BF(Ga;GtGtGa)<exp{(EaEt)n(np)2β},
PR(Ga;GtGtGa)<exp{(EaEt)n(np)2β+(EaEt)log(2q)}.

Since β>1γ2 then (|Ea| − |Et|) log(2q) is the leading term above and |Ea| − |Et| > 0. Therefore, PR(Ga; Gt|GtGa) → 0, as n → ∞.

Appendix B: Equivalence of minimal triangulations when Gt is not decomposable

Let Gm = (V, Em) be any minimal triangulation of Gt, where Em = EtF, F ≠ ∅ In here Ga denotes any decomposable graph other than minimal triangulations of Gt. Since Gm is a minimal triangulation, then EaEtF′, where F′F. Different from when Gt is decomposable, there are three cases here: (1) Ea1<Em1=Et, thus GmGa; (2) Ea1=Em1=Et and GmGa; (3) Ea1=Em1=Et and GmGa. But in case (3) there exists at least one minimal triangulation of Gt which is a subset of Ga. And in both (2) and (3), we have |Em| < |Ea|.

For case (1), when Ea1<Em1=Et, i.e., one of the two cases where GmGa, we inherit all notations from Lemma A.2, {x¯i,y¯i}i=1EcEa is the edge sequence from Gc to Ga and {ρx¯iy¯iS¯i}i=1EcEa is the corresponding population partial correlation sequence. And Lemma A.2 still holds here, that is, we can find a sequence of partial correlations {ρx¯iy¯iS¯i}i=1EcEa which has at least one population partial correlation corresponding to the removal of a true edge is non-zero and not a correlation. The proof carries out the same as in Lemma A.2, just let the first step of moving from Gc to Ga be the deletion of a true edge which is missing in Ga. For case (3), where Ea1=Em1=Et but GmGa, when moving from Gc to Ga, all steps are false-edge deletion cases. There is no true-edge deletion case here since Ga has all the true edges of Gt.

For case (2), when GmGa and Ea1=Em1=Et, we still use {(x~i,y~y)}i=1EaEm to denote the sequence of edges which are added in each steps from Gm to Ga and {ρx~iy~iS~i}i=1EaEm is the corresponding population partial correlation sequence. A similar version of Lemma A.4 still holds here.

Lemma B.1. For any edge sequence {(x~i,y~i)}i=1EaEm from Gm to Ga described above in case (2), all population partial correlations in {ρx~iy~iS~i}i=1EaEm are zero (or correlation, when S~i=).

Proof. This proof follows similarly to the proof of Lemma A.4. Assume in the ith step we add edge (x~i,y~i)Et to graph G~i1ma and S~i is the corresponding separator.

When S~i. Since adding edge (x~i,y~i)Et to graph G~i1ma maintains decomposability of graph G~ima. By Lemma S.3.1, x~i and y~i are in two cliques which are adjacent in the current junction tree of G~i1ma. Thus, by the property of junction trees, we know S~i separates x~i from y~i in G~i1ma. Since this is an increasing sequence in terms of edges from Gm to Ga, thus GmG~i1ma. And due to the minimal triangulation, GtGmG~i1ma. By Lemma A.3, S~i separates node x~i from y~i in Gt, ρx~iy~iS~i=0.

When S~i=, x~i and y~i are disconnected in the current graph G~i1ma. Then they are also disconnected in Gt. Thus, they are marginally independent in Gt, ρx~iy~i=0. □

Remark B.1. For Ea1=Et and EaEa1=0,,F1, no decomposable Ga exists; for Ea1=Et and EaEa1>F, at least one decomposable Ga exists; but for Ea1<Et and EaEa10, a decomposable Ga may not exist. The Bayes factor BF(Ga; Gm) under Ea1<Et and EaEa10 is only valid when a decomposable Ga exists, otherwise it is defined to be zero.

B.1. Proof of Theorem 5.1

Part 1. For any given decomposable graph Ga that is not a minimal triangulation of Gt, let

τ>max{2,2(EcEm)EaEm}.

The construction of Δa,ϵ1 is the same as in the proof of Theorem 4.1. After that, we restrict the following proof to the set Δa,ϵ1. For case (1), when Ea1<Em1=Et, we have

BFmc<exp{p2(121τ)(EcEm)logn}0,
BFca<exp{nρL22+p2logn+nlogn12τlog(np)+2p2logp}0,

Hence,

BF(Ga;GmGmGa,Ea1<Em1)=BFcaBFmc0.

For case (2), when GmGa, that is, Ea1=Em1=Et and |Ea| > |Em|, we have

BF(Ga;GmGmGa)<exp{p2(121τ)(EaEm)logn}0.

For case (3), when Ea1=Em1=Et and GmGa, also |Ea| > |Em|, we have

BFmc<2p2nEcEa2exp[{EaEm2(EcEm)1τ}(EcEm)logn],
BFca<(4p)p2nEcEa2whenn>1.

Hence,

BF(Ga;GmGmGa,Ea1=Em1)<(8p)p2exp[{EaEm2(EcEm)1τ}(EcEm)logn]0.

Therefore, BF(Ga; Gm) → 0, as n → ∞.

Part 2. Let Gm1, Gm2, … , Gml be all the minimal triangulations of Gt, where l is a positive finite integer, since the graph dimension is finite. By Part 1, on the set Δa,ϵ1, BF(Gmi; Ga) → ∞, i = 1, 2, … , l, where GaMt. Therefore,

GmMtπ(GmY)=i=1lp(YGmi)i=1lp(YGmi)+GaMtp(YGa)=11+GaMt1i=1lBF(Gmi;Ga)1asn.

B.2. Proof of Theorem 5.2

Let {ρ^m1,i}i=1EcEm1 and {ρm1,i}i=1EcEm1 be the sample and population partial correlation sequence corresponding to each step from Gm1 to Gc. By Lemma B.1, ρm1,i = 0, i = 1, 2, … , |Ec| − |Em1|. By Lemma S.1.4, for any 0 < ϵ < 1, there exist 0 < M1(ϵ) < 1/4 and M2(ϵ) > 3 (the choice of M1 and M2 is the same as in the proof of Lemma S.1.4), we have P(Δϵ0)>1ϵ2, for n > p + 3. Let

Rm1,i={M1n<ρ^m1,i2<M2np},Δm1i=1EcEm1Rm1,i.

Then

P(Δm1)P(Δϵ0)1ϵ2.

By Lemma S.3.1, when n > b + p, we have

(12n)EcEm12i=1EcEm1(1ρ^m1,i2)n(EcEm)2<BF(Gc;Gm1)<(2n)EcEm12i=1EcEm1(1ρ^m1,i2)n(EcEm)2.

Under the event Δm1 , when n > p + M2,

(eM12n)EcEm12<BF(Gc;Gm1)<(2e2M2n)EcEm12.

Thus we have

P{(eM12n)EcEm12<BF(Gc;Gm1)<(2e2M2n)EcEm12}>1ϵ2.

Similarly,

P{(2e2M2n)EcEm12<BF(Gm2;Gc)<(eM12n)EcEm12}>1ϵ2.

Therefore, let A1=14eM2 and A2 = 4e2M2p2, we have P{A1<BF(Gm1;Gm2)<A2}>1ϵ.

Footnotes

Supplementary Material

Supplement to “Bayesian graph selection consistency under model misspecification” (DOI: 10.3150/20-BEJ1253SUPP; .pdf). There are five sections in the Supplementary Materials [32]. In Section S.1, we provide a set of auxiliary results related to the concentration and tail behavior of partial correlations. The bounds for Bayes factors of local moves are developed in Section S.2 and Section S.3. For the proofs of Theorems 4.3, 5.3, 5.4 and Corollaries 4.3, 5.1, see Section S.4 and Section S.5.

The original reference list

  • [1].Armstrong H, Carter CK, Wong KFK and Kohn R (2009). Bayesian covariance matrix estimation using a mixture of decomposable graphical models. Statistics and Computing 19 303–316. [Google Scholar]
  • [2].Atay-Kayis A and Massam H (2005). A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika 92 317–335. [Google Scholar]
  • [3].Banerjee S, Ghosal S et al. (2014). Posterior convergence rates for estimating large precision matrices using graphical models. Electronic Journal of Statistics 8 2111–2137. [Google Scholar]
  • [4].Banerjee S and Ghosal S (2015). Bayesian structure learning in graphical models. Journal of Multivariate Analysis 136 147–162. [Google Scholar]
  • [5].Ben-David E, Li T, Massam H and Rajaratnam B (2011). High dimensional Bayesian inference for Gaussian directed acyclic graph models. arXiv preprint arXiv:1109.4371. [Google Scholar]
  • [6].Bickel PJ and Levina E (2008). Regularized estimation of large covariance matrices. The Annals of Statistics 36 199–227. [Google Scholar]
  • [7].Cai T and Liu W (2011). Adaptive thresholding for sparse covariance matrix estimation. Journal of the American Statistical Association 106 672–684. [Google Scholar]
  • [8].Cao X, Khare K, Ghosh M et al. (2019). Posterior graph selection and estimation consistency for high-dimensional Bayesian DAG models. The Annals of Statistics 47 319–348. [Google Scholar]
  • [9].Carvalho CM, Massam H and West M (2007). Simulation of hyper-inverse Wishart distributions in graphical models. Biometrika 94 647–659. [Google Scholar]
  • [10].Carvalho CM and Scott JG (2009). Objective Bayesian model selection in Gaussian graphical models. Biometrika 96 497–512. [Google Scholar]
  • [11].Dawid AP and Lauritzen SL (1993). Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics 1272–1317. [Google Scholar]
  • [12].Dellaportas P, Giudici P and Roberts G (2003). Bayesian inference for nondecomposable graphical Gaussian models. Sankhyā: The Indian Journal of Statistics 43–55. [Google Scholar]
  • [13].Diaconis P and Ylvisaker D (1979). Conjugate priors for exponential families. The Annals of statistics 269–281. [Google Scholar]
  • [14].Dobra A, Hans C, Jones B, Nevins JR, Yao G and West M (2004). Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis 90 196–212. [Google Scholar]
  • [15].Donnet S and Marin J-M (2012). An empirical Bayes procedure for the selection of Gaussian graphical models. Statistics and Computing 22 1113–1123. [Google Scholar]
  • [16].Drton M, Perlman MD et al. (2007). Multiple testing and error control in Gaussian graphical model selection. Statistical Science 22 430–449. [Google Scholar]
  • [17].El Karoui N (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics 36 2717–2756. [Google Scholar]
  • [18].Fitch AM, Jones MB and Massam H (2014). The performance of covariance selection methods that consider decomposable models only. Bayesian Analysis 9 659–684. [Google Scholar]
  • [19].Giudici P (1996). Learning in graphical Gaussian models. Bayesian Statistics 5 621–628. [Google Scholar]
  • [20].Giudici P and Green P (1999). Decomposable graphical Gaussian model determination. Biometrika 86 785–801. [Google Scholar]
  • [21].Green PJ (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82 711–732. [Google Scholar]
  • [22].Heggernes P (2006). Minimal triangulations of graphs: A survey. Discrete Mathematics 306 297–317. [Google Scholar]
  • [23].Johnstone IM (2010). High dimensional Bernstein-von Mises: simple examples. Institute of Mathematical Statistics collections 6 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Jones B, Carvalho C, Dobra A, Hans C, Carter C and West M (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science 388–400. [Google Scholar]
  • [25].Khare K, Rajaratnam B and Saha A (2018). Bayesian inference for Gaussian graphical models beyond decomposable graphs. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80 727–747. [Google Scholar]
  • [26].Lam C and Fan J (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Annals of statistics 37 4254. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Lauritzen SL (1996). Graphical models 17. Clarendon Press. [Google Scholar]
  • [28].Lee K, Lee J, Lin L et al. (2019). Minimax posterior convergence rates and model selection consistency in high-dimensional DAG models based on sparse Cholesky factors. The Annals of Statistics 47 3413–3437. [Google Scholar]
  • [29].Letac G, Massam H et al. (2007). Wishart distributions for decomposable graphs. The Annals of Statistics 35 1278–1323. [Google Scholar]
  • [30].Meinshausen N and Bühlmann P (2006). High-dimensional graphs and variable selection with the lasso. The annals of statistics 34 1436–1462. [Google Scholar]
  • [31].Moghaddam B, Khan E, Murphy KP and Marlin BM (2009). Accelerating Bayesian structural inference for non-decomposable Gaussian graphical models. In Advances in Neural Information Processing Systems 1285–1293. [Google Scholar]
  • [32].Niu Y, Pati D and Mallick BK (2020). Supplementary materials to “Bayesian Graph Selection Consistency Under Model Misspecification”. [DOI] [PMC free article] [PubMed]
  • [33].Rajaratnam B, Massam H and Carvalho CM (2008). Flexible covariance estimation in graphical Gaussian models. The Annals of Statistics 36 2818–2849. [Google Scholar]
  • [34].Raskutti G, Yu B,Wainwright MJ and Ravikumar PK (2009). Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of Lregularized MLE. In Advances in Neural Information Processing Systems 1329–1336. [Google Scholar]
  • [35].Rose DJ, Tarjan RE and Lueker GS (1976). Algorithmic aspects of vertex elimination on graphs. SIAM Journal on computing 5 266–283. [Google Scholar]
  • [36].Roverato A (2000). Cholesky decomposition of a hyper inverse Wishart matrix. Biometrika 87 99–112. [Google Scholar]
  • [37].Roverato A (2002). Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics 29 391–411. [Google Scholar]
  • [38].Scott JG and Carvalho CM (2008). Feature-inclusion stochastic search for Gaussian graphical models. Journal of Computational and Graphical Statistics 17 790–808. [Google Scholar]
  • [39].Spokoiny V (2013). Bernstein-von Mises Theorem for growing parameter dimension. arXiv preprint arXiv:1302.3430. [Google Scholar]
  • [40].Uhler C, Lenkoski A and Richards D (2018). Exact formulas for the normalizing constants of Wishart distributions for graphical models. The Annals of Statistics 46 90–118. [Google Scholar]
  • [41].Wang H and Carvalho CM (2010). Simulation of hyper-inverse Wishart distributions for non-decomposable graphs. Electronic Journal of Statistics 4 1470–1475. [Google Scholar]
  • [42].Yuan M and Lin Y (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94 19–35. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

3

RESOURCES