Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Feb 1.
Published in final edited form as: Stat Probab Lett. 2011 Feb 1;81(2):220–230. doi: 10.1016/j.spl.2010.11.009

A Gaussian Mixed Model for Learning Discrete Bayesian Networks

Nikolay Balov a,
PMCID: PMC3046375  NIHMSID: NIHMS254441  PMID: 21379362

Abstract

In this paper we address the problem of learning discrete Bayesian networks from noisy data. Considered is a graphical model based on mixture of Gaussian distributions with categorical mixing structure coming from a discrete Bayesian network. The network learning is formulated as a Maximum Likelihood estimation problem and performed by employing an EM algorithm. The proposed approach is relevant to a variety of statistical problems for which Bayesian network models are suitable - from simple regression analysis to learning gene/protein regulatory networks from microarray data.

Keywords: Bayesian networks, learning, mixture models, MLE, EM algorithm

1. Introduction

Graphical models provide a framework for representing the dependence between random variables through directed edges in graphs. Bayesian networks form a class of graphical models subject to the assumption that the variables can be ordered according to their causal relationship. They witnessed a renewed interest in recent years due to promising applications to systems biology and genetics - for example, Bayesian networks provide efficient means for describing the mechanisms of control and regulation of gene expressions.

Over the last few decades, a large amount of research and literature has been devoted to the problem of learning Bayesian networks - some main sources are Buntine (1996), Heckeman et al. (1995) and Neapolitan (2003). A typical learning is based on the so called score-based approach. It identifies a network structure according to some scoring criterion such as the a posterior probability of a network given some data, or some likelihood based criteria such as AIC or BIC. Another learning approach is formed by the constraint-based algorithms that use conditional independence tests to find associations between the node-variables. Once the graphical structure of a Bayesian network is identified, in the context of a specified probability model, it is usually straightforward to estimate the parameters of the corresponding conditional probabilities. However, finding the structure of Bayesian networks with more than one parent is a NP-hard problem, Chickering (1996), which motivates the search of model constraints and assumptions to relax its general intractability.

In this paper we restrict our attention to discrete Bayesian networks with multinomial conditional distributions. The data is assumed to be observations on categorical variables plus additional Gaussian noise distributed independently among them. In other words, the nodes of the networks are assumed latent, unobserved, random variables, while the actual, observed, variables are their noisy instances. In this case, the conditional distributions of the observed continuous variables are essentially mixture of Gaussian ones and the model can be referred to as mixed Bayesian network. The model estimation is performed in a frequentist manner using the Maximum Likelihood (ML) principle. Under the assumption that the causal order of the node-variables is known, the MLE solution is approximated by employing an Expectation Maximization (EM) algorithm. The node order assumption, despite being considered unacceptable from a general learning standpoint, is the one that allows the global MLE, if exists, to be found, in contrast to many, more general but heuristic learning algorithms (see Heckeman et al. (1995) for systematical overview), which provide only ‘local’ solutions.

The model considered in this paper can be formulated as a special case of the Conditional Gaussian (CG) model, Lauritzen (1992), not to be mistaken with the substantially different linear Gaussian model. CG models belongs to the family of mixed graphical association models that have been first introduced by Lauritzen & Wermuth (1984). A CG model allows both discrete and continuous variables, but the latter can have as parents only discrete ones on which they depend through conditional Gaussian distributions. CG models are well studied and efficient local procedures for calculating their marginal distributions, means and variances are available, Lauritzen (1992) and Lauritzen & Jensen (2001). However, a method for estimating both the graphical and probability structures of a general CG model is not available.

In the context of network learning, the EM algorithm is typically applied in two settings - when some of the data is missing, Lauritzen (1995), or when some of the variables are hidden. In Chickering & Heckerman (1996), considered is a general approach to parameter estimation for networks with hidden variables that resorts to computationally intensive methods, such as Monte-Carlo, for approximation of the marginal distributions. The lack of close form expression of the marginals in a Bayesian network model limits substantially the applicability of EM.

The application of EM has been mostly limited to learning the probability model parameters of Bayesian networks. One exception is Friedman (1997), where a method for learning the structure of Discrete Bayesian networks from incomplete data is proposed. The algorithm described there is a variant of the EM in which the M-step is replaced by two alternating steps: a local structure search, similar to the greedy hill-climbing procedure, and parameter estimation. Its advantage is that it does not require the node order to be known. However, since it converges to locally optimal solutions, it is sensitive to the initial configurations, and second, requires discrete data. In contrast, the proposed here estimation procedure seamlessly combines the discretization and structure learning steps in one unified EM algorithm.

The main advantage of the framework considered in this paper is its ability to model continuous variables by unrestricted categorical distributions and thus able to express complex, non-linear relationships between them (see Example 1 below). Consequently, in appropriate settings, this approach provides greater representation power than the discrete Bayesian networks alone and parametric models such as the linear Gaussian networks. Moreover, with close form expression for the expected sufficient statistics, our model estimation algorithm can be implemented accurately and efficiently.

The paper is organized as follows. In Section 2 we introduce the discrete Bayesian network model and in Section 3 its extension - the mixture of Gaussian Bayesian network (MGBN). In Section 4 we formulate the estimation problem for MGBN and derive an EM algorithm to solve it. Finally, in Section 5, we present simulation results and a small study for learning the interactions between some genes complicated in lung cancer.

The MGBN model inference is implemented in the mugnet R-package, which also contains code for reproducing the presented in the paper examples.

Example 1. (Motivation Example)

The MGBN is designed to model random vectors, which components are multi-modal continuous variables that follow some mixture of Gaussian distributions. In Table 1 we show the distribution of a random vector (X, Y ), such that X = A + eA and Y = B + eB for independent eA ~ Inline graphic (0, 0.04) and eB ~ Inline graphic(0, 0.08), and discrete random variables A and B with probability table fully specified by P(A) and P(B|A), shown on the left. On the right, plotted is a large sample of (X, Y ), which clearly shows the cluster-like structure of the joint distribution. Note that in this case fitting the linear Gaussian model B ~ A is not appropriate. In the remainder of this paper, we present a method for recovering not only the distribution of (A, B) from a (X, Y )-sample, but also the dependence relations in case of many node-variables.

Table 1.

A simple 2-node MGBN model, AB, with its discrete probability table (on the left) and a sample drawn from it (on the right).

levels 1 2 3
P(A) 0.3 0.5 0.2 graphic file with name nihms254441t1.jpg
P(B|A = 1) 0.5 0.2 0.3
P(B|A = 2) 0.1 0.1 0.8
P(B|A = 3) 0.1 0.7 0.2

2. Discrete Bayesian Networks

A Bayesian Network (BN) Inline graphic = (G, P) is defined by directed acyclic graph (DAG) G with random node-variables xi, i = 1, .., n, and probability model P. The directed arcs of G determine the parent-child relations, while the model P is a set of conditional distributions, one for each node, specifying the distribution of the nodes conditional on their parents. Inline graphic is discrete BN if each node x is a discrete random variable that takes values in some finite set of categories Cx and the conditional distributions are categorical. Let us denote with Πi the set of parent indices for node xi and with P (xi|xj, j ∈ Πi), or more concise P(xii), its conditional probability. There is a topological order of the nodes of Inline graphic such that the parents of each one appear always earlier in that order. To simplify the analysis, throughout this paper we will assume, without loss of generality, that the topological order of the nodes in G is the same as their index order, a fact expressed symbolically as x1x2 ≺ … ≺ xn. That is, for every i > 1, Πi ⊂ {1, 2, …, i−1} and Π1 = ∅. An application of the Bayes formula gives us the following factorization of the joint probability

logP(x1,,xnG)=i=1nlogP(xixj,j<i)=i=1nlogP(xiΠi). (1)

Hereafter, we assume that the node categorical sets are Ci = {1,, ai} for some natural numbers ai > 1. We also denote by Ci] the set of possible combination of parent categories, Ci] ≡ ⊗j ∈ ΠiCj.

Categorical distributions, also called multinomial distributions, are the most general model for the conditional probabilities P(xii). With a slight abuse of notation, with P(xii = C) we will denote P (xi|xj = cj, j ∈ Πi), for a parent configuration C = (cj) ∈ Ci]. By definition, ΣcCi P(xi = ci = C) = 1. Hence, each conditional probability P(xii) requires αi = (αi − 1) × ∏j ∈Πi aj parameters to be specified. The total number α=i=1nαn, called network complexity, gives the overall number of parameters of the discrete BN Inline graphic.

As defined, the discrete BNs have no parametric constraints on their conditional distributions except the number of node categories. However, this generality is counterbalanced by the potential information loss in the process of data categorization. In practice, most data is obtained by analogous instruments and is measured effectively in continuous form. In order to fit a discrete BN one needs, before anything else, to discretize the continuous outcome, which usually results in information loss and hampers the statistical analysis. This loss can be especially costly when dealing with small sample sizes, as it is the case with microarray data. In the existing practice, the problems of discretization and fitting a discrete BN, are considered separately, which is not a statistically satisfactory approach. The methodology we present next addresses this issue.

3. Mixture of Gaussian Bayesian Networks

We consider the following model

yi=βi,xi+ei,i=1,,n, (2)

where y = (y1,, yn) is a continuous random vector, x = (x1,, xn) comes from a discrete BN Inline graphic, and ei are independent, zero mean Gaussian random variables with variances σi2, i.e. e=(e1,,en)N(0,diag(σi2)). In this setting, y is the observed random vector, while x is unobserved, also latent, discrete random vector. The triple

θ=({βi=(βi,k)kCi)}i=1n,{σi2}i=1n,G)

constitutes the parameter set for model (2).

For given xi, yi is a Gaussian variable with mean βi,xi and variance σi2. In equivalent formulation of (2), the logarithm of the density of yi conditional on xi can be written as

log(P(yixi,θ))=12[log(2πσi2)+1σi2(yiβi,xi)2]. (3)

The local Markov property implies that each xi depends on xj<i (all xj with j < i), only through xj ∈Πi (all xj with j ∈ Πi). Moreover, yi depends on xji only through xi. By applying these observations we obtain a convenient expression for the logarithm of the joint distribution of x and y

logP(y,x0)=logP(y1,x1θ)+i=2nlogP(yi,xiyj<i,xj<i,θ)=i=1n[logP(yixi,θ)+logP(xixjΠi,θ)]=i=1nlogP(yi,xixjΠi,θ). (4)

By combining equations (1), (3) and (4), we write the log-likelihood of θ with respect to a given sample (y, x) as

l(Θy,x)=12i=1n[log(2πσi2)+1σi2(yiβi,xi)2]+logP(x1,,xnG). (5)

Some of the properties of model (2) can be elucidated by deriving the conditional distributions of y’s. For example, the distribution of yi conditional on its parents yj∈ Πi is given by

ExiyjΠiP(yixi,yjΠi)=ExiyjΠiP(yixi)=cCiP(yixi=c)P(xi=cyjΠi),

where P(yi|xi) are Gaussian distributions, and thus, is a mixture of Gaussian ones. In the light of this observation we tentatively refer to model (2) as mixture of Gaussian Bayesian network (MGBN).

The additional parameters β and σ2 increase the complexity of a MGBN network compared to its underlying discrete BN. The i-th node with ai categories contributes additional 1 + ai parameters (one for σi2 and ai for βi,cs), thus, increasing the overall complexity of Inline graphic to

α(G)=i=1n[1+ai+(ai1)jΠiaj].

Example 2

Here we consider the expression levels of AKT 1 and GSK3A genes from the lung cancer data described and analyzed in Section 5 below. In Fig. 1, we show the cross plot for the original sample of size 62 (left), as well as for a sample of size 200, simulated from a fitted MGBN model (right). Also shown are the β parameters of the model corresponding to AKT 1 and GSK3A variables, drawn as vertical and horizontal lines, respectively. The cross points of the β-lines are centroids of Gaussian clusters. This is a situation where modeling continuous variables with categorical distributions, as the MGBN model does, seems appropriate.

Figure 1.

Figure 1

The cross plot of GSK3A and AKT1 genes for an original (left) and simulated (right) data. The latter is obtained from a MGBN model fitted to the original sample. See Example 2 and the analysis in Section 5 for more details.

4. EM Estimation

4.1. Problem Formulation

The main problem considered in this article is MGBN model estimation from observations on y. It involves learning the discrete BN Inline graphic as well as estimating the mean parameters β and variances σ2. A standard solution is provided by the Maximum Likelihood principle. Moreover, the Expectation Maximization algorithm seems particularly well suited for the task because it can handle latent variables - in our case the discrete random vector x.

We assume that the network Inline graphic has a known topological order and this is the index order, i.e. x1x2 ≺ … ≺ xn. In many practical problems, such as those of recovering gene regulatory networks, this assumption is a natural one if viewed from the perspective that any prior information for the node order should come from experimental studies. For example, the gene-node order in microarray studies can be inferred using perturbed samples. Identifying the causality relations between random variables only by statistical means is rarely possible for there might be many networks with different orders that fit the data equally well. Specifying an order resolves this issue and, eventually, makes the MLE network identifiable. This is the reason why in the considered here framework we provide MLE solutions only relative to given orders.

The EM algorithm is a standard technique for finding a local ML parameter configuration (Dempster, Laird & Rubin (1977)). It essentially requires optimization of the following score function of θ, r(θ|y, θ*) = El(θ|y, x), where the expectation is taken with respect to the conditional distribution of (x|y, θ*) for a fixed choice θ* of the parameters. Starting from an initial θ1, a sequence θk is constructed according to θk+1 = arg maxθ r(θ|θk, y), until some convergence criteria are met.

Applying Eq. (4) we obtain r(θy,θ)=i=1nri(θiy,θ), with ri defined as

ri(θiy,θ)Exy,θlogP(yi,xixjΠi,θ)=Exi,xjΠiy,θlogP(yi,xixjΠi,θi), (6)

where θ=(θi)i=1n and θi=(βi,k,σi2,Πi,P(xiΠi)) is the model parameter corresponding to the i-th node. The last equality in (6) follows from the fact that P(yi, xi|xj ∈Πi) depends on θ only through θi.

At the second step of EM, one needs to maximize r(θ|y, θ*) as a function of θ. Unfortunately, the expectations in (6) involve the calculation of the joint distribution of y’s. The latter problem can easily become computationally prohibitive since it requires considering all possible configurations of x’s ( i=1nai in total). Thus we exclude the possibility of a straightforward implementation of EM based on Eq. (6) as impractical.

In regard to the difficulties of finding r(θ|y, θ*), we observe that accessible seem to be only ‘local’ expectation calculations which involve distributions of limited subsets of nodes. For example, we can readily express the distribution of xi and its parents conditional on yi and yj∈Πi. Indeed, let us define pi,cCP(xi = c, Πi = C|Inline graphic), for cCi and CCi], to denote the joint distribution of xi and its parents Πi in the network Inline graphic. Since yj does not depend on Inline graphic if the value of xj is known, and all xj depend on θ only through the network Inline graphic, we infer that

P(xi=c,Πi=Cyi,yjΠi,θ)=P(xi=c,Πi=CG)P(yixi=c,θ)jΠiP(yjxj=cj,θ)P(yi,yjΠiθ)=pi,cCexp(12σi2(yiβi,c)2jΠi12σj2(yjβj,cj)2)dCiDC[Πi]pi,dDexp(12σi2(yiβi,d)2)jΠi12σj2(yjβj,dj)2]), (7)

where C = (cj)j∈Πi and D = (dj)j∈Πi. In addition, the conditional distribution of xi is

P(xi=cyi,yjΠi,θ)=CC[Πi]P(xi=c,Πi=Cyi,yjΠi,θ). (8)

Because of the additive form of the score function, the maximization of r(θ|y, θ*) can be split into a sequence of n maximization problems, one for each θi. Motivated by this observation we propose to relax the baseline EM algorithm by considering the following alternative score function h(θy,θ)=i=1nhi(θiy,θ), defined as

hi(θiy,θ)Exi,xjΠiyi,yjΠi,θlogP(yi,xixjΠi,θi)=Exi,xjΠiyi,yjΠi,θ[logP(yixi,βi,σi2)+logP(xixjΠi,θi)], (9)

instead of the original (6). For each node, we thus consider the conditional expectation involving only those x’s and y’s used in the corresponding probability term in (6). Next, we briefly comment on the validity of using h instead of r as an EM score function.

Let Y = {y1,, ym} be a sample of y coming from a MGBN model with parameter θ0. Denote by , and the sample averages of the likelihood l, r and h; for example, h¯i(θiY,θ)=1mj=1mhi(θiyj,θ). Let θ̂r,m be a limit point of an EM sequence θk, θk+1 = arg maxθ r(θ|θk, y), and θ̂h,m be a limit point of an EM sequence based on h. Then, it must be that θ̂r,m = arg maxθ(θ|Y, θ̂r,m) and θ̂h,m = arg maxθ(θ|Y, θ̂h,m). We claim that the estimations based on r and h are asymptotically equivalent in the following sense: if θ̂r,m are consistent estimators of the true parameter θ0, so are θ̂h,m. A support for this claim is given in Appendix A.

4.2. Algorithm Derivation

To evaluate the score function h, we start with the first term in (9)

Exiyi,yjΠi,θlog(P(yixi,βi,σi2))=12[log(2πσi2)+1σi2Exiyi,yjΠi,θ(yiβi,xi)2]=12[log(2πσi2)+1σi2cCi(yiβi,c)2qi,c], (10)

where qi,c=P(xi=cyi,yjΠi,θ) are the conditional probabilities defined by Eq. (8). For the second term in (9) we have

Exi,xjΠiyi,yjΠi,θlogP(xixjΠi)=CC[Πi]cCilogP(xi=cΠi=C)qi,cC, (11)

where qi,cCP(xi=c,Πi=Cyi,yjΠi,Θ) as in Eq. (7).

At the second step of the EM algorithm, we need to maximize the sum of (10) and (11) in order to find an optimal network Inline graphic. This problem requires for each i to find, first, the optimal parenthood Πi of node xi and second, the conditional probability P (xii). Provided Πi is known, the optimal conditional probability can be easily derived. Indeed, by using the fact that for every fixed CCi]

argmaxv{clog(vc)qi,cCcvc=1}=(qi,cCdqi,dC)c,

for the expectation in Eq. (11) we can write

maxPExi,xjΠiyi,yjΠi,θlogP(xixjΠi)=Ccqi,cClog(qi,cC/dqi,dC). (12)

Recall that the EM algorithm is an iterative procedure that for a sample Y and a starting triple θ0 = (σ0, β0, Inline graphic), constructs a sequence θk = (σk, βk, Inline graphic), k = 1, 2…, until a target convergence threshold for (θk+1|Y, θk) is achieved. To simplify the algorithm iteration equations, we introduce notations for the probabilities (7) and (8) referring to iteration index k and sample instance l

qi,cCk,lpi,cCkexp(12(σk)2[(yilβi,ck)2+jΠi(yjlβj,cjk)2])dCiDC[Πi]pi,dDkexp(12(σk)2[(yilβi,dk)2+jΠi(yjlβj,djk)2]),qi,ck,lDC[Πi]qi,cDk,l,andletq¯i,cCkl=1mqi,dCk,l,

where pi,ck and pi,cCk denote the probabilities P(xi|Gk) and P(xi, Πi|Gk), respectively. We consider qi,ck,l and qi,cCk,l as functions of the parent set Πi.

Then, at the k-th iteration, the EM algorithm computes θk+1 from θk as a solution of

θk+1(σk+1,βk+1,Gk+1)=argmaxσ,β,Gi=1nHik(Y,θi), (13)

where, by utilizing Eq. (12) for the sums q¯i,cCk, we define Hik(Y,θi) to be

m2log(2θσi2)12σi2l=1mcCiqi,ck,l(yilβi,c)2+CC[Πi]cCiq¯i,cCklog(q¯i,cCk/dCiq¯i,dCk).

Note that the optimization function in (13) is additive for the nodes of the network and θik+1=argmaxθiHik(Y,θi). This fact greatly simplifies the computation process for allowing the evaluation of σk+1, βk+1 and Gk+1 on node by node basis.

Without imposing some restrictions on network Inline graphic, Eq. (13) will always select the maximum connected network - one for which each node has as parents all the nodes appearing earlier in the node order. This trivial solution is the least economical one and surely will over-fit any reasonable data. Apparently, some model selection restrictions on Inline graphic are necessary such as searching for networks with fixed in advance complexity. Alternatively, we may apply some standard model selection criteria such as AIC or BIC. Then the complexity of Inline graphic may vary from iteration to iteration. Next, we write the estimation equations for these cases.

4.3. EM Algorithm with Model Selection

If the parent set sizes |Πi| of the nodes are known and fixed, then it is straightforward to write the iteration equations for θk+1. In this case the complexities of the resulting networks are predetermined. The optimal graph structure Inline graphic in (13) can be found following the topological order of the nodes - the parenthood of the i-th node is a subset of size |Πi| of nodes appearing earlier in the order, that maximizes Hik and can be searched for exhaustively. The optimal β’s and σ’s are the usual MLE solutions from the linear regression analysis. Specifically, the solution of (13) includes the following update rules for βik,σik and Πik

βi,ck+1=l=1mqi,ck,lyil/q¯i,ck, (14)
(σik+1)2=1ml=1mcCiqi,ck,l(yilβi,ck+1)2, (15)
Πik+1=argmaxΠi{m2log(l=1mcCiqi,ck,l(yilβi,ck+1)2)+CC[Πi]cCiq¯i,cCklog(q¯i,cCk/dCiq¯i,dCk)} (16)

where Πi is chosen among the subsets of {1, …, i − 1} of size |Πi|. At every iteration step k, the update equations (14)(16) are applied in order from x1 to xn, or according to the node topological order if it is different from the assumed one x1x2 ≺ … ≺ xn.

If the parent set sizes are unknown, one needs an additional selection criterion to avoid the trivial, most connected, network as a solution of (13). This is essentially a model selection problem similar to that considered in Salzman & Almudevar (2006). A natural choice is a network with a specified, target, complexity A. Then the algorithm guarantees optimally for this target complexity. To express it formally we modify Eq. (13) to

(σk+1,βk+1,Gk+1)=argmaxθ{i=1nHik(Y,θi)α(G)=A}. (17)

A second option is EM algorithm with AIC network selection, which replace (13) with

(σk+1,βk+1,Gk+1)=argmaxθ{i=1nHik(Y,θi)α(G)}, (18)

where, recall, α (Inline graphic) is the network complexity. A third option is to use the BIC criterion that leads to the following iteration equation

(σk+1,βk+1,Gk+1)=argmaxθ{i=1nHik(Y,θi)0.5α(G)log(m)}. (19)

If n is the number of nodes and maxc=maxi=1nci is the maximum number of categories per node, the complexity of one EM iteration is exponential to the maximum parent size maxp=maxi=1nΠi and is of order O(m * (n * maxc)maxp+1). Note that the computational complexity of the algorithm, both in processing time and memory, may quickly become prohibitive with the increase of maxp. Nevertheless, estimating networks with n ≤ 100, maxc ≤ 3 and maxp ≤ 3 is currently quite manageable (see the simulation results in the next section) and these are sufficient for many practical problems.

5. Results

5.1. Simulation Studies

Here we demonstrate the performance of the proposed EM learning algorithm on simulated networks. In each test trial we generate a random MGBN, serving as a ground truth network, then draw samples of varying size from it and finally, do estimation. The network structure recovery is measured by the so called F-score, which is the harmonic mean of the specificity (TN/(TN + FP)) and sensitivity (TP/(TP + FN)), where TP (true positives) is the number of correctly found directed edges, FP is the number of edges present in the estimated network but not in the original one, and FN is the number of edges present in the original network but not in the estimated one. A F-score of 1 means perfect reconstruction. MGBN is compared to another network learning with the discretization and network fitting steps performed separately. The latter method is denoted as D+BN. Note that without some assumptions on the marginal distributions of the nodes, the probability structure of a discrete BN can not be recovered, although the edge structure eventually can. The MGBN model, being a parametric one, provides such set of assumptions. Since D+BN with a general discretization method can not be properly compared to MGBN, to place both methods on equal footing, we perform the discretization step in D+BN using the (known) node marginal distributions from the original networks.

Table 2 presents results for two networks with sizes 25 and 100, 3 categories per node and a maximum of three parents per node. The networks have β-parameters in the [−1, 1] range and a substantial noise level with σ2 = 0.1. From each of the networks four samples of size 100, 200, 400 and 800 are drawn and MGBN and D+BN estimation performed. For each trial we report the F-scores as well as the processing times in seconds. We limit the maximum parent size of the estimated MGBN to 2, one less than that of the original, which prevent perfect reconstruction but speeds up the computations. The number of EM-steps for MGBN varies from trial to trial, but is limited to a maximum of 60.

Table 2.

Performance comparison of MGBN vs. separate discretization and BN fitting (D+BN), both using AIC selection. Reported are the F-scores and the processing time in seconds (in brackets).

m 100 200 400 800

Network with n = 25 nodes and and 24 edges
D+BN 0.28 (1.1) 0.54 (1.1) 0.73 (1.1) 0.87 (1.2)
MGBN 0.34 (35) 0.73 (62) 0.83 (98) 0.88 (178)

Network with n = 100 nodes and 89 edges
D+BN 0.23(54) 0.48 (57) 0.63 (58) 0.77 (62)
MGBN 0.37(1635) 0.62 (2990) 0.74 (5523) 0.85 (11048)

Based on the F-score numbers, MGBN performs better than D+BN, although with much higher computational cost. The results also show that the computational time of MGBN estimation is according to the theoretical O(m(3n)3)-complexity and is linear on the sample size. 1 Note that the presented simulation results are in favor of the D+BN method, for in a real problem, the node marginal distributions will be unknown and any other discretization method will result in worse performance of D+BN.

5.2. Data Example: Learning Gene Regulatory Network

In this section we illustrate the application potential of the MGBN model by a small microarray data study. The renewed interest in the graphical statistical models in recent years is due to the expectation that they would help understanding the complicated mechanism of gene and protein interactions within cells. The data we have chosen is from one of the BROAD Institute GSEA’s publicly available datasets 2 of transcriptional profiles from lung cancer cell lines, the Boston cohort. Extensive gene expression analysis of the data is published in Beer et al. (2002) and Bhattacharjee et al. (2001). The lung dataset contains about 12600 genes and 62 samples taken from patients with similar disease condition. Furthermore, the sample is separated into two equal classes of size 31, depending on the survival status at the end of the study - ‘alive’ and ‘deceased’.

For illustration purposes only, we concentrate attention on a small subset of genes and try to learn the interaction between them using the framework we have just presented. We select the following nine genes: IGF1, PIK3CA, PT EN, PDK1, AKT1, BAD, CASP9, GSK3A and MT OR, that play an important role in the PI3K/PT EN/AKT pathway. This pathway is believed to have a critical role in cell survival process and its activation is linked to tumor genesis, Saal et al. (2007). Briefly, IGF1 is a growth factor gene, PIK3CA encodes the protein p110α and is mutated in a range of human cancers, PT EN is another oncogene with role in tumor suppression, and AKT1 plays anti-apoptic action when activated. The selected subset of genes is ordered following the available in the literature gene interaction causality.

We fit MGBN to the combined data of the two survival classes and then look at the class differences using the fitted model. To avoid over-fitting the data, we carefully choose the number of node categories and the maximum number of parents per node such that the maximum node complexity is smaller than the sample size by some factor. With 2 parents and 2 categories per node the maximum node complexity is 7. In this case, the saturated network with 1 parent per node has complexity 44, while that with 2 parents has complexity 58. We fit MGBN using three different network selection criteria: AIC, 1-parent saturated network and 2-parents saturated network. The found three networks, G1, G2 and G3, are shown in Fig. 2.

Figure 2.

Figure 2

Three networks representing the lung cancer data according to different selection criteria. See the main text for details.

A distinguishing feature of any graphical probability model is its accessible interpretation. For example, the AIC-selected network, G1, includes the top five most significant arcs, with PIK3CAPT EN having the highest likelihood score. G1 is a sub-network of G2, which is saturated with all gene-nodes, except the root, having one parent - their most significant predictors. In the third network, G3, each parent pair is the most significant among the possible ones for the corresponding node. The found three networks confirm some known gene regulations such as IGF1 → PIK3CAPT ENPDK1 and AKT 1 → GSK3A.

The conditional probabilities of the fitted networks can be used to reveal details about the difference between the two survival classes. In network G3, BAD depends on PT EN and AKT1. To quantify the class difference, we estimate the BAD conditional distribution using the ‘alive’ and ‘deceased’ sub-samples. The obtained two empirical conditional distributions are ploted in Fig. 3. Recall that each discrete gene-node has two categories (1 and 2) and thus, the pair (PT EN, AKT1) has four possible configurations. When AKT1 is down-regulated (AKT1 = 1), the distributions of BAD are essentially the same for the two classes and are not influenced by PT EN. However, when AKT1 is up-regulated (AKT1 = 2), the regulation of BAD depends on PT EN and is different between the survival classes. If both PT EN and AKT1 are up-regulated (see the fourth column in Fig. 3), the activity of BAD is slightly disturbed in the ‘deceased’ class in comparison to the ‘alive’ one. Similar plots, Fig. 4, show the behavior of AKT1. A consistent up-regulation of AKT1 is evident for class ‘deceased’, in contrast to class ‘alive’, where its activity is moderated by PIK3CA - see the second column, IGF1 = 1, PIK3CA = 2. These observations are consistent with the inhibiting role of AKT1 in cell apoptosis, which in turn, as it is known, leads to cancer cell proliferation.

Figure 3.

Figure 3

The probabilities of BAD conditional on PTEN and AKT1 for class ‘alive’, first row, and ‘deceased’, second row. The columns in each row correspond to the four combinations of PTEN and AKT1. The largest difference is observed in the second column, corresponding to PTEN = 1 and AKT1 = 2. There is also a slight difference when PTEN = 2 and AKT1 = 2 (fourth column).

Figure 4.

Figure 4

The probabilities of AKT conditional on IGF1 and PIK3CA for class ‘alive’, first row, and ‘deceased’, second row. For the latter we observe a consistent activity of AKT1. There is an apparent class difference when IGF1 = 1 and PIK3CA = 2 (second column).

The MGBN model is implemented in the mugnet R-package and is available from CRAN repositories. The lung cancer data and analysis in this section are also available as demonstration in mugnet.

6. Conclusion

We have introduced the MGBN model - a discrete Bayesian network with Gaussian errors distributed independently across its nodes. It provides a framework for representing continuous variables via unconstrained categorical probability distributions. The model allows unification of two so far independently considered problems - data discretization and network learning. We have also derived an efficient procedure for MGBN estimation based on the Expectation Maximization method. In the appendix we briefly discuss the convergence properties of our estimator, including the MLE consistency, and present directions for more detailed treatment of this important subject. We believe that the MGBN model is suited for a variety of problems that involve complicated dependencies, such as learning gene/protein interactions from microarray data, as our small study has illustrated. The framework can be further extended by replacing the Gaussian error model with other distributions from the exponential family, thus diversifying its application scope.

Acknowledgments

This work was supported by NIH grant K99LM009477 from the National Library Of Medicine. The content is solely the responsibility of the author and does not necessarily represent the official views of the National Library Of Medicine or the National Institutes of Health.

The author thanks Peter Salzman for many valuable discussions and suggestions. The author also thanks the reviewers, whose careful reading and valuable comments have helped for the improvement of this paper.

Appendix A

As in the main text, let Y = {y1, …, ym} be a sample from a MGBN with true parameter θ0, and be the sample averages of r and h, and θ̂r,m and θ̂h,m be limit points of r and h-based EM sequences, respectively. By introducing the complement index set i,Πi¯={j:jiandjΠi} and observing that

hi(θiy,θ)=Eyi,Πi¯yi,yjΠiri(θiy,θ), (A.1)

we immediately obtain Ey|θ*ri(θi|y, θ *) = Ey|θ*hi(θi|y, θ*), for all θ and θ*. Therefore, by the strong law of large numbers, both (θ |Y, θ*) and (θ |Y, θ*) converge to Er(θ|y, θ*), a.s., provided the expectations with respect to y|θ* exist. Tentatively, we expect that the estimators θ̂r,m and θ̂h,m share the same properties.

The question of consistency of MLE for the MGBN model, which does not unconditionally hold, is more involved to be discussed in details here and we will only sketch one approach to this problem. For simplicity, hereafter we will assume that the parent sizes, |Πi|, of all nodes are fixed. Any other setting would involve more complex structure for the parameter space and, generally, would require more stringent conditions to guarantee the uniqueness of MLE.

In order to show the asymptotic equivalence of the h and r-based EM estimations, we will assume the following consistency sufficient conditions on r known from the general asymptotic theory. Let there be a parameter set Θ, with θ0 in the interior of Θ, such that

P(θ^r,mΘ)1andP(θ^h,mΘ)1,asm, (A.2)
supθΘEyθsupθΘr¯i(θiY,θ)Eyθri(θiy,θ)0,asm, (A.3)

for i = 1, …, n, and that all Eri(θi|y, θ0) have well separated maximums at θ0,i. The latter guarantees that the system of equations θi=argmaxθiEri(θiy,θ) has a unique solution for θ* = θ0. If we let θ* = θ0 in condition (A.3), we obtain uniform convergence in mean of the sample log-likelihood of y to the expected one, which, together with the third condition, the well separateness of the maximums, is sufficient for the MLE consistency.

Next we observe that

supθΘh¯i(θiY,θ)Ehi(θiy,θ)supθΘEyi,Πi¯yi,yjΠir¯i(θiY,θ)Eri(θiy,θ)Eyi,Πi¯yi,yjΠisupθΘr¯i(θiY,θ)Eri(θiy,θ)

by (A.1). Then condition (A.3) holds for h as well and therefore, the score function h also produces consistent estimators.

Are the above consistency sufficient conditions reasonable to be assumed? Which ones are general model properties that can be induced and which ones are essential? The identifiability of MLE, guaranteed by the well separateness of the maximum of Er(θ|y, θ0), is an essential condition and it is model specific - it can not be induced. Let Θ be a bounded compact subset of the parametric space that contains θ0 in its interior and that does not intersect any hyper-plane σi = 0 and P (xi = ci = C, θi) = 0. Such Θ exists whenever the conditional distributions in θ0 are away from zero. Then, it can be shown that condition (A.2) follows from the well-separateness of the maximum of Eri(θi|y, θ0) and the compactness of Θ. To show (A.3) we express (6) as

ri(θiy,θ)=12log(2πσi2)12σi2cCi(yiβi,c)2P(xi=cy,θ)+cCi,CC[Πi]log(P(xi=cΠi=C,θi))P(xi=c,Πi=Cy,θ)

and observe that, because of the compactness of Θ, the supremum in (A.3) is bounded by

maxθEyθmaxθ{b1(yiβi)2¯E(yiβi)2+b2P(xi,Πiy,θ)¯EP(xi,Πiy,θ)},

for b1=maxΘ(1/2σi2)< and b2= maxΘ log(P (xi = ci = C, θi)) < ∞, where the bar denotes sample average. Further considerations show that both terms above converge to zero. The convergence of the first one can be verified using the properties of the Gaussian distribution. In the second, since there are only finitely many possible parent sets Πi, in fact at most (n1Πi), we have

maxθP(xi,Πiy,θ)¯EP(xi,Πiy,θ)ΠiP(xi,Πiy,θ)¯EP(xi,Πiy,θ),

and the latter sum converges to zero in mean by an application of the Dominated convergence theorem. Therefore, for Θ as chosen above, (A.3) is satisfied.

Footnotes

1

The simulations are run on 2.66 GHz, i5 processor. D+BN and MGBN estimations are performed using catnet and mugnet R packages.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  1. Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci USA. 2001;98(24):13790–13795. doi: 10.1073/pnas.191502998. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Beer D, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma, 2002. Nat Med. 2002 Aug;8(8):816–24. doi: 10.1038/nm733. Jul 15. [DOI] [PubMed] [Google Scholar]
  3. Buntine WL. A Guide to the Literature on Learning Probabilistic Networks from Data. IEEE Transactions on Knowledge and Data Engineering. 1996;8(2) [Google Scholar]
  4. Chickering DM. Learning from data: Artificial intelligence and statistics V. Springer-Verlag; 1996. Learning Bayesian networks is NP-complete; p. 121130. [Google Scholar]
  5. Chickering DM, Heckerman D. Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learning. 1996;29:181212. [Google Scholar]
  6. Dempster A, Laird N, Rubin D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B. 1977;39:1–38. [Google Scholar]
  7. Heckeman D, Geiger D, Chickering D. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning 1995 [Google Scholar]
  8. Heckeman D, Geiger D. Learning Bayesian Networks: A Unification for Discrete and Gaussian Domains. Proceedings of UAI ’95. 1995:274–284. [Google Scholar]
  9. Lauritzen SL. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis. 1995;19:191–201. [Google Scholar]
  10. Lauritzen SL. Journal of the American Statistical Association. 1992. Propagation of Probabilities, Means and Variances in Mixed Graphical Association Models. [Google Scholar]
  11. Lauritzen SL, Wermuth N. Tech Rep R-84-8. Institute of Electronic Systems, Aalborg University; Denmark: 1984. Mixed interaction models. [Google Scholar]
  12. Lauritzen SL, Jensen F. Stable local computation with conditional Gaussian distributions. Statistics and Computing. 2001:191–203. [Google Scholar]
  13. Friedman N. Learning Belief Networks in the Presence of Missing Values and Hidden Variables. Proceedings of the Fourteenth International Conference on Machine Learning; 1997. pp. 125–133. [Google Scholar]
  14. Neapolitan RE. Learning Bayesian Networks. Prentice Hall; 2003. [Google Scholar]
  15. Saal L, et al. Poor prognosis in carcinoma is associated with a gene expression signature of aberrant PTEN tumor suppressor pathway activity, 2007. Proc Natl Acad Sci U S A. 2007 May 1;104(18):7564–9. doi: 10.1073/pnas.0702507104. Apr 23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Salzman P, Almudevar A. Using Complexity for the Estimation of Bayesian Networks. Statistical Applications in Genetics and Molecular Biology. 2006;5(1) doi: 10.2202/1544-6115.1208. [DOI] [PubMed] [Google Scholar]

RESOURCES