Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs

Yiping Yuan; Xiaotong Shen; Wei Pan

doi:10.1002/sam.11168

. Author manuscript; available in PMC: 2013 Dec 17.

Published in final edited form as: Stat Anal Data Min. 2012 Oct 29;5(6):10.1002/sam.11168. doi: 10.1002/sam.11168

Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs

Yiping Yuan ¹, Xiaotong Shen ¹, Wei Pan ²

PMCID: PMC3866136 NIHMSID: NIHMS461070 PMID: 24358074

Abstract

Estimation of multiple directed graphs becomes challenging in the presence of inhomogeneous data, where directed acyclic graphs (DAGs) are used to represent causal relations among random variables. To infer causal relations among variables, we estimate multiple DAGs given a known ordering in Gaussian graphical models. In particular, we propose a constrained maximum likelihood method with nonconvex constraints over elements and element-wise differences of adjacency matrices, for identifying the sparseness structure as well as detecting structural changes over adjacency matrices of the graphs. Computationally, we develop an efficient algorithm based on augmented Lagrange multipliers, the difference convex method, and a novel fast algorithm for solving convex relaxation subproblems. Numerical results suggest that the proposed method performs well against its alternatives for simulated and real data.

Keywords: collapsed networks, nonconvex constraints, pairwise coordinate descent

1. INTRODUCTION

Directed acyclic graphical models have been useful in analyzing and visualizing causal relations among random variables. Major components of the models are nodes representing random variables and edges encoding conditional dependence relations between the enclosing nodes. In gene network analysis, a large amount of current biological knowledge has been organized as regulatory and other interactive networks, for instance, as a collection of pathways representing molecular and biochemical reactions in the Kyoto Encyclopedia of Genes and Genomes. In such a situation, a key question is to extract gene–gene causal effects, to advance our understanding of molecular interactions in a subnetwork of the genes. Therefore, we reconstruct multiple directed acyclic graphs (DAGs) given a ordering to detect structural changes over graphs in this article.

Reconstruction of a single DAG from data without a given ordering is challenging due to the enormous size of candidate DAGs that is super-exponential in the number of nodes. Existing approaches can be categorized into search-and-score and constraint-based methods [1]. For problems with a small or moderate number of nodes, both methods perform well [1–5]. Recently, interesting hybrid methods have been developed to reduce computational cost [6]. For high-dimensional problems with a large number of nodes, challenges remain on the computational front. Note that even the Peter-Clark (PC) algorithm [7, 8], which has a polynomial runtime when a graph is sparse runs in the worst case in time that is exponential in the number of nodes. In fact, existing methods are impractical because the space of equivalence classes is conjectured to grow super-exponentially in the number of nodes [9].

Reconstruction of a DAG given a ordering is equivalent to sparse estimation of Cholesky decomposition of a covariance matrix, and thus is computationally feasible [10]. The ordering information is usually determined by a natural ordering of temporal observations, previous experiments, and prior knowledge. In this article, we reconstruct multiple DAGs given a known ordering. To our knowledge, estimation of structural changes over multiple DAGs has not yet been explored, although that over multiple undirected graphs has been studied [12–14]. In practice, identification of a change in causality structure arises from detecting a change corresponding to that of experimental conditions or responding to a certain event or treatment.

This paper focuses on maximum likelihood estimation of multiple DAGs under a structural equation model. It is known that maximum likelihood estimation breaks down when the number of variables exceeds the sample size. Even for a moderately sized problem, it always yields a complete graph and does not estimate graphical structures well. Therefore, different methods using penalization have been proposed for sparse estimation of graphical models [14–16, 10]. To achieve our goal of learning graphical structures, we construct two nonconvex constraints based on the truncated L₁-function (TLP, [17]), as a computational surrogate of the L₀-function, with one constraint imposing sparseness and the other encouraging a common structure. Computationally, with difference convex (DC) programming and augmented Lagrange multipliers, nonconvex minimization is solved through a sequence of convex subproblems iteratively. For each subproblem, we develop a fast algorithm to treat a constrained L₁-problem, which we call pairwise coordinate descent algorithm.

The rest of the paper is organized as follows. Section 2 introduces the methodology, followed by our computational development in Section 3. Operating characteristics of the proposed method are examined on simulated and real data in Sections 4 and 5, respectively.

2. METHODOLOGY

Given L p-dimensional vectors of random variables $X^{(l)} = {(X_{1}^{(l)}, \dots, X_{p}^{(l)})}^{T}$ with a known ordering, one from each population, we use L DAGs to describe causal relations within each population and to explore differences among the populations. That is, each component $X_{i}^{(l)}$ corresponds to one node in the lth DAG, with a directed edge between two nodes indicating a causal relation between them. Without loss of generality, we assume that X^(l) has been sorted according to its order, which means a causal relation is only possible from $X_{i}^{(l)}$ to $X_{i}^{(l)}$ for i < j.

To model causality among the components of X^(l), consider a structural equation model of the form

X^{(l)} = A^{(l)} X^{(l)} + ε^{(l)}, l = 1, \dots, L,

(1)

where A^(l) is an adjacency matrix in which a nonzero jk-th element of A^(l) corresponds to a directed edge from parent node k to child note j with its value $A_{j k}^{(l)}$ indicating the strength of the relationship, and $ε^{(l)} = {(ε_{1}^{(l)}, \dots, ε_{p}^{(l)})}^{T}$ is an independent latent variable vector representing unexplained variations in the nodes and acting as random noises. Note that $A_{i, j}^{(l)} = 0$ for i < j, since X^(l)’s are assumed to be ordered. In addition, $A_{i i}^{(l)} = 0$ , for all i, as a self-loop is not allowed in a DAG. Therefore, the adjacency matrices A^(l), l = 1, …, L, are lower triangular with zero diagonal elements, that is, $A_{i, j}^{(l)} = 0, j \geq i$ . The model basically says that each $X_{j}^{(l)}$ depends linearly on its parent variables and some latent variable $ε_{j}^{(l)}$ . Here we assume that $ε_{1}^{(l)}, \dots, ε_{p}^{(l)}$ follow independent normal distributions, that is, $ε_{j}^{(l)} \underset{~}{ind} N (0, {(σ_{j}^{(l)})}^{2})$ . This implies that X^(l)’s follow multivariate normal distributions. Note that Eq. (1) becomes Gaussian autoregressive model when the subscript of $X_{i}^{(l)}$ denotes the consecutive time. Our likelihood method is readily generalizable to other distributions. The reader may refer to Ref [18] for Eq. (1) and structural equations.

In Eq. (1), nonzero entries of A^(l) are uniquely specified by the lth DAG. Thus, we estimate (A⁽¹⁾, …, A^(L)) to preserve a common structure and identify differences among them.

For a total of $n = \sum_{l = 1}^{L} n_{l}$ random samples, n_l samples are drawn according to Eq. (1) for each l to form an n × p data matrix $𝒳^{(l)} = (x_{1}^{(l)}, \dots, x_{p}^{(l)})$ , where each $x_{j}^{(l)} = {(x_{1, j}^{(l)}, \dots, x_{n_{l}, j}^{(l)})}^{T}$ is an n_l-dimensional column vector for each node, and samples from different populations 𝒳^(l) are assumed to be independent. Note that an arbitrary mean vector can be incorporated by adding an intercept to Eq. (1).

Let k⁻ = {j = 1, …, k − 1} be a set of indices, with k = 1 indicating the null set. Let $𝒳_{j -}^{(l)} = (x_{1}^{(l)}, \dots, x_{j - 1}^{(l)}), A_{j, j -}^{(l)} = (A_{j, l}^{(l)}, \dots, A_{j, j - 1}^{(l)})$ . The likelihood of 𝒳^(l) can be written as

f (𝒳^{(1)}, \dots, 𝒳^{(L)}) = \prod_{l = 1}^{L} f (𝒳^{(l)}) = \prod_{l = 1}^{L} \prod_{j = 1}^{p} f (x_{j}^{(l)} | 𝒳_{j -}^{(l)}) = \prod_{j = 1}^{p} \prod_{l = 1}^{L} f (x_{j}^{(l)} | 𝒳_{j -}^{(l)}) .

(2)

This yields the negative log-likelihood

- log f (𝒳^{(1)}, \dots, 𝒳^{(L)}) = \sum_{j = 1}^{p} (\sum_{l = 1}^{L} (- log f (x_{j}^{(l)} | 𝒳_{j -}^{(l)}))) .

(3)

Using the fact that $X_{j}^{(l)} | X_{j -}^{(l)}$ follows $N (A_{j, j -}^{(l)} X_{j -}^{(l)}, {(σ_{j}^{(l)})}^{2})$ from Eq. (1), we obtain that

- log f (𝒳^{(1)}, \dots, 𝒳^{(L)}) = \sum_{j = 1}^{p} (\sum_{l = 1}^{L} (\frac{1}{2 {(σ_{j}^{(l)})}^{2}} ‖ x_{j}^{(l)} - 𝒳_{j -}^{(l)} {(A_{j, j -}^{(l)})}^{T} ‖^{2} + n_{j} log σ_{j}^{(l)})) .

(4)

Maximizing Eq. (2), equivalently minimizing Eq. (4), may result in over-fitting and lead to fully connected DAGs, especially when the number of unknown parameters exceeds the sample size. We therefore regularize Eq. (3) through nonconvex constraints to pursue sparsity and detect structural changes. Note that the constrained approach is not equivalent to its penalized regularization counterpart because of nonconvexity in this case.

Our method is to regularize the number of nonzeros of the adjacency matrices as well as the number of pairwise differences between the corresponding entries across adjacency matrices. Let ℒ be a set of index pairs in which a pair (l, s) indicates the possibility that A^(l) and A^(s) share some common entries and can be grouped or collapsed if data suggest so. Constraints are used to regularize, which are in the form:

\sum_{l = 1}^{L} ‖ A^{(l)} ‖_{0} \leq t_{1}, \sum_{(l, s) \in ℒ} ‖ A^{(l)} - A^{(s)} ‖_{0} \leq t_{2} .

(5)

where $‖ A ‖_{0} : = \sum_{i = 1}^{p} \sum_{j = 1}^{p} I (| A_{i, j} | \neq 0)$ , is the L₀-norm of A, or the number of nonzero entries of A, t₁ ≥ 0 and t₂ ≥ 0 are tuning parameters corresponding to the number of nonzeros of the adjacency matrices and the number of element-wise differences with respect toℒ. A complete set ℒ = {(l, s) : 1 ≤ l < s ≤ L} is used unless additional information is available. For example, a temporal set ℒ = {(l, l + 1) : 1 ≤ l ≤ L − 1} is used in dynamic networks with l representing consecutive times.

To allow for different degrees of sparsity over different rows of lower triangular matrices A^(l), we replace (5) by p pairs of row-wise sparsity through 2p different tuning parameters {t_1,j, t_2,j }:

\sum_{l = 1}^{L} \sum_{k = 1}^{j - 1} ‖ A_{j, k}^{(l)} ‖_{0} \leq t_{1, j}, \sum_{(l, s) \in ℒ} \sum_{k = 1}^{j - 1} ‖ A_{j, k}^{(l)} - A_{j, k}^{(s)} ‖_{0} \leq t_{2, j}, j = 1, \dots, p .

(6)

This is computationally feasible because the log-likelihood (3) is separable or decomposable in j. In fact, minimizing Eq. (3) subject to Eq. (6) reduces to p subproblems.

To circumvent computational difficulty of minimizing Eq. (3) subject to Eq. (6), we approximate the L₀ function there by a surrogate function, the truncated L₁ function (TLP, [17], defined as $P_{λ} (x) = min (\frac{| x |}{λ}, 1)$ . As λ tends to 0, the TLP recovers the L₀-function exactly. Now the constraints in Eq. (6) become

\sum_{l - 1}^{L} \sum_{k = 1}^{j - 1} P_{λ} (A_{j, k}^{(l)}) \leq t_{1, j}, \sum_{(l, s) \in ℒ} \sum_{k = 1}^{j - 1} P_{λ} (A_{j, k}^{(l)} - A_{j, k}^{(s)}) \leq t_{2, j}, j = 1, \dots, p .

(7)

To simplify tuning, we introduce a single constraint for each row as opposed to the two constraints in Eq. (7), with new tuning parameters (κ_j, t_j) corresponding to (t_1,j, t_2,j),

\sum_{l = 1}^{L} \sum_{k = 1}^{j - 1} P_{λ} (A_{j, k}^{(l)}) + κ_{j} \sum_{(l, s) \in ℒ} \sum_{k = 1}^{j - 1} P_{λ} (A_{j, k}^{(l)} - A_{j, k}^{(s)}) \leq t_{j} j = 1, \dots, p,

(8)

where κ_j seeks a trade-off between sparsity and grouping.

On the basis of the foregoing discussion, we solve Eq. (3) subject to Eq. (8) by solving its equivalent form through p subproblems: $min \sum_{l = 1}^{L} (\frac{1}{2 {(σ_{j}^{(l)})}^{2}} ‖ x_{j}^{(l)} - 𝒳_{j -}^{(l)} {(A_{j, j -}^{(l)})}^{T} ‖^{2} + n_{l} log σ_{j}^{(l)}),$ subject to

\sum_{l = 1}^{L} \sum_{k = 1}^{j - 1} P_{λ} (A_{j, k}^{(l)}) + κ_{j} \sum_{(l, s) \in ℒ} \sum_{k = 1}^{j - 1} P_{λ} (A_{j, k}^{(l)} - A_{j, k}^{(s)}) \leq t_{j}, j = 1, \dots, p .

(9)

These p subproblems are of the same type; hence we only need to consider a general form. Let Y^(l) be a vector of length n_l, corresponding to $x_{j}^{(l)}$ in Eq. (9), X^(l) be an n_l by m matrix, corresponding to $𝒳_{j^{-}}^{(l)}$ with m = j − 1, and β^(l) be m-dimensional vector corresponding to $A_{j, j^{-}}^{(l)}$ . Then, a general form is, $min \sum_{l = 1}^{L} (\frac{1}{2 σ_{l}^{2}} ‖ Y^{(l)} - X^{(l)} β^{(l)} ‖^{2} + n_{l} log σ_{l})$ , subject to

\sum_{l = 1}^{L} \sum_{j = 1}^{m} min (\frac{| β_{j}^{(l)} |}{λ}, 1) + κ \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} min (\frac{| α_{j}^{(l s)} |}{λ}, 1) \leq t .

(10)

where $α_{j}^{(l s)} = β_{j}^{(l)} - β_{j}^{(s)}$ , and $ζ = (β_{j}^{(l)}_{l = 1, \dots, L}^{j = 1, \dots, m}, α_{j}^{(l s)}_{j = 1, \dots m}^{(l, s) \in ℒ})$ is our new set of variables to be optimized. In addition, a new constraint T ζ = 0 is imposed, namely, $β_{j}^{(l)} - β_{j}^{(s)} - α_{j}^{(l s)} = 0$ , for j = 1, …, m, (l,s) ∈ ℒ.

3. COMPUTATION

This section develops a computational method for nonconvex minimization in Eq. (10) through DC programming, augmented Lagrange multipliers and our pairwise coordinate descent algorithm.

3.1. Difference Convex Programming

For minimization in Eq. (10), we employ DC programming, which leads to a finite-step termination due to piecewise linearity of the TLP function [17]. Here P_λ can be decomposed into a difference of two convex functions:

P_{λ} (x) = min (\frac{| x |}{λ}, 1) = \frac{| x |}{λ} - max (\frac{| x |}{λ} - 1, 0) .

(11)

This in turn yields a decomposition of the left-hand side of Eq. (10) into a difference of two convex part, that is,

S_{1} (ζ) - S_{2} (ζ) \leq t,

where

S_{1} (ζ) = \sum_{l = 1}^{L} \sum_{j = 1}^{m} \frac{| β_{j}^{(l)} |}{λ} + κ \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} \frac{| α_{j}^{(l s)} |}{λ},

(12)

S_{2} (ζ) = \sum_{l = 1}^{L} \sum_{j = 1}^{m} max (\frac{| β_{j}^{(l)} |}{λ} - 1, 0) + κ \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} max (\frac{| α_{j}^{(l s)} |}{λ} - 1, 0) .

(13)

This constructs a sequence of convex approximation sets that are contained in the original nonconvex set iteratively by replacing S2 at iteration k with its affine majorization at iteration k − 1. At iteration k we minimize Eq. (10) subject to T ζ = 0 and a relaxed constraint

\sum_{l = 1}^{L} \sum_{j = 1}^{m} | β_{j}^{(l)} | I (| {\hat{β}}_{j}^{(l) [k - 1]} | \leq λ) + κ \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} | α_{j}^{(l s)} | I (| {\hat{α}}_{j}^{(l s) [k - 1]} | \leq λ) \leq λ (t - \sum_{l = 1}^{L} \sum_{j = 1}^{m} I (| {\hat{β}}_{j}^{(l) [k - 1]} | > λ) - κ \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} I (| {\hat{α}}_{j}^{(l s) [k - 1]} | > λ)),

(14)

where ${\hat{β}}_{j}^{(l) [k - 1]}$ is the estimate of $β_{j}^{(l)}$ at the (k − 1)th iteration, and ${\hat{α}}_{j}^{(l s) [k - 1]}$ is the estimate of $α_{j}^{(l s)}$ at the (k − 1)th iteration.

3.2. Augmented Lagrange Multipliers

The constraint T ζ = 0 defined by slack variables $α_{j}^{(l s)}$ is treated through the augmented Lagrange multipliers, which is designed to convert a constrained problem to an unconstrained one. At iteration w, we minimize S(ζ) subject to Eq. (14)

S (ζ) = \sum_{l = 1}^{L} (\frac{1}{2 σ^{{(l)}^{2}}} ‖ Y^{(l)} - X^{(l)} β^{(l)} ‖^{2} + n_{l} log σ^{(l)}) + \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} τ_{j}^{(l s) [w]} (β_{j}^{(l)} - β_{j}^{(s)} - α_{j}^{(l s)}) + \frac{1}{2} \sum_{(l, s) \in ℒ} \sum_{j = 1}^{m} ν_{j}^{(l s) [w]} {(β_{j}^{(l)} - β_{j}^{(s)} - α_{j}^{(l s)})}^{2},

(15)

where $τ_{j}^{(l s) [w]}$ are Lagrangian multipliers for T ζ = 0 and $ν_{j}^{(l s) [w]}$ control the convergence speed of the algorithm. They are updated until convergence:

τ_{j}^{(l s) [w + 1]} = τ_{j}^{(l s) [w]} + μ_{j}^{(l s) [w]} (β_{j}^{(l)} - β_{j}^{(s)} - α_{j}^{(l s)}), μ_{j}^{(l s) [w + 1]} = ρ μ_{j}^{(l s) [w]},

where ρ ∈ (1, 2) is pre-determined. At iteration k in the DC loop and at iteration w in the augmentation loop, we minimize Eq. (15) subject to Eq. (14). This weighted Lasso problem is solved by a pairwise coordinate descent algorithm that isintroduced next.

3.3. Pairwise Coordinate Descent

A Lasso problem [19] solves

min_{β_{1}, \dots, β_{m}} f (β), subject to \sum_{j = 1}^{m} | β_{j} | \leq t,

(16)

where f is a convex cost function.

For Eq. (16), coordinate descent methods are applicable to its regularization version [20, 21]. However, such a method breaks down for Eq. (16), as its solution may be trapped [20]. Here we develop a directional blockwise coordinate descent method, to solve Eq. (15) subject to Eq. (14). The main idea is to seek an optimal solution only over the simplex boundary, where the solution lies. This directional search strategy overcomes the difficulty for an optimal solution to be trapped at the constraint boundary.

A solution of Eq. (16) exists when f is continuous since a feasible region of β is compact. For sparse learning, the Lasso constraint is active only when the feasible region defined by the constraint excludes all global minima of f (β), in which any solution of Eq. (16) lies on its boundary. Instead of searching coordinatewise, we move along directions Δ|β_i| = −Δ|β_j|, to search over the boundary, where Δ|β_i| is the change in the absolute value of β_i after a step. Note that the pair of (i, j) is chosen so that the pair gives the steepest descent in the cost function value among all pairs.

The algorithm is summarized as follows.

Algorithm 1 Pairwise Coordinate Descent
Step 1. (Initialization) Input a initial value β⁰ for β satisfying ‖β₀‖₁ = t.
Step 2. (Iteration) Find the steeping direction satisfying Δ\|β_i\| = −Δ\|β_j\|. Perform an exact line search to determine the best step length and update.
Step 3. (Stopping rule) Terminate when the following subdifferential condition is satisfied: there exists λ > 0, s.t. −sign $(β_{j}) \frac{\partial f (β)}{\partial β_{j}} = λ$ for β_j ≠ 0, and $\| \frac{\partial f (β)}{\partial β_{j}} \| < λ$ for β_j = 0.

Open in a new tab

Convergence of the algorithm is assured by Theorem 1.

THEOREM 1: For Eq. (16), stationary points and minimum points coincide. If the minimizer is unique, the algorithm converges to it.

If the cost function has nuisance parameters in addition to β, for example, σ^(l), then we need to treat them as unconstrained coordinates and update them coordinately in every iteration.

A DC loop and an augmentation loop converge in only a few steps in practice [17]. Taking advantage of Algorithm 3.3 in the inner loop, our method is capable of treating multiple graphs of over thousands of variables in real time. This is desirable for our nonconvex minimization problem.

4. NUMERICAL EXAMPLES

This section examines operating characteristics of the proposed method, and contrasts it against its competitors through simulations. In particular, the proposed method, denoted by “nonconvex,” is examined together with its convex counterpart—our method with the L₁-function replacing the TLP, denoted by ‘convex’, and a sparse L₁ method from Ref [17] for DAGs individually, denoted by ‘DAGlasso’.

In simulations, two DAGs are considered and a complete set ℒ = {(1, 2)} is used for possible grouping. All simulations are performed in R. Performance metrics for sparsity and grouping pursuit are the number of false positives (FP), the number of false negatives (FN), the number of correctly identified differences between graphs (TD), and the number of falsely identified differences between graphs (FD). Overall, we use the Matthews Correlation Coefficient (MCC) [22] as a performance metric, which is commonly used in binary classification, and is defined as,

M M C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(17)

where TP, TN, FP, and FN correspond to true positives, true negatives, FP and FN, respectively. A larger value of MCC gives better a fit with 1 being the best and −1 being the worst.

Two graphs are generated as follows. The first DAG 𝒢₁ is generated at random, with the number of nodes p = 50, 100, 200, having the average probability of connecting one node to another with higher ordering: 0.02. The second DAG 𝒢₂ is obtained by removing a number of edges from 𝒢₁ and this number is controlled to be less than 1% of the total edges in 𝒢₁. For p = 50, the two DAGs are just identical. For p = 100, 200, removal is done by deleting all edges connecting from a specific node, which mimics a situation when a certain gene from 𝒢₁ becomes isolated due to some treatment effect. Finally, we generate a random sample from 𝒢₁ and 𝒢₂, respectively, with equal sample size n₁ = n₂ = n.

For the proposed method, there are three data-dependent tuning parameters (λ, κ, t) in Eq. (10). It has been shown in Ref [17] that estimation based on the TLP is not sensitive to the choice of λ as long as λ is small enough. A common set of values for λ is {0.1, 0.01, 0.001}, and κ is a tuning parameter ranged between 0 and 1, with three to five values, based on our limited numerical experience. This makes tuning three parameters much easier with an effort focused on tuning t. In this simulation study, tuning parameters for all methods are optimized for prediction over an independent data set of size 1000 for each graph.

As suggested in Table 1, the proposed method compares favorably against its competitors across all the situations, in terms of the MCCs, accuracy of estimation, and detection of a structural change. Interestingly, seeking common structures or detection of structural changes is more critical for multiple graphs than a single graph, especially when they share some common structures. In addition, our nonconvex method improves significantly over its convex counterpart. In most cases, our nonconvex method yields a MMC value close to 1, suggesting almost perfect learning of structures. In this sense, a nonconvex method is useful in estimating directed graphs.

Table 1.

Estimated quantities and their corresponding estimated standard errors (in parentheses) based on 100 simulation replications.

n	p	method	FP	FN	TD	FD	MCC
50	50	DAGlasso	15.7(3.9)	0.2(0.4)	0(0)	15.3(3.8)	0.84(0.03)
		Convex	3.9(2.7)	0(0)	0(0)	0.3(0.5)	0.95(0.03)
		Nonconvex	0.3(0.8)	0.1(0.3)	0(0)	0(0.2)	0.99(0.01)
	100	DAGlasso	73.7(8.6)	1.1(1)	2(0)	72.3(8.2)	0.83(0.02)
		Convex	18.9(6.2)	0.4(0.9)	1.7(0.5)	0.9(1)	0.95(0.02)
		Nonconvex	4.7(3.5)	1.6(2)	1.6(0.6)	0.2(0.5)	0.98(0.02)
	200	DAGlasso	406.5(21.7)	14.5(3.8)	4.9(0.3)	402.1(20.8)	0.80(0.01)
		Convex	122.3(15.3)	46.7(9.2)	2.5(1)	1.4(1.2)	0.90(0.01)
		Nonconvex	30.1(11.2)	32.9(9.5)	3.2(1.2)	0.6(0.9)	0.96(0.01)
100	50	DAGlasso	6.7(2.6)	0(0)	0(0)	6.6(2.6)	0.92(0.03)
		Convex	1.2(1.5)	0(0)	0(0)	0.1(0.3)	0.98(0.02)
		Nonconvex	0(0.3)	0(0)	0(0)	0(0.2)	1.00(0.01)
	100	DAGlasso	32.3(5.9)	0(0.1)	2(0)	31.8(6)	0.92(0.01)
		Convex	6.3(3.2)	0(0)	1.9(0.3)	0.5(0.7)	0.98(0.01)
		Nonconvex	1(1.3)	0(0)	2(0.1)	0.1(0.3)	1.00(0.01)
	200	DAGlasso	162.7(14.0)	0.1(0.3)	5(0)	158.2(13.7)	0.91(0.01)
		Convex	28.9(7.1)	0.1(0.4)	4.5(0.7)	1(0.9)	0.98(0.01)
		Nonconvex	3.7(3.2)	1(1.8)	4.8(0.4)	0.3(0.7)	1.00(0.01)

Open in a new tab

Notes: For p = 50, there are 20 edges in 𝒢₁ and 20 edges in 𝒢₂ with no differences. For p = 100, there are 88 edges in 𝒢₁ and 86 edges in 𝒢₂ with two differences. For p = 200, there are 427 edges in 𝒢₁ and 422 edges in 𝒢₂ with five differences.

5. ANALYSIS OF CELL SIGNALING DATA

This section applies the proposed method to analyze multivariate flow cytometry data in Ref [23]. Data were collected after a series of stimulatory cues and inhibitory interventions, with cell reactions stopped at 15 minutes after stimulation by fixation, to profile the effects of each condition on the intracellular signaling networks of human primary naive CD4+ T cells. Over 10,000 flow cytometry measurements were made over 11 phosphorylated proteins and phospholipids from 14 experimental conditions. The main purpose of this experiment is to infer casual influences in cellular signaling networks through perturbations with molecular interventions. The simultaneous measurements permit multivariate as opposed to univariate approaches. Data sets were available; see Ref [23] for more details. The DAG representation of the network is displayed in Fig. 1 [23]. A direction from node X to node Y is interpreted as a causal influence from X to Y.

Fig. 1 — DAG representation for 11 proteins.

A DAG model was fit in Ref [23] with data from the first nine conditions, following called Group 1, whereas the rest five conditions, denoted as Group 2 were not used for estimation. Note that all five conditions in Group 2 employed ICAM-2, a general perturbation, in addition to perturbations used in Group 1. In our analysis, we are interested in whether the usage of ICAM-2 in Group 2 activate or inhibit some causal relationship in the network. Thus, data have been split into two datasets: 7466 samples from Group 1 and 4206 samples from Group 2. We fit a two-DAG model with 1/10th samples for training and 9/10th for tuning. The graphical result of reconstructed networks are displayed in Fig. 2, where the DAGlasso [10] based on individual graphs is used for comparison. Correct edges are marked with solid arrows, while FP are indicated by long dashed arrows.

Fig. 2 — Analysis of cell signaling data: Correct edges are marked with solid arrows, while FP are indicated by long dashes. (a) Group 1 by nonconvex. (b) Group 2 by nonconvex. (c) Group 1 by DAGlasso. (d) Group 2 by DAGlasso.

Our method is more reliable in that it gives much fewer FP with almost the same number of TP as compared to the DAGlasso, and identical links and differences between the two groups are likely to be real, which may be cross-validated experimentally. By comparison the two graph reconstructed by the DAGlasso are not so consistent and the estimated differences are mainly false discoveries.

In the analysis, several known links were missed by the proposed method and the DAGlasso method. This is because many causal relations in protein-signaling networks are believed to be non-linear, which may not be detectable by methods based on linear models, such as our method and the DAGlasso. In fact, to our knowledge, no linear methods could reconstruct most links. This non-linearity goes beyond the scope of the linear casual effect specified in Equation (1).

ACKNOWLEDGMENTS

Research supported in part by NSF grant DMS-1207771, NIH grants R02GM08153 and R01HL105397. The authors would like to thank reviewers and editors for helpful comments and suggestions.

APPENDIX

Proof of Theorem 1: Let β^m be the solution at iteration m. Convergence follows directly from the fact that f (β^m) is decreasing in m. To prove the limit of β* satisfies the subdifferential condition, that is, there exists λ > 0, s.t. −sign $(β_{j}) \frac{\partial f (β)}{\partial β_{j}} = λ$ for β_j ≠ 0, and $| \frac{\partial f (β)}{\partial β_{j}} | < λ$ for β_j = 0, suppose the subdifferential condition does not hold for the convergence point β*. Let $i = {argmax}_{k \in 1, \dots, p} | \frac{\partial f (β^{*})}{\partial β_{k}} |, j = {argmin}_{{k | β_{k}^{*} \neq 0}} | \frac{\partial f (β^{*})}{\partial β_{k}} |$ . Then $| \frac{\partial f (β^{*})}{\partial β_{i}} | > | \frac{\partial f (β^{*})}{\partial β_{j}} |$ . This implies that we can further reduce the value of the objection function, which contradicts to convergence. From Ref [24], this subdifferential condition is sufficient and necessary for β* to be the minimizer of the Lasso. This completes the proof.

REFERENCES

1.Brown L, Tsamardinos I, Aliferis C. A comparison of novel and state-of-the-art polynomial Bayesian network learning algorithms. Proceedings of the 20th National Conference on Artificial Intelligence (AAAI) 2005:739–745. [Google Scholar]
2.Heckerman D, Geiger D, Chickering D. Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn. 1995;20(3):197–243. [Google Scholar]
3.Chickering D. Optimal structure identification with greedy search. J Mach Learn Res. 2003;3:507–554. [Google Scholar]
4.Spiegelhalter D, Dawid A, Lauritzen S, Cowell R. Bayesian analysis in expert systems. Stat Sci. 1993;8(3):219–247. [Google Scholar]
5.Cheng J, Greiner R, Kelly J, Bell D, Liu W. Learning Bayesian networks from data: an information-theory based approach. Artif Intell. 2002;137(1–2):43–90. [Google Scholar]
6.Tsamardinos I, Brown LE, Aliferis CF. The max-min hill-climbing Bayesian network structure learning algorithm. Mach Learn. 2006;65:31–78. [Google Scholar]
7.Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. Vol. 81. Cambridge: The MIT Press; 2000. [Google Scholar]
8.Kalisch M, Bühlmann P. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J Mach Learn Res. 2007;8:613–636. [Google Scholar]
9.Gillispie S, Perlman M. Enumerating Markov equivalence classes of acyclic digraph models. In: Breese J, Koller D, editors. Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann; 2001. pp. 171–177. [Google Scholar]
10.Shojaie A, Michailidis G. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika. 2010;97:519–538. doi: 10.1093/biomet/asq038. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Zhou S, Lafferty J, Wasserman L. Time varying undirected graphs. Arxiv preprint arXiv:0802.2758. 2008 [Google Scholar]
12.Kolar M, Song L, Ahmed A, Xing E. Estimating time-varying networks. Ann Appl Stat. 2010;4(1):94–123. [Google Scholar]
13.Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]
15.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19. [Google Scholar]
16.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc. 2012;107:223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Pearl J. Reasoning and Inference. London: Cambridge University Press; 2000. Causality: Models. [Google Scholar]
19.Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodological) 1996;32(2):267–288. [Google Scholar]
20.Friedman J, Hastie T, H ofling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007;1(2):302–332. [Google Scholar]
21.Wu T, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008;2(1):224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16(5):412–424. doi: 10.1093/bioinformatics/16.5.412. [DOI] [PubMed] [Google Scholar]
23.Sachs K, Perez O, Pe’er D, Lauffenburger D, Nolan G. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005;308(5721):523. doi: 10.1126/science.1105809. [DOI] [PubMed] [Google Scholar]
24.Osborne M, Presnell B, Turlach B. On the lasso and its dual. J Comput Graph Stat. 2000;9(2):319–337. [Google Scholar]

[R1] 1.Brown L, Tsamardinos I, Aliferis C. A comparison of novel and state-of-the-art polynomial Bayesian network learning algorithms. Proceedings of the 20th National Conference on Artificial Intelligence (AAAI) 2005:739–745. [Google Scholar]

[R2] 2.Heckerman D, Geiger D, Chickering D. Learning Bayesian networks: the combination of knowledge and statistical data. Mach Learn. 1995;20(3):197–243. [Google Scholar]

[R3] 3.Chickering D. Optimal structure identification with greedy search. J Mach Learn Res. 2003;3:507–554. [Google Scholar]

[R4] 4.Spiegelhalter D, Dawid A, Lauritzen S, Cowell R. Bayesian analysis in expert systems. Stat Sci. 1993;8(3):219–247. [Google Scholar]

[R5] 5.Cheng J, Greiner R, Kelly J, Bell D, Liu W. Learning Bayesian networks from data: an information-theory based approach. Artif Intell. 2002;137(1–2):43–90. [Google Scholar]

[R6] 6.Tsamardinos I, Brown LE, Aliferis CF. The max-min hill-climbing Bayesian network structure learning algorithm. Mach Learn. 2006;65:31–78. [Google Scholar]

[R7] 7.Spirtes P, Glymour C, Scheines R. Causation, Prediction, and Search. Vol. 81. Cambridge: The MIT Press; 2000. [Google Scholar]

[R8] 8.Kalisch M, Bühlmann P. Estimating high-dimensional directed acyclic graphs with the pc-algorithm. J Mach Learn Res. 2007;8:613–636. [Google Scholar]

[R9] 9.Gillispie S, Perlman M. Enumerating Markov equivalence classes of acyclic digraph models. In: Breese J, Koller D, editors. Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence. San Francisco: Morgan Kaufmann; 2001. pp. 171–177. [Google Scholar]

[R10] 10.Shojaie A, Michailidis G. Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs. Biometrika. 2010;97:519–538. doi: 10.1093/biomet/asq038. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Zhou S, Lafferty J, Wasserman L. Time varying undirected graphs. Arxiv preprint arXiv:0802.2758. 2008 [Google Scholar]

[R12] 12.Kolar M, Song L, Ahmed A, Xing E. Estimating time-varying networks. Ann Appl Stat. 2010;4(1):94–123. [Google Scholar]

[R13] 13.Guo J, Levina E, Michailidis G, Zhu J. Joint estimation of multiple graphical models. Biometrika. 2011;98(1):1. doi: 10.1093/biomet/asq060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]

[R15] 15.Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19. [Google Scholar]

[R16] 16.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. J Am Stat Assoc. 2012;107:223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Pearl J. Reasoning and Inference. London: Cambridge University Press; 2000. Causality: Models. [Google Scholar]

[R19] 19.Tibshirani R. Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodological) 1996;32(2):267–288. [Google Scholar]

[R20] 20.Friedman J, Hastie T, H ofling H, Tibshirani R. Pathwise coordinate optimization. Ann Appl Stat. 2007;1(2):302–332. [Google Scholar]

[R21] 21.Wu T, Lange K. Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat. 2008;2(1):224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Baldi P, Brunak S, Chauvin Y, Andersen C, Nielsen H. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics. 2000;16(5):412–424. doi: 10.1093/bioinformatics/16.5.412. [DOI] [PubMed] [Google Scholar]

[R23] 23.Sachs K, Perez O, Pe’er D, Lauffenburger D, Nolan G. Causal protein-signaling networks derived from multiparameter single-cell data. Science. 2005;308(5721):523. doi: 10.1126/science.1105809. [DOI] [PubMed] [Google Scholar]

[R24] 24.Osborne M, Presnell B, Turlach B. On the lasso and its dual. J Comput Graph Stat. 2000;9(2):319–337. [Google Scholar]

PERMALINK

Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs

Yiping Yuan

Xiaotong Shen

Wei Pan

Abstract

1. INTRODUCTION

2. METHODOLOGY

3. COMPUTATION

3.1. Difference Convex Programming

3.2. Augmented Lagrange Multipliers

3.3. Pairwise Coordinate Descent

4. NUMERICAL EXAMPLES

Table 1.

5. ANALYSIS OF CELL SIGNALING DATA

Fig. 1.

Fig. 2.

ACKNOWLEDGMENTS

APPENDIX

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs

Yiping Yuan

Xiaotong Shen

Wei Pan

Abstract

1. INTRODUCTION

2. METHODOLOGY

3. COMPUTATION

3.1. Difference Convex Programming

3.2. Augmented Lagrange Multipliers

3.3. Pairwise Coordinate Descent

4. NUMERICAL EXAMPLES

Table 1.

5. ANALYSIS OF CELL SIGNALING DATA

Fig. 1.

Fig. 2.

ACKNOWLEDGMENTS

APPENDIX

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases