Estimating networks with jumps

Mladen Kolar; Eric P Xing

doi:10.1214/12-EJS739

. Author manuscript; available in PMC: 2014 Jul 8.

Published in final edited form as: Electron J Stat. 2012;6:2069–2106. doi: 10.1214/12-EJS739

Estimating networks with jumps

Mladen Kolar ¹, Eric P Xing ^1,^*

PMCID: PMC4085697 NIHMSID: NIHMS519448 PMID: 25013533

Abstract

We study the problem of estimating a temporally varying coefficient and varying structure (VCVS) graphical model underlying data collected over a period of time, such as social states of interacting individuals or microarray expression profiles of gene networks, as opposed to i.i.d. data from an invariant model widely considered in current literature of structural estimation. In particular, we consider the scenario in which the model evolves in a piece-wise constant fashion. We propose a procedure that estimates the structure of a graphical model by minimizing the temporally smoothed L1 penalized regression, which allows jointly estimating the partition boundaries of the VCVS model and the coefficient of the sparse precision matrix on each block of the partition. A highly scalable proximal gradient method is proposed to solve the resultant convex optimization problem; and the conditions for sparsistent estimation and the convergence rate of both the partition boundaries and the network structure are established for the first time for such estimators.

Keywords and phrases: Gaussian graphical models, network models, dynamic network models, structural changes

1. Introduction

Networks are a fundamental form of representation of relational information underlying large, noisy data from various domains. For example, in a biological study, nodes of a network can represent genes in one organism and edges can represent associations or regulatory dependencies among genes. In a social analysis, nodes of a network can represent actors and edges can represent interactions or friendships between actors. Exploring the statistical properties and hidden characteristics of network entities, and the stochastic processes behind temporal evolution of network topologies is essential for computational knowledge discovery and prediction based on network data.

In many dynamical environments, such as a developing biological system, it is often technically impossible to experimentally determine the network topologies specific to every time point in a certain time period. Resorting to computational inference methods, such as extant structural learning algorithms, is also difficult because for every model unique to a single time point, there exist as few as only a single snapshot of the nodal states distributed accordingly to the model in question. In this paper, we consider an estimation problem under a particular dynamic context, where the model evolves piecewise constantly, i.e., staying structurally invariant during unknown segments of time, and then jump to a different structure.

Approximately piecewise constantly evolving networks can be found underlying many natural dynamic systems of intellectual and practical interest. For example, in a biological developmental system such as the fruit fly, the entire life cycle of the fly consists of 4 discrete developmental stages, namely, embryo, larva, pupa, and adult; across the stages, one expect to see dramatical rewiring of the regulatory network to realize very different regulation functions due to different developmental needs, whereas within each stage, the change of the network topology are expected to be relatively more mild as revealed by the smoother trajectories of the gene expression activities, because a largely stable regulatory machinery is employed to control stage-specific developmental processes. Such phenomena are not uncommon in social systems. For example, in an underlying social network between the senators, even it is not visible to outsiders, we would imagine the network structure being more stable between the elections but more volatile when the campaigns start. Although it is legitimate to use a completely unconstrained time-evolving network model to describe or analysis such systems, an approximately piecewise constantly evolving network model is better at capturing the different amount of network dynamics during different phases of a entire life cycle, and detecting boundaries between different phases when desirable.

A popular technique for deriving the network structure from iid sample is to estimate a sparse precision matrix. The importance of estimating precision matrices with zeros was recognized by Depmster [11] who coined the term covariance selection. The elements of the precision matrix represent the associations or conditional covariances between corresponding variables. Once a sparse precision matrix is estimated, a network can be drawn by connecting variables whose corresponding elements of the precision matrix are non-zero. Recent studies have shown that covariance selection methods based on the penalized likelihood maximization can lead to a consistent estimate of the network structure underlying a Gaussian Markov Random Fields [12, 32]. Moreover, a particular procedure for covariance selection known as neighborhood selection, which is built on ℓ₁ norm regularized regression, can produce a consistent estimate of the network structure when the sample is assumed to follow a general Markov Random Field distribution whose structure corresponds to the network in question [33, 28, 31]. Specifically, a Markov Random Field (MRF) is a probabilistic graphical model defined on a graph G = (V, E), where V = {1, …, p} is a vertex set corresponding to the set of random variables to be modeled (in this paper we call them nodes and variables interchangeably), and E ⊆ V × V is the edge set capturing conditional independencies among these nodes. Let X = (X₁, …, X_p)′ denote a p-dimensional random vector, whose elements are indexed by the nodes of the graph G. Under the MRF, a pair (a, b) is not an element of the edge set E if and only if the variable X_a is conditionally independent of X_b given all the rest of variables X_V_\{_a_,_b_}, X_a ⊥ X_b|X_V_\{_a_,_b_}. A distribution over X can be defined by taking the following log linear form that makes explicit use of the (presence and absence of edges in the) edge set: p(X) ∝ exp{Σ₍_a_,_b_)∈_V θ_abX_aX_b}. When the elements of the random vector X are discrete, e.g., X ∈ {0, 1}^p, the model is referred to as a discrete MRF, sometimes known as an Ising model in statistics physics community; whereas when X is a continuous vector, the model is referred to as a Gaussian graphical model (GGM) because one can easily show that the p(X) above is actually a multivariate Gaussian. The MRF have been widely used for modeling data with graphical relational structures over a fixed set of entities [39, 14]. The vertices can describe entities such as genes in a biological regulatory network, stocks in the market, or people in society; while the edges can describe relationships between vertices, for example, interaction, correlation or influence.

The statistical problem we concern in this paper is to estimate the structure of the Gaussian graphical model from observed samples of nodal states in a dynamic world. Traditional methods handle this problem with the assumption that the samples are iid. Let Inline graphic = {x₁, …, x_n} be an independent and identically distributed sample according to a p-dimensional multivariate normal distribution (0, Σ), where Σ is the covariance matrix. Let Ω := Σ⁻¹ denote the precision matrix, with elements (ω_ab), 1 ≤ a, b ≤ p. Then one can obtain an estimator of the Ω from Inline graphic via optimizing a proper statistical loss function, such as likelihood or penalized likelihood. As mentioned earlier, the precision matrix Ω encodes the conditional independence structure of the distribution and the pattern of the zero elements in the precision matrix define the structure of the associated graph G. There has been a dramatic growth of interest in recent literature in the problem of covariance selection, which deals with the graph estimation problem above. Existing works range from algorithmic development focusing on efficient estimation procedures, to theoretical analysis focusing on statistical guarantees of different estimators. We do not intend to give an extensive overview of the literature here, but interested readers can follow the pointers bellow. In the classical literature (e.g., [22]), procedures are developed for small dimensional graphs and commonly involve hypothesis testing with greedy selection of edges. More recent literature estimates the sparse precision matrix by optimizing penalized likelihood [42, 12, 4, 35, 13, 32, 16, 44] or through neighborhood selection [28, 31, 15, 40], where the structure of the graph is estimated by estimating the neighborhood of each node. Both of these approaches are suitable for high-dimensional problems, even when p ≫ n, and can be efficiently implemented using scalable convex program solvers.

Most of the above mentioned work assumes that a single invariant network model is sufficient to describe the dependencies in the observed data. However, when the observed data are not iid, such an assumption is not justifiable. For example, when data consist of microarray measurements of the gene expression levels collected throughout the cell cycle or development of an organism, different genes can be active during different stages. This suggests that different distributions and hence different networks should be used to describe the dependencies between measured variables at different time intervals. In this paper, we are going to tackle the problem of estimating the structure of the GGM when the structure is allowed to change over time. By assuming that the parameters of the precision matrix change with time, we obtain extra flexibility to model a larger class of distributions while still retaining the interpretability of the static GGM. In particular, as the coefficients of the precision matrix change over time, we also allow the structure of the underlying graph to change as well. This semi-parametric generalization of the parametric model is referred to as a varying coefficient varying structure (VCVS) model.

Now, let {x_i}_i_∈[_n_] ∈ ℝ^p be a sequence of n conditionally independent observations ¹ (we use [n] to denote the set {1, …, n}) from some p-dimensional multivariate normal distributions, not necessarily the same for every observation. Let { Inline graphic }_j_∈[_B_] be a disjoint partitioning of the set [n] where each block of the partition consists of consecutive elements, that is, ∩ = ∅ for j ≠ j′ and ⋃_j = [n] and = [T_j₋₁ : T_j] := {T_j₋₁, T_j₋₁ + 1, …, T_j − 1}. Let := {T₀ = 1 < T₁ < ··· < T_B = n + 1} denote the set of partition boundaries.

We consider the following model

x_{i} ~ N_{p} (0, \sum^{j}), i \in B^{j},

(1.1)

such that observations indexed by elements in Inline graphic are p-dimensional realizations of a multivariate normal distribution with zero mean and the covariance matrix $\sum^{j} = {(σ_{a b}^{j})}_{a, b \in [p]}$ , which suggest that it is only unique to segment j the time series. Let Ω^j := (Σ^j)⁻¹ denote the precision matrix with elements ${(ω_{a b}^{j})}_{a, b \in [p]}$ . With the number of partitions, B, and the boundaries of partitions, Inline graphic , unknown, we study the problem of estimating both the partition set { } and the non-zero elements of the precision matrices {Ω^j}_j_∈[_B_] from the sample {x_i}_i_∈[_n_]. Note that in this work we study a particular case of the VCVS model, where the coefficients are piece-wise constant functions of time. Although this model does not yet entirely agree with how a real world time series data would behave, as none existing model does, this instantiation of the VCVS model come one step closer in some sense to the real world scenario than other popular approaches for time series modeling, such as Hidden Markov Models or state space models, where stationary emission models such as linear Gaussian are usually employed to relate observation at different time points to simple latent states. Here, instead of positing an observation at time t to be derived from a latent state that transitions stationarily from a previous time point, we assume that such an observation is generated from a latent network model that are related to the network models active at the previous and subsequent time points nonparametrically. As suggested in the introduction, many real world dynamic systems, such as the stage-specific development of multi-cellular organisms like the fruit fly, and the evolving network of latent relatedness between politicians, are likely to behave approximately piecewise constantly; therefore time series data from such systems, such as the continuous-valued gene expression microarray time series, and the discrete-valued voting records, are suitable examples where our proposed models can be applied to [1]. A scenario where the coefficients are smoothly varying functions of time has been considered in [44] for the GGM and in [21] and [19] for an Ising model, which complements the model studied in this paper, whose asymptotic properties are somewhat easier to analyze as we have shown earlier.

If the partitions { Inline graphic }_j were known, the problem would be trivially reduced to the setting analyzed in the previous work. Dealing with the unknown partitions, together with the structure estimation of the model, calls for new methods. We propose and analyze a method based on time-coupled neighborhood selection, where the model estimates are forced to stay similar across time using a fusion-type total variation penalty and the sparsity of each neighborhood is obtained through the ℓ₁ penalty. Details of the approach are given in §2.

The model in Eq. (1.1) is related to the varying-coefficient models (e.g., [18]) with the coefficients being piece-wise constant functions. Varying coefficient regression models with piece-wise constant coefficients are also known as segmented multivariate regression models [24] or linear models with structural changes [2]. The structural changes are commonly determined through hypothesis testing and a separate linear model is fit to each of the estimated segments. In our work, we use the penalized model selection approach to jointly estimate the partition boundaries and the model parameters.

Little work has been done so far towards modeling dynamic networks and estimating changing precision matrices. [44] develops a nonparametric method for estimation of time-varying GGM, where x^t ~ Inline graphic (0, Σ(t)) and Σ(t) is smoothly changing over time. The procedure is based on the penalized likelihood approach of [42] with the empirical covariance matrix obtained using a kernel smoother. Our work is very different from the one of [44], since under our assumptions the network changes abruptly rather than smoothly. Furthermore, as we outline in §2, our estimation procedure is not based on the penalized likelihood approach. Estimation of time-varying Ising models has been discussed in [1] and [21]. [41] and [20] studied nonparametric ways to estimate the conditional covariance matrix. The work of [1] is most similar to our setting, where they also use a fused-type penalty combined with an ℓ₁ penalty to estimate the structure of the varying Ising model. Here, in addition to focusing on GGMs, there is an additional subtle, but important, difference to [1]. In this work, we use a modification of the fusion penalty (formally described in §2) which allows us to characterize the model selection consistency of our estimates and the convergence properties of the estimated partition boundaries, which is not available in the earlier work.

The remaining of the paper is organized as follows. In §2, we describe our estimation procedure and provide an efficient first-order optimization procedure capable of estimating large graphs. The optimization procedure is based on the smoothing procedure of [29] and converges in Inline graphic (1/ε) iterations, where ε is the desired accuracy. Our main theoretical results are presented in §3. In particular, we show that the partition boundaries are estimated consistently. Furthermore, the graph structure is consistently estimated on every block of the partition that contains enough samples. In §4, we discuss alternative estimation procedures based on penalized maximum likelihood estimation, instead of the neighborhood selection. Numerical studies showing the finite sample performance of our procedure are given in §5. The proofs of the main results are relegated to §7, with some technical details presented in Appendix.

Notation schemes

For clarity, we end the introduction with a summary of the notations used in the paper. We use [n] to denote the set {1, …, n} and [l : r] to denote the set {l, l + 1, …, r − 1}. We use Inline graphic to denote j-th block of the partition . With some abuse of notation, we also use to denote the set [T_j₋₁ : T_j]. The number of samples in the block is denoted as | |. For a set S ⊂ V, we use the notation X_S to denote the set {X_a : a ∈ S} of random variables. We use X to denote the n × p matrix whose rows consist of observations. The vector X_a = (x₁_,a, …, x_n,a)′ denotes a column of matrix X and, similarly, X_S = (X_b : b ∈ S) denotes the n × |S| sub-matrix of X whose columns are indexed by the set S and X denotes the sub-matrix | Inline graphic | × p whose rows are indexed by the set . For simplicity of notation, we will use \a to denote the index set [p]\{a}, X_\_a = (X_b : b ∈ [p]\{a}). For a vector a ∈ ℝ^p, we let S(a) denote the set of non-zero components of a. Throughout the paper, we use c₁, c₂, … to denote positive constants whose value may change from line to line. For a vector a ∈ ℝⁿ, define ||a||₁ = Σ_i_∈[_n_] |a_i|, ${‖ a ‖}_{2} = \sqrt{\sum_{i \in [n]} a_{i}^{2}}$ and ||a||_∞ = max_i |a_i|. For a symmetric matrix A, Λ_min(A) denotes the smallest and Λ_max(A) the largest eigenvalue. For a matrix A (not necessarily symmetric), we use |||A|||_∞ = max_i Σ_j |A_ij|. For two vectors a, b ∈ ℝⁿ, the dot product is denoted 〈a, b〉 = Σ_i_∈[_n_] a_ib_i. For two matrices A, B ∈ ℝⁿ^×^m, the dot product is denoted as 〈〈A, B〉〉 = tr(A′B). Given two sequences {a_n} and {b_n}, the notation a_n = Inline graphic (b_n) means that there exists a constant c₁ such that a_n ≤ c₁b_n; the notation a_n = Ω(b_n) means that there exists a constant c₂ such that a_n ≥ c₂b_n and the notation a_n ≍ b_n means that a_n = (b_n) and b_n = (a_n). Similarly, we will use the notation a_n = o_p(b_n) to denote that $b_{n}^{- 1} a_{n}$ converges to 0 in probability.

2. Graph estimation via Temporal-Difference Lasso

In this section, we introduce our time-varying covariance selection procedure, which is based on the time-coupled neighborhood selection using the fused-type penalty. We call the proposed procedure Temporal-Difference Lasso (TD-Lasso). We start by reviewing the basic neighborhood selection procedure, which has previously been used to estimate graphs in, e.g., [31, 28, 33, 15].

We start by relating the elements of the precision matrix Ω to a regression problem. Let the set S_a to denote the neighborhood of the node a. Denote S̄_a the closure of S_a, S̄_a: = S_a ∪ {a}, and N_a the set of nodes not in the neighborhood of the node a, N_a = [p]\S̄_a. It holds that X_a ⊥ X_{N_a}|X_{S_a}. The neighborhood of the node a can be easily seen from the non-zero pattern of the elements in the precision matrix Ω, S_a = {b ∈ [p]\{a} : ω_ab ≠ 0}. See [22] for more details. It is a well known result for Gaussian graphical models that the elements of

θ^{a} = \underset{θ \in ℝ^{p - 1}}{argmin} E {(X_{a} - \sum_{b \in \ a} X_{b} θ_{b})}^{2}

are given by $θ_{b}^{a} = - ω_{a b} / ω_{a a}$ . Therefore, the neighborhood of a node a, S_a, is equal to the set of non-zero coefficients of θ^a. Using the expression for θ^a, we can write $X_{a} = \sum_{b \in S_{a}} X_{b} θ_{b}^{a} + ε$ , where ε is independent of X_\_a.

The neighborhood selection procedure was motivated by the above relationship between the regression coefficients and the elements of the precision matrix. [28] proposed to solve the following optimization procedure

{\hat{θ}}^{a} = \underset{θ \in ℝ^{p - 1}}{argmin} \frac{1}{n} {‖ X_{a} - X_{\ a} θ ‖}_{2}^{2} + λ {‖ θ ‖}_{1}

(2.1)

and proved that for iid sample the non-zero coefficients of θ̂^a consistently estimate the neighborhood of the node a, under a suitably chosen penalty parameter λ.

In this paper, we build on the neighbourhood selection procedure to estimate the changing graph structure in model (1.1). We use $S_{a}^{j}$ to denote the neighborhood of the node a on the block Inline graphic and $N_{a}^{j}$ to denote nodes not in the neighborhood of the node a on the j-th block, $N_{a}^{j} = V \ S_{a}^{j}$ . Consider the following estimation procedure

{\hat{β}}^{a} = \underset{β \in ℝ^{p - 1 \times n}}{argmin} L (β) + {pen}_{λ_{1}, λ_{2}} (β)

(2.2)

where the loss is defined for β = (β_b,i)_b_∈[_p_−1]_,i_∈[_n_] as

L (β) : = \sum_{i \in [n]} {(x_{i, a} - \sum_{b \in \ a} x_{i, b} β_{b, i})}^{2}

(2.3)

and the penalty is defined as

{pen}_{λ_{i}, λ_{2}} (β) : = 2 λ_{1} \sum_{i = 2}^{n} {‖ β_{\cdot, i} - β_{\cdot, i - 1} ‖}_{2} + 2 λ_{2} \sum_{i = 1}^{n} \sum_{b \in \ a} ∣ β_{b, i} ∣ .

(2.4)

The penalty term is constructed from two terms. The first term ensures that the solution is going to be piecewise constant for some partition of [n] (possibly a trivial one). The first term can be seen as a sparsity inducing term in the temporal domain, since it penalizes the difference between the coefficients β_·,_i and β_·,_i₊₁ at successive time-points. The second term results in estimates that have many zero coefficients within each block of the partition. The estimated set of partition boundaries

\hat{T} = {{\hat{T}}_{0} = 1} \cup {{\hat{T}}_{j} \in [2 : n] : {\hat{β}}_{\cdot, {\hat{T}}_{j}}^{a} \neq {\hat{β}}_{\cdot, {\hat{T}}_{j} - 1}^{a}} \cup {{\hat{T}}_{\hat{B}} = n + 1}

contains indices of points at which a change is estimated, with B̂ being an estimate of the number of blocks B. The estimated number of the block B̂ is controlled through the user defined penalty parameter λ₁, while the sparsity of the neighborhood is controlled through the penalty parameter λ₂.

Based on the estimated set of partition boundaries Inline graphic , we can define the neighborhood estimate of the node a for each estimated block. Let ${\hat{θ}}^{a, j} = {\hat{β}}_{\cdot, i}^{a}$ , ∀i ∈ [T̂_j₋₁ : T̂_j] be the estimated coefficient vector for the block = [T̂_j₋₁ : T̂_j]. Using the estimated vector θ̂^a,j, we define the neighborhood estimate of the node a for the block Inline graphic as

{\hat{S}}_{a}^{j} : = S ({\hat{θ}}^{a, j}) : = {b \in \ a : {\hat{θ}}_{b}^{a, j} \neq 0} .

Solving (2.2) for each node a ∈ V gives us a neighborhood estimate for each node. Combining the neighborhood estimates we can obtain an estimate of the graph structure for each point i ∈ [n].

The choice of the penalty term is motivated by the work on penalization using total variation [34, 27], which results in a piece-wise constant approximation of an unknown regression function. The fusion-penalty has also been applied in the context of multivariate linear regression [36], where the coefficients that are spatially close, are also biased to have similar values. As a result, nearby coefficients are fused to the same estimated value. Instead of penalizing the ℓ₁ norm on the difference between coefficients, we use the ℓ₂ norm in order to enforce that all the changes occur at the same point.

The objective (2.2) estimates the neighborhood of one node in a graph for all time-points. After solving the objective (2.2) for all nodes a ∈ V, we need to combine them to obtain the graph structure. We will use the following procedure to combine {β̂^a}_a_∈_V,

{\hat{E}}_{i} = {(a, b) : max (∣ β_{b, i}^{a} ∣, ∣ β_{a, i}^{b} ∣) > 0}, i \in [n] .

That is, an edge between nodes a and b is included in the graph if at least one of the nodes a or b is included in the neighborhood of the other node. We use the max operator to combine different neighborhoods as we believe that for the purpose of network exploration it is more important to occasionally include spurious edges than to omit relevant ones. For further discussion on the differences between the min and the max combination, we refer an interested reader to [4].

2.1. Numerical procedure

Finding a minimizer β̂^a of (2.2) can be a computationally challenging task for an off-the-shelf convex optimization procedure. We propose to use an accelerated gradient method with a smoothing technique [29], which converges in Inline graphic (1/ε) iterations where ε is the desired accuracy.

We start by defining a smooth approximation of the fused penalty term. Let H ∈ ℝⁿ^×ⁿ⁻¹ be a matrix with elements

H_{i j} = {\begin{cases} - 1 & if i = j \\ 1 & if i = j + 1 \\ 0 & otherwise . \end{cases}

With the matrix H we can rewrite the fused penalty term as $2 λ_{1} {\sum_{i = 1}^{n - 1} ‖ {(β H)}_{\cdot, i} ‖}_{2}$ and using the fact that the ℓ₂ norm is self dual (e.g., see [7]) we have the following representation

2 λ_{1} \sum_{i = 2}^{n} {‖ β_{\cdot, i} - β_{\cdot, i - 1} ‖}_{2} = max_{U \in Q} 〈 〈 U, 2 λ_{1} β H 〉 〉

(2.5)

where Inline graphic := {U ∈ ℝ^p^−1×ⁿ⁻¹ : ||U_·,_i||₂ ≤ 1, ∀i ∈ [n − 1]}. The following function is defined as a smooth approximation to the fused penalty,

Ψ_{μ} (β) : = max_{U \in Q} 〈 〈 U, 2 λ_{1} β H 〉 〉 - μ {‖ U ‖}_{F}^{2}

(2.6)

where μ > 0 is the smoothness parameter. It is easy to see that

Ψ_{μ} (β) \leq Ψ_{0} (β) \leq Ψ_{μ} (β) + μ (n - 1) .

Setting the smoothness parameter to $μ = \frac{ε}{2 (n - 1)}$ , the correct rate of convergence is ensured. Let U_μ(β) be the optimal solution of the maximization problem in (2.6), which can be obtained analytically as

U_{μ} (β) = Π_{Q} (\frac{λ β H}{μ})

(2.7)

where Π(·) is the projection operator onto the set Inline graphic . From Theorem 1 in [29], we have that Ψ_μ(β) is continuously differentiable and convex, with the gradient

\nabla Ψ_{μ} (β) = 2 λ_{1} U_{μ} (β) H^{'}

(2.8)

that is Lipschitz continuous.

With the above defined smooth approximation, we focus on minimizing the following objective

min_{β \in ℝ^{p - 1 \times n}} F (β) : min_{β \in ℝ^{p - 1 \times n}} L (β) + Ψ_{μ} (β) + 2 λ_{2} {‖ β ‖}_{1} .

Following [5] (see also [30]), we define the following quadratic approximation of F(β) at a point β₀

Q_{L} (β, β_{0}) : = L (β_{0}) + Ψ_{μ} (β_{0}) + 〈 〈 β - β_{0}, \nabla L (β_{0}) + \nabla Ψ (β_{0}) 〉 〉 + \frac{L}{2} {‖ β - β_{0} ‖}_{F}^{2} + 2 λ_{2} {‖ β ‖}_{1}

(2.9)

where L > 0 is the parameter chosen as an upper bounds for the Lipschitz constant of ∇ Inline graphic +∇Ψ. Let p_L(β₀) be a minimizer of Q_L(β, β₀). Ignoring constant terms, p_L(β₀) can be obtained as

p_{L} (β_{0}) = \underset{β \in ℝ^{p - 1 \times n}}{argmin} \frac{1}{2} {‖ β - (β_{0} - \frac{1}{L} (\nabla L + \nabla Ψ) (β_{0})) ‖}_{F}^{2} + \frac{2 λ_{2}}{L} {‖ β ‖}_{1} .

It is clear that p_L(β₀) is the unique minimizer, which can be obtained in a closed form, as a result of the soft-thresholding,

p_{L} (β_{0}) = T (β_{0} - \frac{1}{L} (\nabla L + \nabla Ψ) (β_{0}), \frac{2 λ_{2}}{L})

(2.10)

where T(x, λ) = sign(x) max(0, |x| − λ) is the soft-thresholding operator that is applied element-wise.

In practice, an upper bound on the Lipschitz constant of ∇ Inline graphic + ∇Ψ can be expensive to compute, so the parameter L is going to be determined iteratively. Combining all of the above, we arrive at Algorithm 1. In the algorithm, β₀ is set to zero or, if the optimization problem is solved for a sequence of tuning parameters, it can be set to the solution β̂ obtained for the previous set of tuning parameters. The parameter γ is a constant used to increase the estimate of the Lipschitz constant L and we set it to γ = 1.5 in our experiments, while L = 1 initially. Compared to the gradient descent method (which can be obtain by iterating β_k₊₁ = p_L(β_k)), the accelerated gradient method updates two sequences {β_k} and {z_k} recursively. Instead of performing the gradient step from the latest approximate solution β_k, the gradient step is performed from the search point z_k that is obtained as a linear combination of the last two approximate solutions β_k₋₁ and β_k. Since the condition F(p_L(z_k)) ≤ Q_L(p_L(z_k), z_k) is satisfied in every iteration, we have the algorithm converges in Inline graphic (1/ε) iterations following [5]. As the convergence criterion, we stop iterating once the relative change in the objective value is below some threshold value (e.g., we use 10⁻⁴).

Algorithm 1.

Accelerated Gradient Method for Equation (2.2)

graphic file with name nihms519448f6.jpg

Open in a new tab

2.2. Tuning parameter selection

The penalty parameters λ₁ and λ₂ control the complexity of the estimated model. In this work, we propose to use the BIC score to select the tuning parameters. Define the BIC score for each node a ∈ V as

{BIC}_{a} (λ_{1}, λ_{2}) : = log \frac{L ({\hat{β}}^{a})}{n} + \frac{log n}{n} \sum_{j \in [\hat{B}]} ∣ S ({\hat{θ}}^{a, j}) ∣

(2.11)

where Inline graphic (·) is defined in (2.3) and β̂^a = β̂^a(λ₁, λ₂) is a solution of (2.2). The penalty parameters can now be chosen as

{{\hat{λ}}_{1}, {\hat{λ}}_{2}} = \underset{λ_{1}, λ_{2}}{argmin} \sum_{a \in V} {BIC}_{a} (λ_{1}, λ_{2}) .

(2.12)

We will use the above formula to select the tuning parameters in our simulations, where we are going to search for the best choice of parameters over a grid.

3. Theoretical results

This section is going to address the statistical properties of the estimation procedure presented in Section 2. The properties are addressed in an asymptotic framework by letting the sample size n grow, while keeping the other parameters fixed. For the asymptotic framework to make sense, we assume that there exists a fixed unknown sequence of numbers {τ_j} that defines the partition boundaries as T_j = ⌊nτ_j⌋, where ⌊a⌋ denotes the largest integer smaller that a. This assures that as the number of samples grow, the same fraction of samples falls into every partition. We call {τ_j} the boundary fractions.

We give sufficient conditions under which the sequence {τ_j} is consistently estimated. In particular, if the number of partition blocks is estimated correctly, then we show that max_j_∈[_B_] |T̂_j − T_j| ≤ nδ_n with probability tending to 1, where {δ_n}_n is a non-increasing sequence of positive numbers that tends to zero. If the number of partition segments is over estimated, then we show that for a distance defined for two sets A and B as

h (A, B) : = sup_{b \in B} inf_{a \in A} ∣ a - b ∣,

(3.1)

we have h( Inline graphic , ) ≤ nδ_n with probability tending to 1. With the boundary segments consistently estimated, we further show that under suitable conditions for each node a ∈ V the correct neighborhood is selected on all estimated block partitions that are sufficiently large.

The proof technique employed in this section is quite involved, so we briefly describe the steps used. Our analysis is based on careful inspection of the optimality conditions that a solution β̂^a of the optimization problem (2.2) need to satisfy. The optimality conditions for β̂^a to be a solution of (2.2) are given in §3.2. Using the optimality conditions, we establish the rate of convergence for the partition boundaries. This is done by proof by contradiction. Suppose that there is a solution with the partition boundary Inline graphic that satisfies h( , ) ≥ nδ_n. Then we show that, with high-probability, all such solutions will not satisfy the KKT conditions and therefore cannot be optimal. This shows that all the solutions to the optimization problem (2.2) result in partition boundaries that are “close” to the true partition boundaries, with high-probability. Once it is established that Inline graphic and satisfy h( , ) ≤ nδ_n, we can further show that the neighborhood estimates are consistently estimated, under the assumption that the estimated blocks of the partition have enough samples. This part of the analysis follows the commonly used strategy to prove that the Lasso is sparsistent (e.g., see [9, 38, 28]), however important modifications are required due to the fact that position of the partition boundaries are being estimated.

Our analysis is going to focus on one node a ∈ V and its neighborhood. However, using the union bound over all nodes in V, we will be able to carry over conclusions to the whole graph. To simplify our notation, when it is clear from the context, we will omit the superscript a and write β̂, θ̂ and S, etc., to denote β̂^a, θ̂^a and S_a, etc.

3.1. Assumptions

Before presenting our theoretical results, we give some definitions and assumptions that are going to be used in this section. Let Δ_min := min_j_∈[_B_] |T_j − T_j₋₁| denote the minimum length between change points, ξ_min := min_a_∈_V min_j_∈[_B_−1] × ||θ^a^,^j⁺¹ − θ^a^,^j||₂ denote the minimum jump size and $θ_{min} = {min}_{a \in V} {min}_{j \in [B]} \times {min}_{b \in S^{j}} ∣ θ_{b}^{a, j} ∣$ the minimum coefficient size. Throughout the section, we assume that the following holds.

A1 There exist two constants φ_min > 0 and φ_max < ∞ such that

φ_{min} = min {Λ_{min} (\sum^{j}) : j \in [B], a \in V}

and

φ_{max} = max {Λ_{max} (\sum^{j}) : j \in [B], a \in V} .

A2 Variables are scaled so that $σ_{a a}^{j} = 1$ for all j ∈ [B] and all a ∈ V.

The assumption A1 is commonly used to ensure that the model is identifiable. If the population covariance matrix is ill-conditioned, the question of the correct model identification if not well defined, as a neighborhood of a node may not be uniquely defined. The assumption A2 is assumed for the simplicity of the presentation. The common variance can be obtained through scaling.

A3 There exists a constant M > 0 such that

max_{a \in V} max_{j, k \in [B]} {‖ θ^{a, k} - θ^{a, j} ‖}_{2} \leq M .

The assumption A3 states that the difference between coefficients on two different blocks, ||θ^a,k − θ^a,j||₂, is bounded for all j, k ∈ [B]. This assumption is simply satisfied if the coefficients θ^a were bounded in the ℓ₂ norm.

A4 There exist a constant α ∈ (0, 1], such that the following holds

max_{j \in [B]} ∣ ∣ ∣ \sum_{N_{a}^{j} S_{a}^{j}} {(\sum_{S_{a}^{j} S_{a}^{j}})}^{- 1} ‖ ∣_{\infty} \leq 1 - α, \forall a \in V .

The assumption A4 states that the variables in the neighborhood of the node a, $S_{a}^{j}$ , are not too correlated with the variables in the set $N_{a}^{j}$ . This assumption is necessary and sufficient for correct identification of the relevant variables in the Lasso regression problems (e.g., see [43, 37]). Note that this condition is sufficient also in our case when the correct partition boundaries are not known.

A5 The minimum coefficient size θ_min satisfies $θ_{min} = Ω (\sqrt{log (n) / n})$ .

The lower bound on the minimum coefficient size θ_min is necessary, since if a partial correlation coefficient is too close to zero the edge in the graph would not be detectable.

A6 The sequence of partition boundaries {T_j} satisfy T_j = ⌊nτ_j⌋, where {τ_j} is a fixed, unknown sequence of the boundary fractions belonging to [0, 1].

The assumption is needed for the asymptotic setting. As n → ∞, there will be enough sample points in each of the blocks to estimate the neighborhood of nodes correctly.

3.2. Convergence of the partition boundaries

In this subsection we establish the rate of convergence of the boundary partitions for the estimator (2.2). We start by giving a lemma that characterizes solutions of the optimization problem given in (2.2). Note that the optimization problem in (2.2) is convex, however, there may be multiple solutions to it, since it is not strictly convex.

Lemma 1

Let $x_{i, a} = x_{i, \ a}^{'} θ_{a} + ε_{i}$ . A matrix β̂ is optimal for the optimization problem (2.2) if and only if there exist a collection of subgradient vectors {ẑ_i}_i_∈[2:_n_] and {ŷ_i}_i_∈[_n_], with ẑ_i ∈ ∂||β̂_·_,i − β̂_·_,i₋₁||₂ and ŷ_i ∈ ∂||β̂_·_,i||₁, that satisfies

\sum_{i = k}^{n} x_{i, \ a} 〈 x_{i, \ a}, {\hat{β}}_{\cdot, i} - β_{\cdot, i} 〉 - \sum_{i = k}^{n} x_{i, \ a} ε_{i} + λ_{1} {\hat{z}}_{k} + λ_{2} \sum_{i = k}^{n} {\hat{y}}_{i} = 0

(3.2)

for all k ∈ [n] and ẑ₁ = ẑ_n₊₁ = 0.

The following theorem provides the convergence rate of the estimated boundaries of Inline graphic , under the assumption that the correct number of blocks is known.

Theorem 2

Let {x_i}_i_∈[_n_] be a sequence of observation according to the model in (1.1). Assume that A1–A3 and A5–A6 hold. Suppose that the penalty parameters λ₁ and λ₂ satisfy

λ_{1} ≍ λ_{2} = O (\sqrt{log (n) / n}) .

(3.3)

Let {β̂_·_,i}_i_∈[_n_] be any solution of (2.2) and let Inline graphic be the associated estimate of the block partition. Let {δ_n}_n_≥1 be a non-increasing positive sequence that converges to zero as n → ∞ and satisfies Δ_min ≥ nδ_n for all n ≥ 1. Furthermore, suppose that (nδ_nξ_min)⁻¹λ₁ → 0, $ξ_{min}^{- 1} \sqrt{p} λ_{2} \to 0$ and ${(ξ_{min} \sqrt{n δ_{n}})}^{- 1} \sqrt{p log n} \to 0$ , then if | Inline graphic | = B + 1 the following holds

ℙ [max_{j \in [B]} ∣ T_{j} - {\hat{T}}_{j} ∣ \leq n δ_{n}] \overset{n \to \infty}{\to} 1.

The proof builds on techniques developed in [17] and is presented in §7.

Suppose that δ_n = (log n)^γ/n for some γ > 1 and $ξ_{min} = Ω (\sqrt{log n / {(log n)}^{γ}})$ , the conditions of Theorem 2 are satisfied, and we have that the sequence of boundary fractions {τ_j} is consistently estimated. Since the boundary fractions are consistently estimated, we will see below that the estimated neighborhood S(θ̂^j) on the block Inline graphic consistently recovers the true neighborhood S^j.

Unfortunately, the correct bound on the number of block B may not be known. However, a conservative upper bound B_max on the number of blocks B may be known. Suppose that the sequence of observation is over segmented, with the number of estimated blocks bounded by B_max. Then the following proposition gives an upper bound on h( Inline graphic , ) where h(·, ·) is defined in (3.1).

Proposition 3

Let {x_i}_i_∈[_n_] be a sequence of observation according to the model in (1.1). Assume that the conditions of Theorem 2 are satisfied. Let β̂ be a solution of (2.2) and Inline graphic the corresponding set of partition boundaries, with B̂ blocks. If the number of blocks satisfy B ≤ B̂ ≤ B_max, then

ℙ [h (\hat{T}, T) \leq n δ_{n}] \overset{n \to \infty}{\to} 1.

The proof of the proposition follows the same ideas of Theorem 2 and its sketch is given in the appendix.

The above proposition assures us that even if the number of blocks is overestimated, there will be a partition boundary close to every true unknown partition boundary. In many cases it is reasonable to assume that a practitioner would have an idea about the number of blocks that she wishes to discover. In that way, our procedure can be used to explore and visualize the data. It is still an open question to pick the tuning parameters in a data dependent way so that the number of blocks are estimated consistently.

3.3. Correct neighborhood selection

In this section, we give a result on the consistency of the neighborhood estimation. We will show that whenever the estimated block Inline graphic is large enough, say | | ≥ r_n where {r_n}_n_≥1 is an increasing sequence of numbers that satisfy (r_nλ₂)⁻¹λ₁ → 0 and $r_{n} λ_{2}^{2} \to \infty$ as n → ∞, we have that S(θ̂^j) = S(β^k), where β^k is the true parameter on the true block that overlaps Inline graphic the most. Figure 1 illustrates this idea. The blue region in the figure denotes the overlap between the true block and the estimated block of the partition. The orange region corresponds to the overlap of the estimated block with a different true block. If the blue region is considerably larger than the orange region, the bias coming from the sample from the orange region will not be strong enough to disable us from selecting the correct neighborhood. On the other hand, since the orange region is small, as seen from Theorem 2, there is little hope of estimating the neighborhood correctly on that portion of the sample.

Fig 1 — The figure illustrates where we expect to estimate a neighborhood of a node consistently. The blue region corresponds to the overlap between the true block (bounded by gray lines) and the estimated block (bounded by black lines). If the blue region is much larger than the orange regions, the additional bias introduced from the samples from the orange region will not considerably affect the estimation of the neighborhood of a node on the blue region. However, we cannot hope to consistently estimate the neighborhood of a node on the orange region.

Suppose that we know that there is a solution to the optimization problem (2.2) with the partition boundary Inline graphic . Then that solution is also a minimizer of the following objective

min_{θ^{1}, \dots, θ^{\hat{B}}} \sum_{j \in \hat{B}} {‖ X_{a}^{{\hat{B}}^{j}} - X_{\ a}^{{\hat{B}}^{j}} θ^{j} ‖}_{2}^{2} + 2 λ_{1} \sum_{j = 2}^{\hat{B}} {‖ θ^{j} - θ^{j - 1} ‖}_{2} + 2 λ_{2} \sum_{j = 1}^{\hat{B}} ∣ {\hat{B}}^{j} {∣ ‖ θ^{j} ‖}_{1} .

(3.4)

Note that the problem (3.4) does not give a practical way of solving (2.2), but will help us to reason about the solutions of (2.2). In particular, while there may be multiple solutions to the problem (2.2), under some conditions, we can characterize the sparsity pattern of any solution that has specified partition boundaries Inline graphic .

Lemma 4

Let β̂ be a solution to (2.2), with Inline graphic being an associated estimate of the partition boundaries. Suppose that the subgradient vectors satisfy |ŷ_i,b| < 1 for all b ∉ S(β̂_·_,i), then any other solution β̂ with the partition boundaries satisfy β̂_b,i = 0 for all b ∉ S(β̂_·_,i).

The above Lemma states sufficient conditions under which the sparsity pattern of a solution with the partition boundary Inline graphic is unique. Note, however, that there may other solutions to (2.2) that have different partition boundaries.

Now, we are ready to state the following theorem, which establishes that the correct neighborhood is selected on every sufficiently large estimated block of the partition.

Theorem 5

Let {x_i}_i_∈[_n_] be a sequence of observation according to the model in (1.1). Assume that the conditions of theorem 2 are satisfied. In addition, suppose that A4 also holds. Then, if | Inline graphic | = B + 1, it holds that

ℙ [S^{k} = S ({\hat{θ}}^{k})] \overset{n \to \infty}{\to} 1, \forall k \in [B] .

Under the assumptions of theorem 2 each estimated block is of size Inline graphic (n). As a result, there are enough samples in each block to consistently estimate the underlying neighborhood structure. Observe that the neighborhood is consistently estimated at each i ∈ ∩ for all j ∈ [B] and the error is made only on the small fraction of samples, when i ∉ Inline graphic ∩ , which is of order (nδ_n).

Using proposition 3 in place of theorem 2, it can be similarly shown that, for a large fraction of samples, the neighborhood is consistently estimated even in the case of over-segmentation. In particular, whenever there is a sufficiently large estimated block, with | Inline graphic ∩ | = (r_n), it holds that S( ) = S^j with probability tending to one.

4. Alternative estimation procedures

In this section, we discuss some alternative estimation methods to the neighborhood selection detailed in §2. We start describing how to solve the objective (2.2) for different penalties than the one given in (2.4). In particular, we describe how to minimize the objective when the ℓ₂ is replaced with the ℓ_q (q ∈ {1, ∞}) norm in (2.4). Next, we describe how to solve the penalized maximum likelihood objective with the temporal difference penalty. We do not provide statistical guarantees for solutions of these objective functions.

4.1. Neighborhood selection with modified penalty

We consider the optimization problem given in (2.2) with the following penalty

{pen}_{λ_{1}, λ_{2}} (β) : = 2 λ_{1} \sum_{i = 2}^{n} {‖ β_{\cdot, i} - β_{\cdot, i - 1} ‖}_{q} + 2 λ_{2} \sum_{i = 1}^{n} \sum_{b \in \ a} ∣ β_{b, i} ∣, q \in {1, \infty} .

(4.1)

We call the penalty in (4.1) the TD_q penalty. As in §2.1, we apply the smoothing procedure to the first term in (4.1). Using the dual norm representation, we have

2 λ_{1} \sum_{i = 2}^{n} {‖ β_{\cdot, i} - β_{\cdot, i - 1} ‖}_{q} = max_{U \in Q^{q}} 〈 〈 U, 2 λ_{1} β H 〉 〉

where

Q^{1} : = {U \in ℝ^{p - 1 \times n - 1} : {‖ U_{\cdot, i} ‖}_{\infty} \leq 1, \forall i \in [n - 1]}

and

Q^{\infty} : = {U \in ℝ^{p - 1 \times n - 1} : {‖ U_{\cdot, i} ‖}_{1} \leq 1, \forall i \in [n - 1]} .

Next, we define smooth approximation to the norm as

Ψ_{μ}^{q} (β) : = max_{U \in Q^{q}} 〈 〈 U, 2 λ_{1} β H 〉 〉 - μ {‖ U ‖}_{F}^{2}

(4.2)

where μ > 0 is the smoothness parameter. Let

U_{μ}^{q} (β) = Π_{Q^{q}} (\frac{λ β H}{μ})

(4.3)

be the optimal solution of the maximization problem in (4.2), where Π Inline graphic (·) is the projection operator onto the set . We observe that the projection on the ℓ_∞ unit ball can be easily obtained, while a fast algorithm for projection on the ℓ₁ unit ball can be found in [8]. The gradient can now be obtained as

\nabla Ψ_{μ}^{q} (β) = 2 λ_{1} U_{μ}^{q} (β) H^{'},

(4.4)

and we can proceed as in § 2.1 to arrive at the update (2.10).

We have described how to optimize (2.2) with the TD_q penalty for q ∈ {1, 2, ∞}. Other ℓ_q norms are not commonly used in practice. We also note that a different procedure for q = 1 can be found in [26].

4.2. Penalized maximum likelihood estimation

In §2, we have related the problem of estimating zero elements of a precision matrix to a penalized regression procedure. Now, we consider estimating a sparse precision matrix using a penalized maximum likelihood approach. That is, we consider the following optimization procedure

min_{{Ω_{i} ≻ 0}_{i \in [n]}} \sum_{i \in [n]} (tr Ω_{i} x_{i} x_{i}^{'} - log ∣ Ω_{i} ∣) + {pen}_{λ_{1}, λ_{2}} ({Ω_{t}}_{t \in [n]})

(4.5)

where

{pen}_{λ_{1}, λ_{2}} ({Ω_{i}}_{i \in [n]}) : = 2 λ_{1} \sum_{i = 2}^{n} {‖ Ω_{i} - Ω_{i - 1} ‖}_{F} + 2 λ_{2} \sum_{i = 1}^{n} {∣ Ω_{i} ∣}_{1} .

(4.6)

In order to optimize (4.5) using the smoothing technique described in §2.1, we need to show that the gradient of the log-likelihood is Lipschitz continuous. The following Lemma establishes the desired result.

Lemma 6

The function f(A) = tr SA − log |A| has Lipschitz continuous gradient on the set {A ∈ Inline graphic : Λ_min(A) ≥ γ}, with Lipschitz constant L = γ⁻².

Following [3], we can show that a solution to the optimization problem (4.5), on each estimated block, is indeed positive definite matrix with smallest eigenvalue bounded away from zero. This allows us to use the Nesterov’s smoothing technique to solve (4.5).

Penalized maximum likelihood approach for estimating sparse precision matrix was proposed by [42]. Here, we have modified the penalty to perform estimation under the model (1.1). Although the parameters of the precision matrix can be estimated consistently using the penalized maximum likelihood approach, a number of theoretical results have shown that the neighborhood selection procedure requires lest stringent assumptions in order to estimate the underlying network consistently [28, 32]. We observe this phenomena in our simulation studies as well.

5. Numerical studies

In this section, we present a small numerical study on simulated networks. A full performance test and application on real world data is beyond the scope of this paper which mainly focuses on the theory of time-varying model estimation. In all of our simulations studies we set p = 30 and B = 3 with | Inline graphic | = 80, | | = 130 and | | = 90, so that in total we have n = 300 samples. We consider two types of random networks: a chain and a nearest neighbor network. We measure the performance of the estimation procedure outlined in §2 on the following metrics: average precision of estimated edges, average recall of estimated edges and average F₁ score which combines the precision and recall score. The precision, recall and F₁ score are respectively defined as

\begin{array}{l} precision = \frac{1}{n} \sum_{i \in [n]} \frac{\sum_{a \in [p]} \sum_{b = a + 1}^{p} 1 I {(a, b) \in {\hat{E}}_{i} \land (a, b) \in E_{i}}}{\sum_{a \in [p]} \sum_{b = a + 1}^{p} 1 I {(a, b) \in {\hat{E}}_{i}}} \\ recall = \frac{1}{n} \sum_{i \in [n]} \frac{\sum_{a \in [p]} \sum_{b = a + 1}^{p} 1 I {(a, b) \in {\hat{E}}_{i} \land (a, b) \in E_{i}}}{\sum_{a \in [p]} \sum_{b = a + 1}^{p} 1 I {(a, b) \in E_{i}}} \\ F_{1} = \frac{2 * precision * recall}{precision + recall} . \end{array}

Furthermore, we report results on estimating the partition boundaries using n⁻¹h( Inline graphic , ), where h( , ) is defined in (3.1). Results are averaged over 50 simulation runs. We compare the TD-Lasso algorithm introduced in §2.1 against an oracle algorithm which exactly knows the true partition boundaries. In this case, it is only needed to run the algorithm of [28] on each block of the partition independently. We use a BIC criterion to select the tuning parameter for this oracle procedure as described in [31]. Furthermore, we report results using neighborhood selection procedures introduced in §4, which are denoted TD₁-Lasso and TD_∞-Lasso, as well as the penalized maximum likelihood procedure, which is denoted as LL_max. We choose the tuning parameters for the penalized maximum likelihood procedure using the BIC procedure.

Chain networks

We follow the simulation in [12] to generate a chain network (see Figure 2). This network corresponds to a tridiagonal precision matrix (after an appropriate permutation of nodes). The network is generated as follows. First, we choose to generate a random permutation π of [n]. Next, the covariance matrix is generated as follows: the element at position (a, b) is chosen as σ_ab = exp(−|t_π₍_a₎ − t_π₍_b₎|/2) where t₁ < t₂ < ··· < t_p and t_i − t_i₋₁ ~ Unif(0.5, 1) for i = 2, …, p. This processes is repeated three times to obtain three different covariance matrices, from which we sample 80, 130 and 90 samples respectively.

For illustrative purposes, Figure 3 plots the precision, recall and F₁ score computed for different values of the penalty parameters λ₁ and λ₂. Table 2 shows the precision, recall and F₁ score for the parameters chosen using the BIC score described in 2.2, as well as the error in estimating the partition boundaries. The numbers in parentheses correspond to standard deviation. Due to the fact that there is some error in estimating the partition boundaries, we observe a decrease in performance compared to the oracle procedure that knows the correct position of the partition boundaries. Further, we observe that the neighborhood selection procedure estimate the graph structure more accurately than the maximum likelihood procedure. For TD₁-Lasso we do not report n⁻¹h( Inline graphic , ), as the procedure does not estimate the partition boundaries.

Fig 3 — Plots of the precision, recall and F₁ scores as functions of the penalty parameters λ₁ and λ₂ for chain networks estimated using the TD-Lasso. The parameter λ₁ is obtained as 100 * 0.98⁵⁰⁺ⁱ, where i indexes y-axis. The parameter λ₂ is computed as 285 * 0.98²³⁰⁺^j, where j indexes x-axis. Black dot represents the selected tuning parameters. The white region of each plot corresponds to a region of the parameter space that we did not explore.

Table 2.

Performance of different procedures when estimating chain networks

Method name	Precision	Recall	F₁ score	n⁻¹h( , )

TD-Lasso	0.84 (0.04)	0.80 (0.04)	0.82 (0.04)	0.03 (0.01)
TD₁-Lasso	0.78 (0.05)	0.70 (0.03)	0.74 (0.04)	N/A
TD_∞-Lasso	0.83 (0.03)	0.80 (0.03)	0.81 (0.03)	0.03 (0.01)
LL_max	0.72 (0.03)	0.65 (0.03)	0.68 (0.04)	0.06 (0.02)
Oracle procedure	0.97 (0.02)	0.89 (0.02)	0.93 (0.02)	0 (0)

Open in a new tab

Nearest neighbors networks

We generate nearest neighbor networks following the procedure outlined in [23]. For each node, we draw a point uniformly at random on a unit square and compute the pairwise distances between nodes. Each node is then connected to 4 closest neighbors (see Figure 4). Since some of nodes will have more than 4 adjacent edges, we remove randomly edges from nodes that have degree larger than 4 until the maximum degree of a node in a network is 4. Each edge (a, b) in this network corresponds to a nonzero element in the precision matrix Ω, whose value is generated uniformly on [−1, −0.5] ∪ [0.5, 1]. The diagonal elements of the precision matrix are set to a smallest positive number that makes the matrix positive definite. Next, we scale the corresponding covariance matrix Σ = Ω⁻¹ to have diagonal elements equal to 1. This processes is repeated three times to obtain three different covariance matrices, from which we sample 80, 130 and 90 samples respectively.

Fig 4 — An instance of a random neighborhood graph with 30 nodes.

For illustrative purposes, Figure 5 plots the precision, recall and F₁ score computed for different values of the penalty parameters λ₁ and λ₂. Table 3 shows the precision, recall, F₁ score and n⁻¹h( Inline graphic , ) for the parameters chosen using the BIC score, together with their standard deviations. The results obtained for nearest neighbor networks are qualitatively similar to the results obtain for chain networks.

Fig 5 — Plots of the precision, recall and F₁ scores as functions of the penalty parameters λ₁ and λ₂ for nearest neighbor networks estimated using the TD-Lasso. The parameter λ₁ is obtained as 100 * 0.98⁵⁰⁺ⁱ, where i indexes y-axis. The parameter λ₂ is computed as 285 * 0.98²³⁰⁺^j, where j indexes x-axis. Black dot represents the selected tuning parameters. The white region of each plot corresponds to a region of the parameter space that we did not explore.

Table 3.

Performance of different procedure when estimating random nearest neighbor networks

Method name	Precision	Recall	F₁ score	n⁻¹h( , )

TD-Lasso	0.79 (0.06)	0.76 (0.05)	0.77 (0.05)	0.04 (0.02)
TD₁-Lasso	0.70 (0.05)	0.68 (0.07)	0.69 (0.06)	N/A
TD_∞-Lasso	0.80 (0.06)	0.75 (0.06)	0.77 (0.06)	0.04 (0.02)
LL_max	0.62 (0.08)	0.60 (0.06)	0.61 (0.06)	0.06 (0.02)
Oracle procedure	0.87 (0.05)	0.82 (0.05)	0.84 (0.04)	0 (0)

Open in a new tab

6. Conclusion

We have addressed the problem of time-varying covariance selection when the underlying probability distribution changes abruptly at some unknown points in time. Using a penalized neighborhood selection approach with the fused-type penalty, we are able to consistently estimate times when the distribution changes and the network structure underlying the sample. The proof technique used to establish the convergence of the boundary fractions using the fused-type penalty is novel and constitutes an important contribution of the paper. Furthermore, our procedure estimates the network structure consistently whenever there is a large overlap between the estimated blocks and the unknown true blocks of samples coming from the same distribution. The proof technique used to establish the consistency of the network structure builds on the proof for consistency of the neighborhood selection procedure, however, important modifications are necessary since the times of distribution changes are not known in advance. Applications of the proposed approach range from cognitive neuroscience, where the problem is to identify changing associations between different parts of a brain when presented with different stimuli, to system biology studies, where the task is to identify changing patterns of interactions between genes involved in different cellular processes. We conjecture that our estimation procedure is also valid in the high-dimensional setting when the number of variables p is much larger than the sample size n. We leave the investigations of the rate of convergence in the high-dimensional setting for a future work.

7. Proofs

7.1. Proof of Lemma 1

For each i ∈ [n], introduce a (p − 1)-dimensional vector γ_i defined as

γ_{i} = {\begin{cases} β_{\cdot, i} & for i = 1 \\ β_{\cdot, i} - β_{\cdot, i - 1} & otherwise \end{cases}

and rewrite the objective (2.2) as

{{\hat{γ}}^{i}}_{i \in [n]} = \underset{γ \in ℝ^{n \times p - 1}}{argmin} \sum_{i = 1}^{n} {(x_{i, a} - \sum_{b \in \ a} x_{i, b} \sum_{j \leq i} γ_{j, b})}^{2} + 2 λ_{1} \sum_{i = 2}^{n} {‖ γ_{i} ‖}_{2} + 2 λ_{2} \sum_{i = 1}^{n} \sum_{b \in \ a} | \sum_{j \leq i} γ_{j, b} | .

(7.1)

A necessary and sufficient condition for {γ̂_i}_i_∈[_n_] to be a solution of (7.1), is that for each k ∈ [n] the (p − 1)-dimensional zero vector, 0, belongs to the subdifferential of (7.1) with respect to γ_k evaluated at {γ̂_i}_i_∈[_n_], that is,

0 = 2 \sum_{i = k}^{n} (- x_{i, \ a}) (x_{i, a} - \sum_{b \in \ a} x_{i, b} {\hat{β}}_{b, i}^{a}) + 2 λ_{1} {\hat{z}}_{k} + 2 λ_{2} \sum_{i = k}^{n} {\hat{y}}_{i},

(7.2)

where ẑ_k ∈ ∂||·||₂(γ̂_k), that is,

{\tilde{z}}_{k} = {\begin{cases} \frac{{\tilde{γ}}_{k}}{{‖ {\tilde{γ}}_{k} ‖}_{2}} & if {\tilde{γ}}_{k} \neq 0 \\ \in B_{2} (0, 1) & otherwise \end{cases}

and for k ≤ i, ŷ_i ∈ ∂|Σ_j_≤_i γ̂_j|, that is, y_i = sign(Σ_j_≤_i γ̂_j) with sign(0) ∈ [−1, 1]. The Lemma now simply follows from (7.2).

7.2. Proof of Theorem 2

We build on the ideas presented in the proof of Proposition 5 in [17]. Using the union bound,

ℙ [max_{j \in [B]} ∣ T_{j} - {\hat{T}}_{j} ∣ > n δ_{n}] \leq \sum_{j \in [B]} ℙ [∣ T_{j} - {\hat{T}}_{j} ∣ > n δ_{n}]

and it is enough to show that ℙ[|T_j − T̃_j| > nδ_n] → 0 for all j ∈ [B]. Define the event A_n,j as

A_{n, j} : = {∣ T_{j} - {\hat{T}}_{j} ∣ > n δ_{n}}

and the event C_n as

C_{n} : = {max_{j \in [B]} ∣ {\hat{T}}_{j} - T_{j} ∣ < \frac{Δ_{min}}{2}} .

We show that ℙ[A_n,j] → 0 by showing that both ℙ[A_n,j ∩ C_n] → 0 and $ℙ [A_{n, j} \cap C_{n}^{c}] \to 0$ as n → ∞. The idea here is that, in some sense, the event C_n is a good event on which the estimated boundary partitions and the true boundary partitions are not too far from each other. Considering the two cases will make the analysis simpler.

First, we show that ℙ[A_n,j ∩ C_n] → 0. Without loss of generality, we assume that T̂_j < T_j, since the other case follows using the same reasoning. Using (3.2) twice with k = T̂_j and with k = T_j and then applying the triangle inequality we have

2 λ_{1} \geq {‖ \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} x_{i, \ a} 〈 x_{i, \ a}, {\hat{β}}_{\cdot, i} - β_{\cdot, i} 〉 - \sum_{i = {\hat{T}}_{j}}^{{\hat{T}}_{j} - 1} x_{i, \ a} ε_{i} + λ_{2} \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} {\hat{y}}_{i} ‖}_{2} .

(7.3)

Some algebra on the above display gives

\begin{array}{l} 2 λ_{1} + (T_{j} - {\hat{T}}_{j}) \sqrt{p} λ_{2} \geq {‖ \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} x_{i, \ a} 〈 x_{i, \ a}, θ^{j} - θ^{j + 1} 〉 ‖}_{2} - {‖ \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} x_{i, \ a} 〈 x_{i, \ a}, θ^{j + 1} - {\hat{θ}}^{j + 1} 〉 ‖}_{2} - {‖ \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} x_{i, \ a} ε_{i} ‖}_{2} \\ = : {‖ R_{1} ‖}_{2} - {‖ R_{2} ‖}_{2} - {‖ R_{3} ‖}_{2} . \end{array}

The above display occurs with probability one, so that the event ${2 λ_{1} + (T_{j} - {\hat{T}}_{j}) \sqrt{p} λ_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}} \cup {{‖ R_{2} ‖}_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}} \cup {{‖ R_{3} ‖}_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}}$ also occurs with probability one, which gives us the following bound

\begin{array}{l} ℙ [A_{n, j} \cap C_{n}] \leq ℙ [A_{n, j} \cap C_{n} \cap {2 λ_{1} + (T_{j} - {\hat{T}}_{j}) \sqrt{p} λ_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}}] \\ + ℙ [A_{n, j} \cap C_{n} \cap {{‖ R_{2} ‖}_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}}] \\ + ℙ [A_{n, j} \cap C_{n} \cap {{‖ R_{3} ‖}_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}}] \\ = : ℙ [A_{n, j, 1}] + ℙ [A_{n, j, 2}] + ℙ [A_{n, j, 3}] . \end{array}

First, we focus on the event A_n,j,₁. Using lemma 9, we can upper bound ℙ[A_n,j,₁] with

ℙ [2 λ_{1} + (T_{j} - {\hat{T}}_{j}) \sqrt{p} λ_{2} \geq \frac{φ_{min}}{27} (T_{j} - {\hat{T}}_{j}) ξ_{min}] + 2 exp (- n δ_{n} / 2 + 2 log n) .

Since under the assumptions of the theorem (nδ_nξ_min)⁻¹ λ₁ → 0 and $ξ_{min}^{- 1} \sqrt{p} λ_{2} \to 0$ as n → ∞, we have that ℙ[A_n,j,₁] → 0 as n → ∞.

Next, we show that the probability of the event A_n,j,₂ converges to zero. Let T̄_j:= ⌊2⁻¹(T_j + T_j₊₁)⌋. Observe that on the event C_n, T̂_j₊₁ > T̄_j so that β̂_·_,i = θ̂^j⁺¹ for all i ∈ [T_j, T̄_j]. Using (3.2) with k = T_j and k = T̄_j we have that

2 λ_{1} + ({\bar{T}}_{j} - T_{j}) \sqrt{p} λ_{2} \geq {‖ \sum_{i = T_{j}}^{{\bar{T}}_{j} - 1} x_{i . \ a} 〈 x_{i, \ a}, θ^{j + 1} - {\hat{θ}}^{j + 1} 〉 ‖}_{2} - {‖ \sum_{i = T_{j}}^{{\bar{T}}_{j} - 1} x_{i, \ a} ε_{i} ‖}_{2} .

Using lemma 9 on the display above we have

{‖ θ^{j + 1} - {\hat{θ}}^{j + 1} ‖}_{2} \leq \frac{36 λ_{1} + 18 ({\bar{T}}_{j} - T_{j}) \sqrt{p} λ_{2} + 18 {‖ \sum_{i = T_{j}}^{{\bar{T}}_{j} - 1} x_{i, \ a} ε_{i} ‖}_{2}}{(T_{j + 1} - T_{j}) φ_{min}},

(7.4)

which holds with probability at least 1–2 exp(−Δ_min/4+2 log n). We will use the above bound to deal with the event { ${‖ R_{2} ‖}_{2} \geq \frac{1}{3} {‖ R_{1} ‖}_{2}$ }. Using lemma 9, we have that φ_min(T_j − T̂_j)ξ_min/9 ≤ ||R₁||₂ and ||R₂||₂ ≤ (T_j − T̂_j)9φ_max||θ^j⁺¹ − θ̂^j⁺¹||₂ with probability at least 1 – 4 exp(−nδ_n/2 + 2 log n). Combining with (7.4), the probability ℙ[A_n,j,₂] is upper bounded by

\begin{array}{l} ℙ [c_{1} φ_{min}^{2} φ_{max}^{- 1} Δ_{min} ξ_{min} \leq λ_{1}] + ℙ [c_{2} φ_{min}^{2} φ_{max}^{- 1} ξ_{min} \leq \sqrt{p} λ_{2}] \\ + ℙ [c_{3} φ_{min}^{2} φ_{max}^{- 1} ξ_{min} \leq {({\bar{T}}_{j} - T_{j})}^{- 1} {‖ \sum_{i = T_{j}}^{{\bar{T}}_{j} - 1} x_{i, \ a} ε_{i} ‖}_{2}] + c_{4} exp (- n δ_{n} / 2 + 2 log n) . \end{array}

Under the conditions of the theorem, the first term above converges to zero, since Δ_min > nδ_n and (nδ_nξ_min)⁻¹ λ₁ → 0. The second term also converges to zero, since $ξ_{min}^{- 1} \sqrt{p} λ_{2} \to 0$ . Using lemma 8, the third term converges to zero with the rate exp(−c₆ log n), since ${(ξ_{min} \sqrt{Δ_{min}})}^{- 1} \sqrt{p log n} \to 0$ . Combining all the bounds, we have that ℙ[A_n,j,₂] → 0 as n → ∞.

Finally, we upper bound the probability of the event A_n,j,₃. As before, φ_min(T_j−T̂_j)ξ_min/9 ≤ ||R₁||₂ with probability at least 1 – 2 exp(−nδ_n/2 + 2 log n). This gives us an upper bound on ℙ[A_n,j,₃] as

ℙ [\frac{φ_{min} ξ_{min}}{27} \leq \frac{{‖ \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} x_{i, \ a} ε_{i} ‖}_{2}}{T_{j} - {\hat{T}}_{j}}] + 2 exp (- n δ_{n} / 2 + 2 log n),

which, using lemma 8, converges to zero as under the conditions of the theorem ${(ξ_{min} \sqrt{n δ_{n}})}^{- 1} \sqrt{p log n} \to 0$ . Thus we have shown that ℙ[A_n,j,₃] → 0. Since the case when T̂_j > T_j is shown similarly, we have proved that ℙ[A_n,j ∩ C_n] → 0 as n → ∞.

We proceed to show that $ℙ [A_{n, j} \cap C_{n}^{c}] \to 0$ as n → ∞. Recall that $C_{n}^{c} = {{max}_{j \in [B]} ∣ {\hat{T}}_{j} - T_{j} ∣ \geq Δ_{min} / 2}$ . Define the following events

\begin{array}{l} D_{n}^{(l)} : = {\exists j \in [B], {\hat{T}}_{j} \leq T_{j - 1}} \cap C_{n}^{c}, \\ D_{n}^{(m)} : = {\forall j \in [B], T_{j - 1} < {\hat{T}}_{j} < T_{j + 1}} \cap C_{n}^{c}, \\ D_{n}^{(r)} : = {\exists j \in [B], {\hat{T}}_{j} \geq T_{j + 1}} \cap C_{n}^{c} \end{array}

and write $ℙ [A_{n, j} \cap C_{n}^{c}] = ℙ [A_{n, j} \cap D_{n}^{(l)}] + ℙ [A_{n, j} \cap D_{n}^{(m)}] + ℙ [A_{n, j} \cap D_{n}^{(r)}]$ . First, consider the event $A_{n, j} \cap D_{n}^{(m)}$ under the assumption that T̂_j ≤ T_j. Due to symmetry, the other case will follow in a similar way. Observe that

\begin{array}{l} ℙ [A_{n, j} \cap D_{n}^{(m)}] \leq ℙ [A_{n, j} \cap {({\hat{T}}_{j + 1} - T_{j}) \geq \frac{Δ_{min}}{2}} \cap D_{n}^{(m)}] + ℙ [{(T_{j + 1} - {\hat{T}}_{j + 1}) \geq \frac{Δ_{min}}{2}} \cap D_{n}^{(m)}] \\ \leq ℙ [A_{n, j} \cap {({\hat{T}}_{j + 1} - T_{j}) \geq \frac{Δ_{min}}{2}} \cap D_{n}^{(m)}] + \sum_{k = j + 1}^{B - 1} ℙ [{(T_{k} - {\hat{T}}_{k}) \geq \frac{Δ_{min}}{2}} \cap {({\hat{T}}_{k + 1} - T_{k}) \geq \frac{Δ_{min}}{2}} \cap D_{n}^{(m)}] . \end{array}

(7.5)

We bound the first term in (7.5) and note that the other terms can be bounded in the same way. The following analysis is performed on the event $A_{n, j} \cap {({\hat{T}}_{j + 1} - T_{j}) \geq Δ_{min} / 2} \cap D_{n}^{(m)}$ . Using (3.2) with k = T̂_j and k = T_j, after some algebra (similar to the derivation of (7.3)) the following holds

{‖ θ^{j} - {\hat{θ}}^{j + 1} ‖}_{2} \leq \frac{18 λ_{1} + 9 (T_{j} - {\hat{T}}_{j}) \sqrt{p} λ_{2} + 9 ‖ \sum_{i = {\hat{T}}_{j}}^{T_{j} - 1} x_{i, \ a} ε_{i} ‖}{φ_{min} (T_{j} - {\hat{T}}_{j})},

with probability at least 1–2 exp(−nδ_n/2+2 log n), where we have used lemma 9. Let T̄_j = ⌊2⁻¹(T_j + T_j₊₁)⌋. Using (3.2) with k = T̄_j and k = T_j after some algebra (similar to the derivation of (7.4)) we obtain the following bound

{‖ θ^{j} - θ^{j + 1} ‖}_{2} \leq \frac{18 λ_{1} + 9 ({\bar{T}}_{j} - T_{j}) \sqrt{p} λ_{2} + 9 {‖ \sum_{i = T_{j}}^{{\bar{T}}_{j} - 1} x_{i, \ a} ε_{i} ‖}_{2}}{φ_{min} ({\bar{T}}_{j} - T_{j})} + 81 φ_{max} φ_{min}^{- 1} {‖ θ^{j} - {\hat{θ}}^{j + 1} ‖}_{2},

which holds with probability at least 1 − c₁ exp(−nδ_n/2 + 2 log n), where we have used lemma 9 twice. Combining the last two displays, we can upper bound the first term in (7.5) with

ℙ [ξ_{min} n δ_{n} \leq c_{1} λ_{1}] + ℙ [ξ_{min} \leq c_{2} \sqrt{p} λ_{2}] + ℙ [ξ_{min} \sqrt{n δ_{n}} \leq c_{3} \sqrt{p log n}] + c_{4} exp (- c_{5} log n),

where we have used lemma 8 to obtain the third term. Under the conditions of the theorem, all terms converge to zero. Reasoning similar about the other terms in (7.5), we can conclude that $ℙ [A_{n, j} \cap D_{n}^{(m)}] \to 0$ as n → ∞.

Next, we bound the probability of the event $A_{n, j} \cap D_{n}^{(l)}$ , which is upper bounded by

ℙ [D_{n}^{(l)}] \leq \sum_{j = 1}^{B} 2^{j - 1} ℙ [max {l \in [B] : {\hat{T}}_{l} \leq T_{l - 1}} = j] .

Observe that

{max {l \in [B] : {\hat{T}}_{l} \leq T_{l - 1}} = j} \subseteq \cup_{l = j}^{B} {T_{j} - {\hat{T}}_{j} \geq \frac{Δ_{min}}{2}} \cap {{\hat{T}}_{j + 1} - T_{j} \geq \frac{Δ_{min}}{2}}

so that we have

ℙ [D_{n}^{(l)}] \leq 2^{B - 1} \sum_{j = 1}^{B - 1} \sum_{l > j} ℙ [{T_{l} - {\hat{T}}_{l} \geq \frac{Δ_{min}}{2}} \cap {{\hat{T}}_{l + 1} - T_{l} \geq \frac{Δ_{min}}{2}}] .

Using the same arguments as those used to bound terms in (7.5), we have that $ℙ [D_{n}^{(l)}] \to 0$ as n → ∞ under the conditions of the theorem. Similarly, we can show that the term $ℙ [D_{n}^{(r)}] \to 0$ as n → ∞. Thus, we have shown that $ℙ [A_{n, j} \cap C_{n}^{c}] \to 0$ , which concludes the proof.

7.3. Proof of Lemma 4

Consider Inline graphic fixed. The lemma is a simple consequence of the duality theory, which states that given the subdifferential ŷ_i (which is constant for all i ∈ , being an estimated block of the partition ), all solutions {β̌_·_,i}_i_∈[_n_] of (2.2) need to satisfy the complementary slackness condition Σ_b_∈\a ŷ_i,bβ̌_b,i = ||β̌_·_,i||₁, which holds only if β̌_b_,_i = 0 for all b ∈ \a for which |ŷ_i_,_b| < 1.

7.4. Proof of Theorem 5

Since the assumptions of theorem 2 are satisfied, we are going to work on the event

E : = {max_{j \in [B]} ∣ {\hat{T}}_{j} - T_{j} ∣ \leq n δ_{n}} .

In this case, | Inline graphic | = (n). For i ∈ , we write

x_{i, a} = \sum_{b \in S^{j}} x_{i, b} θ_{b}^{k} + e_{i} + ε_{i}

(7.6)

where $e_{i} = \sum_{b \in S} x_{i, b} (β_{b, i} - θ_{b}^{k})$ is the bias. Observe that ∀i ∈ Inline graphic ∩ , the bias e_i = 0, while for i ∉ ∩ , the bias e_i is normally distributed with variance bounded by M²φ_max under the assumption A1 and A3.

We proceed to show that S(θ̂^k) ⊂ S^k. Since θ̂^k is an optimal solution of (2.2), it needs to satisfy

{(X_{\ a}^{{\hat{B}}^{k}})}^{'} X_{\ a}^{{\hat{B}}^{k}} ({\hat{θ}}^{k} - θ^{k}) - {(X_{\ a}^{{\hat{B}}^{k}})}^{'} (e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}) + λ_{1} ({\hat{z}}_{{\hat{T}}_{k - 1}} - {\hat{z}}_{{\hat{T}}_{k}}) + λ_{2} ∣ {\hat{B}}^{k} ∣ {\hat{y}}_{{\hat{T}}_{k - 1}} = 0.

(7.7)

Now, we will construct the vectors θ̌^k, ž_{T̂_k−1,} ž_{T̂_k} and y̌_{T̂_k−1} that satisfy (7.7) and verify that the subdifferential vectors are dual feasible. Consider the following restricted optimization problem

min_{θ^{1}, \dots, θ^{\hat{B}}; θ_{N^{k}}^{k} = 0} \sum_{j \in [\hat{B}]} {‖ X_{a}^{{\hat{B}}^{j}} - X_{\ a}^{{\hat{B}}^{j}} θ^{j} ‖}_{2}^{2} + 2 λ_{1} \sum_{j = 2}^{\hat{B}} {‖ θ^{j} - θ^{j - 1} ‖}_{2} + 2 λ_{2} \sum_{j = 1}^{\hat{B}} ∣ {\hat{B}}^{j} {∣ ‖ θ^{j} ‖}_{1},

(7.8)

where the vector $θ_{N^{k}}^{k}$ is constrained to be 0. Let {θ̌^j}_j_{∈ [}_B̂_] be a solution to the restricted optimization problem (7.8). Set the subgradient vectors as ž_{T̂_k−1} ∈ ∂||θ̌^k − θ̌^k⁻¹||, ž_{T_k} ∈ ∂||θ̌^k⁺¹ − θ̌^k|| and ${\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}, S^{k}} = sign ({\overset{ˇ}{θ}}_{S^{k}}^{k})$ . Solve (7.7) for y̌_{T̂_k−1,N^k}. By construction, the vectors θ̌^k, ž_{T̂_k−1}, ž_{T̂_k} and y̌_{T̂_k−1} satisfy (7.7). Furthermore, the vectors ž_{T̂_k−1} and ž_{T̂_k} are elements of the subdifferential, and hence dual feasible. To show that θ̌^k is also a solution to (3.4), we need to show that ||y̌_{T̂_k−1,N^k}||_∞ ≤ 1, that is, that y̌^T̂_k−1 is also dual feasible variable. Using lemma 4, if we show that y̌_{T̂_k−1,N^k} is strict dual feasible, ||y̌_{T̂_k−1,N^k}||_∞ < 1, then any other solution ${\hat{\overset{ˇ}{θ}}}^{k}$ to (3.4) will satisfy ${\hat{\overset{ˇ}{θ}}}_{N}^{k} = 0$ .

From (7.7) we can obtain an explicit formula for θ̌_S^k

{\overset{ˇ}{θ}}_{S^{k}}^{k} = θ_{S^{k}}^{k} + {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} {(X_{S^{k}}^{\hat{B}})}^{'} (e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}) - {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} (λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S^{k}}) + λ_{2} ∣ {\hat{B}}^{k} ∣ {\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}, S^{k}}) .

(7.9)

Recall that for large enough n we have that | Inline graphic | > p, so that the matrix ${(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}}$ is invertible with probability one. Plugging (7.9) into (7.7), we have that ||y̌_{T̂_k−1,N^k}||_∞ < 1 if max_b∈N^k |Y_b| < 1, where Y_b is defined to be

Y_{b} : = {(X_{b}^{{\hat{B}}^{k}})}^{'} [X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} ({\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}, S^{k}} + \frac{λ_{1} ({\hat{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\hat{z}}_{{\hat{T}}_{k}, S^{k}})}{∣ {\hat{B}}^{k} ∣ λ_{2}}) + H_{S^{k}}^{{\hat{B}}^{k}, ⊥} (\frac{e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}}{∣ {\hat{B}}^{k} ∣ λ_{2}})] - \frac{λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, b} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, b})}{∣ {\hat{B}}^{k} ∣ λ_{2}},

(7.10)

where $H_{S^{k}}^{{\hat{B}}^{k}, ⊥}$ is the projection matrix

H_{S^{k}}^{{\hat{B}}^{k}, ⊥} = I - X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} .

Let Σ̃^k and ${\hat{\sum^{\sim}}}^{k}$ be defined as

{\sum^{\sim}}^{k} = \frac{1}{∣ {\hat{B}}^{k} ∣} \sum_{i \in {\hat{B}}^{k}} E [x_{\ a}^{i} {(x_{\ a}^{i})}^{'}] and {\hat{\sum^{\sim}}}^{k} = \frac{1}{∣ {\hat{B}}^{k} ∣} \sum_{i \in {\hat{B}}^{k}} x_{\ a}^{i} {(x_{\ a}^{i})}^{'} .

For i ∈ [n], we let Inline graphic (i) index the block to which the sample i belongs to. Now, for any b ∈ N^k, we can write $x_{b}^{i} = \sum_{{b S}^{k}}^{B (i)} {(\sum_{S^{k} S^{k}}^{B^{(i)}})}^{- 1} x_{S^{k}}^{i} + w_{b}^{i}$ where $w_{b}^{i}$ is normally distributed with variance $σ_{b}^{2} < 1$ and independent of $x_{S^{k}}^{i}$ . Let F_b ∈ ℝ^|
| be the vector whose components are equal to $\sum_{{b S}^{k}}^{\hat{B} (i)} {(\sum_{S^{k} S^{k}}^{\hat{B} (i)})}^{- 1} x_{S^{k}}^{i}$ , i ∈ Inline graphic , and W_b ∈ ℝ^|
| be the vector with components equal to $w_{b}^{i}$ . Using this notation, we write $Y_{b} = T_{b}^{1} + T_{b}^{2} + T_{b}^{3} + T_{b}^{4}$ where

T_{b}^{1} = F_{b}^{'} X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} ({\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}} + \frac{λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S^{k}})}{∣ {\hat{B}}^{k} ∣ λ_{2}})

(7.11)

T_{b}^{2} = F_{b}^{'} H_{S^{k}}^{{\hat{B}}^{k}, ⊥} (\frac{e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}}{∣ {\hat{B}}^{k} ∣ λ_{2}})

(7.12)

T_{b}^{3} = {({\tilde{W}}_{b})}^{'} [X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} ({\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}} + \frac{λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S^{k}})}{∣ {\hat{B}}^{k} ∣ λ_{2}}) + H_{S^{k}}^{{\hat{B}}^{k}, ⊥} (\frac{e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}}{∣ {\hat{B}}^{k} ∣ λ_{2}})]

(7.13)

and

T_{b}^{4} = - \frac{λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, b} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, b})}{∣ {\hat{B}}^{k} ∣ λ_{2}} .

(7.14)

We analyze each of the terms separately. Starting with the term $T_{b}^{1}$ , after some algebra, we obtain that

\begin{array}{l} F_{b}^{'} X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} = \sum_{j : {\hat{B}}^{k} \cap B^{j} \neq \emptyset} \frac{∣ B^{j} \cap {\hat{B}}^{k} ∣}{∣ {\hat{B}}^{k} ∣} \sum_{{b S}^{k}}^{j} {(\sum_{S^{k} S^{k}}^{j})}^{- 1} ({\sum^{^}}_{S^{k} S^{k}}^{B^{j} \cap {\hat{B}}^{k}} - \sum_{S^{k} S^{k}}^{j}) {({\hat{\sum^{\sim}}}_{S^{k} S^{k}}^{k})}^{- 1} \\ + {\sum^{\sim}}_{{b S}^{k}}^{k} ({({\hat{\sum^{\sim}}}_{S^{k} S^{k}}^{k})}^{- 1} - {({\sum^{\sim}}_{S^{k} S^{k}}^{k})}^{- 1}) \\ + {\sum^{\sim}}_{{b S}^{k}}^{k} {({\sum^{\sim}}_{S^{k} S^{k}}^{k})}^{- 1} . \end{array}

(7.15)

Recall that we are working on the event Inline graphic , so that $∣ ∣ ∣ {\sum^{\sim}}_{N^{k} S^{k}}^{k} {({\sum^{\sim}}_{S^{k} S^{k}}^{k})}^{- 1} ‖ ∣_{\infty} \overset{n \to \infty}{\to} ∣ ∣ ∣ \sum_{N^{k} S^{k}}^{k} {(\sum_{S^{k} S^{k}}^{k})}^{- 1} ‖ ∣_{\infty}$ and ${(∣ {\hat{B}}^{k} ∣ λ_{2})}^{- 1} λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S^{k}}) \overset{n \to \infty}{\to} 0$ element-wise. Using (7.20) we bound the first two terms in the equation above. We bound the first term by observing that for any j and any b ∈ N^k and n sufficiently large

\begin{array}{l} \frac{∣ B^{j} \cap {\hat{B}}^{k} ∣}{∣ {\hat{B}}^{k} ∣} {‖ \sum_{{b S}^{k}}^{j} {(\sum_{S^{k} S^{k}}^{j})}^{- 1} ({\sum^{^}}_{S^{k} S^{k}}^{{\hat{B}}^{j} \cap {\hat{B}}^{k}} - \sum_{S^{k} S^{k}}^{j}) ‖}_{\infty} \leq \frac{∣ B^{j} \cap {\hat{B}}^{k} ∣}{∣ {\hat{B}}^{k} ∣} {‖ \sum_{{b S}^{k}}^{j} {(\sum_{S^{k} S^{k}}^{j})}^{- 1} ‖}_{1} {‖ {\sum^{^}}_{S^{k} S^{k}}^{B^{j} \cap {\hat{B}}^{k}} - \sum_{S^{k} S^{k}}^{j} ‖}_{\infty} \\ \leq C_{1} \frac{∣ B^{j} \cap {\hat{B}}^{k} ∣}{∣ {\hat{B}}^{k} ∣} {‖ {\sum^{^}}_{S^{k} S^{k}}^{B^{j} \cap {\hat{B}}^{k}} - \sum_{S^{k} S^{k}}^{j} ‖}_{\infty} \leq ε_{1} \end{array}

with probability 1−c₁ exp(−c₂ log n). Next, for any b ∈ N^k we bound the second term as

\begin{array}{l} {‖ {\sum^{\sim}}_{{b S}^{k}}^{k} ({({\hat{\sum^{\sim}}}_{S^{k} S^{k}}^{k})}^{- 1} - {({\sum^{\sim}}_{S^{k} S^{k}}^{k})}^{- 1}) ‖}_{1} \leq C_{2} {‖ {({\hat{\sum^{\sim}}}_{S^{k} S^{k}}^{k})}^{- 1} - {({\sum^{\sim}}_{S^{k} S^{k}}^{k})}^{- 1}) ‖}_{F} \\ \leq C_{2} {‖ {\sum^{\sim}}_{S^{k} S^{k}}^{k} ‖}_{F}^{2} {‖ {\hat{\sum^{\sim}}}_{S^{k} S^{k}}^{k} - {\sum^{\sim}}_{S^{k} S^{k}}^{k} ‖}_{F} + O ({‖ {\hat{\sum^{\sim}}}_{S^{k} S^{k}}^{k} - {\sum^{\sim}}_{S^{k} S^{k}}^{k} ‖}_{F}^{2}) \\ \leq ε_{2} \end{array}

with probability 1−c₁ exp(−c₂ log n). Choosing ε₁, ε₂ sufficiently small and for n large enough, we have that ${max}_{b} ∣ T_{b}^{1} ∣ \leq 1 - α + o_{p} (1)$ under the assumption A4.

We proceed with the term $T_{b}^{2}$ , which can be written as

\begin{array}{l} T_{b}^{2} = {(∣ {\hat{B}}^{k} ∣ λ_{2})}^{- 1} (\sum_{{b S}^{k}}^{k} {(\sum_{S^{k} S^{k}}^{k})}^{- 1} - F_{b}^{'} X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1}) \sum_{i \in B^{k} \cap {\hat{B}}^{k}} x_{S^{k}}^{i} ε^{i} \\ + {(∣ {\hat{B}}^{k} ∣ λ_{2})}^{- 1} \sum_{i \notin B^{k} \cap {\hat{B}}^{k}} (\sum_{{b S}^{k}}^{B (i)} {(\sum_{S^{k} S^{k}}^{B (i)})}^{- 1} - F_{b}^{'} X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1}) x_{S^{k}}^{i} (e^{i} + ε^{i}) . \end{array}

Since we are working on the event Inline graphic the second term in the above equation is dominated by the first term. Next, using (7.15) together with (7.20), we have that for all b ∈ N^k

{‖ \sum_{{b S}^{k}}^{k} {(\sum_{S^{k} S^{k}}^{k})}^{- 1} - F_{b}^{'} X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} ‖}_{2} = o_{p} (1) .

Combining with Lemma 8, we have that under the assumptions of the theorem

max_{b} ∣ T_{b}^{2} ∣ = o_{p} (1) .

We deal with the term $T_{b}^{3}$ by conditioning on $X_{S^{k}}^{{\hat{B}}^{k}}$ and ε, we have that W_b is independent of the terms in the squared bracket in $T_{b}^{3}$ , since all ž_{T̂_k−1,S}, ž_{T̂_k,S} and ŷ_{T̂_k−1,S} are determined from the solution to the restricted optimization problem. To bound the second term, we observe that conditional on $X_{S^{k}}^{{\hat{B}}^{k}}$ and ε, the variance of $T_{b}^{3}$ can be bounded as

\begin{array}{l} Var (T_{b}^{3}) \leq {‖ X_{S^{k}}^{{\hat{B}}^{k}} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} {\overset{ˇ}{η}}_{S^{k}} + H_{S^{k}}^{{\hat{B}}^{k}, ⊥} (\frac{e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}}{∣ {\hat{B}}^{k} ∣ λ_{2}}) ‖}_{2}^{2} \\ \leq {\overset{ˇ}{η}}_{S^{k}}^{'} {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} {\overset{ˇ}{η}}_{S^{k}} + {‖ \frac{e^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}}{∣ {\hat{B}}^{k} ∣ λ_{2}} ‖}_{2}^{2}, \end{array}

(7.16)

where

{\overset{ˇ}{η}}_{S^{k}} = ({\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}, S^{k}} + \frac{λ_{1} ({\overset{ˇ}{z}}_{T_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S})}{∣ \hat{B} ∣ λ_{2}}) .

Using lemma 9 and Young’s inequality, the first term in (7.16) is upper bounded by

\frac{18}{∣ \hat{B} ∣ φ_{min}} (s + \frac{2 λ_{1}^{2}}{{∣ \hat{B} ∣}^{2} λ_{2}^{2}})

with probability at least 1 − 2 exp(−| Inline graphic |/2 + 2 log n). Using lemma 7 we have that the second term is upper bounded by

\frac{(1 + δ^{'}) (1 + M^{2} φ_{max})}{∣ \hat{B} ∣ λ_{2}^{2}}

with probability at least 1−exp(−c₁| Inline graphic |δ′²+2 log n). Combining the two bounds, we have that $Var (T_{b}^{3}) \leq c_{1} s {(∣ {\hat{B}}^{k} ∣)}^{- 1}$ with high probability, using the fact that (| |λ₂)⁻¹λ₁ → 0 and | |λ₂ → ∞ as n → ∞. Using the bound on the variance of the term $T_{b}^{3}$ and the Gaussian tail bound, we have that

max_{b \in N} ∣ T_{b}^{3} ∣ = o_{p} (1) .

Combining the results, we have that max_b∈N^k |Y_b| ≤ 1 − α + o_p(1). For a sufficiently large n, under the conditions of the theorem, we have shown that max_b_∈_N |Y_b| < 1 which implies that $ℙ [S ({\hat{θ}}^{k}) \subset S^{k}] \overset{n \to \infty}{\to} 1$ .

Next, we proceed to show that $ℙ [S^{k} \subset S ({\hat{θ}}^{k})] \overset{n \to \infty}{\to} 1$ . Observe that

ℙ [S^{k} ⊄ S ({\hat{θ}}^{k})] \leq ℙ [{‖ {\hat{θ}}_{S^{k}}^{k} - θ_{S^{k}}^{k} ‖}_{\infty} \geq θ_{min}] .

From (7.7) we have that ${‖ {\hat{θ}}_{S^{k}}^{k} - θ_{S^{k}}^{k} ‖}_{\infty}$ is upper bounded by

\begin{array}{l} {‖ {(\frac{1}{∣ {\hat{B}}^{k} ∣} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} \frac{1}{∣ {\hat{B}}^{k} ∣} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} ({\tilde{e}}^{{\hat{B}}^{k}} + ε^{{\hat{B}}^{k}}) ‖}_{\infty} \\ + {‖ {({(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} (λ_{1} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S^{k}}) - λ_{2} ∣ {\hat{B}}^{{\hat{B}}^{k}} ∣ {\overset{ˇ}{y}}_{T_{k - 1}, S^{k}}) ‖}_{\infty} . \end{array}

Since ẽ_i ≠ 0 only on i ∈ Inline graphic \ and nδ_n/| | → 0, the term involving ẽ is stochastically dominated by the term involving ∊ and can be ignored. Define the following terms

\begin{array}{l} T_{1} = {(\frac{1}{∣ {\hat{B}}^{k} ∣} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} \frac{1}{∣ {\hat{B}}^{k} ∣} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} ε^{{\hat{B}}^{k}}, \\ T_{2} = {(\frac{1}{∣ {\hat{B}}^{k} ∣} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} \frac{λ_{1}}{∣ {\hat{B}}^{k} ∣ λ_{2}} ({\overset{ˇ}{z}}_{{\hat{T}}_{k - 1}, S^{k}} - {\overset{ˇ}{z}}_{{\hat{T}}_{k}, S^{k}}), \\ T_{3} = {(\frac{1}{∣ {\hat{B}}^{k} ∣} {(X_{S^{k}}^{{\hat{B}}^{k}})}^{'} X_{S^{k}}^{{\hat{B}}^{k}})}^{- 1} {\overset{ˇ}{y}}_{{\hat{T}}_{k - 1}, S^{k}} . \end{array}

Conditioning on $X_{S^{k}}^{{\hat{B}}^{k}}$ , the term T₁ is a |S^k| dimensional Gaussian with variance bounded by c₁/n with probability at least 1 − c₁ exp(−c₂ log n) using lemma 9. Combining with the Gaussian tail bound, the term ||T₁||_∞ can be upper bounded as

ℙ [{‖ T_{1} ‖}_{\infty} \geq c_{1} \sqrt{\frac{log s}{n}}] \leq c_{2} exp (- c_{3} log n) .

(7.17)

Using lemma 9, we have that with probability greater than 1 − c₁ exp(−c₂ log n)

{‖ T_{2} ‖}_{\infty} \leq {‖ T_{2} ‖}_{2} \leq c_{3} \frac{λ_{1}}{∣ {\hat{B}}^{k} ∣ λ_{2}} \to 0

under the conditions of theorem. Similarly ${‖ T_{3} ‖}_{\infty} \leq c_{1} \sqrt{s}$ , with probability greater than 1 − c₁ exp(−c₂ log n). Combining the terms, we have that

{‖ θ^{k} - {\hat{θ}}^{k} ‖}_{\infty} \leq c_{1} \sqrt{\frac{log s}{n}} + c_{2} \sqrt{s} λ_{2}

with probability at least 1 − c₃ exp(−c₄ log n). Since $θ_{min} = Ω (\sqrt{log (n) / n})$ , we have shown that S^k ⊆ S(θ̂^k). Combining with the first part, it follows that S(θ̂^k) = S^k with probability tending to one.

7.5. Proof of Lemma 6

We have that ∇ f(A) = A⁻¹. Then

\begin{array}{l} {‖ \nabla f (A) - \nabla f (A^{'}) ‖}_{F} = {‖ A^{- 1} - {(A^{'})}^{- 1} ‖}_{F} \\ \leq Λ_{max} A^{- 1} {‖ A - A^{'} ‖}_{F} Λ_{max} A^{- 1} \\ \leq γ^{- 2} {‖ A - A^{'} ‖}_{F} . \end{array}

Table 1.

Summary of symbols used throughout the paper

Symbol

Meaning

Example

[n]

used to denote the set {1, …, n}

[t₁ : t₂]

used to denote the set {t₁, t₁ + 1, …, t₂ − 1}

used for indexing related to samples

xⁱ or

β_{\cdot, i}^{a}

j, k

used for indexing related to block

θ^a^,^j or

S_{a}^{k}

A, b

used for indexing nodes in a graph

a, b ∈ V

the graph consisting of vertices and edges

G = (V, E)

the set of nodes in a graph

V = [p]

E_i

the set of edges at time i

X_a

the component of a random vector X indexed by the vertex a

β_{\cdot, i}^{a}

the vector of regression coefficients for sample i

θ^a^,^j

the vector of regression coefficients for block j

graphic file with name nihms519448ig5.jpg

the set of partition boundaries

{τ_j}_j

the set of boundary fractions

T_j = ⌊nτ_j⌋

graphic file with name nihms519448ig3.jpg

an index set for the samples in the partition j

⊂ [n]

denotes the number of partitions

S_{a}^{j}

the set of neighbors of node a in block j

S(θ^a^,^j)

the set of non-zero elements of θ^a^,^j

{\bar{S}}_{a}^{j}

the closure of

S_{a}^{j}

{\bar{S}}_{a}^{j} = S_{a}^{j} \cup {a}

N_{a}^{j}

nodes not in the neighborhood of the node a in block j

N_{a}^{j} = [p] \ {\bar{S}}_{a}^{j}

the set of all vertices excluding the vertex a

\a = [p]\{a}

|·|

cardinality of a set or absolute value

the covariance matrix

σ_ab

an element of the covariance matrix

the precision matrix

ω_ab

an element of the precision matrix

〈·, ·〉

the dot product

〈a, b〉 = a′b

〈〈·, ·〉〉

the dot product between matrices

〈〈A, B〉〉 = tr(A′B)

ξ_min

the minimum change between regression coefficient

||θ^a^,^j − θ^a^,^j⁻¹||₂≥ξ_min

θ_min

the minimum size of a coefficient

∣ θ_{b}^{a, j} ∣ \geq θ_{min}

Δ_min

the minimum size of a block

| ≥ Δ_min

Open in a new tab

Acknowledgments

We are thankful to Zaïd Harchaoui for an early version of his manuscript [17] and many useful discussions. We thank Larry Wasserman and Ankur P. Parikh for providing comments on an early version of this work and many insightful suggestions. Furthermore, we are very grateful to the Associate Editor and two anonymous referees whose suggestions helped to tremendously improve the manuscript.

Appendix

Technical results

In this section we collect some technical results needed for the proves presented in §7.

Lemma 7

Let {ζⁱ}_i _{∈ [}_n_] be a sequence of iid Inline graphic (0, 1) random variables. If v_n ≥ C log n, for some constant C > 16, then

ℙ [\underset{\begin{matrix} 1 \leq l < r \leq n \\ r - l > r_{n} \end{matrix}}{\cap} {\sum_{i = l}^{r} {(ζ^{i})}^{2} \leq (1 + C) (r - l + 1)}] \geq 1 - exp (- c_{1} log n)

for some constant c₁ > 0.

Proof

For any 1 ≤ l < r ≤ n, with r − l > v_n we have

\begin{array}{l} ℙ [\sum_{i = l}^{r} {(ζ^{i})}^{2} \geq (1 + C) (r - l + 1)] \leq exp (- C (r - l + 1) / 8) \\ \leq exp (- C log n / 8) \end{array}

using (7.21). The lemma follows from an application of the union bound.

Lemma 8

Let {x_i}_i _{∈ [}_n_] be independent observations from (1.1) and let {ε_i}_i _{∈ [}_n_] be independent Inline graphic (0, 1). Assume that A1 holds. If v_n ≥ C log n for some constant C > 16, then

ℙ [\underset{j \in [B]}{\cap} \underset{\begin{matrix} l, r \in B^{j} \\ r - l > v_{n} \end{matrix}}{\cap} {\frac{1}{r - l + 1} {‖ \sum_{i = l}^{r} x_{i} ε_{i} ‖}_{2} \leq \frac{φ_{max}^{1 / 2} \sqrt{1 + C}}{\sqrt{r - l + 1}} \sqrt{p (1 + C log n)}}] \geq 1 - c_{1} exp (- c_{2} log n),

for some constants c₁, c₂ > 0.

Proof

Let Σ^1/2 denote the symmetric square root of the covariance matrix Σ_SS and let Inline graphic (i) denote the block of the true partition such that i ∈ . With this notation, we can write x_i = (Σ⁽ⁱ⁾)^1/2 u_i where u_i ~ (0, I). For any l ≤ r ∈ we have

{‖ \sum_{i = l}^{r} x_{i} ε_{i} ‖}_{2} = {‖ \sum_{i = l}^{r} {(\sum^{j})}^{1 / 2} u_{i} ε_{i} ‖}_{2} \leq φ_{max}^{1 / 2} {‖ \sum_{i = l}^{r} u_{i} ε_{i} ‖}_{2} .

Conditioning on {∊_i}_i, for each b ∈ [p], $\sum_{i = l}^{r} u_{i, b} ε_{i}$ is a normal random variable with variance $\sum_{i = l}^{r} {(ε_{i})}^{2}$ . Hence, ${‖ \sum_{i = l}^{r} u_{i} ε_{i} ‖}_{2}^{2} / (\sum_{i = l}^{r} {(ε_{i})}^{2})$ conditioned on {ε_i}_i is distributed according to $χ_{p}^{2}$ and

\begin{matrix} ℙ [\frac{1}{r - l + 1} {‖ \sum_{i = l}^{r} x_{i} ε_{i} ‖}_{2} \geq \frac{φ_{max}^{1 / 2} \sqrt{\sum_{i = l}^{r} {(ε_{i})}^{2}}}{r - l + 1} \sqrt{p (1 + C log n)} | {ε_{i}}_{i = l}^{r}] \\ \leq ℙ [χ_{p}^{2} \geq p (1 + C log n)] \leq exp (- C log n / 8), \end{matrix}

where the last inequality follows from (7.21). Using lemma 7, for all l, r ∈ Inline graphic with r − l > v_n the quantity $\sum_{i = l}^{r} {(ε_{i})}^{2}$ is bounded by (1 + C)(r − l + 1) with probability at least 1 − exp(−c₁ log n), which gives us the following bound

ℙ [\underset{j \in [B]}{\cap} \underset{\begin{matrix} l, r \in B^{j} \\ r - l > v_{n} \end{matrix}}{\cap} {\frac{1}{r - l + 1} {‖ \sum_{i = l}^{r} x_{i} ε_{i} ‖}_{2} \leq \frac{φ_{max}^{1 / 2} \sqrt{1 + C}}{\sqrt{r - l + 1}} \sqrt{p (1 + C log n)}}] \geq 1 - c_{1} exp (- c_{2} log n) .

Lemma 9

Let {x_i}_i _{∈ [}_n_] be independent observations from (1.1). Assume that A1 holds. Then for any v_n > p,

ℙ [max_{\begin{matrix} 1 \leq l < r \leq n \\ r - l > v_{n} \end{matrix}} Λ_{max} (\frac{1}{r - l + 1} \sum_{i = l}^{r} x_{i} {(x_{i})}^{'}) \geq 9 φ_{max}] \leq 2 n^{2} exp (- v_{n} / 2)

and

ℙ [min_{\begin{matrix} 1 \leq l < r \leq n \\ r - l > v_{n} \end{matrix}} Λ_{min} (\frac{1}{r - l + 1} \sum_{i = l}^{r} x_{i} {(x_{i})}^{'}) \leq φ_{min} / 9] \leq 2 n^{2} exp (- v_{n} / 2) .

Proof

For any 1 ≤ l < r ≤ n, with r − l ≥ v_n we have

\begin{array}{l} ℙ [Λ_{max} (\frac{1}{r - l + 1} \sum_{i = l}^{r} x_{i} {(x_{i})}^{'}) \geq 9 φ_{max}] \leq 2 exp (- (r - l + 1) / 2) \\ \leq 2 exp (- v_{n} / 2) \end{array}

using (7.18), convexity of Λ_max(·) and A1. The lemma follows from an application of the union bound. The other inequality follows using a similar argument.

Proof of Proposition 3

The following proof follows main ideas already given in theorem 2. We provide only a sketch.

Given an upper bound on the number of partitions B_max, we are going to perform the analysis on the event {B̂ ≤ B_max}. Since

ℙ [h (\hat{T}, T) \geq n δ_{n} ∣ {\hat{B} \leq B_{max}}] \leq \sum_{B^{'} = B}^{B_{max}} ℙ [h (\hat{T}, T) \geq n δ_{n} ∣ {∣ \hat{T} ∣ = B^{'} + 1}],

we are going to focus on ℙ [h( Inline graphic , ) ≥ nδ_n|{| | = B′+ 1}] for B′> B (for B′ = B it follows from theorem 2 that h( , ) < nδ_n with high probability). Let us define the following events

\begin{array}{l} E_{j, 1} = {\exists l \in [B^{'}] : ∣ {\hat{T}}_{l} - T_{j} ∣ \geq n δ_{n}, ∣ {\hat{T}}_{l + 1} - T_{j} ∣ \geq n δ_{n} and {\hat{T}}_{l} < T_{j} < {\hat{T}}_{l + 1}} \\ E_{j, 2} = {\forall l \in [B^{'}] : ∣ {\hat{T}}_{l} - T_{j} ∣ \geq n δ_{n} and {\hat{T}}_{l} < T_{j}} \\ E_{j, 3} = {\forall l \in [B^{'}] : ∣ {\hat{T}}_{l} - T_{j} ∣ \geq n δ_{n} and {\hat{T}}_{l} > T_{j}} . \end{array}

Using the above events, we have the following bound

ℙ [h (\hat{T}, T) \geq n δ_{n} ∣ {∣ \hat{T} ∣ = B^{'} + 1}] \leq \sum_{j \in [B]} ℙ [E_{j, 1}] + ℙ [E_{j, 2}] + ℙ [E_{j, 3}] .

The probabilities of the above events can be bounded using the same reasoning as in the proof of theorem 2, by repeatedly using the KKT conditions given in (3.2). In particular, we can use the strategy used to bound the event A_n,j,₂. Since the proof is technical and does not reveal any new insight, we omit the details.

A collection of known results

This section collects some known results that we have used in the paper. We start by collecting some results on the eigenvalues of random matrices. Let $x \overset{iid}{~} N (0, \sum)$ , i ∈ [n], and Σ̂ = n⁻¹ Σ x_i(x_i)′ be the empirical covariance matrix. Denote the elements of the covariance matrix Σ as [σ_ab] and of the empirical covariance matrix Σ̂ as [σ̂_ab].

Using standard results on concentration of spectral norms and eigenvalues [10], [38] derives the following two crude bounds that can be very useful. Under the assumption that p < n,

ℙ [Λ_{max} (\sum^{^}) \geq 9 φ_{max}] \leq 2 exp (- n / 2)

(7.18)

ℙ [Λ_{min} (\sum^{^}) \leq φ_{min} / 9] \leq 2 exp (- n / 2) .

(7.19)

From Lemma A.3. in [6] we have the following bound on the elements of the covariance matrix

ℙ [∣ {\hat{σ}}_{a b} - σ_{a b} ∣ \geq ε] \leq c_{1} exp (- c_{2} n ε^{2}), ∣ ε ∣ \leq ε_{0}

(7.20)

where c₁ and c₂ are positive constants that depend only on Λ_max(Σ) and ε₀.

Next, we use the following tail bound for χ² distribution from [25], which holds for all ε > 0,

ℙ [χ_{n}^{2} > n + ε] \leq exp (- \frac{1}{8} min (ε, \frac{ε^{2}}{n})) .

(7.21)

Footnotes

We emphasize that the independence is only present when each instance of the latent time varying model is given. In practice, such models are unknown, and therefore marginally the samples are dependent. Furthermore, the instances of the latent evolving models generating the samples are NOT independent, as we can see in later presentation.

Contributor Information

Mladen Kolar, Email: mladenk@cs.cmu.edu.

Eric P. Xing, Email: epxing@cs.cmu.edu.

References

1.Ahmed A, Xing EP. Recovering time-varying networks of dependencies in social and biological studies. Proc Natl A Sci. 2009;106(29):11878–11883. doi: 10.1073/pnas.0901910106. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Bai J, Perron P. Estimating and testing linear models with multiple structural changes. Econometrica. 1998;66(1):47–78. [Google Scholar]
3.Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
4.Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Mach Learn Res. 2008;9:485–516. [Google Scholar]
5.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci. 2009;2(1):183–202. [Google Scholar]
6.Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann Stat. 2008;36(1):199–227. [Google Scholar]
7.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]
8.Brucker P. An o(n) algorithm for quadratic knapsack problems. Operations Research Letters. 1984;3(3):163–166. [Google Scholar]
9.Bunea F. Honest variable selection in linear and logistic regression modelsvia ℓ1 and ℓ1 + ℓ2 penalization. Electron J Stat. 2008;2:1153. [Google Scholar]
10.Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. Handbook of the geometry of Banach spaces. 2001;1:317–366. [Google Scholar]
11.Dempster AP. Covariance selection. Biometrics. 1972;28(1):157–175. [Google Scholar]
12.Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann Appl Stat. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostat. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Getoor L, Taskar B. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) The MIT Press; 2007. [Google Scholar]
15.Guo J, Levina E, Michailidis G, Zhu J. Technical report. Department of Statistics, University of Michigan; 2010. Joint Structure Estimation for Categorical Markov Networks. [Google Scholar]
16.Guo J, Levina E, Michailidis G, Zhu J. Joint Estimation of Multiple Graphical Models. Biometrika. 2010 doi: 10.1093/biomet/asq060. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Harchaoui Zaïd, Lévy-Leduc Céline. Multiple change-point estimation with a total-variation penalty. J Am Stat Soc. 2010;105(492):1480–1493. [Google Scholar]
18.Hastie T, Tibshirani R. Varying-coeffcient models. J R Stat Soc B. 1993;55(4):757–796. [Google Scholar]
19.Kolar M, Xing EP. Technical report. Machine Learning Department, Carnegie Mellon University; 2009. Sparsistent estimation of Time-Varying discrete markov random fields. Available at arxiv 0907.2337. [Google Scholar]
20.Kolar M, Parikh AP, Xing EP. On sparse nonparametric conditional covariance selection. Proc 27th Ann. Int’l Conf. Machine Learn.; 2010. [Google Scholar]
21.Kolar M, Song L, Ahmed A, Xing EP. Estimating Time-Varying networks. Ann Appl Statist. 2010;4(1):94–123. [Google Scholar]
22.Lauritzen SL. Graphical Models (Oxford Statistical Science Series) Oxford University Press; USA: 1996. [Google Scholar]
23.Li H, Gui J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7(2):302. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]
24.Liu J, Wu S, Zidek JV. On segmented multivariate regression. Stat Sin. 1997;7:497–526. [Google Scholar]
25.Lounici K, Pontil M, Tsybakov AB, van de Geer S. Taking advantage of sparsity in Multi-Task learning. Proc Conf Learning Theory. 2009 [Google Scholar]
26.Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research. 2010;11:19–60. [Google Scholar]
27.Mammen E, van de Geer S. Locally adaptive regression splines. Ann Statist. 1997;25(1):387–413. [Google Scholar]
28.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]
29.Nesterov Y. Smooth minimization of non-smooth functions. Math Program. 2005;103(1):127–152. [Google Scholar]
30.Nesterov Y. Technical Report. Vol. 76. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain; 2007. Gradient methods for minimizing composite objective function. 2007. [Google Scholar]
31.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. J Am Stat Ass. 2009;104(486):735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Ravikumar P, Wainwright MJ, Raskutti G, Yu B. Technical report. Department of Statistics, University of California; Berkeley: 2008. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. [Google Scholar]
33.Ravikumar P, Wainwright MJ, Lafferty JD. High-dimensional ising model selection using ℓ1 regularized logistic regression. Ann Stat. 2010;38(3):1287–1319. [Google Scholar]
34.Rinaldo A. Properties and refinements of the fused lasso. Ann Stat. 2009;37(5):2922–2952. [Google Scholar]
35.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electron J Stat. 2008;2:494–515. [Google Scholar]
36.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc B. 2005;67(1):91–108. [Google Scholar]
37.van de Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electron J Stat. 2009;3:1360–1392. [Google Scholar]
38.Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 -constrained quadratic programming (lasso) IEEE T Inform Theory. 2009;55(5):2183–2202. [Google Scholar]
39.Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Found Trends Mach Learn. 2008;1(1–2):1–305. [Google Scholar]
40.Wang P, Chao DL, Hsu L. Learning networks from high dimensional binary data: An application to genomic instability data. Biometrics. 2009 to appear. [Google Scholar]
41.Yin J, Geng Z, Li R, Wang H. Nonparametric Covariance Model. Statistica Sinica. 2010;20:469–479. [PMC free article] [PubMed] [Google Scholar]
42.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94(1):19–35. [Google Scholar]
43.Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
44.Zhou S, Lafferty J, Wasserman L. Time varying undirected graphs. Proc Conf Learning Theory. 2008:455–466. [Google Scholar]

[R1] 1.Ahmed A, Xing EP. Recovering time-varying networks of dependencies in social and biological studies. Proc Natl A Sci. 2009;106(29):11878–11883. doi: 10.1073/pnas.0901910106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Bai J, Perron P. Estimating and testing linear models with multiple structural changes. Econometrica. 1998;66(1):47–78. [Google Scholar]

[R3] 3.Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]

[R4] 4.Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Mach Learn Res. 2008;9:485–516. [Google Scholar]

[R5] 5.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci. 2009;2(1):183–202. [Google Scholar]

[R6] 6.Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann Stat. 2008;36(1):199–227. [Google Scholar]

[R7] 7.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]

[R8] 8.Brucker P. An o(n) algorithm for quadratic knapsack problems. Operations Research Letters. 1984;3(3):163–166. [Google Scholar]

[R9] 9.Bunea F. Honest variable selection in linear and logistic regression modelsvia ℓ1 and ℓ1 + ℓ2 penalization. Electron J Stat. 2008;2:1153. [Google Scholar]

[R10] 10.Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. Handbook of the geometry of Banach spaces. 2001;1:317–366. [Google Scholar]

[R11] 11.Dempster AP. Covariance selection. Biometrics. 1972;28(1):157–175. [Google Scholar]

[R12] 12.Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann Appl Stat. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostat. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Getoor L, Taskar B. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) The MIT Press; 2007. [Google Scholar]

[R15] 15.Guo J, Levina E, Michailidis G, Zhu J. Technical report. Department of Statistics, University of Michigan; 2010. Joint Structure Estimation for Categorical Markov Networks. [Google Scholar]

[R16] 16.Guo J, Levina E, Michailidis G, Zhu J. Joint Estimation of Multiple Graphical Models. Biometrika. 2010 doi: 10.1093/biomet/asq060. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Harchaoui Zaïd, Lévy-Leduc Céline. Multiple change-point estimation with a total-variation penalty. J Am Stat Soc. 2010;105(492):1480–1493. [Google Scholar]

[R18] 18.Hastie T, Tibshirani R. Varying-coeffcient models. J R Stat Soc B. 1993;55(4):757–796. [Google Scholar]

[R19] 19.Kolar M, Xing EP. Technical report. Machine Learning Department, Carnegie Mellon University; 2009. Sparsistent estimation of Time-Varying discrete markov random fields. Available at arxiv 0907.2337. [Google Scholar]

[R20] 20.Kolar M, Parikh AP, Xing EP. On sparse nonparametric conditional covariance selection. Proc 27th Ann. Int’l Conf. Machine Learn.; 2010. [Google Scholar]

[R21] 21.Kolar M, Song L, Ahmed A, Xing EP. Estimating Time-Varying networks. Ann Appl Statist. 2010;4(1):94–123. [Google Scholar]

[R22] 22.Lauritzen SL. Graphical Models (Oxford Statistical Science Series) Oxford University Press; USA: 1996. [Google Scholar]

[R23] 23.Li H, Gui J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7(2):302. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]

[R24] 24.Liu J, Wu S, Zidek JV. On segmented multivariate regression. Stat Sin. 1997;7:497–526. [Google Scholar]

[R25] 25.Lounici K, Pontil M, Tsybakov AB, van de Geer S. Taking advantage of sparsity in Multi-Task learning. Proc Conf Learning Theory. 2009 [Google Scholar]

[R26] 26.Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research. 2010;11:19–60. [Google Scholar]

[R27] 27.Mammen E, van de Geer S. Locally adaptive regression splines. Ann Statist. 1997;25(1):387–413. [Google Scholar]

[R28] 28.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]

[R29] 29.Nesterov Y. Smooth minimization of non-smooth functions. Math Program. 2005;103(1):127–152. [Google Scholar]

[R30] 30.Nesterov Y. Technical Report. Vol. 76. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain; 2007. Gradient methods for minimizing composite objective function. 2007. [Google Scholar]

[R31] 31.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. J Am Stat Ass. 2009;104(486):735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Ravikumar P, Wainwright MJ, Raskutti G, Yu B. Technical report. Department of Statistics, University of California; Berkeley: 2008. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. [Google Scholar]

[R33] 33.Ravikumar P, Wainwright MJ, Lafferty JD. High-dimensional ising model selection using ℓ1 regularized logistic regression. Ann Stat. 2010;38(3):1287–1319. [Google Scholar]

[R34] 34.Rinaldo A. Properties and refinements of the fused lasso. Ann Stat. 2009;37(5):2922–2952. [Google Scholar]

[R35] 35.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electron J Stat. 2008;2:494–515. [Google Scholar]

[R36] 36.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc B. 2005;67(1):91–108. [Google Scholar]

[R37] 37.van de Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electron J Stat. 2009;3:1360–1392. [Google Scholar]

[R38] 38.Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 -constrained quadratic programming (lasso) IEEE T Inform Theory. 2009;55(5):2183–2202. [Google Scholar]

[R39] 39.Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Found Trends Mach Learn. 2008;1(1–2):1–305. [Google Scholar]

[R40] 40.Wang P, Chao DL, Hsu L. Learning networks from high dimensional binary data: An application to genomic instability data. Biometrics. 2009 to appear. [Google Scholar]

[R41] 41.Yin J, Geng Z, Li R, Wang H. Nonparametric Covariance Model. Statistica Sinica. 2010;20:469–479. [PMC free article] [PubMed] [Google Scholar]

[R42] 42.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94(1):19–35. [Google Scholar]

[R43] 43.Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]

[R44] 44.Zhou S, Lafferty J, Wasserman L. Time varying undirected graphs. Proc Conf Learning Theory. 2008:455–466. [Google Scholar]

PERMALINK

Estimating networks with jumps

Mladen Kolar

Eric P Xing

Abstract

1. Introduction

Notation schemes

2. Graph estimation via Temporal-Difference Lasso

2.1. Numerical procedure

Algorithm 1.

2.2. Tuning parameter selection

3. Theoretical results

3.1. Assumptions

3.2. Convergence of the partition boundaries

Lemma 1

Theorem 2

Proposition 3

3.3. Correct neighborhood selection

Fig 1.

Lemma 4

Theorem 5

4. Alternative estimation procedures

4.1. Neighborhood selection with modified penalty

4.2. Penalized maximum likelihood estimation

Lemma 6

5. Numerical studies

Chain networks

Fig 2.

Fig 3.

Table 2.

Nearest neighbors networks

Fig 4.

Fig 5.

Table 3.

6. Conclusion

7. Proofs

7.1. Proof of Lemma 1

7.2. Proof of Theorem 2

7.3. Proof of Lemma 4

7.4. Proof of Theorem 5

7.5. Proof of Lemma 6

Table 1.

Acknowledgments

Appendix

Technical results

Lemma 7

Proof

Lemma 8

Proof

Lemma 9

Proof

Proof of Proposition 3

A collection of known results

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases