Abstract
We study the problem of estimating a temporally varying coefficient and varying structure (VCVS) graphical model underlying data collected over a period of time, such as social states of interacting individuals or microarray expression profiles of gene networks, as opposed to i.i.d. data from an invariant model widely considered in current literature of structural estimation. In particular, we consider the scenario in which the model evolves in a piece-wise constant fashion. We propose a procedure that estimates the structure of a graphical model by minimizing the temporally smoothed L1 penalized regression, which allows jointly estimating the partition boundaries of the VCVS model and the coefficient of the sparse precision matrix on each block of the partition. A highly scalable proximal gradient method is proposed to solve the resultant convex optimization problem; and the conditions for sparsistent estimation and the convergence rate of both the partition boundaries and the network structure are established for the first time for such estimators.
Keywords and phrases: Gaussian graphical models, network models, dynamic network models, structural changes
1. Introduction
Networks are a fundamental form of representation of relational information underlying large, noisy data from various domains. For example, in a biological study, nodes of a network can represent genes in one organism and edges can represent associations or regulatory dependencies among genes. In a social analysis, nodes of a network can represent actors and edges can represent interactions or friendships between actors. Exploring the statistical properties and hidden characteristics of network entities, and the stochastic processes behind temporal evolution of network topologies is essential for computational knowledge discovery and prediction based on network data.
In many dynamical environments, such as a developing biological system, it is often technically impossible to experimentally determine the network topologies specific to every time point in a certain time period. Resorting to computational inference methods, such as extant structural learning algorithms, is also difficult because for every model unique to a single time point, there exist as few as only a single snapshot of the nodal states distributed accordingly to the model in question. In this paper, we consider an estimation problem under a particular dynamic context, where the model evolves piecewise constantly, i.e., staying structurally invariant during unknown segments of time, and then jump to a different structure.
Approximately piecewise constantly evolving networks can be found underlying many natural dynamic systems of intellectual and practical interest. For example, in a biological developmental system such as the fruit fly, the entire life cycle of the fly consists of 4 discrete developmental stages, namely, embryo, larva, pupa, and adult; across the stages, one expect to see dramatical rewiring of the regulatory network to realize very different regulation functions due to different developmental needs, whereas within each stage, the change of the network topology are expected to be relatively more mild as revealed by the smoother trajectories of the gene expression activities, because a largely stable regulatory machinery is employed to control stage-specific developmental processes. Such phenomena are not uncommon in social systems. For example, in an underlying social network between the senators, even it is not visible to outsiders, we would imagine the network structure being more stable between the elections but more volatile when the campaigns start. Although it is legitimate to use a completely unconstrained time-evolving network model to describe or analysis such systems, an approximately piecewise constantly evolving network model is better at capturing the different amount of network dynamics during different phases of a entire life cycle, and detecting boundaries between different phases when desirable.
A popular technique for deriving the network structure from iid sample is to estimate a sparse precision matrix. The importance of estimating precision matrices with zeros was recognized by Depmster [11] who coined the term covariance selection. The elements of the precision matrix represent the associations or conditional covariances between corresponding variables. Once a sparse precision matrix is estimated, a network can be drawn by connecting variables whose corresponding elements of the precision matrix are non-zero. Recent studies have shown that covariance selection methods based on the penalized likelihood maximization can lead to a consistent estimate of the network structure underlying a Gaussian Markov Random Fields [12, 32]. Moreover, a particular procedure for covariance selection known as neighborhood selection, which is built on ℓ1 norm regularized regression, can produce a consistent estimate of the network structure when the sample is assumed to follow a general Markov Random Field distribution whose structure corresponds to the network in question [33, 28, 31]. Specifically, a Markov Random Field (MRF) is a probabilistic graphical model defined on a graph G = (V, E), where V = {1, …, p} is a vertex set corresponding to the set of random variables to be modeled (in this paper we call them nodes and variables interchangeably), and E ⊆ V × V is the edge set capturing conditional independencies among these nodes. Let X = (X1, …, Xp)′ denote a p-dimensional random vector, whose elements are indexed by the nodes of the graph G. Under the MRF, a pair (a, b) is not an element of the edge set E if and only if the variable Xa is conditionally independent of Xb given all the rest of variables XV\{a,b}, Xa ⊥ Xb|XV\{a,b}. A distribution over X can be defined by taking the following log linear form that makes explicit use of the (presence and absence of edges in the) edge set: p(X) ∝ exp{Σ(a,b)∈V θabXaXb}. When the elements of the random vector X are discrete, e.g., X ∈ {0, 1}p, the model is referred to as a discrete MRF, sometimes known as an Ising model in statistics physics community; whereas when X is a continuous vector, the model is referred to as a Gaussian graphical model (GGM) because one can easily show that the p(X) above is actually a multivariate Gaussian. The MRF have been widely used for modeling data with graphical relational structures over a fixed set of entities [39, 14]. The vertices can describe entities such as genes in a biological regulatory network, stocks in the market, or people in society; while the edges can describe relationships between vertices, for example, interaction, correlation or influence.
The statistical problem we concern in this paper is to estimate the structure of the Gaussian graphical model from observed samples of nodal states in a dynamic world. Traditional methods handle this problem with the assumption that the samples are iid. Let
= {x1, …, xn} be an independent and identically distributed sample according to a p-dimensional multivariate normal distribution
(0, Σ), where Σ is the covariance matrix. Let Ω := Σ−1 denote the precision matrix, with elements (ωab), 1 ≤ a, b ≤ p. Then one can obtain an estimator of the Ω from
via optimizing a proper statistical loss function, such as likelihood or penalized likelihood. As mentioned earlier, the precision matrix Ω encodes the conditional independence structure of the distribution and the pattern of the zero elements in the precision matrix define the structure of the associated graph G. There has been a dramatic growth of interest in recent literature in the problem of covariance selection, which deals with the graph estimation problem above. Existing works range from algorithmic development focusing on efficient estimation procedures, to theoretical analysis focusing on statistical guarantees of different estimators. We do not intend to give an extensive overview of the literature here, but interested readers can follow the pointers bellow. In the classical literature (e.g., [22]), procedures are developed for small dimensional graphs and commonly involve hypothesis testing with greedy selection of edges. More recent literature estimates the sparse precision matrix by optimizing penalized likelihood [42, 12, 4, 35, 13, 32, 16, 44] or through neighborhood selection [28, 31, 15, 40], where the structure of the graph is estimated by estimating the neighborhood of each node. Both of these approaches are suitable for high-dimensional problems, even when p ≫ n, and can be efficiently implemented using scalable convex program solvers.
Most of the above mentioned work assumes that a single invariant network model is sufficient to describe the dependencies in the observed data. However, when the observed data are not iid, such an assumption is not justifiable. For example, when data consist of microarray measurements of the gene expression levels collected throughout the cell cycle or development of an organism, different genes can be active during different stages. This suggests that different distributions and hence different networks should be used to describe the dependencies between measured variables at different time intervals. In this paper, we are going to tackle the problem of estimating the structure of the GGM when the structure is allowed to change over time. By assuming that the parameters of the precision matrix change with time, we obtain extra flexibility to model a larger class of distributions while still retaining the interpretability of the static GGM. In particular, as the coefficients of the precision matrix change over time, we also allow the structure of the underlying graph to change as well. This semi-parametric generalization of the parametric model is referred to as a varying coefficient varying structure (VCVS) model.
Now, let {xi}i∈[n] ∈ ℝp be a sequence of n conditionally independent observations 1 (we use [n] to denote the set {1, …, n}) from some p-dimensional multivariate normal distributions, not necessarily the same for every observation. Let {
}j∈[B] be a disjoint partitioning of the set [n] where each block of the partition consists of consecutive elements, that is,
∩
= ∅ for j ≠ j′ and ⋃j
= [n] and
= [Tj−1 : Tj] := {Tj−1, Tj−1 + 1, …, Tj − 1}. Let
:= {T0 = 1 < T1 < ··· < TB = n + 1} denote the set of partition boundaries.
We consider the following model
| (1.1) |
such that observations indexed by elements in
are p-dimensional realizations of a multivariate normal distribution with zero mean and the covariance matrix
, which suggest that it is only unique to segment j the time series. Let Ωj := (Σj)−1 denote the precision matrix with elements
. With the number of partitions, B, and the boundaries of partitions,
, unknown, we study the problem of estimating both the partition set {
} and the non-zero elements of the precision matrices {Ωj}j∈[B] from the sample {xi}i∈[n]. Note that in this work we study a particular case of the VCVS model, where the coefficients are piece-wise constant functions of time. Although this model does not yet entirely agree with how a real world time series data would behave, as none existing model does, this instantiation of the VCVS model come one step closer in some sense to the real world scenario than other popular approaches for time series modeling, such as Hidden Markov Models or state space models, where stationary emission models such as linear Gaussian are usually employed to relate observation at different time points to simple latent states. Here, instead of positing an observation at time t to be derived from a latent state that transitions stationarily from a previous time point, we assume that such an observation is generated from a latent network model that are related to the network models active at the previous and subsequent time points nonparametrically. As suggested in the introduction, many real world dynamic systems, such as the stage-specific development of multi-cellular organisms like the fruit fly, and the evolving network of latent relatedness between politicians, are likely to behave approximately piecewise constantly; therefore time series data from such systems, such as the continuous-valued gene expression microarray time series, and the discrete-valued voting records, are suitable examples where our proposed models can be applied to [1]. A scenario where the coefficients are smoothly varying functions of time has been considered in [44] for the GGM and in [21] and [19] for an Ising model, which complements the model studied in this paper, whose asymptotic properties are somewhat easier to analyze as we have shown earlier.
If the partitions {
}j were known, the problem would be trivially reduced to the setting analyzed in the previous work. Dealing with the unknown partitions, together with the structure estimation of the model, calls for new methods. We propose and analyze a method based on time-coupled neighborhood selection, where the model estimates are forced to stay similar across time using a fusion-type total variation penalty and the sparsity of each neighborhood is obtained through the ℓ1 penalty. Details of the approach are given in §2.
The model in Eq. (1.1) is related to the varying-coefficient models (e.g., [18]) with the coefficients being piece-wise constant functions. Varying coefficient regression models with piece-wise constant coefficients are also known as segmented multivariate regression models [24] or linear models with structural changes [2]. The structural changes are commonly determined through hypothesis testing and a separate linear model is fit to each of the estimated segments. In our work, we use the penalized model selection approach to jointly estimate the partition boundaries and the model parameters.
Little work has been done so far towards modeling dynamic networks and estimating changing precision matrices. [44] develops a nonparametric method for estimation of time-varying GGM, where xt ~
(0, Σ(t)) and Σ(t) is smoothly changing over time. The procedure is based on the penalized likelihood approach of [42] with the empirical covariance matrix obtained using a kernel smoother. Our work is very different from the one of [44], since under our assumptions the network changes abruptly rather than smoothly. Furthermore, as we outline in §2, our estimation procedure is not based on the penalized likelihood approach. Estimation of time-varying Ising models has been discussed in [1] and [21]. [41] and [20] studied nonparametric ways to estimate the conditional covariance matrix. The work of [1] is most similar to our setting, where they also use a fused-type penalty combined with an ℓ1 penalty to estimate the structure of the varying Ising model. Here, in addition to focusing on GGMs, there is an additional subtle, but important, difference to [1]. In this work, we use a modification of the fusion penalty (formally described in §2) which allows us to characterize the model selection consistency of our estimates and the convergence properties of the estimated partition boundaries, which is not available in the earlier work.
The remaining of the paper is organized as follows. In §2, we describe our estimation procedure and provide an efficient first-order optimization procedure capable of estimating large graphs. The optimization procedure is based on the smoothing procedure of [29] and converges in
(1/ε) iterations, where ε is the desired accuracy. Our main theoretical results are presented in §3. In particular, we show that the partition boundaries are estimated consistently. Furthermore, the graph structure is consistently estimated on every block of the partition that contains enough samples. In §4, we discuss alternative estimation procedures based on penalized maximum likelihood estimation, instead of the neighborhood selection. Numerical studies showing the finite sample performance of our procedure are given in §5. The proofs of the main results are relegated to §7, with some technical details presented in Appendix.
Notation schemes
For clarity, we end the introduction with a summary of the notations used in the paper. We use [n] to denote the set {1, …, n} and [l : r] to denote the set {l, l + 1, …, r − 1}. We use
to denote j-th block of the partition
. With some abuse of notation, we also use
to denote the set [Tj−1 : Tj]. The number of samples in the block
is denoted as |
|. For a set S ⊂ V, we use the notation XS to denote the set {Xa : a ∈ S} of random variables. We use X to denote the n × p matrix whose rows consist of observations. The vector Xa = (x1,a, …, xn,a)′ denotes a column of matrix X and, similarly, XS = (Xb : b ∈ S) denotes the n × |S| sub-matrix of X whose columns are indexed by the set S and X
denotes the sub-matrix |
| × p whose rows are indexed by the set
. For simplicity of notation, we will use \a to denote the index set [p]\{a}, X\a = (Xb : b ∈ [p]\{a}). For a vector a ∈ ℝp, we let S(a) denote the set of non-zero components of a. Throughout the paper, we use c1, c2, … to denote positive constants whose value may change from line to line. For a vector a ∈ ℝn, define ||a||1 = Σi∈[n] |ai|,
and ||a||∞ = maxi |ai|. For a symmetric matrix A, Λmin(A) denotes the smallest and Λmax(A) the largest eigenvalue. For a matrix A (not necessarily symmetric), we use |||A|||∞ = maxi Σj |Aij|. For two vectors a, b ∈ ℝn, the dot product is denoted 〈a, b〉 = Σi∈[n]
aibi. For two matrices A, B ∈ ℝn×m, the dot product is denoted as 〈〈A, B〉〉 = tr(A′B). Given two sequences {an} and {bn}, the notation an =
(bn) means that there exists a constant c1 such that an ≤ c1bn; the notation an = Ω(bn) means that there exists a constant c2 such that an ≥ c2bn and the notation an ≍ bn means that an =
(bn) and bn =
(an). Similarly, we will use the notation an = op(bn) to denote that
converges to 0 in probability.
2. Graph estimation via Temporal-Difference Lasso
In this section, we introduce our time-varying covariance selection procedure, which is based on the time-coupled neighborhood selection using the fused-type penalty. We call the proposed procedure Temporal-Difference Lasso (TD-Lasso). We start by reviewing the basic neighborhood selection procedure, which has previously been used to estimate graphs in, e.g., [31, 28, 33, 15].
We start by relating the elements of the precision matrix Ω to a regression problem. Let the set Sa to denote the neighborhood of the node a. Denote S̄a the closure of Sa, S̄a: = Sa ∪ {a}, and Na the set of nodes not in the neighborhood of the node a, Na = [p]\S̄a. It holds that Xa ⊥ XNa|XSa. The neighborhood of the node a can be easily seen from the non-zero pattern of the elements in the precision matrix Ω, Sa = {b ∈ [p]\{a} : ωab ≠ 0}. See [22] for more details. It is a well known result for Gaussian graphical models that the elements of
are given by . Therefore, the neighborhood of a node a, Sa, is equal to the set of non-zero coefficients of θa. Using the expression for θa, we can write , where ε is independent of X\a.
The neighborhood selection procedure was motivated by the above relationship between the regression coefficients and the elements of the precision matrix. [28] proposed to solve the following optimization procedure
| (2.1) |
and proved that for iid sample the non-zero coefficients of θ̂a consistently estimate the neighborhood of the node a, under a suitably chosen penalty parameter λ.
In this paper, we build on the neighbourhood selection procedure to estimate the changing graph structure in model (1.1). We use
to denote the neighborhood of the node a on the block
and
to denote nodes not in the neighborhood of the node a on the j-th block,
. Consider the following estimation procedure
| (2.2) |
where the loss is defined for β = (βb,i)b∈[p−1],i∈[n] as
| (2.3) |
and the penalty is defined as
| (2.4) |
The penalty term is constructed from two terms. The first term ensures that the solution is going to be piecewise constant for some partition of [n] (possibly a trivial one). The first term can be seen as a sparsity inducing term in the temporal domain, since it penalizes the difference between the coefficients β·,i and β·,i+1 at successive time-points. The second term results in estimates that have many zero coefficients within each block of the partition. The estimated set of partition boundaries
contains indices of points at which a change is estimated, with B̂ being an estimate of the number of blocks B. The estimated number of the block B̂ is controlled through the user defined penalty parameter λ1, while the sparsity of the neighborhood is controlled through the penalty parameter λ2.
Based on the estimated set of partition boundaries
, we can define the neighborhood estimate of the node a for each estimated block. Let
, ∀i ∈ [T̂j−1 : T̂j] be the estimated coefficient vector for the block
= [T̂j−1 : T̂j]. Using the estimated vector θ̂a,j, we define the neighborhood estimate of the node a for the block
as
Solving (2.2) for each node a ∈ V gives us a neighborhood estimate for each node. Combining the neighborhood estimates we can obtain an estimate of the graph structure for each point i ∈ [n].
The choice of the penalty term is motivated by the work on penalization using total variation [34, 27], which results in a piece-wise constant approximation of an unknown regression function. The fusion-penalty has also been applied in the context of multivariate linear regression [36], where the coefficients that are spatially close, are also biased to have similar values. As a result, nearby coefficients are fused to the same estimated value. Instead of penalizing the ℓ1 norm on the difference between coefficients, we use the ℓ2 norm in order to enforce that all the changes occur at the same point.
The objective (2.2) estimates the neighborhood of one node in a graph for all time-points. After solving the objective (2.2) for all nodes a ∈ V, we need to combine them to obtain the graph structure. We will use the following procedure to combine {β̂a}a∈V,
That is, an edge between nodes a and b is included in the graph if at least one of the nodes a or b is included in the neighborhood of the other node. We use the max operator to combine different neighborhoods as we believe that for the purpose of network exploration it is more important to occasionally include spurious edges than to omit relevant ones. For further discussion on the differences between the min and the max combination, we refer an interested reader to [4].
2.1. Numerical procedure
Finding a minimizer β̂a of (2.2) can be a computationally challenging task for an off-the-shelf convex optimization procedure. We propose to use an accelerated gradient method with a smoothing technique [29], which converges in
(1/ε) iterations where ε is the desired accuracy.
We start by defining a smooth approximation of the fused penalty term. Let H ∈ ℝn×n−1 be a matrix with elements
With the matrix H we can rewrite the fused penalty term as and using the fact that the ℓ2 norm is self dual (e.g., see [7]) we have the following representation
| (2.5) |
where
:= {U ∈ ℝp−1×n−1 : ||U·,i||2 ≤ 1, ∀i ∈ [n − 1]}. The following function is defined as a smooth approximation to the fused penalty,
| (2.6) |
where μ > 0 is the smoothness parameter. It is easy to see that
Setting the smoothness parameter to , the correct rate of convergence is ensured. Let Uμ(β) be the optimal solution of the maximization problem in (2.6), which can be obtained analytically as
| (2.7) |
where Π
(·) is the projection operator onto the set
. From Theorem 1 in [29], we have that Ψμ(β) is continuously differentiable and convex, with the gradient
| (2.8) |
that is Lipschitz continuous.
With the above defined smooth approximation, we focus on minimizing the following objective
Following [5] (see also [30]), we define the following quadratic approximation of F(β) at a point β0
| (2.9) |
where L > 0 is the parameter chosen as an upper bounds for the Lipschitz constant of ∇
+∇Ψ. Let pL(β0) be a minimizer of QL(β, β0). Ignoring constant terms, pL(β0) can be obtained as
It is clear that pL(β0) is the unique minimizer, which can be obtained in a closed form, as a result of the soft-thresholding,
| (2.10) |
where T(x, λ) = sign(x) max(0, |x| − λ) is the soft-thresholding operator that is applied element-wise.
In practice, an upper bound on the Lipschitz constant of ∇
+ ∇Ψ can be expensive to compute, so the parameter L is going to be determined iteratively. Combining all of the above, we arrive at Algorithm 1. In the algorithm, β0 is set to zero or, if the optimization problem is solved for a sequence of tuning parameters, it can be set to the solution β̂ obtained for the previous set of tuning parameters. The parameter γ is a constant used to increase the estimate of the Lipschitz constant L and we set it to γ = 1.5 in our experiments, while L = 1 initially. Compared to the gradient descent method (which can be obtain by iterating βk+1 = pL(βk)), the accelerated gradient method updates two sequences {βk} and {zk} recursively. Instead of performing the gradient step from the latest approximate solution βk, the gradient step is performed from the search point zk that is obtained as a linear combination of the last two approximate solutions βk−1 and βk. Since the condition F(pL(zk)) ≤ QL(pL(zk), zk) is satisfied in every iteration, we have the algorithm converges in
(1/ε) iterations following [5]. As the convergence criterion, we stop iterating once the relative change in the objective value is below some threshold value (e.g., we use 10−4).
Algorithm 1.
Accelerated Gradient Method for Equation (2.2)
|
2.2. Tuning parameter selection
The penalty parameters λ1 and λ2 control the complexity of the estimated model. In this work, we propose to use the BIC score to select the tuning parameters. Define the BIC score for each node a ∈ V as
| (2.11) |
where
(·) is defined in (2.3) and β̂a = β̂a(λ1, λ2) is a solution of (2.2). The penalty parameters can now be chosen as
| (2.12) |
We will use the above formula to select the tuning parameters in our simulations, where we are going to search for the best choice of parameters over a grid.
3. Theoretical results
This section is going to address the statistical properties of the estimation procedure presented in Section 2. The properties are addressed in an asymptotic framework by letting the sample size n grow, while keeping the other parameters fixed. For the asymptotic framework to make sense, we assume that there exists a fixed unknown sequence of numbers {τj} that defines the partition boundaries as Tj = ⌊nτj⌋, where ⌊a⌋ denotes the largest integer smaller that a. This assures that as the number of samples grow, the same fraction of samples falls into every partition. We call {τj} the boundary fractions.
We give sufficient conditions under which the sequence {τj} is consistently estimated. In particular, if the number of partition blocks is estimated correctly, then we show that maxj∈[B] |T̂j − Tj| ≤ nδn with probability tending to 1, where {δn}n is a non-increasing sequence of positive numbers that tends to zero. If the number of partition segments is over estimated, then we show that for a distance defined for two sets A and B as
| (3.1) |
we have h(
,
) ≤ nδn with probability tending to 1. With the boundary segments consistently estimated, we further show that under suitable conditions for each node a ∈ V the correct neighborhood is selected on all estimated block partitions that are sufficiently large.
The proof technique employed in this section is quite involved, so we briefly describe the steps used. Our analysis is based on careful inspection of the optimality conditions that a solution β̂a of the optimization problem (2.2) need to satisfy. The optimality conditions for β̂a to be a solution of (2.2) are given in §3.2. Using the optimality conditions, we establish the rate of convergence for the partition boundaries. This is done by proof by contradiction. Suppose that there is a solution with the partition boundary
that satisfies h(
,
) ≥ nδn. Then we show that, with high-probability, all such solutions will not satisfy the KKT conditions and therefore cannot be optimal. This shows that all the solutions to the optimization problem (2.2) result in partition boundaries that are “close” to the true partition boundaries, with high-probability. Once it is established that
and
satisfy h(
,
) ≤ nδn, we can further show that the neighborhood estimates are consistently estimated, under the assumption that the estimated blocks of the partition have enough samples. This part of the analysis follows the commonly used strategy to prove that the Lasso is sparsistent (e.g., see [9, 38, 28]), however important modifications are required due to the fact that position of the partition boundaries are being estimated.
Our analysis is going to focus on one node a ∈ V and its neighborhood. However, using the union bound over all nodes in V, we will be able to carry over conclusions to the whole graph. To simplify our notation, when it is clear from the context, we will omit the superscript a and write β̂, θ̂ and S, etc., to denote β̂a, θ̂a and Sa, etc.
3.1. Assumptions
Before presenting our theoretical results, we give some definitions and assumptions that are going to be used in this section. Let Δmin := minj∈[B] |Tj − Tj−1| denote the minimum length between change points, ξmin := mina∈V minj∈[B−1] × ||θa,j+1 − θa,j||2 denote the minimum jump size and the minimum coefficient size. Throughout the section, we assume that the following holds.
A1 There exist two constants φmin > 0 and φmax < ∞ such that
and
A2 Variables are scaled so that for all j ∈ [B] and all a ∈ V.
The assumption A1 is commonly used to ensure that the model is identifiable. If the population covariance matrix is ill-conditioned, the question of the correct model identification if not well defined, as a neighborhood of a node may not be uniquely defined. The assumption A2 is assumed for the simplicity of the presentation. The common variance can be obtained through scaling.
A3 There exists a constant M > 0 such that
The assumption A3 states that the difference between coefficients on two different blocks, ||θa,k − θa,j||2, is bounded for all j, k ∈ [B]. This assumption is simply satisfied if the coefficients θa were bounded in the ℓ2 norm.
A4 There exist a constant α ∈ (0, 1], such that the following holds
The assumption A4 states that the variables in the neighborhood of the node a, , are not too correlated with the variables in the set . This assumption is necessary and sufficient for correct identification of the relevant variables in the Lasso regression problems (e.g., see [43, 37]). Note that this condition is sufficient also in our case when the correct partition boundaries are not known.
A5 The minimum coefficient size θmin satisfies .
The lower bound on the minimum coefficient size θmin is necessary, since if a partial correlation coefficient is too close to zero the edge in the graph would not be detectable.
A6 The sequence of partition boundaries {Tj} satisfy Tj = ⌊nτj⌋, where {τj} is a fixed, unknown sequence of the boundary fractions belonging to [0, 1].
The assumption is needed for the asymptotic setting. As n → ∞, there will be enough sample points in each of the blocks to estimate the neighborhood of nodes correctly.
3.2. Convergence of the partition boundaries
In this subsection we establish the rate of convergence of the boundary partitions for the estimator (2.2). We start by giving a lemma that characterizes solutions of the optimization problem given in (2.2). Note that the optimization problem in (2.2) is convex, however, there may be multiple solutions to it, since it is not strictly convex.
Lemma 1
Let . A matrix β̂ is optimal for the optimization problem (2.2) if and only if there exist a collection of subgradient vectors {ẑi}i∈[2:n] and {ŷi}i∈[n], with ẑi ∈ ∂||β̂·,i − β̂·,i−1||2 and ŷi ∈ ∂||β̂·,i||1, that satisfies
| (3.2) |
for all k ∈ [n] and ẑ1 = ẑn+1 = 0.
The following theorem provides the convergence rate of the estimated boundaries of
, under the assumption that the correct number of blocks is known.
Theorem 2
Let {xi}i∈[n] be a sequence of observation according to the model in (1.1). Assume that A1–A3 and A5–A6 hold. Suppose that the penalty parameters λ1 and λ2 satisfy
| (3.3) |
Let {β̂·,i}i∈[n] be any solution of (2.2) and let
be the associated estimate of the block partition. Let {δn}n≥1 be a non-increasing positive sequence that converges to zero as n → ∞ and satisfies Δmin ≥ nδn for all n ≥ 1. Furthermore, suppose that (nδnξmin)−1λ1 → 0,
and
, then if |
| = B + 1 the following holds
The proof builds on techniques developed in [17] and is presented in §7.
Suppose that δn = (log n)γ/n for some γ > 1 and
, the conditions of Theorem 2 are satisfied, and we have that the sequence of boundary fractions {τj} is consistently estimated. Since the boundary fractions are consistently estimated, we will see below that the estimated neighborhood S(θ̂j) on the block
consistently recovers the true neighborhood Sj.
Unfortunately, the correct bound on the number of block B may not be known. However, a conservative upper bound Bmax on the number of blocks B may be known. Suppose that the sequence of observation is over segmented, with the number of estimated blocks bounded by Bmax. Then the following proposition gives an upper bound on h(
,
) where h(·, ·) is defined in (3.1).
Proposition 3
Let {xi}i∈[n] be a sequence of observation according to the model in (1.1). Assume that the conditions of Theorem 2 are satisfied. Let β̂ be a solution of (2.2) and
the corresponding set of partition boundaries, with B̂ blocks. If the number of blocks satisfy B ≤ B̂ ≤ Bmax, then
The proof of the proposition follows the same ideas of Theorem 2 and its sketch is given in the appendix.
The above proposition assures us that even if the number of blocks is overestimated, there will be a partition boundary close to every true unknown partition boundary. In many cases it is reasonable to assume that a practitioner would have an idea about the number of blocks that she wishes to discover. In that way, our procedure can be used to explore and visualize the data. It is still an open question to pick the tuning parameters in a data dependent way so that the number of blocks are estimated consistently.
3.3. Correct neighborhood selection
In this section, we give a result on the consistency of the neighborhood estimation. We will show that whenever the estimated block
is large enough, say |
| ≥ rn where {rn}n≥1 is an increasing sequence of numbers that satisfy (rnλ2)−1λ1 → 0 and
as n → ∞, we have that S(θ̂j) = S(βk), where βk is the true parameter on the true block
that overlaps
the most. Figure 1 illustrates this idea. The blue region in the figure denotes the overlap between the true block and the estimated block of the partition. The orange region corresponds to the overlap of the estimated block with a different true block. If the blue region is considerably larger than the orange region, the bias coming from the sample from the orange region will not be strong enough to disable us from selecting the correct neighborhood. On the other hand, since the orange region is small, as seen from Theorem 2, there is little hope of estimating the neighborhood correctly on that portion of the sample.
Fig 1.

The figure illustrates where we expect to estimate a neighborhood of a node consistently. The blue region corresponds to the overlap between the true block (bounded by gray lines) and the estimated block (bounded by black lines). If the blue region is much larger than the orange regions, the additional bias introduced from the samples from the orange region will not considerably affect the estimation of the neighborhood of a node on the blue region. However, we cannot hope to consistently estimate the neighborhood of a node on the orange region.
Suppose that we know that there is a solution to the optimization problem (2.2) with the partition boundary
. Then that solution is also a minimizer of the following objective
| (3.4) |
Note that the problem (3.4) does not give a practical way of solving (2.2), but will help us to reason about the solutions of (2.2). In particular, while there may be multiple solutions to the problem (2.2), under some conditions, we can characterize the sparsity pattern of any solution that has specified partition boundaries
.
Lemma 4
Let β̂ be a solution to (2.2), with
being an associated estimate of the partition boundaries. Suppose that the subgradient vectors satisfy |ŷi,b| < 1 for all b ∉ S(β̂·,i), then any other solution β̂ with the partition boundaries
satisfy β̂b,i = 0 for all b ∉ S(β̂·,i).
The above Lemma states sufficient conditions under which the sparsity pattern of a solution with the partition boundary
is unique. Note, however, that there may other solutions to (2.2) that have different partition boundaries.
Now, we are ready to state the following theorem, which establishes that the correct neighborhood is selected on every sufficiently large estimated block of the partition.
Theorem 5
Let {xi}i∈[n] be a sequence of observation according to the model in (1.1). Assume that the conditions of theorem 2 are satisfied. In addition, suppose that A4 also holds. Then, if |
| = B + 1, it holds that
Under the assumptions of theorem 2 each estimated block is of size
(n). As a result, there are enough samples in each block to consistently estimate the underlying neighborhood structure. Observe that the neighborhood is consistently estimated at each i ∈
∩
for all j ∈ [B] and the error is made only on the small fraction of samples, when i ∉
∩
, which is of order
(nδn).
Using proposition 3 in place of theorem 2, it can be similarly shown that, for a large fraction of samples, the neighborhood is consistently estimated even in the case of over-segmentation. In particular, whenever there is a sufficiently large estimated block, with |
∩
| =
(rn), it holds that S(
) = Sj with probability tending to one.
4. Alternative estimation procedures
In this section, we discuss some alternative estimation methods to the neighborhood selection detailed in §2. We start describing how to solve the objective (2.2) for different penalties than the one given in (2.4). In particular, we describe how to minimize the objective when the ℓ2 is replaced with the ℓq (q ∈ {1, ∞}) norm in (2.4). Next, we describe how to solve the penalized maximum likelihood objective with the temporal difference penalty. We do not provide statistical guarantees for solutions of these objective functions.
4.1. Neighborhood selection with modified penalty
We consider the optimization problem given in (2.2) with the following penalty
| (4.1) |
We call the penalty in (4.1) the TDq penalty. As in §2.1, we apply the smoothing procedure to the first term in (4.1). Using the dual norm representation, we have
where
and
Next, we define smooth approximation to the norm as
| (4.2) |
where μ > 0 is the smoothness parameter. Let
| (4.3) |
be the optimal solution of the maximization problem in (4.2), where Π
(·) is the projection operator onto the set
. We observe that the projection on the ℓ∞ unit ball can be easily obtained, while a fast algorithm for projection on the ℓ1 unit ball can be found in [8]. The gradient can now be obtained as
| (4.4) |
and we can proceed as in § 2.1 to arrive at the update (2.10).
We have described how to optimize (2.2) with the TDq penalty for q ∈ {1, 2, ∞}. Other ℓq norms are not commonly used in practice. We also note that a different procedure for q = 1 can be found in [26].
4.2. Penalized maximum likelihood estimation
In §2, we have related the problem of estimating zero elements of a precision matrix to a penalized regression procedure. Now, we consider estimating a sparse precision matrix using a penalized maximum likelihood approach. That is, we consider the following optimization procedure
| (4.5) |
where
| (4.6) |
In order to optimize (4.5) using the smoothing technique described in §2.1, we need to show that the gradient of the log-likelihood is Lipschitz continuous. The following Lemma establishes the desired result.
Lemma 6
The function f(A) = tr SA − log |A| has Lipschitz continuous gradient on the set {A ∈
: Λmin(A) ≥ γ}, with Lipschitz constant L = γ−2.
Following [3], we can show that a solution to the optimization problem (4.5), on each estimated block, is indeed positive definite matrix with smallest eigenvalue bounded away from zero. This allows us to use the Nesterov’s smoothing technique to solve (4.5).
Penalized maximum likelihood approach for estimating sparse precision matrix was proposed by [42]. Here, we have modified the penalty to perform estimation under the model (1.1). Although the parameters of the precision matrix can be estimated consistently using the penalized maximum likelihood approach, a number of theoretical results have shown that the neighborhood selection procedure requires lest stringent assumptions in order to estimate the underlying network consistently [28, 32]. We observe this phenomena in our simulation studies as well.
5. Numerical studies
In this section, we present a small numerical study on simulated networks. A full performance test and application on real world data is beyond the scope of this paper which mainly focuses on the theory of time-varying model estimation. In all of our simulations studies we set p = 30 and B = 3 with |
| = 80, |
| = 130 and |
| = 90, so that in total we have n = 300 samples. We consider two types of random networks: a chain and a nearest neighbor network. We measure the performance of the estimation procedure outlined in §2 on the following metrics: average precision of estimated edges, average recall of estimated edges and average F1 score which combines the precision and recall score. The precision, recall and F1 score are respectively defined as
Furthermore, we report results on estimating the partition boundaries using n−1h(
,
), where h(
,
) is defined in (3.1). Results are averaged over 50 simulation runs. We compare the TD-Lasso algorithm introduced in §2.1 against an oracle algorithm which exactly knows the true partition boundaries. In this case, it is only needed to run the algorithm of [28] on each block of the partition independently. We use a BIC criterion to select the tuning parameter for this oracle procedure as described in [31]. Furthermore, we report results using neighborhood selection procedures introduced in §4, which are denoted TD1-Lasso and TD∞-Lasso, as well as the penalized maximum likelihood procedure, which is denoted as LLmax. We choose the tuning parameters for the penalized maximum likelihood procedure using the BIC procedure.
Chain networks
We follow the simulation in [12] to generate a chain network (see Figure 2). This network corresponds to a tridiagonal precision matrix (after an appropriate permutation of nodes). The network is generated as follows. First, we choose to generate a random permutation π of [n]. Next, the covariance matrix is generated as follows: the element at position (a, b) is chosen as σab = exp(−|tπ(a) − tπ(b)|/2) where t1 < t2 < ··· < tp and ti − ti−1 ~ Unif(0.5, 1) for i = 2, …, p. This processes is repeated three times to obtain three different covariance matrices, from which we sample 80, 130 and 90 samples respectively.
Fig 2.

A chain graph
For illustrative purposes, Figure 3 plots the precision, recall and F1 score computed for different values of the penalty parameters λ1 and λ2. Table 2 shows the precision, recall and F1 score for the parameters chosen using the BIC score described in 2.2, as well as the error in estimating the partition boundaries. The numbers in parentheses correspond to standard deviation. Due to the fact that there is some error in estimating the partition boundaries, we observe a decrease in performance compared to the oracle procedure that knows the correct position of the partition boundaries. Further, we observe that the neighborhood selection procedure estimate the graph structure more accurately than the maximum likelihood procedure. For TD1-Lasso we do not report n−1h(
,
), as the procedure does not estimate the partition boundaries.
Fig 3.
Plots of the precision, recall and F1 scores as functions of the penalty parameters λ1 and λ2 for chain networks estimated using the TD-Lasso. The parameter λ1 is obtained as 100 * 0.9850+i, where i indexes y-axis. The parameter λ2 is computed as 285 * 0.98230+j, where j indexes x-axis. Black dot represents the selected tuning parameters. The white region of each plot corresponds to a region of the parameter space that we did not explore.
Table 2.
Performance of different procedures when estimating chain networks
| Method name | Precision | Recall | F1 score |
n−1h(
,
) |
|---|---|---|---|---|
|
| ||||
| TD-Lasso | 0.84 (0.04) | 0.80 (0.04) | 0.82 (0.04) | 0.03 (0.01) |
| TD1-Lasso | 0.78 (0.05) | 0.70 (0.03) | 0.74 (0.04) | N/A |
| TD∞-Lasso | 0.83 (0.03) | 0.80 (0.03) | 0.81 (0.03) | 0.03 (0.01) |
| LLmax | 0.72 (0.03) | 0.65 (0.03) | 0.68 (0.04) | 0.06 (0.02) |
| Oracle procedure | 0.97 (0.02) | 0.89 (0.02) | 0.93 (0.02) | 0 (0) |
Nearest neighbors networks
We generate nearest neighbor networks following the procedure outlined in [23]. For each node, we draw a point uniformly at random on a unit square and compute the pairwise distances between nodes. Each node is then connected to 4 closest neighbors (see Figure 4). Since some of nodes will have more than 4 adjacent edges, we remove randomly edges from nodes that have degree larger than 4 until the maximum degree of a node in a network is 4. Each edge (a, b) in this network corresponds to a nonzero element in the precision matrix Ω, whose value is generated uniformly on [−1, −0.5] ∪ [0.5, 1]. The diagonal elements of the precision matrix are set to a smallest positive number that makes the matrix positive definite. Next, we scale the corresponding covariance matrix Σ = Ω−1 to have diagonal elements equal to 1. This processes is repeated three times to obtain three different covariance matrices, from which we sample 80, 130 and 90 samples respectively.
Fig 4.
An instance of a random neighborhood graph with 30 nodes.
For illustrative purposes, Figure 5 plots the precision, recall and F1 score computed for different values of the penalty parameters λ1 and λ2. Table 3 shows the precision, recall, F1 score and n−1h(
,
) for the parameters chosen using the BIC score, together with their standard deviations. The results obtained for nearest neighbor networks are qualitatively similar to the results obtain for chain networks.
Fig 5.

Plots of the precision, recall and F1 scores as functions of the penalty parameters λ1 and λ2 for nearest neighbor networks estimated using the TD-Lasso. The parameter λ1 is obtained as 100 * 0.9850+i, where i indexes y-axis. The parameter λ2 is computed as 285 * 0.98230+j, where j indexes x-axis. Black dot represents the selected tuning parameters. The white region of each plot corresponds to a region of the parameter space that we did not explore.
Table 3.
Performance of different procedure when estimating random nearest neighbor networks
| Method name | Precision | Recall | F1 score |
n−1h(
,
) |
|---|---|---|---|---|
|
| ||||
| TD-Lasso | 0.79 (0.06) | 0.76 (0.05) | 0.77 (0.05) | 0.04 (0.02) |
| TD1-Lasso | 0.70 (0.05) | 0.68 (0.07) | 0.69 (0.06) | N/A |
| TD∞-Lasso | 0.80 (0.06) | 0.75 (0.06) | 0.77 (0.06) | 0.04 (0.02) |
| LLmax | 0.62 (0.08) | 0.60 (0.06) | 0.61 (0.06) | 0.06 (0.02) |
| Oracle procedure | 0.87 (0.05) | 0.82 (0.05) | 0.84 (0.04) | 0 (0) |
6. Conclusion
We have addressed the problem of time-varying covariance selection when the underlying probability distribution changes abruptly at some unknown points in time. Using a penalized neighborhood selection approach with the fused-type penalty, we are able to consistently estimate times when the distribution changes and the network structure underlying the sample. The proof technique used to establish the convergence of the boundary fractions using the fused-type penalty is novel and constitutes an important contribution of the paper. Furthermore, our procedure estimates the network structure consistently whenever there is a large overlap between the estimated blocks and the unknown true blocks of samples coming from the same distribution. The proof technique used to establish the consistency of the network structure builds on the proof for consistency of the neighborhood selection procedure, however, important modifications are necessary since the times of distribution changes are not known in advance. Applications of the proposed approach range from cognitive neuroscience, where the problem is to identify changing associations between different parts of a brain when presented with different stimuli, to system biology studies, where the task is to identify changing patterns of interactions between genes involved in different cellular processes. We conjecture that our estimation procedure is also valid in the high-dimensional setting when the number of variables p is much larger than the sample size n. We leave the investigations of the rate of convergence in the high-dimensional setting for a future work.
7. Proofs
7.1. Proof of Lemma 1
For each i ∈ [n], introduce a (p − 1)-dimensional vector γi defined as
and rewrite the objective (2.2) as
| (7.1) |
A necessary and sufficient condition for {γ̂i}i∈[n] to be a solution of (7.1), is that for each k ∈ [n] the (p − 1)-dimensional zero vector, 0, belongs to the subdifferential of (7.1) with respect to γk evaluated at {γ̂i}i∈[n], that is,
| (7.2) |
where ẑk ∈ ∂||·||2(γ̂k), that is,
and for k ≤ i, ŷi ∈ ∂|Σj≤i γ̂j|, that is, yi = sign(Σj≤i γ̂j) with sign(0) ∈ [−1, 1]. The Lemma now simply follows from (7.2).
7.2. Proof of Theorem 2
We build on the ideas presented in the proof of Proposition 5 in [17]. Using the union bound,
and it is enough to show that ℙ[|Tj − T̃j| > nδn] → 0 for all j ∈ [B]. Define the event An,j as
and the event Cn as
We show that ℙ[An,j] → 0 by showing that both ℙ[An,j ∩ Cn] → 0 and as n → ∞. The idea here is that, in some sense, the event Cn is a good event on which the estimated boundary partitions and the true boundary partitions are not too far from each other. Considering the two cases will make the analysis simpler.
First, we show that ℙ[An,j ∩ Cn] → 0. Without loss of generality, we assume that T̂j < Tj, since the other case follows using the same reasoning. Using (3.2) twice with k = T̂j and with k = Tj and then applying the triangle inequality we have
| (7.3) |
Some algebra on the above display gives
The above display occurs with probability one, so that the event also occurs with probability one, which gives us the following bound
First, we focus on the event An,j,1. Using lemma 9, we can upper bound ℙ[An,j,1] with
Since under the assumptions of the theorem (nδnξmin)−1 λ1 → 0 and as n → ∞, we have that ℙ[An,j,1] → 0 as n → ∞.
Next, we show that the probability of the event An,j,2 converges to zero. Let T̄j:= ⌊2−1(Tj + Tj+1)⌋. Observe that on the event Cn, T̂j+1 > T̄j so that β̂·,i = θ̂j+1 for all i ∈ [Tj, T̄j]. Using (3.2) with k = Tj and k = T̄j we have that
Using lemma 9 on the display above we have
| (7.4) |
which holds with probability at least 1–2 exp(−Δmin/4+2 log n). We will use the above bound to deal with the event { }. Using lemma 9, we have that φmin(Tj − T̂j)ξmin/9 ≤ ||R1||2 and ||R2||2 ≤ (Tj − T̂j)9φmax||θj+1 − θ̂j+1||2 with probability at least 1 – 4 exp(−nδn/2 + 2 log n). Combining with (7.4), the probability ℙ[An,j,2] is upper bounded by
Under the conditions of the theorem, the first term above converges to zero, since Δmin > nδn and (nδnξmin)−1 λ1 → 0. The second term also converges to zero, since . Using lemma 8, the third term converges to zero with the rate exp(−c6 log n), since . Combining all the bounds, we have that ℙ[An,j,2] → 0 as n → ∞.
Finally, we upper bound the probability of the event An,j,3. As before, φmin(Tj−T̂j)ξmin/9 ≤ ||R1||2 with probability at least 1 – 2 exp(−nδn/2 + 2 log n). This gives us an upper bound on ℙ[An,j,3] as
which, using lemma 8, converges to zero as under the conditions of the theorem . Thus we have shown that ℙ[An,j,3] → 0. Since the case when T̂j > Tj is shown similarly, we have proved that ℙ[An,j ∩ Cn] → 0 as n → ∞.
We proceed to show that as n → ∞. Recall that . Define the following events
and write . First, consider the event under the assumption that T̂j ≤ Tj. Due to symmetry, the other case will follow in a similar way. Observe that
| (7.5) |
We bound the first term in (7.5) and note that the other terms can be bounded in the same way. The following analysis is performed on the event . Using (3.2) with k = T̂j and k = Tj, after some algebra (similar to the derivation of (7.3)) the following holds
with probability at least 1–2 exp(−nδn/2+2 log n), where we have used lemma 9. Let T̄j = ⌊2−1(Tj + Tj+1)⌋. Using (3.2) with k = T̄j and k = Tj after some algebra (similar to the derivation of (7.4)) we obtain the following bound
which holds with probability at least 1 − c1 exp(−nδn/2 + 2 log n), where we have used lemma 9 twice. Combining the last two displays, we can upper bound the first term in (7.5) with
where we have used lemma 8 to obtain the third term. Under the conditions of the theorem, all terms converge to zero. Reasoning similar about the other terms in (7.5), we can conclude that as n → ∞.
Next, we bound the probability of the event , which is upper bounded by
Observe that
so that we have
Using the same arguments as those used to bound terms in (7.5), we have that as n → ∞ under the conditions of the theorem. Similarly, we can show that the term as n → ∞. Thus, we have shown that , which concludes the proof.
7.3. Proof of Lemma 4
Consider
fixed. The lemma is a simple consequence of the duality theory, which states that given the subdifferential ŷi (which is constant for all i ∈
,
being an estimated block of the partition
), all solutions {β̌·,i}i∈[n] of (2.2) need to satisfy the complementary slackness condition Σb∈\a
ŷi,bβ̌b,i = ||β̌·,i||1, which holds only if β̌b,i = 0 for all b ∈ \a for which |ŷi,b| < 1.
7.4. Proof of Theorem 5
Since the assumptions of theorem 2 are satisfied, we are going to work on the event
In this case, |
| =
(n). For i ∈
, we write
| (7.6) |
where
is the bias. Observe that ∀i ∈
∩
, the bias ei = 0, while for i ∉
∩
, the bias ei is normally distributed with variance bounded by M2φmax under the assumption A1 and A3.
We proceed to show that S(θ̂k) ⊂ Sk. Since θ̂k is an optimal solution of (2.2), it needs to satisfy
| (7.7) |
Now, we will construct the vectors θ̌k, žT̂k−1, žT̂k and y̌T̂k−1 that satisfy (7.7) and verify that the subdifferential vectors are dual feasible. Consider the following restricted optimization problem
| (7.8) |
where the vector is constrained to be 0. Let {θ̌j}j∈ [B̂] be a solution to the restricted optimization problem (7.8). Set the subgradient vectors as žT̂k−1 ∈ ∂||θ̌k − θ̌k−1||, žTk ∈ ∂||θ̌k+1 − θ̌k|| and . Solve (7.7) for y̌T̂k−1,Nk. By construction, the vectors θ̌k, žT̂k−1, žT̂k and y̌T̂k−1 satisfy (7.7). Furthermore, the vectors žT̂k−1 and žT̂k are elements of the subdifferential, and hence dual feasible. To show that θ̌k is also a solution to (3.4), we need to show that ||y̌T̂k−1,Nk||∞ ≤ 1, that is, that y̌T̂k−1 is also dual feasible variable. Using lemma 4, if we show that y̌T̂k−1,Nk is strict dual feasible, ||y̌T̂k−1,Nk||∞ < 1, then any other solution to (3.4) will satisfy .
From (7.7) we can obtain an explicit formula for θ̌Sk
| (7.9) |
Recall that for large enough n we have that |
| > p, so that the matrix
is invertible with probability one. Plugging (7.9) into (7.7), we have that ||y̌T̂k−1,Nk||∞ < 1 if maxb∈Nk |Yb| < 1, where Yb is defined to be
| (7.10) |
where is the projection matrix
Let Σ̃k and be defined as
For i ∈ [n], we let
(i) index the block to which the sample i belongs to. Now, for any b ∈ Nk, we can write
where
is normally distributed with variance
and independent of
. Let Fb ∈ ℝ|
| be the vector whose components are equal to
, i ∈
, and Wb ∈ ℝ|
| be the vector with components equal to
. Using this notation, we write
where
| (7.11) |
| (7.12) |
| (7.13) |
and
| (7.14) |
We analyze each of the terms separately. Starting with the term , after some algebra, we obtain that
| (7.15) |
Recall that we are working on the event
, so that
and
element-wise. Using (7.20) we bound the first two terms in the equation above. We bound the first term by observing that for any j and any b ∈ Nk and n sufficiently large
with probability 1−c1 exp(−c2 log n). Next, for any b ∈ Nk we bound the second term as
with probability 1−c1 exp(−c2 log n). Choosing ε1, ε2 sufficiently small and for n large enough, we have that under the assumption A4.
We proceed with the term , which can be written as
Since we are working on the event
the second term in the above equation is dominated by the first term. Next, using (7.15) together with (7.20), we have that for all b ∈ Nk
Combining with Lemma 8, we have that under the assumptions of the theorem
We deal with the term
by conditioning on
and ε
, we have that Wb is independent of the terms in the squared bracket in
, since all žT̂k−1,S, žT̂k,S and ŷT̂k−1,S are determined from the solution to the restricted optimization problem. To bound the second term, we observe that conditional on
and ε
, the variance of
can be bounded as
| (7.16) |
where
Using lemma 9 and Young’s inequality, the first term in (7.16) is upper bounded by
with probability at least 1 − 2 exp(−|
|/2 + 2 log n). Using lemma 7 we have that the second term is upper bounded by
with probability at least 1−exp(−c1|
|δ′2+2 log n). Combining the two bounds, we have that
with high probability, using the fact that (|
|λ2)−1λ1 → 0 and |
|λ2 → ∞ as n → ∞. Using the bound on the variance of the term
and the Gaussian tail bound, we have that
Combining the results, we have that maxb∈Nk |Yb| ≤ 1 − α + op(1). For a sufficiently large n, under the conditions of the theorem, we have shown that maxb∈N |Yb| < 1 which implies that .
Next, we proceed to show that . Observe that
From (7.7) we have that is upper bounded by
Since ẽi ≠ 0 only on i ∈
\
and nδn/|
| → 0, the term involving ẽ
is stochastically dominated by the term involving ∊
and can be ignored. Define the following terms
Conditioning on , the term T1 is a |Sk| dimensional Gaussian with variance bounded by c1/n with probability at least 1 − c1 exp(−c2 log n) using lemma 9. Combining with the Gaussian tail bound, the term ||T1||∞ can be upper bounded as
| (7.17) |
Using lemma 9, we have that with probability greater than 1 − c1 exp(−c2 log n)
under the conditions of theorem. Similarly , with probability greater than 1 − c1 exp(−c2 log n). Combining the terms, we have that
with probability at least 1 − c3 exp(−c4 log n). Since , we have shown that Sk ⊆ S(θ̂k). Combining with the first part, it follows that S(θ̂k) = Sk with probability tending to one.
7.5. Proof of Lemma 6
We have that ∇ f(A) = A−1. Then
Table 1.
Summary of symbols used throughout the paper
| Symbol | Meaning | Example | ||
|---|---|---|---|---|
|
| ||||
| [n] | used to denote the set {1, …, n} | |||
| [t1 : t2] | used to denote the set {t1, t1 + 1, …, t2 − 1} | |||
| i | used for indexing related to samples | xi or | ||
| j, k | used for indexing related to block | θa,j or | ||
| A, b | used for indexing nodes in a graph | a, b ∈ V | ||
| G | the graph consisting of vertices and edges | G = (V, E) | ||
| V | the set of nodes in a graph | V = [p] | ||
| Ei | the set of edges at time i | |||
| Xa | the component of a random vector X indexed by the vertex a | |||
|
|
the vector of regression coefficients for sample i | |||
| θa,j | the vector of regression coefficients for block j | |||
|
the set of partition boundaries | |||
| {τj}j | the set of boundary fractions | Tj = ⌊nτj⌋ | ||
|
an index set for the samples in the partition j |
⊂ [n] |
||
| B | denotes the number of partitions | |||
|
|
the set of neighbors of node a in block j | |||
| S(θa,j) | the set of non-zero elements of θa,j | |||
|
|
the closure of |
|
||
|
|
nodes not in the neighborhood of the node a in block j |
|
||
| \a | the set of all vertices excluding the vertex a | \a = [p]\{a} | ||
| |·| | cardinality of a set or absolute value | |||
| Σ | the covariance matrix | |||
| σab | an element of the covariance matrix | |||
| Ω | the precision matrix | |||
| ωab | an element of the precision matrix | |||
| 〈·, ·〉 | the dot product | 〈a, b〉 = a′b | ||
| 〈〈·, ·〉〉 | the dot product between matrices | 〈〈A, B〉〉 = tr(A′B) | ||
| ξmin | the minimum change between regression coefficient | ||θa,j − θa,j−1||2≥ξmin | ||
| θmin | the minimum size of a coefficient |
|
||
| Δmin | the minimum size of a block | |
| ≥ Δmin
|
||
Acknowledgments
We are thankful to Zaïd Harchaoui for an early version of his manuscript [17] and many useful discussions. We thank Larry Wasserman and Ankur P. Parikh for providing comments on an early version of this work and many insightful suggestions. Furthermore, we are very grateful to the Associate Editor and two anonymous referees whose suggestions helped to tremendously improve the manuscript.
Appendix
Technical results
In this section we collect some technical results needed for the proves presented in §7.
Lemma 7
Let {ζi}i
∈ [n] be a sequence of iid
(0, 1) random variables. If vn ≥ C log n, for some constant C > 16, then
for some constant c1 > 0.
Proof
For any 1 ≤ l < r ≤ n, with r − l > vn we have
using (7.21). The lemma follows from an application of the union bound.
Lemma 8
Let {xi}i
∈ [n] be independent observations from (1.1) and let {εi}i
∈ [n] be independent
(0, 1). Assume that A1 holds. If vn ≥ C log n for some constant C > 16, then
for some constants c1, c2 > 0.
Proof
Let Σ1/2 denote the symmetric square root of the covariance matrix ΣSS and let
(i) denote the block
of the true partition such that i ∈
. With this notation, we can write xi = (Σ
(i))1/2
ui where ui ~
(0, I). For any l ≤ r ∈
we have
Conditioning on {∊i}i, for each b ∈ [p], is a normal random variable with variance . Hence, conditioned on {εi}i is distributed according to and
where the last inequality follows from (7.21). Using lemma 7, for all l, r ∈
with r − l > vn the quantity
is bounded by (1 + C)(r − l + 1) with probability at least 1 − exp(−c1 log n), which gives us the following bound
Lemma 9
Let {xi}i ∈ [n] be independent observations from (1.1). Assume that A1 holds. Then for any vn > p,
and
Proof
For any 1 ≤ l < r ≤ n, with r − l ≥ vn we have
using (7.18), convexity of Λmax(·) and A1. The lemma follows from an application of the union bound. The other inequality follows using a similar argument.
Proof of Proposition 3
The following proof follows main ideas already given in theorem 2. We provide only a sketch.
Given an upper bound on the number of partitions Bmax, we are going to perform the analysis on the event {B̂ ≤ Bmax}. Since
we are going to focus on ℙ [h(
,
) ≥ nδn|{|
| = B′+ 1}] for B′> B (for B′ = B it follows from theorem 2 that h(
,
) < nδn with high probability). Let us define the following events
Using the above events, we have the following bound
The probabilities of the above events can be bounded using the same reasoning as in the proof of theorem 2, by repeatedly using the KKT conditions given in (3.2). In particular, we can use the strategy used to bound the event An,j,2. Since the proof is technical and does not reveal any new insight, we omit the details.
A collection of known results
This section collects some known results that we have used in the paper. We start by collecting some results on the eigenvalues of random matrices. Let , i ∈ [n], and Σ̂ = n−1 Σ xi(xi)′ be the empirical covariance matrix. Denote the elements of the covariance matrix Σ as [σab] and of the empirical covariance matrix Σ̂ as [σ̂ab].
Using standard results on concentration of spectral norms and eigenvalues [10], [38] derives the following two crude bounds that can be very useful. Under the assumption that p < n,
| (7.18) |
| (7.19) |
From Lemma A.3. in [6] we have the following bound on the elements of the covariance matrix
| (7.20) |
where c1 and c2 are positive constants that depend only on Λmax(Σ) and ε0.
Next, we use the following tail bound for χ2 distribution from [25], which holds for all ε > 0,
| (7.21) |
Footnotes
We emphasize that the independence is only present when each instance of the latent time varying model is given. In practice, such models are unknown, and therefore marginally the samples are dependent. Furthermore, the instances of the latent evolving models generating the samples are NOT independent, as we can see in later presentation.
Contributor Information
Mladen Kolar, Email: mladenk@cs.cmu.edu.
Eric P. Xing, Email: epxing@cs.cmu.edu.
References
- 1.Ahmed A, Xing EP. Recovering time-varying networks of dependencies in social and biological studies. Proc Natl A Sci. 2009;106(29):11878–11883. doi: 10.1073/pnas.0901910106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Bai J, Perron P. Estimating and testing linear models with multiple structural changes. Econometrica. 1998;66(1):47–78. [Google Scholar]
- 3.Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
- 4.Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. J Mach Learn Res. 2008;9:485–516. [Google Scholar]
- 5.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J Imaging Sci. 2009;2(1):183–202. [Google Scholar]
- 6.Bickel PJ, Levina E. Regularized estimation of large covariance matrices. Ann Stat. 2008;36(1):199–227. [Google Scholar]
- 7.Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]
- 8.Brucker P. An o(n) algorithm for quadratic knapsack problems. Operations Research Letters. 1984;3(3):163–166. [Google Scholar]
- 9.Bunea F. Honest variable selection in linear and logistic regression modelsvia ℓ1 and ℓ1 + ℓ2 penalization. Electron J Stat. 2008;2:1153. [Google Scholar]
- 10.Davidson KR, Szarek SJ. Local operator theory, random matrices and Banach spaces. Handbook of the geometry of Banach spaces. 2001;1:317–366. [Google Scholar]
- 11.Dempster AP. Covariance selection. Biometrics. 1972;28(1):157–175. [Google Scholar]
- 12.Fan J, Feng Y, Wu Y. Network exploration via the adaptive LASSO and SCAD penalties. Ann Appl Stat. 2009;3(2):521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostat. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Getoor L, Taskar B. Introduction to Statistical Relational Learning (Adaptive Computation and Machine Learning) The MIT Press; 2007. [Google Scholar]
- 15.Guo J, Levina E, Michailidis G, Zhu J. Technical report. Department of Statistics, University of Michigan; 2010. Joint Structure Estimation for Categorical Markov Networks. [Google Scholar]
- 16.Guo J, Levina E, Michailidis G, Zhu J. Joint Estimation of Multiple Graphical Models. Biometrika. 2010 doi: 10.1093/biomet/asq060. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Harchaoui Zaïd, Lévy-Leduc Céline. Multiple change-point estimation with a total-variation penalty. J Am Stat Soc. 2010;105(492):1480–1493. [Google Scholar]
- 18.Hastie T, Tibshirani R. Varying-coeffcient models. J R Stat Soc B. 1993;55(4):757–796. [Google Scholar]
- 19.Kolar M, Xing EP. Technical report. Machine Learning Department, Carnegie Mellon University; 2009. Sparsistent estimation of Time-Varying discrete markov random fields. Available at arxiv 0907.2337. [Google Scholar]
- 20.Kolar M, Parikh AP, Xing EP. On sparse nonparametric conditional covariance selection. Proc 27th Ann. Int’l Conf. Machine Learn.; 2010. [Google Scholar]
- 21.Kolar M, Song L, Ahmed A, Xing EP. Estimating Time-Varying networks. Ann Appl Statist. 2010;4(1):94–123. [Google Scholar]
- 22.Lauritzen SL. Graphical Models (Oxford Statistical Science Series) Oxford University Press; USA: 1996. [Google Scholar]
- 23.Li H, Gui J. Gradient directed regularization for sparse Gaussian concentration graphs, with applications to inference of genetic networks. Biostatistics. 2006;7(2):302. doi: 10.1093/biostatistics/kxj008. [DOI] [PubMed] [Google Scholar]
- 24.Liu J, Wu S, Zidek JV. On segmented multivariate regression. Stat Sin. 1997;7:497–526. [Google Scholar]
- 25.Lounici K, Pontil M, Tsybakov AB, van de Geer S. Taking advantage of sparsity in Multi-Task learning. Proc Conf Learning Theory. 2009 [Google Scholar]
- 26.Mairal J, Bach F, Ponce J, Sapiro G. Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research. 2010;11:19–60. [Google Scholar]
- 27.Mammen E, van de Geer S. Locally adaptive regression splines. Ann Statist. 1997;25(1):387–413. [Google Scholar]
- 28.Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. Ann Stat. 2006;34(3):1436–1462. [Google Scholar]
- 29.Nesterov Y. Smooth minimization of non-smooth functions. Math Program. 2005;103(1):127–152. [Google Scholar]
- 30.Nesterov Y. Technical Report. Vol. 76. Center for Operations Research and Econometrics (CORE), Catholic University of Louvain; 2007. Gradient methods for minimizing composite objective function. 2007. [Google Scholar]
- 31.Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. J Am Stat Ass. 2009;104(486):735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Ravikumar P, Wainwright MJ, Raskutti G, Yu B. Technical report. Department of Statistics, University of California; Berkeley: 2008. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. [Google Scholar]
- 33.Ravikumar P, Wainwright MJ, Lafferty JD. High-dimensional ising model selection using ℓ1 regularized logistic regression. Ann Stat. 2010;38(3):1287–1319. [Google Scholar]
- 34.Rinaldo A. Properties and refinements of the fused lasso. Ann Stat. 2009;37(5):2922–2952. [Google Scholar]
- 35.Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electron J Stat. 2008;2:494–515. [Google Scholar]
- 36.Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc B. 2005;67(1):91–108. [Google Scholar]
- 37.van de Geer S, Bühlmann P. On the conditions used to prove oracle results for the lasso. Electron J Stat. 2009;3:1360–1392. [Google Scholar]
- 38.Wainwright MJ. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1 -constrained quadratic programming (lasso) IEEE T Inform Theory. 2009;55(5):2183–2202. [Google Scholar]
- 39.Wainwright MJ, Jordan MI. Graphical models, exponential families, and variational inference. Found Trends Mach Learn. 2008;1(1–2):1–305. [Google Scholar]
- 40.Wang P, Chao DL, Hsu L. Learning networks from high dimensional binary data: An application to genomic instability data. Biometrics. 2009 to appear. [Google Scholar]
- 41.Yin J, Geng Z, Li R, Wang H. Nonparametric Covariance Model. Statistica Sinica. 2010;20:469–479. [PMC free article] [PubMed] [Google Scholar]
- 42.Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94(1):19–35. [Google Scholar]
- 43.Zhao P, Yu B. On model selection consistency of lasso. J Mach Learn Res. 2006;7:2541–2563. [Google Scholar]
- 44.Zhou S, Lafferty J, Wasserman L. Time varying undirected graphs. Proc Conf Learning Theory. 2008:455–466. [Google Scholar]


