Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jan 27.
Published in final edited form as: J Comput Graph Stat. 2016 Nov 10;25(4):1272–1296. doi: 10.1080/10618600.2016.1164533

Accelerated Path-following Iterative Shrinkage Thresholding Algorithm with Application to Semiparametric Graph Estimation

Tuo Zhao *, Han Liu
PMCID: PMC5271586  NIHMSID: NIHMS752300  PMID: 28133430

Abstract

We propose an accelerated path-following iterative shrinkage thresholding algorithm (APISTA) for solving high dimensional sparse nonconvex learning problems. The main difference between APISTA and the path-following iterative shrinkage thresholding algorithm (PISTA) is that APISTA exploits an additional coordinate descent subroutine to boost the computational performance. Such a modification, though simple, has profound impact: APISTA not only enjoys the same theoretical guarantee as that of PISTA, i.e., APISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties, but also significantly outperforms PISTA in empirical benchmarks. As an application, we apply APISTA to solve a family of nonconvex optimization problems motivated by estimating sparse semiparametric graphical models. APISTA allows us to obtain new statistical recovery results which do not exist in the existing literature. Thorough numerical results are provided to back up our theory.

1 Introduction

High dimensional data challenge both statistics and computation. In the statistics community, researchers have proposed a large family of regularized M-estimators, including Lasso, Group Lasso, Fused Lasso, Graphical Lasso, Sparse Inverse Column Operator, Sparse Multivariate Regression, Sparse Linear Discriminant Analysis (Tibshirani, 1996; Zou and Hastie, 2005; Yuan and Lin, 2005, 2007; Banerjee et al., 2008; Tibshirani et al., 2005; Jacob et al., 2009; Fan et al., 2012; Liu and Luo, 2015; Han et al., 2012; Liu et al., 2015). Theoretical analysis of these methods usually rely on the sparsity of the parameter space and requires the resulting optimization problems to be strongly convex over a restricted parameter space. More details can be found in Meinshausen and Bühlmann (2006); Zhao and Yu (2006); Zou (2006); Rothman et al. (2008); Zhang and Huang (2008); Van de Geer (2008); Zhang (2009); Meinshausen and Yu (2009); Wainwright (2009); Fan et al. (2009); Zhang (2010a); Ravikumar et al. (2011); Liu et al. (2012a); Negahban et al. (2012); Han et al. (2012); Kim and Kwon (2012); Shen et al. (2012). In the optimization community, researchers have proposed a large variety of computational algorithms including the proximal gradient methods (Nesterov, 1988, 2005; NESTEROV, 2013; Beck and Teboulle, 2009b,a; Zhao and Liu, 2012; Liu et al., 2015) and coordinate descent methods (Fu, 1998; Friedman et al., 2007; Wu and Lange, 2008; Friedman et al., 2008; Meier et al., 2008; Liu et al., 2009; Friedman et al., 2010; Qin et al., 2010; Mazumder et al., 2011; Breheny and Huang, 2011; Shalev-Shwartz and Tewari, 2011; Zhao et al., 2014c).

Recently, Wang et al. (2014) propose the path-following iterative soft shrinkage thresholding algorithm (PISTA), which combines the proximal gradient algorithm with path-following optimization scheme. By exploiting the solution sparsity and restricted strong convexity, they show that PISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties for solving a large class of sparse nonconvex learning problems. However, though the PISTA has superior theoretical properties, it is empirical performance is in general not as good as some heuristic competing methods such as the path-following coordinate descent algorithm (PCDA) (Tseng and Yun, 2009b,a; Lu and Xiao, 2013; Friedman et al., 2010; Mazumder et al., 2011; Zhao et al., 2012, 2014a). To address this concern, we propose a new computational algorithm named APISTA (Accelerated Path-following Iterative Shrinkage Thresholding Algorithm). More specifically, we exploit an additional coordinate descent subroutine to assist PISTA to efficiently decrease the objective value in each iteration. This makes APISTA significantly outperform PISTA in practice. Meanwhile, the coordinate descent subroutine preserves the solution sparsity and restricted strong convexity, therefore APISTA enjoys the same theoretical guarantee as those of PISTA, i.e., APISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties. As an application, we apply APISTA to a family of nonconvex optimization problems motivated by estimating semiparametric graphical models (Liu et al., 2012b; Zhao and Liu, 2014). PISTA allows us to obtain new sparse recovery results on graph estimation consistency which has not been established before. Thorough numerical results are presented to back up our theory.

NOTATIONS

Let υ = (υ1, …, υd)T ∈ ℝd, we define ‖υ1 = ∑jj|, υ22=jυj2, and ‖υ = maxjj|. We denote the number of nonzero entries in υ as ‖υ0 = ∑j 𝟙(υj ≠ 0). We define the soft-thresholding operator as 𝒮λ(υ)=[sign(υj)·(|υj|λ)]j=1d for any λ ≥ 0. Given a matrix A ∈ ℝd×d, we use A*j = (A1j, …, Adj)T to denote the jth column of A, and Ak* = (Ak1, …, Akd)T to denote the kth row of A. Let Λmax(A) and Λmin(A) denote the largest and smallest eigenvalues of A. Let ψ1(A), …, ψd(A) be the singular values of A, we define the following matrix norms: AF2=jA*j22, ‖Amax = maxjA*j, ‖A1 = maxjA*j1, ‖A2 = maxj ψj(A), ‖A = maxkAk*1. We denote υ\j = (υ1, …, υj−1, υj+1, …, υd)T ∈ ℝd−1 as the subvector of υ with the jth entry removed. We denote A\i\j as the submatrix of A with the ith row and the jth column removed. We denote Ai\j to be the ith row of A with its jth entry removed. Let 𝒜 ⊆ {1, …, d}, we use υ𝒜 to denote a subvector of υ by extracting all entries of υ with indices in 𝒜, and A𝒜𝒜 to denote a submatrix of A by extracting all entries of A with both row and column indices in 𝒜.

2 Background and Problem Setup

Let θ*=(θ1*,,θd*)T be a parameter vector to be estimated. We are interested in solving a class of regularized optimization problems in a generic form:

minθd(θ)+λ(θ)λ(θ), (2.1)

where ℒ(θ) is a smooth loss function and ℛλ(θ) is a nonsmooth regularization function with a regularization parameter λ.

2.1 Sparsity-inducing Nonconvex Regularization Functions

For high dimensional problems, we exploit sparsity-inducing regularization functions, which are usually continuous and decomposable with respect to each coordinate, i.e., λ(θ)=j=1drλ(θj). For example, the widely used ℓ1 norm regularization decomposes as λθ1=j=1dλ|θj|. One drawback of the ℓ1 norm is that it incurs large estimation bias when |θj*| is large. This motivates the usage of nonconvex regularizers. Examples include the SCAD (Fan and Li, 2001) regularization

rλ(θj)=λ|θj|·𝟙(|θj|λ)θj22λβ|θj|+λ22(β1)·𝟙(λ<|θj|λβ)+(β+1)λ22·𝟙(|θj|>λβ) for β>2,

and MCP (Zhang, 2010a) regularization

rλ(θj)=λ(|θj|θj22λβ)·𝟙(|θj|<λβ)+λ2β2·𝟙(|θj|λβ) for β>1.

Both SCAD and MCP can be written as the sum of an ℓ1 norm and a concave function ℋλ(θ), i.e., ℛλ(θ) = λ‖θ1 + ℋλ(θ). It is easy to see that λ(θ)=j=1dhλ(θj) is also decomposable with respect to each coordinate. More specifically, the SCAD regularization has

hλ(θj)=2λ|θj|θj2λ22(β1)·𝟙(λ<|θj|λβ)+(β+1)λ22λ|θj|2·𝟙(|θj|>λβ),
hλ(θj)=λsign(θj)θjβ1·𝟙(λ<|θj|λβ)λsign(θj)·𝟙(|θj|>λβ),

and the MCP regularization has

hλ(θj)=θj22β·𝟙(|θj|<λβ)+λ2β2λ|θj|2·𝟙(|θj|λβ),
hλ(θj)=θjβ·𝟙(|θj|λβ)λsign(θj)·𝟙(|θj|>λβ).

In general, the concave function hλ(·) is smooth and symmetric about zero with hλ(0) = 0 and hλ(0)=0. Its gradient hλ(·) is monotone decreasing and Lipschitz continuous, i.e., for any θj>θj, there exists a constant α ≥ 0 such that

α(θjθj)hλ(θj)hλ(θj)0. (2.2)

Moreover, we require hλ(θj)=λsign(θj) if |θj| ≥ λβ, and hλ(θj)(λ,0) if |θj| ≤ λβ.

It is easy to verify that both SCAD and MCP satisfy the above properties. In particular, the SCAD regularization has α = 1/(β − 1), and the MCP regularization has α = 1/β. These nonconvex regularization functions have been shown to achieve better asymptotic behavior than the convex ℓ1 regularization. More technical details can be found in Fan and Li (2001); Zhang (2010a, b); Zhang and Zhang (2012); Fan et al. (2014); Xue et al. (2012); Wang et al. (2014, 2013); Liu et al. (2014). We present several illustrative examples of the nonconvex regularizers in Figure 2.1.

Figure 2.1.

Figure 2.1

Two illustrative examples of the nonconvex regularization functions: SCAD and MCP. Here we choose λ = 1 and β = 2.01 for both SCAD and MCP.

2.2 Nonconvex Loss Function

A motivating application of the method proposed in this paper is sparse transelliptical graphical model estimation (Liu et al., 2012b). The transelliptical graphical model is a semiparametric graphical modeling tool for exploring the relationships between a large number of variables. We start with a brief review the transelliptical distribution defined below.

Definition 2.1 (Transelliptical Distribution)

Let {fj}j=1d be a set of strictly increasing univariate functions. Given a positive semidefinite matrix Σ* ∈ ℝd×d with rank(Σ*) = rd and Σjj*=1 for j = 1, …, d, we say that a d-dimensional random vector X = (X1, …, Xd)T follows a transelliptical distribution, denoted as X ~ TEd(Σ*, ξ, f1, …, fd), if X has a stochastic representation

(f1(X1),,fd(Xd))T=dξAU,

where Σ* = AAT, U ∈ 𝕊r − 1 is uniformly distributed on the unit sphere in ℝr, and ξ ≥ 0 is a continuous random variable independent of U.

Note that Σ* in Definition 2.1 is not necessarily the correlation matrix of X. To interpret Σ*, Liu et al. (2012b) provide a latent Gaussian representation for the transelliptical distribution, which implies that the sparsity pattern of Θ* = (Σ*)−1 encodes the graph structure of some underlying Gaussian distribution. Since Σ* needs to be invertible, we have r = d. To estimate Θ*, Liu et al. (2012b) suggest to directly plug in the following transformed Kendall’s tau estimator into existing gaussian graphical model estimation procedures.

Definition 2.2 (Transformed Kendall’s tau Estimator)

Let x1, …, xn ∈ ℝd be n independent observations of X = (X1, …, Xd)T, where xi = (xi1, …, xid)T. The transformed Kendall’s tau estimator Ŝ ∈ ℝd×d is defined as Ŝ=[Ŝkj]=[sin(π2τ^kj)], where τ̂kj is the empirical Kendall’s tau statistic between Xk and Xj defined as

τ^kj={2n(n1)i<isign ((xijxij)(xikxik))ifjk,1otherwise.

We then adopt the sparse column inverse operator to estimate the jth column of Θ*. In particular, we solve the following regularized quadratic optimization problem (Liu and Luo, 2015),

minΘ*jd12Θ*jTŜΘ*jI*jTΘ*j+λ(Θ*j) forj=1,,d. (2.3)

For notational simplicity, we omit the column index j in (2.3), and denote Θ*j and I*j by θ and e respectively. Throughout the rest of this paper, if not specified, we study the following optimization problem for the transelliptical graph estimation

minθd12θTŜθeTθ+λ(θ). (2.4)

The quadratic loss function used in (2.4) is twice differentiable with

(θ)=Ŝθe,2(θ)=Ŝ.

Since the transformed Kendall’s tau estimator is rank-based and could be indefinite (Zhao et al., 2014b), the optimization in (2.3) may not be convex even if ℛλ(θ) is a convex.

Remark 2.1

It is worth mentioning that the indefiniteness of Ŝ also makes(2.3) unbounded from below, but as will be shown later, our proposed algorithm can still guarantee a unique sparse local solution with optimal statistical properties under suitable solutions.

Remark 2.2

To handle the possible nonconvexity, Liu et al. (2012b) estimate Θ*j* based on a graphical model estimation procedure proposed in Cai et al. (2011) as follows,

minΘ*jdΘ*j1 subject to ŜΘ*jI*jλ forj=1,,d. (2.5)

(2.5) is convex regardless the indefiniteness of Ŝ. But a major disadvantage of (2.5) is the computation. Existing solvers can only solve (2.5) up to moderate dimensions. We will present more empirical comparison between (2.3) and (2.5) in our numerical experiments.

3 Method

For notational convenience, we rewrite the objective function ℱλ(θ) as

λ(θ)=(θ)+λ(θ)˜λ(θ)+λθ1.

We call ℒ̃λ(θ) the augmented loss function, which is smooth but possibly nonconvex. We first introduce the path-following optimization scheme, which is a multistage optimization framework and also used in PISTA.

3.1 Path-following Optimization Scheme

The path-following optimization scheme solves the regularized optimization problem (2.1) using a decreasing sequence of N + 1 regularization parameters {λK}K=0N, and yields a sequence of N + 1 output solutions {θ^{K}}K=0N from sparse to dense. We set the initial tuning parameter as λ0 = ‖∇ℒ(0)‖. By checking the KKT condition of (2.1) for λ0, we have

minξ01˜λ(0)+λ0ξ=minξ01(0)+λ(0)+λ0ξ=0, (3.1)

where the second equality comes from ‖ξ ≤ 1 and λ(0)=(hλ(0),hλ(0),,hλ(0))T=0 as introduced in §2.1. Since (3.1) indicates that 0 is a local solution to (2.1) for λ0, we take the leading output solution as θ̂{0} = 0. Let η ∈ (0, 1), we set λK = ηλK − 1 for K = 1, …, N. We then solve (2.1) for the regularization parameter λK with θ̂{K − 1} as the initial solution, which leads to the next output solution θ̂{K}. The path-following optimization scheme is illustrated in Algorithm 1.

3.2 Accelerated Iterative Shrinkage Thresholding Algorithm

We then explain the accelerated iterative shrinkage thresholding (AISTA) subroutine, which solves (2.1) in each stage of the path-following optimization scheme. For notational simplicity, we omit the stage index K, and only consider the iteration index m of AISTA. Suppose that AISTA takes some initial solution θ[0] and an initial step size parameter L[0], and we want to solve (2.1) with the regularization parameter λ. Then at the mth iteration of AISTA, we already have L[m] and θ[m]. Each iteration of AISTA contains two steps: The first one is the proximal gradient descent iteration, and the second one is the coordinate descent subroutine.

Algorithm 1.

Path-following optimization. It solves the problem (2.1) using a decreasing sequence of regularization parameters {λK}K=0N. More specifically, λ0 = ‖ℒ(0)‖ yields an all zero output solution θ̂{0} = 0. For K = 1, …, N, we set λK = ηλK − 1, where η ∈ (0, 1). We solve (2.1) for λK with θ̂{K − 1} as an initial solution. Note that AISTA is the computational algorithm for obtaining θ̂K + 1 using θ̂K as the initial solution. Lmin and {L^{K}}K=0N are corresponding step size parameters. More technical details on AISTA are presented are Algorithm 3.

Algorithm: {θ^(K)}K=0NAPISTA({λK}K=0N)
Parameter: η, Lmin
Initialize: λ0 = ‖∇ℒ(0)‖, θ̂{0}0, {0}Lmin
For: K = 0, …., N − 1
  λK+1 ← ηλK, {θ̂{K+1}, {K+1}} ← AISTAK+1, θ̂{K}, {K})
End for
Output: {θ^(K)}K=0N

(I) Proximal Gradient Descent Iteration

We consider the following quadratic approximation of ℱλ(θ) at θ = θ[m],

𝒬λ,L[m+1](θ;θ[m])=˜λ(θ[m])+(θθ[m])T˜λ(θ[m])+L[m+1]2θθ[m]22+λθ1,

where L[m+1] is the step size parameter such that 𝒬λ,L[m+1] (θ; θ[m]) ≥ ℱλ(θ). We then take a proximal gradient descent iteration and obtain θ[m] by

θ[m+0.5]=argminθd𝒬λ,L[m+1](θ;θ[m])=argminθdL[m+1]2θθ˜[m]22+λθ1, (3.2)

where θ̃[m] = θ[m] − ∇ℒ̃λ(θ[m])/L[m+1]. For notational simplicity, we write

θ[m+0.5]=𝒯λ,L[m+1](θ[m])). (3.3)

For sparse column inverse operator, we can obtain a closed form solution to (3.2) by soft thresholding

θ[m+0.5]=𝒮λ/L[m+1](θ˜[m+0.5]).

The step size 1/L[m+1] can be obtained by the backtracking line search. In particular, we start with a small enough L[0]. Then in each iteration of the middle loop, we choose the minimum nonnegative integer z such that L[m+1] = 2zL[m] satisfies

λ(θ[m+0.5])𝒬λ,L[m+1](θ[m+0.5];θ[m]) for m=0,1,2,. (3.4)

(II) Coordinate Descent Subroutine

Unlike the proximal gradient algorithm which repeats (3.3) until convergence at each stage of the path-following optimization scheme, AISTA exploits an additional coordinate descent subroutine to further boost the computational performance. More specifically, we define 𝒜={j|θj[m+0.5]=0} and solve the following optimization problem

minθλ(θ) subject to θ𝒜=0 (3.5)

using the cyclic coordinate descent algorithm (CCDA) initiated by θ[m+0.5]. For notational simplicity, we omit the stage index K and iteration index m, and only consider the iteration index t of CCDA. Suppose that the CCDA algorithm takes some initial solution θ(0) for solving (2.1) with the regularization parameter λ. Without loss of generality, we denote 𝒜 = {1, …, |𝒜|}. At the tth iteration, we have θ(t). Then at the (t + 1)th iteration, we conduct the coordinate minimization cyclically over all active coordinates. Let w(t+1,k) be an auxiliary solution of the (t + 1)th iteration with the first k − 1 coordinates updated. For k = 1, we have w(t+1,1) = θ(t). We then update the kth coordinate to obtain the next auxiliary solution w(t+1,k+1).

More specifically, let ∇kℒ̃λ(θ) be the kth entry of ∇ℒ̃λ(θ). We minimize the objective function with respect to each selected coordinate and keep all other coordinates fixed,

wk(t+1,k+1)=argminθkλ(θk;w\k(t+1,k))+rλ(θk). (3.6)

Once we obtain wk(t+1,k+1), we can set w\k(t+1,k+1)=w\k(t+1,k) to obtain the next auxiliary solution w(t+1,k+1). For sparse column inverse operator, let w˜k(t+1,k)=ekŜ\kkTw\k(t+1,k), we have

wk(t+1,k+1)=argminθk12Ŝkkθk2+ekŜ\kkTw\k(t+1,k)θkekθk+rλ(θk)=argminθk12(θkw˜k(t+1,k))2+rλ(θk), (3.7)

where the last equality comes from the fact Ŝkk = 1 for all k = 1, …, d. By setting the subgradient of (3.7) equal to zero, we can obtain wk(t+1,k+1) as follows:

  • For the ℓ1 norm regularization, we have wk(t+1,k+1)=𝒮λ(w˜k(t+1,k)).

  • For the SCAD regularization, we have
    wk(t+1,k+1)={w˜k(t+1,k)if|w˜k(t+1,k)|γλ,𝒮γλ/(γ1)(w˜k(t+1,k))11/(γ1)if|w˜k(t+1,k)|[2λ,γλ),𝒮λ(w˜k(t+1,k))if|w˜k(t+1,k)|<2λ.
  • For the MCP regularization, we have
    wk(t+1,k+1)={w˜k(t+1,k)if|w˜k(t+1,k)|γλ,𝒮λ(w˜k(t+1,k))11/γif|w˜k(t+1,k)|<γλ.

When all |𝒜| coordinate updates in the (t + 1)th iteration of CCDA finish, we set θ(t+1) = w(t+1,|𝒜|+1). We summarize CCDA in Algorithm 2. Once CCDA terminates, we denote its output solution by θ[m+1], and start the next iteration of AISTA. We summarize AISTA in Algorithm 3.

Algorithm 2.

The cyclic coordinate descent algorithm (CCDA). The cyclic coordinate descent algorithm cyclically iterates over the support of the initial solution. Without loss of generality, we assume 𝒜 = {1, …, |𝒜|}.

Algorithm: θ̂CCDA(λ, θ(0)).
Initialize: t ← 0, 𝒜 = supp(θ(0))
Repeat:
  w(t+1,1)θ(t)
  For k = 1, …, |𝒜|
    wk(t+1,k+1)argminθkλ(θk;w\k(t+1,k))+rλ(θk) and w\k(t+1,k+1)w\k(t+1,k)
  End for
  θ(t+1)w(t+1,|𝒜|+1), tt + 1
Until convergence
θ̂θ(t)

Remark 3.1

The backtracking line search procedure in PISTA has been extensively studied in existing optimization literature on the adaptive step size selection (Dennis and Schnabel, 1983; Nocedal and Wright, 2006), especially for proximal gradient algorithms (Beck and Teboulle, 2009b,a; NESTEROV, 2013). Many empirical results have corroborated better computational performance than that using a fix step size. But unlike the classical proximal gradient algorithms, APISTA can efficiently reduce the objective value by the coordinate descent subroutine in each iteration. Therefore we can simply choose a constant step size parameter L such that

LsupθdΛmax(2(θ)). (3.8)

The step size parameter L in (3.8) guarantees 𝒬λ,L(θ; θ[m]) ≥ ℱλ(θ) in each iteration of AISTA. For sparse column inverse operator, ∇2ℒ(θ) = Ŝ does not depend on θ. Therefore we choose

Algorithm 3.

The accelerated iterative shrinkage thresholding algorithm (AISTA). Within each iteration, we exploit an additional coordinate descent subroutine to improve the empirical computational performance.

Algorithm: {θ̂, } ← AISTA(λ, θ[0], L[0])
Initialize: m ← 0
Repeat:
  z ← 0
  Repeat:
    L [m+1] ← 2z L[m], θ[m+0.5] ← 𝒯λ,Ω,L[m+1] (θ[m]), zz + 1
  Until: 𝒬λ,L[m+1] (θ[m+0.5]; θ[m]) ≥ ℱλ(θ[m+1])
  θ[m+1]CCDA(λ, θ[m+0.5]), mm + 1
Until convergence
θ̂θ[m−0.5], L[m]
Output: {θ̂, }

L = Λmax(Ŝ). Our numerical experiments show that choosing a fixed step not only simplifies the implementation, but also attains better empirical computational performance than the backtracking line search. See more details in §5.

3.3 Stopping Criteria

Since θ is a local minimum if and only if the KKT condition minξ∈∂‖θ1 ‖∇ℒ̃λ(θ) + λξ = 0 holds, we terminate AISTA when

ωλ(θ[m+0.5])=minξθ[m+0.5]1˜λ(θ[m+0.5])+λξε, (3.9)

where ε is the target precision and usually proportional to the regularization parameter. More specifically, given the regularization parameter λK, we have

εK=δKλK for K=1,,N, (3.10)

where δK ∈ (0, 1) is a convergence parameter for the Kth stage of the path-following optimization scheme. Moreover, for CCDA, we terminate the iteration when

θ(t+1)θ(t)22δ02λ2, (3.11)

where δ0 ∈ (0, 1) is a convergence parameter. This stopping criterion is natural to the sparse coordinate descent algorithm, since we only need to calculate the value change of each coordinate (not the gradient). We will discuss how to choose δK’s and δ0 in §4.1.

4 Theory

Before we present the computational and statistical theories of APISTA, we introduce some additional assumptions. The first one is about the choice of regularization parameters.

Assumption 4.1

Let δK’s and η saitsify

η[0.9,1) and max0KNδKδmax=1/4,

where η is the rescaling parameter of the path-following optimization scheme, δK’s are the convergence parameters defined in (3.10), and δ0 is the convergence parameter defined in (3.11). We have the regularization parameters

λ0>λ1>λN8(θ*).

Assumption 4.1 has been extensively studied in existing literature on high dimensional statistical theory of the regularized M-estimators (Rothman et al., 2008; Zhang and Huang, 2008; Negahban and Wainwright, 2011; Negahban et al., 2012). It requires the regularization parameters to be large enough such that irrelevant variables can be eliminated along the solution path. Though ‖∇ℒ(θ*)‖ cannot be explicitly calculated (θ* is unknown), we can exploit concentration inequalities to show that Assumption 4.1 holds with high probability (Ledoux, 2005). In particular, we will verify Assumption 4.1 for sparse transellpitical graphical model estimation in Lemma 4.8.

Before we proceed with our second assumption, we define the largest and smallest s-sparse eigenvalues of the Hessian matrix of the loss function as follows.

Definition 4.1

Given an integer s ≥ 1, we define the largest and smallest s-sparse eigenvalues of ∇2ℒ(θ) as

  • Largest s-Sparse Eigenvalue : ρ+(s)=supυd,υ0sυT2(θ)υυ22,

  • Smallest s-Sparse Eigenvalue : ρ(s)=infυd,υ0sυT2(θ)υυ22.

Moreover, we define ρ̃(s) = ρ(s) − α and ρ+(s) = ρ+(s) for notational simplicity, where α is defined in (2.2).

The next lemma shows the connection between the sparse eigenvalue conditions and restricted strongly convex and smooth conditions.

Lemma 4.1

Given ρ(s) > 0, for any θ, θ′ ∈ ℝd with |supp(θ) ∪ supp(θ′)| ≤ s, we have

(θ)(θ)+(θθ)T(θ)+ρ+(s)2θθ22,
(θ)(θ)+(θθ)T(θ)+ρ(s)2θθ22.

Moreover, if ρ(s) > α, then we have

˜λ(θ)˜λ(θ)+(θθ)T˜λ(θ)+ρ+(s)2θθ22,
˜λ(θ)˜λ(θ)+(θθ)T˜λ(θ)+ρ˜(s)2θθ22,

and for any ξ ∈ ∂‖θ1,

λ(θ)λ(θ)+(˜λ(θ)+λξ)T(θθ)+ρ˜(s)2θθ22.

The proof of Lemma 4.1 is provided in Wang et al. (2014), therefore omitted. We then introduce the second assumption.

Assumption 4.2

Given ‖θ*0s*, there exists an integer satisfying

s˜(144κ2+250κ)·s*,ρ+(s*+2s˜)<+, and ρ˜(s*+2s˜)>0,

where κ = ρ+(s + 2)ρ̃(s + 2).

Assumption 4.2 requires that ℒ̃λ(θ) satisfies the strong convexity and smoothness when θ is sparse. As will be shown later, APISTA can always guarantee the number of irrelevant coordinates with nonzero values not to exceed . Therefore the restricted strong convexity is preserved along the solution path. We will verify that Assumption 4.2 holds with high probability for the transellpitical graphical model estimation in Lemma 4.9.

Remark 4.2 (Step Size Initialization)

We take the initial step size parameter as Lmin ≥ ρ+(1). For sparse column inverse operator, we directly choose Lmin = ρ+(1) = 1.

4.1 Computational Theory

We develop the computational theory of APISTA. For notational simplicity, we define 𝒮={j|θj*0} and 𝒮={j|θj*=0} for characterizing the the solution sparsity. We first start with the convergence analysis for the cyclic coordinate descent algorithm (CCDA). The next theorem presents its rate of convergence in term of the objective value.

Theorem 4.3 (Geometric Rate of Convergence of CCDA)

Suppose that Assumption 4.2 holds. Given a sparse initial solution satisfying θ𝒮(0)0s˜, (3.5) is a strongly convex optimization problem with a unique global minimizer θ̄. Moreover, for t = 1, 2…, we have

λ(θ(t))λ(θ¯)((s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜))t[λ(θ(0))λ(θ¯)].

The proof of Theorems 4.3 is provided in Appendix A. Theorem 4.3 suggests that when the initial solution is sparse, CCDA essentially solves a strongly convex optimization problem with a unique global minimizer. Consequently we can establish the geometric rate of convergence in term of the objective value for CCDA. We then proceed with the convergence analysis of AISTA. The next theorem presents its theoretical rate of convergence in term of the objective value.

Theorem 4.4 (Geometric Rate of Convergence of AISTA)

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if the initial solution θ[0] satisfies

θ𝒮[0]0s˜,ωλ(θ[0])λ/2, (4.1)

then we have θ𝒮(m)0s˜ for m = 0.5, 1, 1.5, 2, …. Moreover, for m = 1, 2, …, we have

λ(θ[m])λ(θ¯λ)(118κ)m[λ(θ(0))λ(θ¯)],

where θ̄λ is a unique sparse local solution to (2.1) satisfying ωλ(θ̄λ) = 0 and θ¯𝒮λ0s˜.

The proof of Theorem 4.4 is provided in Appendix B. Theorem 4.4 suggests that all solutions of AISTA are sparse such that the restricted strongly convex and smooth conditions hold for all iterations. Therefore, AISTA attains the geometric rate of convergence in term of the objective value. Theorem 4.4 requires a proper initial solution to satisfy (4.1). This can be verified by the following theorem.

Theorem 4.5 (Path-following Optimization Scheme)

Suppose that Assumptions 4.1 and 4.2 hold. Given θ satisfying

θ𝒮0s and ωλK1(θ)δK1λK1, (4.2)

we have ωλK(θ) ≤ λK/2.

The proof of Theorem 4.5 is provided in Wang et al. (2014), therefore omitted. Since θ{0} naturally satisfies (4.2) for λ1, by Theorem 4.5 and induction, we can show that the path-following optimization scheme always guarantees that the output solution of the (K − 1)th stage is a proper initial solution for the Kth stage, where K = 1, …, N. Eventually, we combine Theorems 4.3 and 4.4 with Theorem 4.5, and establish the global geometric rate of convergence in term of the objective value for APISTA in the next theorem.

Theorem 4.6 (Global Geometric Rate of Convergence of APISTA)

Suppose that Assumptions 4.1 and 4.2 hold. Recall that δ0 and δK’s are defined in §3.3, κ and are defined in Assumption 4.2, and α is defined in (2.2). We have the following results:

  1. At the Kth stage (K = 1, …, N), the number of coordinate descent iterations within each CCDA is at most C1 log (C20), where
    C1=2 log1((s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜)) and C2=21s*ρ˜(s*+s˜)ρ˜(1);
  2. At the Kth stage (K = 1, …, N), the number of the proximal gradient iterations in each AISTA is at most C3 log (C4K), where
    C3=2 log1(118κ) and C4=10κs*;
  3. To compute all N + 1 output solutions, the total number of coordinate descent iterations in APISTA is at most
    C1 log (C2/δ0)K=1NC3 log (C4/δK); (4.3)
  4. At the Kth stage (K = 1, …, N), we have
    λN(θ^{K})λN(θ¯λN)[𝟙(K<N)+δK]·105λK2s*ρ˜(s*+s˜);

The proof Theorem 4.6 is provided in Appendix C. We then present a more intuitive explanation about Result (3). To secure the generalization performance in practice, we usually tune the regularization parameter over a refined sequence based on cross validation. In particular, we solve (2.1) using partial data with high precision for every regularization parameter. If we set δK = δoptλK for K = 1, …N, where δopt is a very small value (e.g. 10−8), then we can rewrite (4.3) as

NC1 log (C2δ0)C3 log (C4δopt)=𝒪(N log (1δopt)), (4.4)

where δ0 is some reasonably large value (e.g. 10−2) defined in §3.3. The iteration complexity in (4.4) depends on N.

Once the regularization parameter is selected, we still need to solve (2.1) using full data with some regularization sequence. But we only need high precision for the selected regularization parameter (e.g., λN), and for K = 1, …, N − 1, we only solve (2.1) for λK up to an adequate precision, e.g., δK = δ0 for K = 1, …, N − 1 and δN = δoptλN. Since 1/δopt is much larger than N, we can rewrite (4.3) as

C1 log (C2δ0)((N1)C3 log (C4δ0)+C3 log (C4δopt))=𝒪(log(1δopt)). (4.5)

Now the iteration complexity in (4.5) does not depend on N.

Remark 4.7

To establish computational theories of APISTA with a fixed step size, we only need to slightly modify the proofs of Theorems 4.4 and 4.6 by replacing ρ+(s* + 2) and ρ+(s* + ) by their upper bound L defined in (3.8). Then a global geometric rate of convergence can also be derived, but with a worse constant term.

4.2 Statistical Theory

We then establish the statistical theory of the SCIO estimator obtained by APISTA under transelliptical models. We use Θ* and Σ* to denote the true latent precision and covariance matrices. We assume that Θ* belongs to the following class of sparse, positive definite, and symmetric matrices:

𝒰ψmax,ψmin(M,s*)={Θd×d|Θ=ΘT, max jΘ*j0s*,Θ1M,0<ψmax1Λmin(Θ)Λmax(Θ)ψmin1<},

where ψmax and ψmin are positive constants, and do not scale with (M, s*, n, d). Since Σ* = (Θ*)−1, we also have ψmin ≤ Λmin(Σ*) ≤ Λmax(Σ*) ≤ ψmax.

We first verify Assumptions 4.1 and 4.2 in the next two lemmas for transelliptical models.

Lemma 4.8

Suppose that X~TEd(Σ*,ξ,{fj}j=1d). Given λN=82πMlog d/n, we have

(λN8(θ*))11d2.

The proof of Lemma 4.8 is provided in Appendix D. Lemma 4.8 guarantees that the selected regularization parameter λN satisfies Assumption 4.1 with high probability.

Lemma 4.9

Suppose that X~TEd(Σ*,ξ,{fj}j=1d). Given α = ψmin/2, there exist universal positive constants c1 and c2 such that for n4ψmin1c2(1+2c1)s*log d, with probability at least 1 − 2/d2, we have

s˜=c1s*(144κ2+250κ)s*,ρ˜(s*+2s˜)ψmin4,ρ+(s*+2s˜)5ψmax4,

where κ is defined in Assumption 4.2.

The proof of Lemma 4.9 is provided in Appendix E. Lemma 4.9 guarantees that if the Lipschitz constant of hλ defined in (2.2) satisfies α = ψmin/2, then the transformed Kendall’s tau estimator Ŝ = ∇2ℒ(θ) satisfies Assumption 4.2 with high probability.

Remark 4.10

Since Assumptions 4.1 and 4.2 have been verified, by Theorem 4.6, we know that APISTA attains the geometric rate of convergence to a unique sparse local solution to (2.3) in term of the objective value with high probability.

Recall that we use θ to denote Θ*j in (2.4), by solving (2.3) with respect to all d columns, we obtain Θ^{N}=[Θ^*1{N},,Θ^*d{N}] and Θ¯λN=[Θ¯*1λN,,Θ¯*dλN], where Θ¯*jλN denotes the output solution of APISTA corresponding to λN for the jth column (j = 1, ‥d), and Θ¯*jλN to denote the unique sparse local solution corresponding to λN for the jth column (j = 1, ‥d), which APISTA converges to. We then present concrete rates of convergence of the estimator obtained by APISTA under the matrix ℓ1 and Frobenius norms in the following theorem.

Theorem 4.11. [Parameter Estimation]

Suppose that X~TEd(Σ*,ξ,{fj}j=1d), and α = ψmin/2. For n4ψmin1c2(1+2c1)s*log d, given λN=82πMlog d/n, we have

Θ^{N}Θ*1=OP(Ms*log dn),1dΘ^{N}Θ*F2=OP(M2s*log dn).

The proof of (4.11) is provided in Appendix F. The results in Theorem 4.11 show that the SCIO estimator obtained by APISTA achieves the same rates of convergence as those for subguassian distributions (Liu and Luo, 2015). Moreover, when using the nonconvex regularization such as MCP and SCAD, we can achieve graph estimation consistency under the following assumption.

Assumption 4.3

Suppose that X~TEd(Σ*,ξ,{fj}j=1d). Define *={(k,j)|Θkj*0} as the support of Θ*. There exists some universal constant c3 such that

min(k,j)*|Θkj*|c3M·s* log dn.

Assumption 4.3 is a sufficient condition for sparse column inverse operator to achieve graph estimation consistency in high dimensions for transelliptical models. The violation of Assumption 4.3 may result in underselection of the nonzero entries in Θ*.

The next theorem shows that, with high probability, Θ̅λN and the oracle solution Θ̂o are identical. More specifically, let 𝒮j=supp(Θ*j*) for j = 1, …, d, Θ^o=[Θ^*1o,,Θ^*do] defined as follows,

Θ^𝒮jjo=argminΘ𝒮jj|𝒮j|12Θ𝒮jjTŜ𝒮j𝒮jΘ𝒮jjI𝒮jjTΘ𝒮jj and Θ^𝒮jjo=0 for j=1,,d. (4.6)

Theorem 4.12. [Graph Estimation]

Suppose that X~TEd(Σ*,ξ,{fj}j=1d), α = ψmin/2, and Assumption 4.3 holds. There exists a universal constant c4 such that n4ψmin1c2(1+2c1)s*log d, if we choose λN=c42πMs*log d/n, then we have

(Θ¯λN=Θ^o)13d2.

The proof of Theorem 4.12 is provided in Appendix G. Since Θ̂o shares the same support with Θ*, Theorem 4.12 guarantees that the SCIO estimator obtained by APISTA can perfectly recover ℰ* with high probability. To the best of our knowledge, Theorem 4.12 is the first graph estimation consistency result for transelliptical models without any post-processing procedure (e.g. thresholding).

Remark 4.13

In Theorem (4.12), we choose λN=c42πMlog d/n, which is different from the selected regularization parameter in Assumption 4.8. But as long as we have c4s*8, which is not an issue under the high dimensional scaling

M,s*,n,d and Ms*logd/n0,

λN ≥ 8‖∇ℒ(θ*)‖ still holds with high probability. Therefore all computational theories in §4.1 hold for Θ̅λN in Theorem 4.12.

5 Numerical Experiments

In this section, we study the computational and statistical performance of APISTA method through numerical experiments on sparse transelliptical graphical model estimation. All experiments are conducted on a personal computer with Intel Core i5 3.3 GHz and 16GB memory. All programs are coded in double precision C, called from R. The computation are optimized by exploiting the sparseness of vector and matrices. Thus we can gain a significant speedup in vector and matrix manipulations (e.g. calculating the gradient and evaluating the objective value). We choose the MCP regularization with varying β’s for all simulations.

5.1 Simulated Data

We consider the chain and Erdös-Rényi graph generation schemes with varying d = 200, 400, and 800 to obtain the latent precision matrices:

  • Chain. Each node is assigned a coordinate j for j = 1, …, d. Two nodes are connected by an edge whenever the corresponding points are at distance no more than 1.

  • Erdös-Rényi. We set an edge between each pair of nodes with probability 1/d, independently of the other edges.

Two illustrative examples are presented in Figure 5.1. Let 𝒟 be the adjacency matrix of the generated graph, and ℳ2 be the rescaling operator that converts a symmetric positive semidefinite matrix to a correlation matrix. We calculate

Σ*=2[(𝒟˜+(1Λmin(𝒟))I)1].

We use Σ* as the covariance matrix to generate n = ⌈60 log d⌉ independent observations from a multivariate t-distribution with mean 0 and degrees of freedom 3. We then adopt the power transformation g(t) = t5 to convert to the t-distributed data to the transelliptical data. Note that the corresponding latent precision matrix is Ω* = (Σ*)−1. We compare the following five computational methods:

  1. APISTA: The computational algorithm proposed in §3.

  2. F-APISTA: APISTA without the backtracking line search (using a fixed step size instead).

  3. PISTA: The pathwise iterative shrinkage thresholding algoritm proposed in Wang et al. (2014).

  4. CLIME: The sparse latent precision matrix estimation method proposed in Liu et al. (2012b), which solves (2.5) by the ADMM method (Alternating Direction Method of Multipliers, Li et al. (2015); Liu et al. (2014)).

  5. SCIO(P): The SCIO estimator based on the positive semidefinite projection method proposed in Zhao et al. (2014b). More specifically, we first project the possibly indefinite Kendall’s tau matrix into the cone of all positive semidefinite matrices. Then we plug the obtained replacement into (2.3), and solve it by the coordinate descent method proposed in Liu and Luo (2015).

Figure 5.1.

Figure 5.1

Different graph patterns. To ease the visualization, we only present graphs with d = 200.

Note that (4) and (5) have theoretical guarantees only when the ℓ1 norm regularization is applied. For (1)–(3), we set δ0 = δK = 10−5 for K = 1, …, N.

We first compare the statistical performance in parameter estimation and graph estimation of all methods. To meet this end, we generate a validation set of the same size as the training set. We use the regularization sequence with N = 100 and λN=0.5log d/n0.0645. The optimal regularization parameter is selected by

λ^=argminλ{λ1,,λN}Θ^λS˜Imax,

where Θ̂λ denotes the estimated latent precision matrix using the training set with the regularization parameter λ, and denotes the estimated latent covariance matrix using the validation set. We repeat the simulation for 100 times, and summarize the averaged results in Tables 5.1 and 5.2. For all settings, we set δ0 = δK = 10−5. We also vary β of the MCP regularization from 100 to 20/19, thus the corresponding α varies from 0.01 to 0.95. The parameter estimation performance is evaluated by the difference between the obtained estimator and the true latent prediction matrix under the Forbenius and matrix ℓ1 norms. The graph estimation performance is evaluated by the true positive rate (T. P. R.) and false positive rate (F. P. R.) defined as follows,

T.P.R.=kj𝟙(Θ^kjλ^0)·𝟙(Θkj*0)kj𝟙(Θkj*0),F.P.R.=kj𝟙(Θ^kjλ^0)·𝟙(Θkj*0)kj𝟙(Θkj*0).

Table 5.1.

Quantitive comparison among different estimators on the chain model. Since APISTA and F-APISTA can output valid results for large α’s, their estimator attains better performance than other competitors. The SCIO(P) and CLIME estimators use the ℓ1 norm regularization with no bias reduction. Thus their performance is worse than the other competitors in both parameter estimation and graph estimation.

Method d ‖Θ̂−Θ‖F ‖Θ̂−Θ‖1 T. P. R. F. P. R. α
PISTA 200 4.1112(0.7856) 1.0517(0.1141) 1.0000(0.0000) 0.0048(0.0079) 0.20
400 6.4507(0.9062) 1.0756(0.0717) 1.0000(0.0000) 0.0007(0.0004) 0.20
800 8.2640(1.1456) 1.0434(0.0673) 1.0000(0.0000) 0.0003(0.0006) 0.20

APISTA 200 2.5162(0.2677) 0.7665(0.1583) 0.9993(0.0012) 0.0001(0.0001) 0.95
400 3.3664(0.2735) 0.8298(0.0986) 1.0000(0.0000) 0.0002(0.0000) 0.67
800 5.0244(0.7984) 0.9312(0.1226) 1.0000(0.0000) 0.0002(0.0004) 0.50

F-APISTA 200 2.5163(0.2670) 0.7658(0.1559) 0.9994(0.0015) 0.0001(0.0002) 0.95
400 3.3629(0.2702) 0.8253(0.0959) 1.0000(0.0000) 0.0002(0.0000) 0.67
800 5.0237(0.7963) 0.9373(0.1289) 1.0000(0.0000) 0.0002(0.0005) 0.50

SCIO(P) 200 6.1812(1.2924) 1.2245(0.0777) 1.0000(0.0000) 0.0165(0.0220) 0.00
400 8.9991(0.9894) 1.2255(0.0785) 1.0000(0.0000) 0.0058(0.0047) 0.00

CLIME 200 6.4771(0.8617) 1.2187(0.0358) 1.0000(0.0000) 0.0126(0.0043) 0.00
400 9.1221(0.9997) 1.2177(0.0629) 1.0000(0.0000) 0.0043(0.0032) 0.00

Table 5.2.

Quantitive comparison among different estimators on the Erdös-Rényi model. Since A-PISTA and F-APISTA can output valid results for large α’s, their estimators attains better performance than other competitors. The SCIO(P) and CLIME estimators use the ℓ1 norm regularization with no bias reduction. Thus their performance is worse than the other competitors in both parameter estimation and graph estimation.

Method d ‖Θ̂−Θ‖F ‖Θ̂−Θ‖1 T. P. R. F. P. R. α̂
PISTA 200 3.2647(0.1235) 1.6807(0.2675) 1.0000(0.0000) 0.0587(0.0013) 0.20
400 4.5609(0.7666) 2.2113(0.3358) 1.0000(0.0000) 0.0295(0.0091) 0.20
800 5.0751(0.3832) 2.5718(0.2826) 1.0000(0.0000) 0.0099(0.0020) 0.20

APISTA 200 2.2888(0.1141) 1.1644(0.2343) 1.0000(0.0000) 0.0193(0.0005) 0.33
400 3.2206(0.2733) 1.4974(0.2778) 1.0000(0.0000) 0.0067(0.0100) 0.33
800 4.0929(0.1862) 1.6347(0.2023) 1.0000(0.0000) 0.0036(0.0008) 0.50

F-APISTA 200 2.2890(0.1161) 1.1647(0.2390) 1.0000(0.0000) 0.0197(0.0007) 0.33
400 3.2251(0.2702) 1.4928(0.2731) 1.0000(0.0000) 0.0060(0.0102) 0.33
800 4.0984(0.1891) 1.6397(0.2096) 1.0000(0.0000) 0.0034(0.0009) 0.50

SCIO(P) 200 3.4277(0.5405) 1.5213(0.3223) 1.0000(0.0000) 0.0618(0.0170) 0.00
400 5.7144(0.8158) 1.9057(0.2933) 0.9994(0.0017) 0.0341(0.0145) 0.00

CLIME 200 3.6297(0.6103) 1.4876(0.2855) 1.0000(0.0000) 0.0581(0.0159) 0.00
400 5.9206(0.8385) 1.8246(0.2817) 1.0000(0.0000) 0.0320(0.0112) 0.00

Since the convergence of PISTA is very slow when α is large, we only present its results for α = 0.2. APISTA and F-APISTA can work for larger α’s. Therefore they effectively reduces the estimation bias to attain the best statistical performance in both parameter estimation and graph estimation among all estimators. The SCIO(P) and CLIME methods only use ℓ1 norm without any bias reduction, their performance is worse than the other competitors. Moreover, due to the poor scalability of their solvers, SCIO(P) and CLIME fail to output valid results within 10 hours when d = 800.

We then compare the computational performance of all methods. We use a regularization sequence with N = 50, and λN is proper selected such that the graphs obtained by all methods have approximately the same number of edges for each regularization parameter. In particular, the obtained graphs corresponding to λN have approximately 0.1 · d(d − 1)/2 edges. To make a fair comparison, we choose the ℓ1 norm regularization for all methods. We repeat the simulation for 100 times, and the timing results are summarized in Tables 5.3 and 5.4. We see that F-APISTA method is up to 10 times faster than PISTA algorithm, and APISTA is up to 5 times after than PISTA. SCIO(P) and CLIME are much slower than the other three competitors.

Table 5.3.

Quantitive comparison of computational performance on the chain model (in seconds). We see that the F-APISTA method attains the best timing performance among all methods. The SCIO(P) and CLIME methods are much slower than the other three methods.

d PISTA APISTA F-APISTA SCIO(P) CLIME
200 0.8342(0.0248) 0.2693(0.0031) 0.1013(0.0022) 2.6572(0.1253) 8.5932(0.5396)
400 3.8782(0.0696) 1.2103(0.0368) 0.4559(0.0308) 25.451(2.5752) 48.235(5.3494)
800 30.014(0.3514) 6.5970(0.2338) 2.4283(0.2605) 315.87(34.638) 460.12(45.121)

Table 5.4.

Quantitive comparison of computational performance on the Erdös-Rényi model (in seconds). We see that the F-APISTA method attains the best timing performance among all methods. The SCIO(P) and CLIME methods are much slower than the other three methods.

d PISTA APISTA F-APISTA SCIO(P) CLIME
200 0.5401(0.0248) 0.2048(0.0056) 0.1063(0.0110) 2.712(0.13558) 7.1325(0.7891)
400 3.0501(0.0829) 0.9982(0.0453) 0.4555(0.0071) 26.140(2.1503) 45.160(4.9026)
800 28.581(0.3517) 6.8417(0.7543) 2.7037(0.2145) 332.90(30.115) 442.57(50.978)

5.2 Real Data

We present a real data example to demonstrate the usefulness of the transelliptical graph obtained by the sparse column inverse operator (based on the transformed Kendall’s tau matrix). We acquire closing prices from all stocks of the S&P 500 for all the days that the market was open between January 1, 2003 and January 1, 2005, which results in 504 samples for the 452 stocks. We transform the dataset by calculating the log-ratio of the price at time t + 1 to price at time t. The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) sectors.

We adopt the stability graphs obtained by the following procedure (Meinshausen and Bühlmann, 2010; Liu et al., 2010):

  1. Calculate the graph path using all samples, and choose the regularization parameter at the sparsity level 0.1;

  2. Randomly choose 50% of all samples without replacement using the regularization parameter chosen in (1);

  3. Repeat (2) 100 times and retain the edges that appear with frequencies no less than 95%.

We choose the sparsity level 0.1 in (1) and subsampling ratio 50% in (2) based on two criteria: The resulting graphs need to be sparse to ease visualization, interpretation, and computation; The resulting graphs need to be stable. We then present the obtained stability graphs in Figure 5.2. The nodes are colored according to the GICS sector of the corresponding stock. We highlight a region in the transelliptical graph obtained by the SCIO method and by color coding we see that the nodes in this region belong to the same sector of the market. A similar pattern is also found in the transelliptical graph obtained by the CLIME method. In contrast, this region is shown to be sparse in the Gaussian graph obtained by the SCIO method (based on the Pearson correlation matrix). Therefore we can see that the SCIO method is also capable of generating refined structures as the CLIME method when estimating the transelliptical graph.

Figure 5.2.

Figure 5.2

Stock Graphs. We see that both transelliptical graphs reveal more refined structures than the Gaussian graph.

6 Discussions

We compare F-APISTA with a closely related algorithm – the path-following coordinate descent algorithm (PCDA1) in timing performance. In particular, we give a failure example of PCDA for solving sparse linear regression. Let X ∈ ℝn×d denote design matrix and y ∈ ℝn denote the response vector. We solve the following regularized optimization problem,

minθ12nyXθ22+λ(θ).

We generate each row of the design matrix Xi* from a d-variate Gaussian distribution with mean 0 and covariance matrix Σ ∈ ℝd×d, where Σkj = 0.75 if kj and Σkk = 1 for all j, k = 1, …, d. We then normalize each column of the design matrix X*j such that X*j22=n. The response vector is generated from the linear model y = * + ε, where θ* ∈ ℝd is the regression coefficient vector, and ε is generated from a n-variate Gaussian distribution with mean 0 and covariance matrix I. We set n = 60 and d = 1000. We set the coefficient vector as θ250*=3,θ500*=2,θ750*=1.5, and θj*=0 for all j ≠ 250, 500, 750. We then set α = 0.95, N = 100, λN=0.25log d/n, and δc = δK = 10−5.

We then generate a validation set using the same design matrix as the training set for the regularization selection. We denote the response vector of the validation set as ∈ ℝn. Let θ̂λ denote the obtained estimator using the regularization parameter λ. We then choose the optimal regularization parameter λ̂ by

λ^=argminλ{λ1,,λN}Xθ^λ22.

We repeat 100 simulations, and summarize the average results in Table 6. We see that F-APISTA and PCDA attain similar timing results. But PCDA achieves worse statistical performance than F-APISTA in both support recovery and parameter estimation. This is because PCDA has no control over the solution sparsity. The overselection irrelevant variables compromise the restricted strong convexity, and make PCDA attain some local optima with poor statistical properties.

Table 6.1.

Quantitative comparison between F-APISTA and PCDA. We see that F-APISTA and PCDA attain similar timing results. But PCDA achieves worse statistical performance than F-APISTA in both support recovery and parameter estimation.

Method ‖θ̂−θ*2 ‖θ̂𝒮0 ‖θ̂𝒮c0 Correct Selection Timing
F-APISTA 0.8001(0.9089) 2.801(0.5123) 0.890(2.112) 667/1000 0.0181(0.0025)
PCDA 1.1275(1.2539) 2.655(0.7051) 1.644(3.016) 517/1000 0.0195(0.0021)

Acknowledgments

Research supported by NSF Grants III-1116730 and NSF III-1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841, and FDA HHSF223201000072C.

Appendix

A Proof of Theorem 4.3

Proof

Since ‖θ(0)0s* + implies that |𝒜| ≤ s* + , by Assumption 4.2 and Lemma 4.1, we know that (3.5) is strongly convex over θ𝒜. Thus it has a unique global minimizer. We then analyze the amount of successive decrease. By the restricted strong convexity of ℱλ(θ), we have

λ(w(t+1,k))λ(w(t+1,k+1))(kλ(θk(t+1),w\k(t+1,k))+λξk(t+1))(θk(t)θk(t+1))+ρ˜(1)2(θk(t)θk(t+1))2, (A.1)

where ξk(t+1)|θk(t+1)| satisfies the optimality condition of (3.6),

k˜λ(θk(t+1),w\k(t+1,k))+λξk(t+1)=0. (A.2)

By combining (A.1) with (A.2), we have

λ(w(t+1,k))λ(w(t+1,k+1))ρ˜(1)2(θk(t+1)θk(t))2,

which further implies

λ(θ(t))λ(θ(t+1))ρ˜(1)2θ(t)θ(t+1)22. (A.3)

We then analyze the gap in the objective value yet to be minimized after each iteration. For any θ′, θ ∈ ℝd with θ𝒜=θ𝒜=0, by the restricted strong convexity of ℱλ(θ), we have

λ(θ)λ(θ)+(˜λ(θ)+λξ)T(θθ)+ρ˜(s*+s˜)2θθ22, (A.4)

where ξ ∈ ℝd with ξ𝒜 ∈ ∂‖θ𝒜1 and ξ𝒜 = 0. We then minimize both sides of (A.4) with respect to θ𝒜 and obtain

λ(θ(t+1))λ(θ¯)12ρ˜(s*+s˜)𝒜˜λ(θ(t+1))+λξ𝒜(t+1)22=(i)12ρ˜(s*+s˜)k=1|𝒜|[k˜λ(θ(t+1))k˜λ(θk(t+1),w\k(t+1,k))]2(ii)ρ+2(s*+s˜)2ρ˜(s*+s˜)k=1|𝒜|θ(t+1)w(t+1,k)2(s*+s˜)ρ+2(s*+s˜)2ρ˜(s*+s˜)θ(t+1)θ(t)2, (A.5)

where (i) comes from (A.2) and (ii) comes from the restricted strong smoothness of ℒ̃λ(θ).

Eventually, by combing (A.5) with (A.3), we obtain

λ(θ(t+1))λ(θ¯)(s*+s˜)ρ+2(s*+s˜)ρ˜(1)ρ˜(s*+s˜)[λ(θ(t))λ(θ(t+1))](s*+s˜)ρ+2(s*+s˜)ρ˜(1)ρ˜(s*+s˜)([λ(θ(t))λ(θ¯)][λ(θ(t+1))λ(θ¯)]),

which further implies

λ(θ(t+1))λ(θ¯)(s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜))[λ(θ(t))λ(θ¯)]. (A.6)

By recursively applying (A.6), we complete the proof.

B Proof of Theorem 4.4

Proof

Before we proceed with the proof, we first introduce several important lemmas.

Lemma B.1

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies,

θ𝒮0s˜ and ωλ(θ)λ/2, (B.1)

then we have

θθ*221λs*8ρ˜(s*+s˜),θθ*121λs*ρ˜(s*+s˜), and λ(θ)λ(θ*)+21λ2s*2ρ˜(s*+s˜).

Lemma B.2

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies,

θ𝒮0s˜ and λ(θ)λ(θ*)+21λ2s*2ρ˜(s*+s˜),

then we have ‖[𝒯L(θ)]𝒮0 for any L ≤ 2ρ+(s* + 2).

The proofs of Lemmas B.1 and B.2 are provided in Wang et al. (2014), therefore omitted. Since the initial solution θ[0] satisfies the approximate KKT condition. By Lemma B.1, we know that θ[0] satisfies

λ(θ[0])λ(θ*)+21λ2s*2ρ˜(s*+s˜). (B.2)

We assume L[m] ≤ 2ρ+(s* + 2). Since θ𝒮[0]0s˜, by (B.2) and Lemma B.2, we have θ[0.5] = 𝒯L(θ[0]) and θ𝒮[0.5]0s˜. Since the coordinate descent subroutine iterates over 𝒜 = supp(θ[0.5]), its output solution θ[1] also satisfies θ𝒮[1]0s˜. Since the proximal gradient descent iteration and coordinate descent subroutine decrease the objective value, by (B.2), we also have

λ(θ[1])λ(θ[0.5])λ(θ[0])λ(θ*)+21λ2s*2ρ˜(s*+s˜).

Then by induction, we know that all successive θ[m]’s satisfy θ𝒮[m]0s˜ for m = 1.5, 2, 2.5, ….

Now we verify L[m] ≤ 2ρ+(s* + 2). Since we start with a small enough L = ρ+(1) ≤ 2ρ+(s* + 2). If L does not satisfy the stopping criterion for the backtracking line search in (3.4), then we multiply L by 2. Once L attains the interval ∈ [ρ+(s* + 2), 2ρ+(s* + 2)], it stops increasing. Because by the restricted strong smoothness of ℒ̃λ(θ), such a step size parameter always guarantees that the algorithm iterates from a sparse θ[m] to a sparse θ[m+0.5], and meanwhile satisfies the stopping criterion of the backtracking line search. Thus L[m] ≤ 2ρ+(s* + 2) is verified.

The existence and uniqueness of θ̄λ has been verified in Wang et al. (2014). Therefore the proof is omitted. We then proceed to derive the geometric rate of convergence to θ̄λ by the next lemma.

Lemma B.3

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies

θ𝒮0s˜ and λ(θ)λ(θ*)+21λ2s*2ρ˜(s*+s˜), (B.3)

given L ≤ 2ρ+(s* + 2), then we have

λ(𝒯λ,L(θ))λ(θ¯λ)(118κ)[λ(θ)λ(θ¯λ)].

The proof of Lemma B.3 is provided in Wang et al. (2014), therefore omitted. Since we have verified that all θ[m]’s satisfy (B.3) and all L[m]’s satisfy L[m] ≤ 2ρ+(s* + 2) for m = 0, 1, 2, …, Lemma B.3 implies

λ(θ[m+1])λ(θ¯λ)λ(θ[m+0.5])λ(θ¯λ)(118κ)[λ(θ[m])λ(θ¯λ)], (B.4)

where the first inequality holds because the coordinate descent subroutine decreases the objective value. Then by recursively applying (B.4), we compete the proof.

C Proof of Theorem 4.6

Proof

Before we proceed with the proof of Result (1), we first introduce the following lemma.

Lemma C.1

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies

θ𝒮0s˜ and ωλ(θ)δmaxλ,

then for any λ′ ∈ [λN, λ], we have

λ(θ)λ(θ¯λ)21[δmaxλ+2(λλ)](λ+λ)s*ρ˜(s*+s˜).

The proof of Lemmas C.1 is provided in Wang et al. (2014), therefore omitted. If we take λ = λ′ = λK and θ = θ̂{K−1}, then Lemma C.1 implies

λK(θ^{K1})λK(θ¯λK)21s*λK22ρ˜(s*+s˜). (C.1)

Recall (A.3) in Appendix A. Within each coordinate descent subroutine for λK, we have

θ(t)θ(t+1)222[λK(θ(t))λK(θ(t+1))]ρ˜(1)2[λK(θ(t))λK(θ¯)]ρ˜(1). (C.2)

By combining Theorem 4.3 with (C.2), we have

θ(t)θ(t+1)222((s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜))t[λK(θ(0))λK(θ¯)]ρ˜(1).

Therefore given

tlog(2[λK(θ(0))λK(θ¯)]ρ˜(1)δ02λK2)/log1((s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜)), (C.3)

we have

θ(t)θ(t+1)222(s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜))t[λK(θ(0))λK(θ¯)]ρ˜(1)δ02λK2,

which satisfies the stopping criterion of CCDA for λK. Since both the proximal gradient descent iteration and coordinate descent subroutine decrease the objective value, we have

λK(θ^{K1})λK(θ(0))λK(θ¯)λK(θ¯λK) (C.4)

within each coordinate descent subroutine for the Kth stage. By combining (C.1) and (C.3) with (C.4), we have

tlog(21s*ρ˜(s*+s˜)ρ˜(1)δ02)/log1((s*+s˜)ρ+2(s*+s˜)(s*+s˜)ρ+2(s*+s˜)+ρ˜(1)ρ˜(s*+s˜)).

Before we proceed with the proof of Result (2), we first introduce the following lemma.

Lemma C.2

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies,

θ𝒮0s˜ and λ(θ)λ(θ*)+21λ2s*2ρ˜(s*+s˜), (C.5)

given L ≤ 2ρ+(s* + 2), we have

ωλ(𝒯λ,L(θ))3ρ+(s*+2s˜)[λ(𝒯λ,L(θ))λ(θ)].

The proof of Lemma C.2 is provided in Wang et al. (2014), therefore omitted. Recall that in Appendix B, we have shown that at the Kth stage, θ[m] satisfies (C.5). The backtracking line search guarantees L[m+1] ≤ 2ρ+(s* + 2). Thus by Lemma C.2, we have

ωλK(θ[m+0.5])3ρ+(s*+2s˜)[λK(θ[m+0.5])λK(θ[m])]3ρ+(s*+2s˜)[λK(θ[m+1])λK(θ¯λK)], (C.6)

where the last inequality holds since the coordinate descent subroutine decreases the objective value. By combining (C.6) with Theorem 4.4, we obtain

ωλK(θ[m+0.5])3ρ+(s*+2s˜)(118κ)m+1[λK(θ[0])λK(θ¯λK)].

Thus as long as

mlog(9ρ+(s*+2s˜)[λK(θ[0])λK(θ¯λK)]δK2λK2)/log1(118κ), (C.7)

we have

ωλK(θ[m+0.5])3ρ+(s*+2s˜)(118κ)m[λK(θ[0])λK(θ¯λK)]δKλK,

which satisfies the stopping criterion of AISTA at the Kth stage. By combining (C.1) with (C.7), we have

mlog(189κλK2s*2δK2λK2)/log1(118κ).

Result (3) is just a straightforward combination of Results (1) and (2).

To prove Result (4), we need to use Lemma C.1 again. In particular, for K < N, we take λ′ = λN, λ = λK and θ = θ̂{K}. We then have

λN(θ^{K})λN(θ¯λN)21(λK+λN)(ωλK(θ^{K})+2(λKλN))s*)ρ˜(s*+s˜). (C.8)

Since we have λK > λN for K = 1, …, N − 1, (C.8) implies

λN(θ^{K})λN(θ¯λN)105λK2s*ρ˜(s*+s˜). (C.9)

For K = N, (C.8) implies

λN(θ^{N})λN(θ¯λN)105δNλN2s*ρ˜(s*+s˜). (C.10)

By combining (C.9) with (C.10), we prove Result (4).

D Proof of Lemma 4.8

Proof

Before we proceed with the proof, we need to introduce the following lemma.

Lemma D.1

Suppose that X~TEd(Σ*,ξ,{fj}j=1d). We have

(ŜΣ*max2πlog dn)11d2. (D.1)

The proof of Lemma D.1 is provided in Liu et al. (2012a), therefore omitted. We consider the following decomposition,

(θ*)=Ŝθ*e=(ŜΣ*)θ*θ*1ŜΣ*max. (D.2)

Then by combining (D.1) and (D.2) with the fact ‖θ*1 ≤ ‖Θ*1M, we have

((θ*)2πMlog dn)11d2,

which completes the proof.

E Proof of Lemma 4.9

Proof

Before we proceed with the proof, we first introduce the following lemma.

Lemma E.1

Suppose that X~TEd(Σ*,ξ,{fj}j=1d). There exists a universal constant c2 such that

(supθ0s|θT(ŜΣ*)θ|c2s log dnθ22)12d2. (E.1)

The proof of Lemma E.1 is provided in Han and Liu (2015), therefore omitted. We consider the decomposition

θTŜθ=θTΣ*θ+θT(ŜΣ*)θ. (E.2)

By assuming ‖θ0s* + 2 and

|θT(ŜΣ¯)θ|c2(s*+2s˜)log dnθ22/n,

we further have

θTŜθΛmax(Σ*)·θ22+|θT(ŜΣ*)θ|ψmaxθ22+c2(s*+2s˜) log dnθ22, (E.3)
θTŜθΛmin(Σ*)·θ22|θT(ŜΣ*)θ|ψminθ22c2(s*+2s˜) log dnθ22. (E.4)

Thus for n4ψmin1c2(s*+2s˜)log d, we have

3ψminθ22/4θTŜθ5ψmaxθ22/4.

Given α = ψmin/2, we have

ρ+(s*+2s˜)5ψmax/4,ρ˜(s*+2s˜)ψmin/4,κ5ψmax/ψmin. (E.5)

Since we need to secure = c1s* ≥ (144κ2 + 250κ)s*, we take

c1=3600ψmax2/ψmin2+1250ψmax/ψmin72(1+γ)κ2+250κ. (E.6)

In another word, we need

n4ψmin1c2(1+2c1)s*log d4ψmin1c2(s*+2s˜)log d.

Eventually by combining (E.1) and (E.5) with (E.6), we complete the proof.

F Proof of Theorem 4.11

Proof

Recall that the output solution θ̂{N} satisfies θ^𝒮{N}0s˜ and ωλN ≤ δNλN. By Lemma B.1, we have

θ^{N}θ*121λNs*ρ˜(s*+s˜) and θ^{N}θ*227λN2s*ρ˜(s*+s˜). (F.1)

By the definition of the matrix ℓ1 and Frobenius norms, we have

Θ^{N}Θ*1=max1jdΘ*j{N}Θ*j*1 and Θ^{N}Θ*F2=j=1dΘ*j{N}Θ*j*22. (F.2)

Recall that we use θ̂{N} to denote arbitrary column of Θ̂{N}. By combining (F.2) with (F.1), we have

Θ^{N}Θ*121λNs*ρ˜(s*+s˜) and 1dΘ^{N}Θ*F27λN2s*ρ˜(s*+s˜).

Since all above results rely on Assumptions 4.1 and 4.2, by Lemma 4.8 and 4.9, we have

Θ^{N}Θ*11682πs*Mρ˜(s*+s˜)log dn and 1dΘ^{N}Θ*F2896π2s*M2log dρ˜(s*+s˜)n

with probability 1 − 3d−2, which completes the proof.

G Proof of Theorem 4.12

Proof

For notational simplicity, we omit the column index j, and use 𝒮 and θ̂o ∈ ℝd to denote the true support 𝒮j and corresponding oracle estimator Θ̂o respectively for the jth column. In particular, we can rewrite (4.6) as follows,

θ^𝒮o=argminθ𝒮|𝒮|12θ𝒮TŜ𝒮𝒮θ𝒮e𝒮Tθ𝒮 and θ^𝒮o=0. (G.1)

Suppose that Assumption 4.2 holds. We have

Λmin(S𝒮𝒮)ρ(s*)ρ(s*+2s˜)=ρ˜(s*+2s˜)+α>α,

which implies that S𝒮𝒮 is positive definite. Thus (G.1) is strongly convex and θ̂o is a unique minimizer. In our following analysis, we also assume

ŜΣ*max2πlog dn. (G.2)

By the strong convexity of (G.1), we have

0(i)12(θ^𝒮o)TŜ𝒮𝒮θ^𝒮oe𝒮Tθ^𝒮o12(θ𝒮*)TŜ𝒮𝒮θ𝒮*+e𝒮Tθ𝒮*(Ŝ𝒮𝒮θ𝒮*e)T(θ^𝒮oθ𝒮*)+ρ(s*)2θ𝒮*θ^𝒮o22, (G.3)

where (i) comes from the fact that θ̂o is the minimizer to (G.1). For notational simplicity, we denote Δ^𝒮o=θ^𝒮oθ𝒮*. By the Cauchy-Schwarz inequality, (G.3) can be rewritten as

ρ(s*)2Δ^𝒮o22(Ŝ𝒮𝒮θ𝒮*e𝒮)TΔ^𝒮oŜ𝒮𝒮θ𝒮*e𝒮maxΔ^𝒮o1Ŝθ*emaxs*Δ^𝒮o2,

where the last inequality comes from (G.2) and the fact that Δ̂o contains at most s* entries. By simple manipulations, we obtain

Δ^𝒮o22s*Ŝθ*emaxρ(s*)2s*θ𝒮*1Ŝ𝒮𝒮Σ𝒮𝒮*maxρ(s*)22πMρ(s*)s*log dn, (G.4)

where the last inequality comes from the fact ‖θ*1 ≤ ‖Θ*1M. By combining (G.4) with Assumption 4.3, we obtain

minj𝒮|θ^jo|minj𝒮|θ^j*|Δ^𝒮o(i)minj𝒮|θj*|Δ^𝒮o2=(c322πρ(s*))Ms*log dn,

where (i) comes from the fact Δ^𝒮oΔ^𝒮o2. Now we assume c322πρ1(s*)+c42πβ for some constant c4 (will be discussed later). We then have

minj𝒮|θ^jo|c42πMs*log dnλNβ.

Now we show that θ̂o is a sparse local solution to (2.4). In particular, we have the following decomposition,

θ^o=Ŝθ^oe=[Ŝ𝒮𝒮Ŝ𝒮𝒮Ŝ𝒮𝒮Ŝ𝒮𝒮][θ^𝒮o0][e𝒮0].

Since θ^𝒮o is the minimizer to (G.1), by the KKT condition of (G.1), we have

Ŝ𝒮𝒮θ^𝒮oe𝒮=0. (G.5)

Moreover, since minj𝒮|θ^jo|λNβ, we have

λN(θ^𝒮o)=λN(θ^𝒮o)+λNθ^𝒮o1=0. (G.6)

By combining (G.5) with (G.6), we have

Ŝ𝒮𝒮θ^𝒮oe𝒮λN(θ^𝒮o)+λNθ^𝒮o1=0. (G.7)

Now we consider

Ŝ𝒮𝒮θ^𝒮o=(Ŝ𝒮𝒮Σ¯𝒮𝒮+Σ¯𝒮𝒮)(θ^𝒮oθ𝒮*+θ𝒮*)=(Ŝ𝒮𝒮Σ¯𝒮𝒮)Δ^𝒮o+Σ¯𝒮𝒮Δ^𝒮o+Ŝ𝒮𝒮Σ¯𝒮𝒮θ*=s*Ŝ𝒮𝒮Σ¯𝒮𝒮maxΔ^𝒮o2+Δ^𝒮o+Ŝ𝒮𝒮Σ¯𝒮𝒮θ*1=4π2Ms*log dρ(s*)n+22πMρ(s*)s*log dn+2πMlog dn=(2πψminc2(1+2c3)ρ(s*)+2ρ(s*)+1)2πMs*log dn.

Therefore as long as

c42πψminc2(1+2c3)ρ(s*)+2ρ(s*)+1,

we have Ŝ𝒮𝒮θ^𝒮oλN, which implies that there exists ξ ∈ ∂‖01 such that

Ŝ𝒮𝒮θ^𝒮oλN(0)+λNξ=0. (G.8)

By combining (G.7) with (G.8), we know that θ̂o satisfies the KKT condition and is a local solution to (2.4).

Now we will show that θ̂o and θ̄λN are identical. Since θ¯𝒮λNs˜ and θ^𝒮o=0, we have

|supp(θ^o)supp(θ¯λN)|s*+s˜.

By the restricted strong convexity of ℱλN, we have

λN(θ¯λN)λN(θ^o)+(˜λN(θ^o)+λNξ˜o)T(θ¯λNθ^o)+ρ˜(s*+s˜)2θ¯λNθ^o22,=λN(θ¯λN)+ρ˜(s*+s˜)2θ¯λNθ^o22, (G.9)
λN(θ^o)λN(θ¯λN)+(˜λN(θ¯λN)+λNξ˜)T(θ^oθ¯λN)+ρ˜(s*+s˜)2θ^oθ¯λN22,=λN(θ¯λN)+ρ˜(s*+s˜)2θ¯λNθ^o22, (G.10)

where ξ̃ and ξ̃o are defined as

ξ˜=argminξθ¯λN1˜λN(θ¯λN)+λNξ and ξ˜o=argminξθ^o1˜λN(θ^o)+λNξ.

By combining (G.9) with (G.10), we have θ^oθ¯λN22=0, i.e., θ̂o = θ̄λN. Note that we choose λN=c42πMlog d/n, which is different from the selected regularization parameter in Assumption 4.8. But as long as we have c4s*8, which is not an issue under the high dimensional scaling

M,s*,n,d and Ms*log d/n0,

λN ≥ 8‖∇ℒ(θ*)‖ still holds with high probability. Since the above results universally hold over all columns of Θ̅λN and Θ* under Assumptions 4.1 and (4.2), by Lemmas 4.8 and 4.9, we obtain Θ̂o = Θ̅λN, which completes the proof.

Footnotes

1

In our numerical experiments, PCDA is implemented by the R package “ncvreg”.

References

  1. Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
  2. Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Image Processing, IEEE Transactions on. 2009a;18:2419–2434. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]
  3. Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009b;2:183–202. [Google Scholar]
  4. Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cai T, Liu W, Luo X. A constrained 2113;1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
  6. Dennis JJE, Schnabel RB. Numerical methods for unconstrained optimization and nonlinear equations. Vol. 16. SIAM; 1983. [Google Scholar]
  7. Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
  10. Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014;42:819–849. doi: 10.1214/13-aos1198. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
  12. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33:1–13. [PMC free article] [PubMed] [Google Scholar]
  14. Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
  15. Han F, Liu H. Statistical analysis of latent generalized correlation matrix estimation in transelliptical distribution. Bernoulli. 2015 doi: 10.3150/15-BEJ702. (Accepted) [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Han F, Zhao T, Liu H. CODA: High dimensional copula discriminant analysis. Journal of Machine Learning Research. 2012;14:629–671. [Google Scholar]
  17. Jacob L, Obozinski G, Vert J-P. Group lasso with overlap and graph lasso; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]
  18. Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]
  19. Ledoux M. The concentration of measure phenomenon. Vol. 89. AMS Bookstore; 2005. [Google Scholar]
  20. Li X, Zhao T, Yuan X, Liu H. The ”flare” package for high-dimensional sparse linear regression in R. Journal of Machine Learning Research. 2015;16:553–557. [PMC free article] [PubMed] [Google Scholar]
  21. Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics. 2012a;40:2293–2326. [Google Scholar]
  22. Liu H, Han F, Zhang C-H. Transelliptical graphical models. Advances in Neural Information Processing Systems 25. 2012b [PMC free article] [PubMed] [Google Scholar]
  23. Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]
  24. Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. Advances in Neural Information Processing Systems. 2010 [PMC free article] [PubMed] [Google Scholar]
  25. Liu H, Wang L, Zhao T. Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics. 2014;23:439–459. doi: 10.1080/10618600.2013.782818. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Liu H, Wang L, Zhao T. Calibrated multivariate regression with application to neural semantic basis discovery. Journal of Machine Learning Research. 2015;16:1579–1606. [PMC free article] [PubMed] [Google Scholar]
  27. Liu W, Luo X. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis. 2015;135:153–162. doi: 10.1016/j.jmva.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Lu Z, Xiao L. Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint arXiv:1306.5918. 2013 [Google Scholar]
  29. Mazumder R, Friedman JH, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B. 2008;70:53–71. [Google Scholar]
  31. Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
  32. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72:417–473. [Google Scholar]
  33. Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37:246–270. [Google Scholar]
  34. Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39:1069–1097. [Google Scholar]
  35. Negahban SN, Ravikumar P, Wainwright MJ, Yu B. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27:538–557. [Google Scholar]
  36. Nesterov Y. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody. 1988;24:509–517. [Google Scholar]
  37. Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–152. [Google Scholar]
  38. NESTEROV Y. Gradient methods for minimizing composite objective function. Mathematical Programming Series B. 2013;140:125–161. [Google Scholar]
  39. Nocedal J, Wright S. Numerical optimization, series in operations research and financial engineering. New York: Springer; 2006. [Google Scholar]
  40. Qin Z, Scheinberg K, Goldfarb D. Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation. 2010:1–27. [Google Scholar]
  41. Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]
  42. Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
  43. Shalev-Shwartz S, Tewari A. Stochastic methods for ℓ1-regularized loss minimization. The Journal of Machine Learning Research. 2011;12:1865–1892. [Google Scholar]
  44. Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 2012;107:223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  46. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B. 2005;67:91–108. [Google Scholar]
  47. Tseng P, Yun S. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications. 2009a;140:513–535. [Google Scholar]
  48. Tseng P, Yun S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming. 2009b;117:387–423. [Google Scholar]
  49. Van de Geer SA. High-dimensional generalized linear models and the lasso. The Annals of Statistics. 2008;36:614–645. [Google Scholar]
  50. Wainwright M. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming. IEEE Transactions on Information Theory. 2009;55:2183–2201. [Google Scholar]
  51. Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Wang Z, Liu H, Zhang T. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics. 2014;42:2164–2201. doi: 10.1214/14-AOS1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics. 2008;2:224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Xue L, Zou H, Cai T. Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics. 2012;40:1403–1429. [Google Scholar]
  55. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68:49–67. [Google Scholar]
  56. Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
  57. Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010a;38:894–942. [Google Scholar]
  58. Zhang C-H, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]
  59. Zhang C-H, Zhang T. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
  60. Zhang T. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics. 2009;37:2109–2144. [Google Scholar]
  61. Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research. 2010b;11:1081–1107. [Google Scholar]
  62. Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
  63. Zhao T, Liu H. Sparse additive machine; International Conference on Artificial Intelligence and Statistics; 2012. [Google Scholar]
  64. Zhao T, Liu H. Calibrated precision matrix estimation for high-dimensional elliptical distributions. IEEE transactions on Information Theory. 2014;60:7874. doi: 10.1109/TIT.2014.2360980. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research. 2012;13:1059–1062. [PMC free article] [PubMed] [Google Scholar]
  66. Zhao T, Liu H, Zhang T. A general theory of pathwise coordinate optimization. arXiv preprint arXiv:1412.7477. 2014a [Google Scholar]
  67. Zhao T, Roeder K, Liu H. Positive semidefinite rank-based correlation matrix estimation with application to semiparametric graph estimation. Journal of Computational and Graphical Statistics. 2014b;23:895–922. doi: 10.1080/10618600.2013.858633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  68. Zhao T, Yu M, Wang Y, Arora R, Liu H. Accelerated mini-batch randomized block coordinate descent method. Advances in neural information processing systems. 2014c [PMC free article] [PubMed] [Google Scholar]
  69. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
  70. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]

RESOURCES