Accelerated Path-following Iterative Shrinkage Thresholding Algorithm with Application to Semiparametric Graph Estimation

Tuo Zhao; Han Liu

doi:10.1080/10618600.2016.1164533

. Author manuscript; available in PMC: 2017 Jan 27.

Published in final edited form as: J Comput Graph Stat. 2016 Nov 10;25(4):1272–1296. doi: 10.1080/10618600.2016.1164533

Accelerated Path-following Iterative Shrinkage Thresholding Algorithm with Application to Semiparametric Graph Estimation

Tuo Zhao ^*, Han Liu ^†

PMCID: PMC5271586 NIHMSID: NIHMS752300 PMID: 28133430

Abstract

We propose an accelerated path-following iterative shrinkage thresholding algorithm (APISTA) for solving high dimensional sparse nonconvex learning problems. The main difference between APISTA and the path-following iterative shrinkage thresholding algorithm (PISTA) is that APISTA exploits an additional coordinate descent subroutine to boost the computational performance. Such a modification, though simple, has profound impact: APISTA not only enjoys the same theoretical guarantee as that of PISTA, i.e., APISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties, but also significantly outperforms PISTA in empirical benchmarks. As an application, we apply APISTA to solve a family of nonconvex optimization problems motivated by estimating sparse semiparametric graphical models. APISTA allows us to obtain new statistical recovery results which do not exist in the existing literature. Thorough numerical results are provided to back up our theory.

1 Introduction

High dimensional data challenge both statistics and computation. In the statistics community, researchers have proposed a large family of regularized M-estimators, including Lasso, Group Lasso, Fused Lasso, Graphical Lasso, Sparse Inverse Column Operator, Sparse Multivariate Regression, Sparse Linear Discriminant Analysis (Tibshirani, 1996; Zou and Hastie, 2005; Yuan and Lin, 2005, 2007; Banerjee et al., 2008; Tibshirani et al., 2005; Jacob et al., 2009; Fan et al., 2012; Liu and Luo, 2015; Han et al., 2012; Liu et al., 2015). Theoretical analysis of these methods usually rely on the sparsity of the parameter space and requires the resulting optimization problems to be strongly convex over a restricted parameter space. More details can be found in Meinshausen and Bühlmann (2006); Zhao and Yu (2006); Zou (2006); Rothman et al. (2008); Zhang and Huang (2008); Van de Geer (2008); Zhang (2009); Meinshausen and Yu (2009); Wainwright (2009); Fan et al. (2009); Zhang (2010a); Ravikumar et al. (2011); Liu et al. (2012a); Negahban et al. (2012); Han et al. (2012); Kim and Kwon (2012); Shen et al. (2012). In the optimization community, researchers have proposed a large variety of computational algorithms including the proximal gradient methods (Nesterov, 1988, 2005; NESTEROV, 2013; Beck and Teboulle, 2009b,a; Zhao and Liu, 2012; Liu et al., 2015) and coordinate descent methods (Fu, 1998; Friedman et al., 2007; Wu and Lange, 2008; Friedman et al., 2008; Meier et al., 2008; Liu et al., 2009; Friedman et al., 2010; Qin et al., 2010; Mazumder et al., 2011; Breheny and Huang, 2011; Shalev-Shwartz and Tewari, 2011; Zhao et al., 2014c).

Recently, Wang et al. (2014) propose the path-following iterative soft shrinkage thresholding algorithm (PISTA), which combines the proximal gradient algorithm with path-following optimization scheme. By exploiting the solution sparsity and restricted strong convexity, they show that PISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties for solving a large class of sparse nonconvex learning problems. However, though the PISTA has superior theoretical properties, it is empirical performance is in general not as good as some heuristic competing methods such as the path-following coordinate descent algorithm (PCDA) (Tseng and Yun, 2009b,a; Lu and Xiao, 2013; Friedman et al., 2010; Mazumder et al., 2011; Zhao et al., 2012, 2014a). To address this concern, we propose a new computational algorithm named APISTA (Accelerated Path-following Iterative Shrinkage Thresholding Algorithm). More specifically, we exploit an additional coordinate descent subroutine to assist PISTA to efficiently decrease the objective value in each iteration. This makes APISTA significantly outperform PISTA in practice. Meanwhile, the coordinate descent subroutine preserves the solution sparsity and restricted strong convexity, therefore APISTA enjoys the same theoretical guarantee as those of PISTA, i.e., APISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties. As an application, we apply APISTA to a family of nonconvex optimization problems motivated by estimating semiparametric graphical models (Liu et al., 2012b; Zhao and Liu, 2014). PISTA allows us to obtain new sparse recovery results on graph estimation consistency which has not been established before. Thorough numerical results are presented to back up our theory.

NOTATIONS

Let υ = (υ₁, …, υ_d)^T ∈ ℝ^d, we define ‖υ‖₁ = ∑_j |υ_j|, ${‖ υ ‖}_{2}^{2} = \sum_{j} υ_{j}^{2}$ , and ‖υ‖_∞ = max_j |υ_j|. We denote the number of nonzero entries in υ as ‖υ‖₀ = ∑_j 𝟙(υ_j ≠ 0). We define the soft-thresholding operator as $𝒮_{λ} (υ) = {[sign (υ_{j}) \cdot (| υ_{j} | - λ)]}_{j = 1}^{d}$ for any λ ≥ 0. Given a matrix A ∈ ℝ^d×d, we use A_*j = (A_1j, …, A_dj)^T to denote the j^th column of A, and A_k* = (A_k1, …, A_kd)^T to denote the k^th row of A. Let Λ_max(A) and Λ_min(A) denote the largest and smallest eigenvalues of A. Let ψ₁(A), …, ψ_d(A) be the singular values of A, we define the following matrix norms: ${‖ A ‖}_{F}^{2} = \sum_{j} {‖ A_{* j} ‖}_{2}^{2}$ , ‖A‖_max = max_j ‖A_*j‖_∞, ‖A‖₁ = max_j ‖A_*j‖₁, ‖A‖₂ = max_j ψ_j(A), ‖A‖_∞ = max_k ‖A_k*‖₁. We denote υ_\j = (υ₁, …, υ_j−1, υ_j+1, …, υ_d)^T ∈ ℝ^d−1 as the subvector of υ with the j^th entry removed. We denote A_\i\j as the submatrix of A with the i^th row and the j^th column removed. We denote A_i\j to be the i^th row of A with its j^th entry removed. Let 𝒜 ⊆ {1, …, d}, we use υ_𝒜 to denote a subvector of υ by extracting all entries of υ with indices in 𝒜, and A_𝒜𝒜 to denote a submatrix of A by extracting all entries of A with both row and column indices in 𝒜.

2 Background and Problem Setup

Let $θ^{*} = {(θ_{1}^{*}, \dots, θ_{d}^{*})}^{T}$ be a parameter vector to be estimated. We are interested in solving a class of regularized optimization problems in a generic form:

min_{θ \in ℝ^{d}} \underset{ℱ_{λ} (θ)}{\underset{︸}{ℒ (θ) + ℛ_{λ} (θ)}},

(2.1)

where ℒ(θ) is a smooth loss function and ℛ_λ(θ) is a nonsmooth regularization function with a regularization parameter λ.

2.1 Sparsity-inducing Nonconvex Regularization Functions

For high dimensional problems, we exploit sparsity-inducing regularization functions, which are usually continuous and decomposable with respect to each coordinate, i.e., $ℛ_{λ} (θ) = \sum_{j = 1}^{d} r_{λ} (θ_{j})$ . For example, the widely used ℓ₁ norm regularization decomposes as $λ {‖ θ ‖}_{1} = \sum_{j = 1}^{d} λ | θ_{j} |$ . One drawback of the ℓ₁ norm is that it incurs large estimation bias when $| θ_{j}^{*} |$ is large. This motivates the usage of nonconvex regularizers. Examples include the SCAD (Fan and Li, 2001) regularization

r_{λ} (θ_{j}) = λ | θ_{j} | \cdot 𝟙 (| θ_{j} | \leq λ) - \frac{θ_{j}^{2} - 2 λ β | θ_{j} | + λ^{2}}{2 (β - 1)} \cdot 𝟙 (λ < | θ_{j} | \leq λ β) + \frac{(β + 1) λ^{2}}{2} \cdot 𝟙 (| θ_{j} | > λ β) for β > 2,

and MCP (Zhang, 2010a) regularization

r_{λ} (θ_{j}) = λ (| θ_{j} | - \frac{θ_{j}^{2}}{2 λ β}) \cdot 𝟙 (| θ_{j} | < λ β) + \frac{λ^{2} β}{2} \cdot 𝟙 (| θ_{j} | \geq λ β) for β > 1 .

Both SCAD and MCP can be written as the sum of an ℓ₁ norm and a concave function ℋ_λ(θ), i.e., ℛ_λ(θ) = λ‖θ‖₁ + ℋ_λ(θ). It is easy to see that $ℋ_{λ} (θ) = \sum_{j = 1}^{d} h_{λ} (θ_{j})$ is also decomposable with respect to each coordinate. More specifically, the SCAD regularization has

h_{λ} (θ_{j}) = \frac{2 λ | θ_{j} | - θ_{j}^{2} - λ^{2}}{2 (β - 1)} \cdot 𝟙 (λ < | θ_{j} | \leq λ β) + \frac{(β + 1) λ^{2} - 2 λ | θ_{j} |}{2} \cdot 𝟙 (| θ_{j} | > λ β),

h_{λ}^{'} (θ_{j}) = \frac{λ sign (θ_{j}) - θ_{j}}{β - 1} \cdot 𝟙 (λ < | θ_{j} | \leq λ β) - λ sign (θ_{j}) \cdot 𝟙 (| θ_{j} | > λ β),

and the MCP regularization has

h_{λ} (θ_{j}) = - \frac{θ_{j}^{2}}{2 β} \cdot 𝟙 (| θ_{j} | < λ β) + \frac{λ^{2} β - 2 λ | θ_{j} |}{2} \cdot 𝟙 (| θ_{j} | \geq λ β),

h_{λ}^{'} (θ_{j}) = - \frac{θ_{j}}{β} \cdot 𝟙 (| θ_{j} | \leq λ β) - λ sign (θ_{j}) \cdot 𝟙 (| θ_{j} | > λ β) .

In general, the concave function h_λ(·) is smooth and symmetric about zero with h_λ(0) = 0 and $h_{λ}^{'} (0) = 0$ . Its gradient $h_{λ}^{'} (\cdot)$ is monotone decreasing and Lipschitz continuous, i.e., for any $θ_{j}^{'} > θ_{j}$ , there exists a constant α ≥ 0 such that

- α (θ_{j} - θ_{j}^{'}) \leq h_{λ}^{'} (θ_{j}) - h_{λ}^{'} (θ_{j}^{'}) \leq 0 .

(2.2)

Moreover, we require $h_{λ}^{'} (θ_{j}) = - λ sign (θ_{j})$ if |θ_j| ≥ λβ, and $h_{λ}^{'} (θ_{j}) \in (- λ, 0)$ if |θ_j| ≤ λβ.

It is easy to verify that both SCAD and MCP satisfy the above properties. In particular, the SCAD regularization has α = 1/(β − 1), and the MCP regularization has α = 1/β. These nonconvex regularization functions have been shown to achieve better asymptotic behavior than the convex ℓ₁ regularization. More technical details can be found in Fan and Li (2001); Zhang (2010a, b); Zhang and Zhang (2012); Fan et al. (2014); Xue et al. (2012); Wang et al. (2014, 2013); Liu et al. (2014). We present several illustrative examples of the nonconvex regularizers in Figure 2.1.

Figure 2.1 — Two illustrative examples of the nonconvex regularization functions: SCAD and MCP. Here we choose λ = 1 and β = 2.01 for both SCAD and MCP.

2.2 Nonconvex Loss Function

A motivating application of the method proposed in this paper is sparse transelliptical graphical model estimation (Liu et al., 2012b). The transelliptical graphical model is a semiparametric graphical modeling tool for exploring the relationships between a large number of variables. We start with a brief review the transelliptical distribution defined below.

Definition 2.1 `(Transelliptical Distribution)`

Let ${f_{j}}_{j = 1}^{d}$ be a set of strictly increasing univariate functions. Given a positive semidefinite matrix Σ^* ∈ ℝ^d×d with rank(Σ^*) = r ≤ d and $Σ_{j j}^{*} = 1$ for j = 1, …, d, we say that a d-dimensional random vector X = (X₁, …, X_d)^T follows a transelliptical distribution, denoted as X ~ TE_d(Σ^*, ξ, f₁, …, f_d), if X has a stochastic representation

{(f_{1} (X_{1}), \dots, f_{d} (X_{d}))}^{T} \overset{d}{=} ξ A U,

where Σ^* = AA^T, U ∈ 𝕊^{r − 1} is uniformly distributed on the unit sphere in ℝ^r, and ξ ≥ 0 is a continuous random variable independent of U.

Note that Σ^* in Definition 2.1 is not necessarily the correlation matrix of X. To interpret Σ^*, Liu et al. (2012b) provide a latent Gaussian representation for the transelliptical distribution, which implies that the sparsity pattern of Θ^* = (Σ^*)⁻¹ encodes the graph structure of some underlying Gaussian distribution. Since Σ^* needs to be invertible, we have r = d. To estimate Θ^*, Liu et al. (2012b) suggest to directly plug in the following transformed Kendall’s tau estimator into existing gaussian graphical model estimation procedures.

Definition 2.2 `(Transformed Kendall’s tau Estimator)`

Let x₁, …, x_n ∈ ℝ^d be n independent observations of X = (X₁, …, X_d)^T, where x_i = (x_i1, …, x_id)^T. The transformed Kendall’s tau estimator Ŝ ∈ ℝ^d×d is defined as $Ŝ = [Ŝ_{k j}] = [sin (\frac{π}{2} {\hat{τ}}_{k j})]$ , where τ̂_kj is the empirical Kendall’s tau statistic between X_k and X_j defined as

{\hat{τ}}_{k j} = {\begin{matrix} \frac{2}{n (n - 1)} \sum_{i < i'} sign ((x_{i j} - x_{i' j}) (x_{i k} - x_{i' k})) & if j \neq k, \\ 1 & otherwise . \end{matrix}

We then adopt the sparse column inverse operator to estimate the j^th column of Θ^*. In particular, we solve the following regularized quadratic optimization problem (Liu and Luo, 2015),

min_{Θ_{* j} \in ℝ^{d}} \frac{1}{2} Θ_{* j}^{T} Ŝ Θ_{* j} - I_{* j}^{T} Θ_{* j} + ℛ_{λ} (Θ_{* j}) for j = 1, \dots, d .

(2.3)

For notational simplicity, we omit the column index j in (2.3), and denote Θ_*j and I_*j by θ and e respectively. Throughout the rest of this paper, if not specified, we study the following optimization problem for the transelliptical graph estimation

min_{θ \in ℝ^{d}} \frac{1}{2} θ^{T} Ŝ θ - e^{T} θ + ℛ_{λ} (θ) .

(2.4)

The quadratic loss function used in (2.4) is twice differentiable with

\nabla ℒ (θ) = Ŝ θ - e, \nabla^{2} ℒ (θ) = Ŝ .

Since the transformed Kendall’s tau estimator is rank-based and could be indefinite (Zhao et al., 2014b), the optimization in (2.3) may not be convex even if ℛ_λ(θ) is a convex.

Remark 2.1

It is worth mentioning that the indefiniteness of Ŝ also makes(2.3) unbounded from below, but as will be shown later, our proposed algorithm can still guarantee a unique sparse local solution with optimal statistical properties under suitable solutions.

Remark 2.2

To handle the possible nonconvexity, Liu et al. (2012b) estimate $Θ_{* j}^{*}$ based on a graphical model estimation procedure proposed in Cai et al. (2011) as follows,

min_{Θ_{* j} \in ℝ^{d}} {‖ Θ_{* j} ‖}_{1} subject to {‖ Ŝ Θ_{* j} - I_{* j} ‖}_{\infty} \leq λ for j = 1, \dots, d .

(2.5)

(2.5) is convex regardless the indefiniteness of Ŝ. But a major disadvantage of (2.5) is the computation. Existing solvers can only solve (2.5) up to moderate dimensions. We will present more empirical comparison between (2.3) and (2.5) in our numerical experiments.

3 Method

For notational convenience, we rewrite the objective function ℱ_λ(θ) as

ℱ_{λ} (θ) = \underset{{\tilde{ℒ}}_{λ} (θ)}{\underset{︸}{ℒ (θ) + ℋ_{λ} (θ)}} + λ {‖ θ ‖}_{1} .

We call ℒ̃_λ(θ) the augmented loss function, which is smooth but possibly nonconvex. We first introduce the path-following optimization scheme, which is a multistage optimization framework and also used in PISTA.

3.1 Path-following Optimization Scheme

The path-following optimization scheme solves the regularized optimization problem (2.1) using a decreasing sequence of N + 1 regularization parameters ${λ_{K}}_{K = 0}^{N}$ , and yields a sequence of N + 1 output solutions ${{\hat{θ}}^{{K}}}_{K = 0}^{N}$ from sparse to dense. We set the initial tuning parameter as λ₀ = ‖∇ℒ(0)‖_∞. By checking the KKT condition of (2.1) for λ₀, we have

min_{ξ \in \partial {‖ 0 ‖}_{1}} {‖ \nabla {\tilde{ℒ}}_{λ} (0) + λ_{0} ξ ‖}_{\infty} = min_{ξ \in \partial {‖ 0 ‖}_{1}} {‖ \nabla ℒ (0) + \nabla ℋ_{λ} (0) + λ_{0} ξ ‖}_{\infty} = 0,

(3.1)

where the second equality comes from ‖ξ‖_∞ ≤ 1 and $\nabla ℋ_{λ} (0) = {(h_{λ}^{'} (0), h_{λ}^{'} (0), \dots, h_{λ}^{'} (0))}^{T} = 0$ as introduced in §2.1. Since (3.1) indicates that 0 is a local solution to (2.1) for λ₀, we take the leading output solution as θ̂^{0} = 0. Let η ∈ (0, 1), we set λ_K = ηλ_{K − 1} for K = 1, …, N. We then solve (2.1) for the regularization parameter λ_K with θ̂^{{K − 1}} as the initial solution, which leads to the next output solution θ̂^{K}. The path-following optimization scheme is illustrated in Algorithm 1.

3.2 Accelerated Iterative Shrinkage Thresholding Algorithm

We then explain the accelerated iterative shrinkage thresholding (AISTA) subroutine, which solves (2.1) in each stage of the path-following optimization scheme. For notational simplicity, we omit the stage index K, and only consider the iteration index m of AISTA. Suppose that AISTA takes some initial solution θ^[0] and an initial step size parameter L^[0], and we want to solve (2.1) with the regularization parameter λ. Then at the m^th iteration of AISTA, we already have L^[m] and θ^[m]. Each iteration of AISTA contains two steps: The first one is the proximal gradient descent iteration, and the second one is the coordinate descent subroutine.

Algorithm 1.

Path-following optimization. It solves the problem (2.1) using a decreasing sequence of regularization parameters ${λ_{K}}_{K = 0}^{N}$ . More specifically, λ₀ = ‖ℒ(0)‖_∞ yields an all zero output solution θ̂^{0} = 0. For K = 1, …, N, we set λ_K = ηλ_{K − 1}, where η ∈ (0, 1). We solve (2.1) for λ_K with θ̂^{{K − 1}} as an initial solution. Note that AISTA is the computational algorithm for obtaining θ̂^{K + 1} using θ̂^K as the initial solution. L_min and ${{\hat{L}}^{{K}}}_{K = 0}^{N}$ are corresponding step size parameters. More technical details on AISTA are presented are Algorithm 3.

Algorithm:

{{\hat{θ}}^{(K)}}_{K = 0}^{N} \leftarrow APISTA ({λ_{K}}_{K = 0}^{N})

Parameter: η, L_min

Initialize: λ₀ = ‖∇ℒ(0)‖_∞, θ̂^{0} ← 0, L̂^{0} ← L_min

For: K = 0, …., N − 1

λ_K+1 ← ηλ_K, {θ̂^{K+1}, L̂^{K+1}} ← AISTA(λ_K+1, θ̂^{K}, L̂^{K})

End for

Output:

{{\hat{θ}}^{(K)}}_{K = 0}^{N}

Open in a new tab

(I) Proximal Gradient Descent Iteration

We consider the following quadratic approximation of ℱ_λ(θ) at θ = θ^[m],

𝒬_{λ, L^{[m + 1]}} (θ; θ^{[m]}) = {\tilde{ℒ}}_{λ} (θ^{[m]}) + {(θ - θ^{[m]})}^{T} \nabla {\tilde{ℒ}}_{λ} (θ^{[m]}) + \frac{L^{[m + 1]}}{2} {‖ θ - θ^{[m]} ‖}_{2}^{2} + λ {‖ θ ‖}_{1},

where L^[m+1] is the step size parameter such that 𝒬_λ,L^[m+1] (θ; θ^[m]) ≥ ℱ_λ(θ). We then take a proximal gradient descent iteration and obtain θ^[m] by

θ^{[m + 0.5]} = \underset{θ \in ℝ^{d}}{argmin} 𝒬_{λ, L^{[m + 1]}} (θ; θ^{[m]}) = \underset{θ \in ℝ^{d}}{argmin} \frac{L^{[m + 1]}}{2} {‖ θ - {\tilde{θ}}^{[m]} ‖}_{2}^{2} + λ {‖ θ ‖}_{1},

(3.2)

where θ̃^[m] = θ^[m] − ∇ℒ̃_λ(θ^[m])/L^[m+1]. For notational simplicity, we write

θ^{[m + 0.5]} = 𝒯_{λ, L^{[m + 1]}} (θ^{[m])}) .

(3.3)

For sparse column inverse operator, we can obtain a closed form solution to (3.2) by soft thresholding

θ^{[m + 0.5]} = 𝒮_{λ / L^{[m + 1]}} ({\tilde{θ}}^{[m + 0.5]}) .

The step size 1/L^[m+1] can be obtained by the backtracking line search. In particular, we start with a small enough L^[0]. Then in each iteration of the middle loop, we choose the minimum nonnegative integer z such that L^[m+1] = 2^zL^[m] satisfies

ℱ_{λ} (θ^{[m + 0.5]}) \leq 𝒬_{λ, L^{[m + 1]}} (θ^{[m + 0.5]}; θ^{[m]}) for m = 0, 1, 2, \dots .

(3.4)

(II) Coordinate Descent Subroutine

Unlike the proximal gradient algorithm which repeats (3.3) until convergence at each stage of the path-following optimization scheme, AISTA exploits an additional coordinate descent subroutine to further boost the computational performance. More specifically, we define $𝒜^{⊥} = {j | θ_{j}^{[m + 0.5]} = 0}$ and solve the following optimization problem

min_{θ} ℱ_{λ} (θ) subject to θ_{𝒜^{⊥}} = 0

(3.5)

using the cyclic coordinate descent algorithm (CCDA) initiated by θ^[m+0.5]. For notational simplicity, we omit the stage index K and iteration index m, and only consider the iteration index t of CCDA. Suppose that the CCDA algorithm takes some initial solution θ⁽⁰⁾ for solving (2.1) with the regularization parameter λ. Without loss of generality, we denote 𝒜 = {1, …, |𝒜|}. At the t^th iteration, we have θ^(t). Then at the (t + 1)^th iteration, we conduct the coordinate minimization cyclically over all active coordinates. Let w^(t+1,k) be an auxiliary solution of the (t + 1)^th iteration with the first k − 1 coordinates updated. For k = 1, we have w^(t+1,1) = θ^(t). We then update the k^th coordinate to obtain the next auxiliary solution w^(t+1,k+1).

More specifically, let ∇_kℒ̃_λ(θ) be the k^th entry of ∇ℒ̃_λ(θ). We minimize the objective function with respect to each selected coordinate and keep all other coordinates fixed,

w_{k}^{(t + 1, k + 1)} = \underset{θ_{k} \in ℝ}{argmin} ℒ_{λ} (θ_{k}; w_{\ k}^{(t + 1, k)}) + r_{λ} (θ_{k}) .

(3.6)

Once we obtain $w_{k}^{(t + 1, k + 1)}$ , we can set $w_{\ k}^{(t + 1, k + 1)} = w_{\ k}^{(t + 1, k)}$ to obtain the next auxiliary solution w^(t+1,k+1). For sparse column inverse operator, let ${\tilde{w}}_{k}^{(t + 1, k)} = e_{k} - Ŝ_{\ k k}^{T} w_{\ k}^{(t + 1, k)}$ , we have

w_{k}^{(t + 1, k + 1)} = \underset{θ_{k} \in ℝ}{argmin} \frac{1}{2} Ŝ_{k k} θ_{k}^{2} + e_{k} - Ŝ_{\ k k}^{T} w_{\ k}^{(t + 1, k)} θ_{k} - e_{k} θ_{k} + r_{λ} (θ_{k}) = \underset{θ_{k} \in ℝ}{argmin} \frac{1}{2} {(θ_{k} - {\tilde{w}}_{k}^{(t + 1, k)})}^{2} + r_{λ} (θ_{k}),

(3.7)

where the last equality comes from the fact Ŝ_kk = 1 for all k = 1, …, d. By setting the subgradient of (3.7) equal to zero, we can obtain $w_{k}^{(t + 1, k + 1)}$ as follows:

For the ℓ₁ norm regularization, we have $w_{k}^{(t + 1, k + 1)} = 𝒮_{λ} ({\tilde{w}}_{k}^{(t + 1, k)})$ .
For the SCAD regularization, we have
$w_{k}^{(t + 1, k + 1)} = {\begin{matrix} {\tilde{w}}_{k}^{(t + 1, k)} & if | {\tilde{w}}_{k}^{(t + 1, k)} | \geq γ λ, \\ \frac{𝒮_{γ λ / (γ - 1)} ({\tilde{w}}_{k}^{(t + 1, k)})}{1 - 1 / (γ - 1)} & if | {\tilde{w}}_{k}^{(t + 1, k)} | \in [2 λ, γ λ), \\ 𝒮_{λ} ({\tilde{w}}_{k}^{(t + 1, k)}) & if | {\tilde{w}}_{k}^{(t + 1, k)} | < 2 λ . \end{matrix}$
For the MCP regularization, we have
$w_{k}^{(t + 1, k + 1)} = {\begin{matrix} {\tilde{w}}_{k}^{(t + 1, k)} & if | {\tilde{w}}_{k}^{(t + 1, k)} | \geq γ λ, \\ \frac{𝒮_{λ} ({\tilde{w}}_{k}^{(t + 1, k)})}{1 - 1 / γ} & if | {\tilde{w}}_{k}^{(t + 1, k)} | < γ λ . \end{matrix}$

When all |𝒜| coordinate updates in the (t + 1)^th iteration of CCDA finish, we set θ^(t+1) = w^{(t+1,|𝒜|+1)}. We summarize CCDA in Algorithm 2. Once CCDA terminates, we denote its output solution by θ^[m+1], and start the next iteration of AISTA. We summarize AISTA in Algorithm 3.

Algorithm 2.

The cyclic coordinate descent algorithm (CCDA). The cyclic coordinate descent algorithm cyclically iterates over the support of the initial solution. Without loss of generality, we assume 𝒜 = {1, …, |𝒜|}.

Algorithm: θ̂ ← CCDA(λ, θ⁽⁰⁾).

Initialize: t ← 0, 𝒜 = supp(θ⁽⁰⁾)

Repeat:

w^(t+1,1) ← θ^(t)

For k = 1, …, |𝒜|

w_{k}^{(t + 1, k + 1)} \leftarrow {argmin}_{θ_{k} \in ℝ} ℒ_{λ} (θ_{k}; w_{\ k}^{(t + 1, k)}) + r_{λ} (θ_{k}) and w_{\ k}^{(t + 1, k + 1)} \leftarrow w_{\ k}^{(t + 1, k)}

End for

θ^(t+1) ← w^{(t+1,|𝒜|+1)}, t ← t + 1

Until convergence

θ̂ ← θ^(t)

Open in a new tab

Remark 3.1

The backtracking line search procedure in PISTA has been extensively studied in existing optimization literature on the adaptive step size selection (Dennis and Schnabel, 1983; Nocedal and Wright, 2006), especially for proximal gradient algorithms (Beck and Teboulle, 2009b,a; NESTEROV, 2013). Many empirical results have corroborated better computational performance than that using a fix step size. But unlike the classical proximal gradient algorithms, APISTA can efficiently reduce the objective value by the coordinate descent subroutine in each iteration. Therefore we can simply choose a constant step size parameter L such that

L \geq sup_{θ \in ℝ^{d}} Λ_{max} (\nabla^{2} ℒ (θ)) .

(3.8)

The step size parameter L in (3.8) guarantees 𝒬_λ,L(θ; θ^[m]) ≥ ℱ_λ(θ) in each iteration of AISTA. For sparse column inverse operator, ∇²ℒ(θ) = Ŝ does not depend on θ. Therefore we choose

Algorithm 3.

The accelerated iterative shrinkage thresholding algorithm (AISTA). Within each iteration, we exploit an additional coordinate descent subroutine to improve the empirical computational performance.

Algorithm: {θ̂, L̂} ← AISTA(λ, θ^[0], L^[0])

Initialize: m ← 0

Repeat:

z ← 0

Repeat:

L ^[m+1] ← 2^z L^[m], θ^[m+0.5] ← 𝒯_{λ,Ω,L^[m+1]} (θ^[m]), z ← z + 1

Until: 𝒬_λ,L^[m+1] (θ^[m+0.5]; θ^[m]) ≥ ℱ_λ(θ^[m+1])

θ^[m+1] ← CCDA(λ, θ^[m+0.5]), m ← m + 1

Until convergence

θ̂ ← θ^[m−0.5], L̂ ← L^[m]

Output: {θ̂, L̂}

Open in a new tab

L = Λ_max(Ŝ). Our numerical experiments show that choosing a fixed step not only simplifies the implementation, but also attains better empirical computational performance than the backtracking line search. See more details in §5.

3.3 Stopping Criteria

Since θ is a local minimum if and only if the KKT condition min_{ξ∈∂‖θ‖₁} ‖∇ℒ̃_λ(θ) + λξ‖_∞ = 0 holds, we terminate AISTA when

ω_{λ} (θ^{[m + 0.5]}) = min_{ξ \in \partial {‖ θ^{[m + 0.5]} ‖}_{1}} {‖ \nabla {\tilde{ℒ}}_{λ} (θ^{[m + 0.5]}) + λ ξ ‖}_{\infty} \leq ε,

(3.9)

where ε is the target precision and usually proportional to the regularization parameter. More specifically, given the regularization parameter λ_K, we have

ε_{K} = δ_{K} λ_{K} for K = 1, \dots, N,

(3.10)

where δ_K ∈ (0, 1) is a convergence parameter for the K^th stage of the path-following optimization scheme. Moreover, for CCDA, we terminate the iteration when

{‖ θ^{(t + 1)} - θ^{(t)} ‖}_{2}^{2} \leq δ_{0}^{2} λ^{2},

(3.11)

where δ₀ ∈ (0, 1) is a convergence parameter. This stopping criterion is natural to the sparse coordinate descent algorithm, since we only need to calculate the value change of each coordinate (not the gradient). We will discuss how to choose δ_K’s and δ₀ in §4.1.

4 Theory

Before we present the computational and statistical theories of APISTA, we introduce some additional assumptions. The first one is about the choice of regularization parameters.

Assumption 4.1

Let δ_K’s and η saitsify

η \in [0.9, 1) and max_{0 \leq K \leq N} δ_{K} \leq δ_{max} = 1 / 4,

where η is the rescaling parameter of the path-following optimization scheme, δ_K’s are the convergence parameters defined in (3.10), and δ₀ is the convergence parameter defined in (3.11). We have the regularization parameters

λ_{0} > λ_{1} \dots > λ_{N} \geq 8 {‖ \nabla ℒ (θ^{*}) ‖}_{\infty} .

Assumption 4.1 has been extensively studied in existing literature on high dimensional statistical theory of the regularized M-estimators (Rothman et al., 2008; Zhang and Huang, 2008; Negahban and Wainwright, 2011; Negahban et al., 2012). It requires the regularization parameters to be large enough such that irrelevant variables can be eliminated along the solution path. Though ‖∇ℒ(θ^*)‖_∞ cannot be explicitly calculated (θ^* is unknown), we can exploit concentration inequalities to show that Assumption 4.1 holds with high probability (Ledoux, 2005). In particular, we will verify Assumption 4.1 for sparse transellpitical graphical model estimation in Lemma 4.8.

Before we proceed with our second assumption, we define the largest and smallest s-sparse eigenvalues of the Hessian matrix of the loss function as follows.

Definition 4.1

Given an integer s ≥ 1, we define the largest and smallest s-sparse eigenvalues of ∇²ℒ(θ) as

Largest s-Sparse Eigenvalue : $ρ_{+} (s) = sup_{υ \in ℝ^{d}, {‖ υ ‖}_{0} \leq s} \frac{υ^{T} \nabla^{2} ℒ (θ) υ}{{‖ υ ‖}_{2}^{2}}$ ,
Smallest s-Sparse Eigenvalue : $ρ_{-} (s) = inf_{υ \in ℝ^{d}, {‖ υ ‖}_{0} \leq s} \frac{υ^{T} \nabla^{2} ℒ (θ) υ}{{‖ υ ‖}_{2}^{2}}$ .

Moreover, we define ρ̃₋(s) = ρ₋(s) − α and ρ₊(s) = ρ₊(s) for notational simplicity, where α is defined in (2.2).

The next lemma shows the connection between the sparse eigenvalue conditions and restricted strongly convex and smooth conditions.

Lemma 4.1

Given ρ₋(s) > 0, for any θ, θ′ ∈ ℝ^d with |supp(θ) ∪ supp(θ′)| ≤ s, we have

ℒ (θ') \leq ℒ (θ) + {(θ' - θ)}^{T} \nabla ℒ (θ) + \frac{ρ_{+} (s)}{2} {‖ θ' - θ ‖}_{2}^{2},

ℒ (θ') \geq ℒ (θ) + {(θ' - θ)}^{T} \nabla ℒ (θ) + \frac{ρ_{-} (s)}{2} {‖ θ' - θ ‖}_{2}^{2} .

Moreover, if ρ₋(s) > α, then we have

{\tilde{ℒ}}_{λ} (θ') \leq {\tilde{ℒ}}_{λ} (θ) + {(θ' - θ)}^{T} \nabla {\tilde{ℒ}}_{λ} (θ) + \frac{ρ_{+} (s)}{2} {‖ θ' - θ ‖}_{2}^{2},

{\tilde{ℒ}}_{λ} (θ') \geq {\tilde{ℒ}}_{λ} (θ) + {(θ' - θ)}^{T} \nabla {\tilde{ℒ}}_{λ} (θ) + \frac{{\tilde{ρ}}_{-} (s)}{2} {‖ θ' - θ ‖}_{2}^{2},

and for any ξ ∈ ∂‖θ‖₁,

ℱ_{λ} (θ') \geq ℱ_{λ} (θ) + {(\nabla {\tilde{ℒ}}_{λ} (θ) + λ ξ)}^{T} (θ' - θ) + \frac{{\tilde{ρ}}_{-} (s)}{2} {‖ θ' - θ ‖}_{2}^{2} .

The proof of Lemma 4.1 is provided in Wang et al. (2014), therefore omitted. We then introduce the second assumption.

Assumption 4.2

Given ‖θ^*‖₀ ≤ s^*, there exists an integer s̃ satisfying

\tilde{s} \geq (144 κ^{2} + 250 κ) \cdot s^{*}, ρ_{+} (s^{*} + 2 \tilde{s}) < + \infty, and {\tilde{ρ}}_{-} (s^{*} + 2 \tilde{s}) > 0,

where κ = ρ₊(s + 2s̃)ρ̃₋(s + 2s̃).

Assumption 4.2 requires that ℒ̃_λ(θ) satisfies the strong convexity and smoothness when θ is sparse. As will be shown later, APISTA can always guarantee the number of irrelevant coordinates with nonzero values not to exceed s̃. Therefore the restricted strong convexity is preserved along the solution path. We will verify that Assumption 4.2 holds with high probability for the transellpitical graphical model estimation in Lemma 4.9.

Remark 4.2 `(Step Size Initialization)`

We take the initial step size parameter as L_min ≥ ρ₊(1). For sparse column inverse operator, we directly choose L_min = ρ₊(1) = 1.

4.1 Computational Theory

We develop the computational theory of APISTA. For notational simplicity, we define $𝒮 = {j | θ_{j}^{*} \neq 0}$ and $𝒮^{⊥} = {j | θ_{j}^{*} = 0}$ for characterizing the the solution sparsity. We first start with the convergence analysis for the cyclic coordinate descent algorithm (CCDA). The next theorem presents its rate of convergence in term of the objective value.

Theorem 4.3 (`Geometric Rate of Convergence of CCDA`)

Suppose that Assumption 4.2 holds. Given a sparse initial solution satisfying ${‖ θ_{𝒮^{⊥}}^{(0)} ‖}_{0} \leq \tilde{s}$ , (3.5) is a strongly convex optimization problem with a unique global minimizer θ̄. Moreover, for t = 1, 2…, we have

ℱ_{λ} (θ^{(t)}) - ℱ_{λ} (\bar{θ}) \leq {(\frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})})}^{t} [ℱ_{λ} (θ^{(0)}) - ℱ_{λ} (\bar{θ})] .

The proof of Theorems 4.3 is provided in Appendix A. Theorem 4.3 suggests that when the initial solution is sparse, CCDA essentially solves a strongly convex optimization problem with a unique global minimizer. Consequently we can establish the geometric rate of convergence in term of the objective value for CCDA. We then proceed with the convergence analysis of AISTA. The next theorem presents its theoretical rate of convergence in term of the objective value.

Theorem 4.4 `(Geometric Rate of Convergence of AISTA)`

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λ_N, if the initial solution θ^[0] satisfies

{‖ θ_{𝒮^{⊥}}^{[0]} ‖}_{0} \leq \tilde{s}, ω_{λ} (θ^{[0]}) \leq λ / 2,

(4.1)

then we have ${‖ θ_{𝒮^{⊥}}^{(m)} ‖}_{0} \leq \tilde{s}$ for m = 0.5, 1, 1.5, 2, …. Moreover, for m = 1, 2, …, we have

ℱ_{λ} (θ^{[m]}) - ℱ_{λ} ({\bar{θ}}^{λ}) \leq {(1 - \frac{1}{8 κ})}^{m} [ℱ_{λ} (θ^{(0)}) - ℱ_{λ} (\bar{θ})],

where θ̄^λ is a unique sparse local solution to (2.1) satisfying ω_λ(θ̄^λ) = 0 and ${‖ {\bar{θ}}_{𝒮^{⊥}}^{λ} ‖}_{0} \leq \tilde{s}$ .

The proof of Theorem 4.4 is provided in Appendix B. Theorem 4.4 suggests that all solutions of AISTA are sparse such that the restricted strongly convex and smooth conditions hold for all iterations. Therefore, AISTA attains the geometric rate of convergence in term of the objective value. Theorem 4.4 requires a proper initial solution to satisfy (4.1). This can be verified by the following theorem.

Theorem 4.5 `(Path-following Optimization Scheme)`

Suppose that Assumptions 4.1 and 4.2 hold. Given θ satisfying

{‖ θ_{𝒮^{⊥}} ‖}_{0} \leq s and ω_{λ_{K - 1}} (θ) \leq δ_{K - 1} λ_{K - 1},

(4.2)

we have ω_{λ_K}(θ) ≤ λ_K/2.

The proof of Theorem 4.5 is provided in Wang et al. (2014), therefore omitted. Since θ^{0} naturally satisfies (4.2) for λ₁, by Theorem 4.5 and induction, we can show that the path-following optimization scheme always guarantees that the output solution of the (K − 1)^th stage is a proper initial solution for the K^th stage, where K = 1, …, N. Eventually, we combine Theorems 4.3 and 4.4 with Theorem 4.5, and establish the global geometric rate of convergence in term of the objective value for APISTA in the next theorem.

Theorem 4.6 (`Global Geometric Rate of Convergence of APISTA`)

Suppose that Assumptions 4.1 and 4.2 hold. Recall that δ₀ and δ_K’s are defined in §3.3, κ and s̃ are defined in Assumption 4.2, and α is defined in (2.2). We have the following results:

At the K^th stage (K = 1, …, N), the number of coordinate descent iterations within each CCDA is at most C₁ log (C₂/δ₀), where
$C_{1} = 2 {log}^{- 1} (\frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})}) and C_{2} = \sqrt{\frac{21 s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s}) {\tilde{ρ}}_{-} (1)}};$
At the K^th stage (K = 1, …, N), the number of the proximal gradient iterations in each AISTA is at most C₃ log (C₄/δ_K), where
$C_{3} = 2 {log}^{- 1} (1 - \frac{1}{8 κ}) and C_{4} = 10 \sqrt{κ s^{*}};$
To compute all N + 1 output solutions, the total number of coordinate descent iterations in APISTA is at most
$C_{1} log (C_{2} / δ_{0}) \sum_{K = 1}^{N} C_{3} log (C_{4} / δ_{K});$ (4.3)
At the K^th stage (K = 1, …, N), we have
$ℱ_{λ_{N}} ({\hat{θ}}^{{K}}) - ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) \leq [𝟙 (K < N) + δ_{K}] \cdot \frac{105 λ_{K}^{2} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})};$

The proof Theorem 4.6 is provided in Appendix C. We then present a more intuitive explanation about Result (3). To secure the generalization performance in practice, we usually tune the regularization parameter over a refined sequence based on cross validation. In particular, we solve (2.1) using partial data with high precision for every regularization parameter. If we set δ_K = δ_optλ_K for K = 1, …N, where δ_opt is a very small value (e.g. 10⁻⁸), then we can rewrite (4.3) as

N C_{1} log (\frac{C_{2}}{δ_{0}}) C_{3} log (\frac{C_{4}}{δ_{opt}}) = 𝒪 (N log (\frac{1}{δ_{opt}})),

(4.4)

where δ₀ is some reasonably large value (e.g. 10⁻²) defined in §3.3. The iteration complexity in (4.4) depends on N.

Once the regularization parameter is selected, we still need to solve (2.1) using full data with some regularization sequence. But we only need high precision for the selected regularization parameter (e.g., λ_N), and for K = 1, …, N − 1, we only solve (2.1) for λ_K up to an adequate precision, e.g., δ_K = δ₀ for K = 1, …, N − 1 and δ_N = δ_optλ_N. Since 1/δ_opt is much larger than N, we can rewrite (4.3) as

C_{1} log (\frac{C_{2}}{δ_{0}}) ((N - 1) C_{3} log (\frac{C_{4}}{δ_{0}}) + C_{3} log (\frac{C_{4}}{δ_{opt}})) = 𝒪 (log (\frac{1}{δ_{opt}})) .

(4.5)

Now the iteration complexity in (4.5) does not depend on N.

Remark 4.7

To establish computational theories of APISTA with a fixed step size, we only need to slightly modify the proofs of Theorems 4.4 and 4.6 by replacing ρ₊(s^* + 2s̃) and ρ₊(s^* + s̃) by their upper bound L defined in (3.8). Then a global geometric rate of convergence can also be derived, but with a worse constant term.

4.2 Statistical Theory

We then establish the statistical theory of the SCIO estimator obtained by APISTA under transelliptical models. We use Θ^* and Σ^* to denote the true latent precision and covariance matrices. We assume that Θ^* belongs to the following class of sparse, positive definite, and symmetric matrices:

𝒰_{ψ_{max}, ψ_{min}} (M, s^{*}) = {Θ \in ℝ^{d \times d} | Θ = Θ^{T}, max_{j} {‖ Θ_{* j} ‖}_{0} \leq s^{*}, {‖ Θ ‖}_{1} \leq M, 0 < ψ_{max}^{- 1} \leq Λ_{min} (Θ) \leq Λ_{max} (Θ) \leq ψ_{min}^{- 1} < \infty},

where ψ_max and ψ_min are positive constants, and do not scale with (M, s^*, n, d). Since Σ^* = (Θ^*)⁻¹, we also have ψ_min ≤ Λ_min(Σ^*) ≤ Λ_max(Σ^*) ≤ ψ_max.

We first verify Assumptions 4.1 and 4.2 in the next two lemmas for transelliptical models.

Lemma 4.8

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ . Given $λ_{N} = 8 \sqrt{2 π} M \sqrt{log d / n}$ , we have

ℙ (λ_{N} \geq 8 {‖ \nabla ℒ (θ^{*}) ‖}_{\infty}) \geq 1 - \frac{1}{d^{2}} .

The proof of Lemma 4.8 is provided in Appendix D. Lemma 4.8 guarantees that the selected regularization parameter λ_N satisfies Assumption 4.1 with high probability.

Lemma 4.9

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ . Given α = ψ_min/2, there exist universal positive constants c₁ and c₂ such that for $n \geq 4 ψ_{min}^{- 1} c_{2} (1 + 2 c_{1}) s^{*} log d$ , with probability at least 1 − 2/d², we have

\tilde{s} = c_{1} s^{*} \geq (144 κ^{2} + 250 κ) s^{*}, {\tilde{ρ}}_{-} (s^{*} + 2 \tilde{s}) \geq \frac{ψ_{min}}{4}, ρ_{+} (s^{*} + 2 \tilde{s}) \leq \frac{5 ψ_{max}}{4},

where κ is defined in Assumption 4.2.

The proof of Lemma 4.9 is provided in Appendix E. Lemma 4.9 guarantees that if the Lipschitz constant of $h_{λ}^{'}$ defined in (2.2) satisfies α = ψ_min/2, then the transformed Kendall’s tau estimator Ŝ = ∇²ℒ(θ) satisfies Assumption 4.2 with high probability.

Remark 4.10

Since Assumptions 4.1 and 4.2 have been verified, by Theorem 4.6, we know that APISTA attains the geometric rate of convergence to a unique sparse local solution to (2.3) in term of the objective value with high probability.

Recall that we use θ to denote Θ_*j in (2.4), by solving (2.3) with respect to all d columns, we obtain ${\hat{Θ}}^{{N}} = [{\hat{Θ}}_{* 1}^{{N}}, \dots, {\hat{Θ}}_{* d}^{{N}}]$ and ${\bar{Θ}}^{λ_{N}} = [{\bar{Θ}}_{* 1}^{λ_{N}}, \dots, {\bar{Θ}}_{* d}^{λ_{N}}]$ , where ${\bar{Θ}}_{* j}^{λ_{N}}$ denotes the output solution of APISTA corresponding to λ_N for the j^th column (j = 1, ‥d), and ${\bar{Θ}}_{* j}^{λ_{N}}$ to denote the unique sparse local solution corresponding to λ_N for the j^th column (j = 1, ‥d), which APISTA converges to. We then present concrete rates of convergence of the estimator obtained by APISTA under the matrix ℓ₁ and Frobenius norms in the following theorem.

Theorem 4.11. `[Parameter Estimation]`

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ , and α = ψ_min/2. For $n \geq 4 ψ_{min}^{- 1} c_{2} (1 + 2 c_{1}) s^{*} log d$ , given $λ_{N} = 8 \sqrt{2} π M \sqrt{log d / n}$ , we have

{‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{1} = O_{P} (M s^{*} \sqrt{\frac{log d}{n}}), \frac{1}{d} {‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{F}^{2} = O_{P} (\frac{M^{2} s^{*} log d}{n}) .

The proof of (4.11) is provided in Appendix F. The results in Theorem 4.11 show that the SCIO estimator obtained by APISTA achieves the same rates of convergence as those for subguassian distributions (Liu and Luo, 2015). Moreover, when using the nonconvex regularization such as MCP and SCAD, we can achieve graph estimation consistency under the following assumption.

Assumption 4.3

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ . Define $ℰ^{*} = {(k, j) | Θ_{k j}^{*} \neq 0}$ as the support of Θ^*. There exists some universal constant c₃ such that

min_{(k, j) \in ℰ^{*}} | Θ_{k j}^{*} | \geq c_{3} M \cdot \sqrt{\frac{s^{*} log d}{n}} .

Assumption 4.3 is a sufficient condition for sparse column inverse operator to achieve graph estimation consistency in high dimensions for transelliptical models. The violation of Assumption 4.3 may result in underselection of the nonzero entries in Θ^*.

The next theorem shows that, with high probability, Θ̅^λ_N and the oracle solution Θ̂^o are identical. More specifically, let $𝒮_{j} = supp (Θ_{* j}^{*})$ for j = 1, …, d, ${\hat{Θ}}^{o} = [{\hat{Θ}}_{* 1}^{o}, \dots, {\hat{Θ}}_{* d}^{o}]$ defined as follows,

{\hat{Θ}}_{𝒮_{j} j}^{o} = \underset{Θ_{𝒮_{j} j} \in ℝ^{| 𝒮_{j} |}}{argmin} \frac{1}{2} Θ_{𝒮_{j} j}^{T} Ŝ_{𝒮_{j} 𝒮_{j}} Θ_{𝒮_{j} j} - I_{𝒮_{j} j}^{T} Θ_{𝒮_{j} j} and {\hat{Θ}}_{𝒮_{j}^{⊥} j}^{o} = 0 for j = 1, \dots, d .

(4.6)

Theorem 4.12. `[Graph Estimation]`

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ , α = ψ_min/2, and Assumption 4.3 holds. There exists a universal constant c₄ such that $n \geq 4 ψ_{min}^{- 1} c_{2} (1 + 2 c_{1}) s^{*} log d$ , if we choose $λ_{N} = c_{4} \sqrt{2} π M \sqrt{s^{*} log d / n}$ , then we have

ℙ ({\bar{Θ}}^{λ_{N}} = {\hat{Θ}}^{o}) \geq 1 - \frac{3}{d^{2}} .

The proof of Theorem 4.12 is provided in Appendix G. Since Θ̂^o shares the same support with Θ^*, Theorem 4.12 guarantees that the SCIO estimator obtained by APISTA can perfectly recover ℰ^* with high probability. To the best of our knowledge, Theorem 4.12 is the first graph estimation consistency result for transelliptical models without any post-processing procedure (e.g. thresholding).

Remark 4.13

In Theorem (4.12), we choose $λ_{N} = c_{4} \sqrt{2} π M \sqrt{log d / n}$ , which is different from the selected regularization parameter in Assumption 4.8. But as long as we have $c_{4} \sqrt{s^{*}} \geq 8$ , which is not an issue under the high dimensional scaling

M, s^{*}, n, d \to \infty and M s^{*} log d / n \to 0,

λ_N ≥ 8‖∇ℒ(θ^*)‖_∞ still holds with high probability. Therefore all computational theories in §4.1 hold for Θ̅^λ_N in Theorem 4.12.

5 Numerical Experiments

In this section, we study the computational and statistical performance of APISTA method through numerical experiments on sparse transelliptical graphical model estimation. All experiments are conducted on a personal computer with Intel Core i5 3.3 GHz and 16GB memory. All programs are coded in double precision C, called from R. The computation are optimized by exploiting the sparseness of vector and matrices. Thus we can gain a significant speedup in vector and matrix manipulations (e.g. calculating the gradient and evaluating the objective value). We choose the MCP regularization with varying β’s for all simulations.

5.1 Simulated Data

We consider the chain and Erdös-Rényi graph generation schemes with varying d = 200, 400, and 800 to obtain the latent precision matrices:

Chain. Each node is assigned a coordinate j for j = 1, …, d. Two nodes are connected by an edge whenever the corresponding points are at distance no more than 1.
Erdös-Rényi. We set an edge between each pair of nodes with probability 1/d, independently of the other edges.

Two illustrative examples are presented in Figure 5.1. Let 𝒟 be the adjacency matrix of the generated graph, and ℳ₂ be the rescaling operator that converts a symmetric positive semidefinite matrix to a correlation matrix. We calculate

Σ^{*} = ℳ_{2} [{(\tilde{𝒟} + (1 - Λ_{min} (𝒟)) I)}^{- 1}] .

We use Σ^* as the covariance matrix to generate n = ⌈60 log d⌉ independent observations from a multivariate t-distribution with mean 0 and degrees of freedom 3. We then adopt the power transformation g(t) = t⁵ to convert to the t-distributed data to the transelliptical data. Note that the corresponding latent precision matrix is Ω^* = (Σ^*)⁻¹. We compare the following five computational methods:

APISTA: The computational algorithm proposed in §3.
F-APISTA: APISTA without the backtracking line search (using a fixed step size instead).
PISTA: The pathwise iterative shrinkage thresholding algoritm proposed in Wang et al. (2014).
CLIME: The sparse latent precision matrix estimation method proposed in Liu et al. (2012b), which solves (2.5) by the ADMM method (Alternating Direction Method of Multipliers, Li et al. (2015); Liu et al. (2014)).
SCIO(P): The SCIO estimator based on the positive semidefinite projection method proposed in Zhao et al. (2014b). More specifically, we first project the possibly indefinite Kendall’s tau matrix into the cone of all positive semidefinite matrices. Then we plug the obtained replacement into (2.3), and solve it by the coordinate descent method proposed in Liu and Luo (2015).

Figure 5.1 — Different graph patterns. To ease the visualization, we only present graphs with d = 200.

Note that (4) and (5) have theoretical guarantees only when the ℓ₁ norm regularization is applied. For (1)–(3), we set δ₀ = δ_K = 10⁻⁵ for K = 1, …, N.

We first compare the statistical performance in parameter estimation and graph estimation of all methods. To meet this end, we generate a validation set of the same size as the training set. We use the regularization sequence with N = 100 and $λ_{N} = 0.5 \sqrt{log d / n} \approx 0.0645$ . The optimal regularization parameter is selected by

\hat{λ} = \underset{λ \in {λ_{1}, \dots, λ_{N}}}{argmin} {‖ {\hat{Θ}}^{λ} \tilde{S} - I ‖}_{max},

where Θ̂^λ denotes the estimated latent precision matrix using the training set with the regularization parameter λ, and S̃ denotes the estimated latent covariance matrix using the validation set. We repeat the simulation for 100 times, and summarize the averaged results in Tables 5.1 and 5.2. For all settings, we set δ₀ = δ_K = 10⁻⁵. We also vary β of the MCP regularization from 100 to 20/19, thus the corresponding α varies from 0.01 to 0.95. The parameter estimation performance is evaluated by the difference between the obtained estimator and the true latent prediction matrix under the Forbenius and matrix ℓ₁ norms. The graph estimation performance is evaluated by the true positive rate (T. P. R.) and false positive rate (F. P. R.) defined as follows,

T . P . R . = \frac{\sum_{k \neq j} 𝟙 ({\hat{Θ}}_{k j}^{\hat{λ}} \neq 0) \cdot 𝟙 (Θ_{k j}^{*} \neq 0)}{\sum_{k \neq j} 𝟙 (Θ_{k j}^{*} \neq 0)}, F . P . R . = \frac{\sum_{k \neq j} 𝟙 ({\hat{Θ}}_{k j}^{\hat{λ}} \neq 0) \cdot 𝟙 (Θ_{k j}^{*} \neq 0)}{\sum_{k \neq j} 𝟙 (Θ_{k j}^{*} \neq 0)} .

Table 5.1.

Quantitive comparison among different estimators on the chain model. Since APISTA and F-APISTA can output valid results for large α’s, their estimator attains better performance than other competitors. The SCIO(P) and CLIME estimators use the ℓ₁ norm regularization with no bias reduction. Thus their performance is worse than the other competitors in both parameter estimation and graph estimation.

Method	d	‖Θ̂−Θ‖_F	‖Θ̂−Θ‖₁	T. P. R.	F. P. R.	α
PISTA	200	4.1112(0.7856)	1.0517(0.1141)	1.0000(0.0000)	0.0048(0.0079)	0.20
	400	6.4507(0.9062)	1.0756(0.0717)	1.0000(0.0000)	0.0007(0.0004)	0.20
	800	8.2640(1.1456)	1.0434(0.0673)	1.0000(0.0000)	0.0003(0.0006)	0.20

APISTA	200	2.5162(0.2677)	0.7665(0.1583)	0.9993(0.0012)	0.0001(0.0001)	0.95
	400	3.3664(0.2735)	0.8298(0.0986)	1.0000(0.0000)	0.0002(0.0000)	0.67
	800	5.0244(0.7984)	0.9312(0.1226)	1.0000(0.0000)	0.0002(0.0004)	0.50

F-APISTA	200	2.5163(0.2670)	0.7658(0.1559)	0.9994(0.0015)	0.0001(0.0002)	0.95
	400	3.3629(0.2702)	0.8253(0.0959)	1.0000(0.0000)	0.0002(0.0000)	0.67
	800	5.0237(0.7963)	0.9373(0.1289)	1.0000(0.0000)	0.0002(0.0005)	0.50

SCIO(P)	200	6.1812(1.2924)	1.2245(0.0777)	1.0000(0.0000)	0.0165(0.0220)	0.00
SCIO(P)	400	8.9991(0.9894)	1.2255(0.0785)	1.0000(0.0000)	0.0058(0.0047)	0.00

CLIME	200	6.4771(0.8617)	1.2187(0.0358)	1.0000(0.0000)	0.0126(0.0043)	0.00
CLIME	400	9.1221(0.9997)	1.2177(0.0629)	1.0000(0.0000)	0.0043(0.0032)	0.00

Open in a new tab

Table 5.2.

Quantitive comparison among different estimators on the Erdös-Rényi model. Since A-PISTA and F-APISTA can output valid results for large α’s, their estimators attains better performance than other competitors. The SCIO(P) and CLIME estimators use the ℓ₁ norm regularization with no bias reduction. Thus their performance is worse than the other competitors in both parameter estimation and graph estimation.

Method	d	‖Θ̂−Θ‖_F	‖Θ̂−Θ‖₁	T. P. R.	F. P. R.	α̂
PISTA	200	3.2647(0.1235)	1.6807(0.2675)	1.0000(0.0000)	0.0587(0.0013)	0.20
	400	4.5609(0.7666)	2.2113(0.3358)	1.0000(0.0000)	0.0295(0.0091)	0.20
	800	5.0751(0.3832)	2.5718(0.2826)	1.0000(0.0000)	0.0099(0.0020)	0.20

APISTA	200	2.2888(0.1141)	1.1644(0.2343)	1.0000(0.0000)	0.0193(0.0005)	0.33
	400	3.2206(0.2733)	1.4974(0.2778)	1.0000(0.0000)	0.0067(0.0100)	0.33
	800	4.0929(0.1862)	1.6347(0.2023)	1.0000(0.0000)	0.0036(0.0008)	0.50

F-APISTA	200	2.2890(0.1161)	1.1647(0.2390)	1.0000(0.0000)	0.0197(0.0007)	0.33
	400	3.2251(0.2702)	1.4928(0.2731)	1.0000(0.0000)	0.0060(0.0102)	0.33
	800	4.0984(0.1891)	1.6397(0.2096)	1.0000(0.0000)	0.0034(0.0009)	0.50

SCIO(P)	200	3.4277(0.5405)	1.5213(0.3223)	1.0000(0.0000)	0.0618(0.0170)	0.00
SCIO(P)	400	5.7144(0.8158)	1.9057(0.2933)	0.9994(0.0017)	0.0341(0.0145)	0.00

CLIME	200	3.6297(0.6103)	1.4876(0.2855)	1.0000(0.0000)	0.0581(0.0159)	0.00
CLIME	400	5.9206(0.8385)	1.8246(0.2817)	1.0000(0.0000)	0.0320(0.0112)	0.00

Open in a new tab

Since the convergence of PISTA is very slow when α is large, we only present its results for α = 0.2. APISTA and F-APISTA can work for larger α’s. Therefore they effectively reduces the estimation bias to attain the best statistical performance in both parameter estimation and graph estimation among all estimators. The SCIO(P) and CLIME methods only use ℓ₁ norm without any bias reduction, their performance is worse than the other competitors. Moreover, due to the poor scalability of their solvers, SCIO(P) and CLIME fail to output valid results within 10 hours when d = 800.

We then compare the computational performance of all methods. We use a regularization sequence with N = 50, and λ_N is proper selected such that the graphs obtained by all methods have approximately the same number of edges for each regularization parameter. In particular, the obtained graphs corresponding to λ_N have approximately 0.1 · d(d − 1)/2 edges. To make a fair comparison, we choose the ℓ₁ norm regularization for all methods. We repeat the simulation for 100 times, and the timing results are summarized in Tables 5.3 and 5.4. We see that F-APISTA method is up to 10 times faster than PISTA algorithm, and APISTA is up to 5 times after than PISTA. SCIO(P) and CLIME are much slower than the other three competitors.

Table 5.3.

Quantitive comparison of computational performance on the chain model (in seconds). We see that the F-APISTA method attains the best timing performance among all methods. The SCIO(P) and CLIME methods are much slower than the other three methods.

d	PISTA	APISTA	F-APISTA	SCIO(P)	CLIME
200	0.8342(0.0248)	0.2693(0.0031)	0.1013(0.0022)	2.6572(0.1253)	8.5932(0.5396)
400	3.8782(0.0696)	1.2103(0.0368)	0.4559(0.0308)	25.451(2.5752)	48.235(5.3494)
800	30.014(0.3514)	6.5970(0.2338)	2.4283(0.2605)	315.87(34.638)	460.12(45.121)

Open in a new tab

Table 5.4.

Quantitive comparison of computational performance on the Erdös-Rényi model (in seconds). We see that the F-APISTA method attains the best timing performance among all methods. The SCIO(P) and CLIME methods are much slower than the other three methods.

d	PISTA	APISTA	F-APISTA	SCIO(P)	CLIME
200	0.5401(0.0248)	0.2048(0.0056)	0.1063(0.0110)	2.712(0.13558)	7.1325(0.7891)
400	3.0501(0.0829)	0.9982(0.0453)	0.4555(0.0071)	26.140(2.1503)	45.160(4.9026)
800	28.581(0.3517)	6.8417(0.7543)	2.7037(0.2145)	332.90(30.115)	442.57(50.978)

Open in a new tab

5.2 Real Data

We present a real data example to demonstrate the usefulness of the transelliptical graph obtained by the sparse column inverse operator (based on the transformed Kendall’s tau matrix). We acquire closing prices from all stocks of the S&P 500 for all the days that the market was open between January 1, 2003 and January 1, 2005, which results in 504 samples for the 452 stocks. We transform the dataset by calculating the log-ratio of the price at time t + 1 to price at time t. The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) sectors.

We adopt the stability graphs obtained by the following procedure (Meinshausen and Bühlmann, 2010; Liu et al., 2010):

Calculate the graph path using all samples, and choose the regularization parameter at the sparsity level 0.1;
Randomly choose 50% of all samples without replacement using the regularization parameter chosen in (1);
Repeat (2) 100 times and retain the edges that appear with frequencies no less than 95%.

We choose the sparsity level 0.1 in (1) and subsampling ratio 50% in (2) based on two criteria: The resulting graphs need to be sparse to ease visualization, interpretation, and computation; The resulting graphs need to be stable. We then present the obtained stability graphs in Figure 5.2. The nodes are colored according to the GICS sector of the corresponding stock. We highlight a region in the transelliptical graph obtained by the SCIO method and by color coding we see that the nodes in this region belong to the same sector of the market. A similar pattern is also found in the transelliptical graph obtained by the CLIME method. In contrast, this region is shown to be sparse in the Gaussian graph obtained by the SCIO method (based on the Pearson correlation matrix). Therefore we can see that the SCIO method is also capable of generating refined structures as the CLIME method when estimating the transelliptical graph.

Figure 5.2 — Stock Graphs. We see that both transelliptical graphs reveal more refined structures than the Gaussian graph.

6 Discussions

We compare F-APISTA with a closely related algorithm – the path-following coordinate descent algorithm (PCDA¹) in timing performance. In particular, we give a failure example of PCDA for solving sparse linear regression. Let X ∈ ℝ^n×d denote design matrix and y ∈ ℝⁿ denote the response vector. We solve the following regularized optimization problem,

min_{θ} \frac{1}{2 n} {‖ y - X θ ‖}_{2}^{2} + ℛ_{λ} (θ) .

We generate each row of the design matrix X_i* from a d-variate Gaussian distribution with mean 0 and covariance matrix Σ ∈ ℝ^d×d, where Σ_kj = 0.75 if k ≠ j and Σ_kk = 1 for all j, k = 1, …, d. We then normalize each column of the design matrix X_*j such that ${‖ X_{* j} ‖}_{2}^{2} = n$ . The response vector is generated from the linear model y = Xθ^* + ε, where θ^* ∈ ℝ^d is the regression coefficient vector, and ε is generated from a n-variate Gaussian distribution with mean 0 and covariance matrix I. We set n = 60 and d = 1000. We set the coefficient vector as $θ_{250}^{*} = 3, θ_{500}^{*} = 2, θ_{750}^{*} = 1.5$ , and $θ_{j}^{*} = 0$ for all j ≠ 250, 500, 750. We then set α = 0.95, N = 100, $λ_{N} = 0.25 \sqrt{log d / n}$ , and δ_c = δ_K = 10⁻⁵.

We then generate a validation set using the same design matrix as the training set for the regularization selection. We denote the response vector of the validation set as ỹ ∈ ℝⁿ. Let θ̂^λ denote the obtained estimator using the regularization parameter λ. We then choose the optimal regularization parameter λ̂ by

\hat{λ} = \underset{λ \in {λ_{1}, \dots, λ_{N}}}{argmin} {‖ ỹ - X {\hat{θ}}^{λ} ‖}_{2}^{2} .

We repeat 100 simulations, and summarize the average results in Table 6. We see that F-APISTA and PCDA attain similar timing results. But PCDA achieves worse statistical performance than F-APISTA in both support recovery and parameter estimation. This is because PCDA has no control over the solution sparsity. The overselection irrelevant variables compromise the restricted strong convexity, and make PCDA attain some local optima with poor statistical properties.

Table 6.1.

Quantitative comparison between F-APISTA and PCDA. We see that F-APISTA and PCDA attain similar timing results. But PCDA achieves worse statistical performance than F-APISTA in both support recovery and parameter estimation.

Method	‖θ̂−θ^*‖₂	‖θ̂_𝒮‖₀	‖θ̂_𝒮^c‖₀	Correct Selection	Timing
F-APISTA	0.8001(0.9089)	2.801(0.5123)	0.890(2.112)	667/1000	0.0181(0.0025)
PCDA	1.1275(1.2539)	2.655(0.7051)	1.644(3.016)	517/1000	0.0195(0.0021)

Open in a new tab

Acknowledgments

Research supported by NSF Grants III-1116730 and NSF III-1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841, and FDA HHSF223201000072C.

Appendix

A Proof of Theorem 4.3

Proof

Since ‖θ⁽⁰⁾‖₀ ≤ s^* + s̃ implies that |𝒜| ≤ s^* + s̃, by Assumption 4.2 and Lemma 4.1, we know that (3.5) is strongly convex over θ_𝒜. Thus it has a unique global minimizer. We then analyze the amount of successive decrease. By the restricted strong convexity of ℱ_λ(θ), we have

ℱ_{λ} (w^{(t + 1, k)}) - ℱ_{λ} (w^{(t + 1, k + 1)}) \geq (\nabla_{k} ℒ_{λ} (θ_{k}^{(t + 1)}, w_{\ k}^{(t + 1, k)}) + λ ξ_{k}^{(t + 1)}) (θ_{k}^{(t)} - θ_{k}^{(t + 1)}) + \frac{\tilde{ρ} - (1)}{2} {(θ_{k}^{(t)} - θ_{k}^{(t + 1)})}^{2},

(A.1)

where $ξ_{k}^{(t + 1)} \in \partial | θ_{k}^{(t + 1)} |$ satisfies the optimality condition of (3.6),

\nabla_{k} {\tilde{ℒ}}_{λ} (θ_{k}^{(t + 1)}, w_{\ k}^{(t + 1, k)}) + λ ξ_{k}^{(t + 1)} = 0 .

(A.2)

By combining (A.1) with (A.2), we have

ℱ_{λ} (w^{(t + 1, k)}) - ℱ_{λ} (w^{(t + 1, k + 1)}) \geq \frac{\tilde{ρ} - (1)}{2} {(θ_{k}^{(t + 1)} - θ_{k}^{(t)})}^{2},

which further implies

ℱ_{λ} (θ^{(t)}) - ℱ_{λ} (θ^{(t + 1)}) \geq \frac{\tilde{ρ} - (1)}{2} {‖ θ^{(t)} - θ^{(t + 1)} ‖}_{2}^{2} .

(A.3)

We then analyze the gap in the objective value yet to be minimized after each iteration. For any θ′, θ ∈ ℝ^d with $θ_{𝒜^{⊥}}^{'} = θ_{𝒜^{⊥}} = 0$ , by the restricted strong convexity of ℱ_λ(θ), we have

ℱ_{λ} (θ') \geq ℱ_{λ} (θ) + {(\nabla {\tilde{ℒ}}_{λ} (θ) + λ ξ)}^{T} (θ' - θ) + \frac{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})}{2} {‖ θ' - θ ‖}_{2}^{2},

(A.4)

where ξ ∈ ℝ^d with ξ_𝒜 ∈ ∂‖θ_𝒜‖₁ and ξ_𝒜^⊥ = 0. We then minimize both sides of (A.4) with respect to $θ_{𝒜}^{'}$ and obtain

ℱ_{λ} (θ^{(t + 1)}) - ℱ_{λ} (\bar{θ}) \leq \frac{1}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} {‖ \nabla_{𝒜} {\tilde{ℒ}}_{λ} (θ^{(t + 1)}) + λ ξ_{𝒜}^{(t + 1)} ‖}_{2}^{2} \overset{(i)}{=} \frac{1}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} \sum_{k = 1}^{| 𝒜 |} {[\nabla_{k} {\tilde{ℒ}}_{λ} (θ^{(t + 1)}) - \nabla_{k} {\tilde{ℒ}}_{λ} (θ_{k}^{(t + 1)}, w_{\ k}^{(t + 1, k)})]}^{2} \overset{(i i)}{\leq} \frac{ρ_{+}^{2} (s^{*} + \tilde{s})}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} \sum_{k = 1}^{| 𝒜 |} {‖ θ^{(t + 1)} - w^{(t + 1, k)} ‖}^{2} \leq \frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} {‖ θ^{(t + 1)} - θ^{(t)} ‖}^{2},

(A.5)

where (i) comes from (A.2) and (ii) comes from the restricted strong smoothness of ℒ̃_λ(θ).

Eventually, by combing (A.5) with (A.3), we obtain

ℱ_{λ} (θ^{(t + 1)}) - ℱ_{λ} (\bar{θ}) \leq \frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{{\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} [ℱ_{λ} (θ^{(t)}) - ℱ_{λ} (θ^{(t + 1)})] \leq \frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{{\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} ([ℱ_{λ} (θ^{(t)}) - ℱ_{λ} (\bar{θ})] - [ℱ_{λ} (θ^{(t + 1)}) - ℱ_{λ} (\bar{θ})]),

which further implies

ℱ_{λ} (θ^{(t + 1)}) - ℱ_{λ} (\bar{θ}) \leq (\frac{s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})}) [ℱ_{λ} (θ^{(t)}) - ℱ_{λ} (\bar{θ})] .

(A.6)

By recursively applying (A.6), we complete the proof.

B Proof of Theorem 4.4

Proof

Before we proceed with the proof, we first introduce several important lemmas.

Lemma B.1

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λ_N, if θ satisfies,

{‖ θ_{𝒮^{⊥}} ‖}_{0} \leq \tilde{s} and ω_{λ} (θ) \leq λ / 2,

(B.1)

then we have

{‖ θ - θ^{*} ‖}_{2} \leq \frac{21 λ \sqrt{s^{*}}}{8 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})}, {‖ θ - θ^{*} ‖}_{1} \leq \frac{21 λ s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})}, and ℱ_{λ} (θ) \leq ℱ_{λ} (θ^{*}) + \frac{21 λ^{2} s^{*}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

Lemma B.2

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λ_N, if θ satisfies,

{‖ θ_{𝒮^{⊥}} ‖}_{0} \leq \tilde{s} and ℱ_{λ} (θ) \leq ℱ_{λ} (θ^{*}) + \frac{21 λ^{2} s^{*}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})},

then we have ‖[𝒯_L,λ(θ)]_𝒮^⊥‖₀ ≤ s̃ for any L ≤ 2ρ₊(s^* + 2s̃).

The proofs of Lemmas B.1 and B.2 are provided in Wang et al. (2014), therefore omitted. Since the initial solution θ^[0] satisfies the approximate KKT condition. By Lemma B.1, we know that θ^[0] satisfies

ℱ_{λ} (θ^{[0]}) \leq ℱ_{λ} (θ^{*}) + \frac{21 λ^{2} s^{*}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

(B.2)

We assume L^[m] ≤ 2ρ₊(s^* + 2s̃). Since ${‖ θ_{𝒮^{⊥}}^{[0]} ‖}_{0} \leq \tilde{s}$ , by (B.2) and Lemma B.2, we have θ^[0.5] = 𝒯_L,λ(θ^[0]) and ${‖ θ_{𝒮^{⊥}}^{[0.5]} ‖}_{0} \leq \tilde{s}$ . Since the coordinate descent subroutine iterates over 𝒜 = supp(θ^[0.5]), its output solution θ^[1] also satisfies ${‖ θ_{𝒮^{⊥}}^{[1]} ‖}_{0} \leq \tilde{s}$ . Since the proximal gradient descent iteration and coordinate descent subroutine decrease the objective value, by (B.2), we also have

ℱ_{λ} (θ^{[1]}) \leq ℱ_{λ} (θ^{[0.5]}) \leq ℱ_{λ} (θ^{[0]}) \leq ℱ_{λ} (θ^{*}) + \frac{21 λ^{2} s^{*}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

Then by induction, we know that all successive θ^[m]’s satisfy ${‖ θ_{𝒮^{⊥}}^{[m]} ‖}_{0} \leq \tilde{s}$ for m = 1.5, 2, 2.5, ….

Now we verify L^[m] ≤ 2ρ₊(s^* + 2s̃). Since we start with a small enough L = ρ₊(1) ≤ 2ρ₊(s^* + 2s̃). If L does not satisfy the stopping criterion for the backtracking line search in (3.4), then we multiply L by 2. Once L attains the interval ∈ [ρ₊(s^* + 2s̃), 2ρ₊(s^* + 2s̃)], it stops increasing. Because by the restricted strong smoothness of ℒ̃_λ(θ), such a step size parameter always guarantees that the algorithm iterates from a sparse θ^[m] to a sparse θ^[m+0.5], and meanwhile satisfies the stopping criterion of the backtracking line search. Thus L^[m] ≤ 2ρ₊(s^* + 2s̃) is verified.

The existence and uniqueness of θ̄^λ has been verified in Wang et al. (2014). Therefore the proof is omitted. We then proceed to derive the geometric rate of convergence to θ̄^λ by the next lemma.

Lemma B.3

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λ_N, if θ satisfies

{‖ θ_{𝒮^{⊥}} ‖}_{0} \leq \tilde{s} and ℱ_{λ} (θ) \leq ℱ_{λ} (θ^{*}) + \frac{21 λ^{2} s^{*}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})},

(B.3)

given L ≤ 2ρ₊(s^* + 2s̃), then we have

ℱ_{λ} (𝒯_{λ, L} (θ)) - ℱ_{λ} ({\bar{θ}}^{λ}) \leq (1 - \frac{1}{8 κ}) [ℱ_{λ} (θ) - ℱ_{λ} ({\bar{θ}}^{λ})] .

The proof of Lemma B.3 is provided in Wang et al. (2014), therefore omitted. Since we have verified that all θ^[m]’s satisfy (B.3) and all L^[m]’s satisfy L^[m] ≤ 2ρ₊(s^* + 2s̃) for m = 0, 1, 2, …, Lemma B.3 implies

ℱ_{λ} (θ^{[m + 1]}) - ℱ_{λ} ({\bar{θ}}^{λ}) \leq ℱ_{λ} (θ^{[m + 0.5]}) - ℱ_{λ} ({\bar{θ}}^{λ}) \leq (1 - \frac{1}{8 κ}) [ℱ_{λ} (θ^{[m]}) - ℱ_{λ} ({\bar{θ}}^{λ})],

(B.4)

where the first inequality holds because the coordinate descent subroutine decreases the objective value. Then by recursively applying (B.4), we compete the proof.

C Proof of Theorem 4.6

Proof

Before we proceed with the proof of Result (1), we first introduce the following lemma.

Lemma C.1

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λ_N, if θ satisfies

{‖ θ_{𝒮^{⊥}} ‖}_{0} \leq \tilde{s} and ω_{λ} (θ) \leq δ_{max} λ,

then for any λ′ ∈ [λ_N, λ], we have

ℱ_{λ'} (θ) - ℱ_{λ'} ({\bar{θ}}^{λ'}) \leq \frac{21 [δ_{max} λ + 2 (λ - λ')] (λ + λ') s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

The proof of Lemmas C.1 is provided in Wang et al. (2014), therefore omitted. If we take λ = λ′ = λ_K and θ = θ̂^{K−1}, then Lemma C.1 implies

ℱ_{λ_{K}} ({\hat{θ}}^{{K - 1}}) - ℱ_{λ_{K}} ({\bar{θ}}^{λ_{K}}) \leq \frac{21 s^{*} λ_{K}^{2}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

(C.1)

Recall (A.3) in Appendix A. Within each coordinate descent subroutine for λ_K, we have

{‖ θ^{(t)} - θ^{(t + 1)} ‖}_{2}^{2} \leq \frac{2 [ℱ_{λ_{K}} (θ^{(t)}) - ℱ_{λ_{K}} (θ^{(t + 1)})]}{{\tilde{ρ}}_{-} (1)} \leq \frac{2 [ℱ_{λ_{K}} (θ^{(t)}) - ℱ_{λ_{K}} (\bar{θ})]}{{\tilde{ρ}}_{-} (1)} .

(C.2)

By combining Theorem 4.3 with (C.2), we have

{‖ θ^{(t)} - θ^{(t + 1)} ‖}_{2}^{2} \leq 2 {(\frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})})}^{t} \frac{[ℱ_{λ_{K}} (θ^{(0)}) - ℱ_{λ_{K}} (\bar{θ})]}{{\tilde{ρ}}_{-} (1)} .

Therefore given

t \geq log (\frac{2 [ℱ_{λ_{K}} (θ^{(0)}) - ℱ_{λ_{K}} (\bar{θ})]}{{\tilde{ρ}}_{-} (1) δ_{0}^{2} λ_{K}^{2}}) / {log}^{- 1} (\frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})}),

(C.3)

we have

{‖ θ^{(t)} - θ^{(t + 1)} ‖}_{2}^{2} \leq 2 {(\frac{s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})})}^{t} \frac{[ℱ_{λ_{K}} (θ^{(0)}) - ℱ_{λ_{K}} (\bar{θ})]}{{\tilde{ρ}}_{-} (1)} \leq δ_{0}^{2} λ_{K}^{2},

which satisfies the stopping criterion of CCDA for λ_K. Since both the proximal gradient descent iteration and coordinate descent subroutine decrease the objective value, we have

ℱ_{λ_{K}} ({\hat{θ}}^{{K - 1}}) \geq ℱ_{λ_{K}} (θ^{(0)}) \geq ℱ_{λ_{K}} (\bar{θ}) \geq ℱ_{λ_{K}} ({\bar{θ}}^{λ_{K}})

(C.4)

within each coordinate descent subroutine for the K^th stage. By combining (C.1) and (C.3) with (C.4), we have

t \geq log (\frac{21 s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s}) {\tilde{ρ}}_{-} (1) δ_{0}^{2}}) / {log}^{- 1} (\frac{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s})}{(s^{*} + \tilde{s}) ρ_{+}^{2} (s^{*} + \tilde{s}) + {\tilde{ρ}}_{-} (1) {\tilde{ρ}}_{-} (s^{*} + \tilde{s})}) .

Before we proceed with the proof of Result (2), we first introduce the following lemma.

Lemma C.2

Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λ_N, if θ satisfies,

{‖ θ_{𝒮^{⊥}} ‖}_{0} \leq \tilde{s} and ℱ_{λ} (θ) \leq ℱ_{λ} (θ^{*}) + \frac{21 λ^{2} s^{*}}{2 {\tilde{ρ}}_{-} (s^{*} + \tilde{s})},

(C.5)

given L ≤ 2ρ₊(s^* + 2s̃), we have

ω_{λ} (𝒯_{λ, L} (θ)) \leq 3 \sqrt{ρ_{+} (s^{*} + 2 \tilde{s}) [ℱ_{λ} (𝒯_{λ, L} (θ)) - ℱ_{λ} (θ)]} .

The proof of Lemma C.2 is provided in Wang et al. (2014), therefore omitted. Recall that in Appendix B, we have shown that at the K^th stage, θ^[m] satisfies (C.5). The backtracking line search guarantees L^[m+1] ≤ 2ρ₊(s^* + 2s̃). Thus by Lemma C.2, we have

ω_{λ_{K}} (θ^{[m + 0.5]}) \leq 3 \sqrt{ρ_{+} (s^{*} + 2 \tilde{s}) [ℱ_{λ_{K}} (θ^{[m + 0.5]}) - ℱ_{λ_{K}} (θ^{[m]})]} \leq 3 \sqrt{ρ_{+} (s^{*} + 2 \tilde{s}) [ℱ_{λ_{K}} (θ^{[m + 1]}) - ℱ_{λ_{K}} ({\bar{θ}}^{λ_{K}})]},

(C.6)

where the last inequality holds since the coordinate descent subroutine decreases the objective value. By combining (C.6) with Theorem 4.4, we obtain

ω_{λ_{K}} (θ^{[m + 0.5]}) \leq 3 \sqrt{ρ_{+} (s^{*} + 2 \tilde{s}) {(1 - \frac{1}{8 κ})}^{m + 1} [ℱ_{λ_{K}} (θ^{[0]}) - ℱ_{λ_{K}} ({\bar{θ}}^{λ_{K}})]} .

Thus as long as

m \geq log (\frac{9 ρ_{+} (s^{*} + 2 \tilde{s}) [ℱ_{λ_{K}} (θ^{[0]}) - ℱ_{λ_{K}} ({\bar{θ}}^{λ_{K}})]}{δ_{K}^{2} λ_{K}^{2}}) / {log}^{- 1} (1 - \frac{1}{8 κ}),

(C.7)

we have

ω_{λ_{K}} (θ^{[m + 0.5]}) \leq 3 \sqrt{ρ_{+} (s^{*} + 2 \tilde{s}) {(1 - \frac{1}{8 κ})}^{m} [ℱ_{λ_{K}} (θ^{[0]}) - ℱ_{λ_{K}} ({\bar{θ}}^{λ_{K}})]} \leq δ_{K} λ_{K},

which satisfies the stopping criterion of AISTA at the K^th stage. By combining (C.1) with (C.7), we have

m \geq log (\frac{189 κ λ_{K}^{2} s^{*}}{2 δ_{K}^{2} λ_{K}^{2}}) / {log}^{- 1} (1 - \frac{1}{8 κ}) .

Result (3) is just a straightforward combination of Results (1) and (2).

To prove Result (4), we need to use Lemma C.1 again. In particular, for K < N, we take λ′ = λ_N, λ = λ_K and θ = θ̂^{K}. We then have

ℱ_{λ_{N}} ({\hat{θ}}^{{K}}) - ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) \leq \frac{21 (λ_{K} + λ_{N}) (ω_{λ_{K}} ({\hat{θ}}^{{K}}) + 2 (λ_{K} - λ_{N})) s^{*})}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

(C.8)

Since we have λ_K > λ_N for K = 1, …, N − 1, (C.8) implies

ℱ_{λ_{N}} ({\hat{θ}}^{{K}}) - ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) \leq \frac{105 λ_{K}^{2} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

(C.9)

For K = N, (C.8) implies

ℱ_{λ_{N}} ({\hat{θ}}^{{N}}) - ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) \leq \frac{105 δ_{N} λ_{N}^{2} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

(C.10)

By combining (C.9) with (C.10), we prove Result (4).

D Proof of Lemma 4.8

Proof

Before we proceed with the proof, we need to introduce the following lemma.

Lemma D.1

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ . We have

ℙ ({‖ Ŝ - Σ^{*} ‖}_{max} \leq \sqrt{2} π \sqrt{\frac{log d}{n}}) \geq 1 - \frac{1}{d^{2}} .

(D.1)

The proof of Lemma D.1 is provided in Liu et al. (2012a), therefore omitted. We consider the following decomposition,

{‖ \nabla ℒ (θ^{*}) ‖}_{\infty} = {‖ Ŝ θ^{*} - e ‖}_{\infty} = {‖ (Ŝ - Σ^{*}) θ^{*} ‖}_{\infty} \leq {‖ θ^{*} ‖}_{1} {‖ Ŝ - Σ^{*} ‖}_{max} .

(D.2)

Then by combining (D.1) and (D.2) with the fact ‖θ^*‖₁ ≤ ‖Θ^*‖₁ ≤ M, we have

ℙ ({‖ \nabla ℒ (θ^{*}) ‖}_{\infty} \leq \sqrt{2} π M \sqrt{\frac{log d}{n}}) \geq 1 - \frac{1}{d^{2}},

which completes the proof.

E Proof of Lemma 4.9

Proof

Before we proceed with the proof, we first introduce the following lemma.

Lemma E.1

Suppose that $X ~ T E_{d} (Σ^{*}, ξ, {f_{j}}_{j = 1}^{d})$ . There exists a universal constant c₂ such that

ℙ (sup_{{‖ θ ‖}_{0} \leq s} | θ^{T} (Ŝ - Σ^{*}) θ | \leq \frac{c_{2} s log d}{n} {‖ θ ‖}_{2}^{2}) \geq 1 - \frac{2}{d^{2}} .

(E.1)

The proof of Lemma E.1 is provided in Han and Liu (2015), therefore omitted. We consider the decomposition

θ^{T} Ŝ θ = θ^{T} Σ^{*} θ + θ^{T} (Ŝ - Σ^{*}) θ .

(E.2)

By assuming ‖θ‖₀ ≤ s^* + 2s̃ and

| θ^{T} (Ŝ - \bar{Σ}) θ | \leq c_{2} \frac{(s^{*} + 2 \tilde{s}) log d}{n} {‖ θ ‖}_{2}^{2} / n,

we further have

θ^{T} Ŝ θ \leq Λ_{max} (Σ^{*}) \cdot {‖ θ ‖}_{2}^{2} + | θ^{T} (Ŝ - Σ^{*}) θ | \leq ψ_{max} {‖ θ ‖}_{2}^{2} + c_{2} \frac{(s^{*} + 2 \tilde{s}) log d}{n} {‖ θ ‖}_{2}^{2},

(E.3)

θ^{T} Ŝ θ \leq Λ_{min} (Σ^{*}) \cdot {‖ θ ‖}_{2}^{2} - | θ^{T} (Ŝ - Σ^{*}) θ | \leq ψ_{min} {‖ θ ‖}_{2}^{2} - c_{2} \frac{(s^{*} + 2 \tilde{s}) log d}{n} {‖ θ ‖}_{2}^{2} .

(E.4)

Thus for $n \geq 4 ψ_{min}^{- 1} c_{2} (s^{*} + 2 \tilde{s}) log d$ , we have

3 ψ_{min} {‖ θ ‖}_{2}^{2} / 4 \leq θ^{T} Ŝ θ \leq 5 ψ_{max} {‖ θ ‖}_{2}^{2} / 4 .

Given α = ψ_min/2, we have

ρ_{+} (s^{*} + 2 \tilde{s}) \leq 5 ψ_{max} / 4, {\tilde{ρ}}_{-} (s^{*} + 2 \tilde{s}) \geq ψ_{min} / 4, κ \leq 5 ψ_{max} / ψ_{min} .

(E.5)

Since we need to secure s̃ = c₁s^* ≥ (144κ² + 250κ)s^*, we take

c_{1} = 3600 ψ_{max}^{2} / ψ_{min}^{2} + 1250 ψ_{max} / ψ_{min} \geq 72 (1 + γ) κ^{2} + 250 κ .

(E.6)

In another word, we need

n \geq 4 ψ_{min}^{- 1} c_{2} (1 + 2 c_{1}) s^{*} log d \geq 4 ψ_{min}^{- 1} c_{2} (s^{*} + 2 \tilde{s}) log d .

Eventually by combining (E.1) and (E.5) with (E.6), we complete the proof.

F Proof of Theorem 4.11

Proof

Recall that the output solution θ̂^{N} satisfies ${‖ {\hat{θ}}_{𝒮^{⊥}}^{{N}} ‖}_{0} \leq \tilde{s}$ and ω_{λ_N} ≤ δ_Nλ_N. By Lemma B.1, we have

{‖ {\hat{θ}}^{{N}} - θ^{*} ‖}_{1} \leq \frac{21 λ_{N} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} and {‖ {\hat{θ}}^{{N}} - θ^{*} ‖}_{2}^{2} \leq \frac{7 λ_{N}^{2} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

(F.1)

By the definition of the matrix ℓ₁ and Frobenius norms, we have

{‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{1} = max_{1 \leq j \leq d} {‖ Θ_{* j}^{{N}} - Θ_{* j}^{*} ‖}_{1} and {‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{F}^{2} = \sum_{j = 1}^{d} {‖ Θ_{* j}^{{N}} - Θ_{* j}^{*} ‖}_{2}^{2} .

(F.2)

Recall that we use θ̂^{N} to denote arbitrary column of Θ̂^{N}. By combining (F.2) with (F.1), we have

{‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{1} \leq \frac{21 λ_{N} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} and \frac{1}{d} {‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{F}^{2} \leq \frac{7 λ_{N}^{2} s^{*}}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} .

Since all above results rely on Assumptions 4.1 and 4.2, by Lemma 4.8 and 4.9, we have

{‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{1} \leq \frac{168 \sqrt{2} π s^{*} M}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})} \sqrt{\frac{log d}{n}} and \frac{1}{d} {‖ {\hat{Θ}}^{{N}} - Θ^{*} ‖}_{F}^{2} \leq \frac{896 π^{2} s^{*} M^{2} log d}{{\tilde{ρ}}_{-} (s^{*} + \tilde{s}) n}

with probability 1 − 3d⁻², which completes the proof.

G Proof of Theorem 4.12

Proof

For notational simplicity, we omit the column index j, and use 𝒮 and θ̂^o ∈ ℝ^d to denote the true support 𝒮_j and corresponding oracle estimator Θ̂^o respectively for the j^th column. In particular, we can rewrite (4.6) as follows,

{\hat{θ}}_{𝒮}^{o} = \underset{θ_{𝒮} \in ℝ^{| 𝒮 |}}{argmin} \frac{1}{2} θ_{𝒮}^{T} Ŝ_{𝒮 𝒮} θ_{𝒮} - e_{𝒮}^{T} θ_{𝒮} and {\hat{θ}}_{𝒮^{⊥}}^{o} = 0 .

(G.1)

Suppose that Assumption 4.2 holds. We have

Λ_{min} (S_{𝒮 𝒮}) \geq ρ_{-} (s^{*}) \geq ρ_{-} (s^{*} + 2 \tilde{s}) = {\tilde{ρ}}_{-} (s^{*} + 2 \tilde{s}) + α > α,

which implies that S_𝒮𝒮 is positive definite. Thus (G.1) is strongly convex and θ̂^o is a unique minimizer. In our following analysis, we also assume

{‖ Ŝ - Σ^{*} ‖}_{max} \leq \sqrt{2} π \sqrt{\frac{log d}{n}} .

(G.2)

By the strong convexity of (G.1), we have

0 \overset{(i)}{\geq} \frac{1}{2} {({\hat{θ}}_{𝒮}^{o})}^{T} Ŝ_{𝒮 𝒮} {\hat{θ}}_{𝒮}^{o} - e_{𝒮}^{T} {\hat{θ}}_{𝒮}^{o} - \frac{1}{2} {(θ_{𝒮}^{*})}^{T} Ŝ_{𝒮 𝒮} θ_{𝒮}^{*} + e_{𝒮}^{T} θ_{𝒮}^{*} \geq {(Ŝ_{𝒮 𝒮} θ_{𝒮}^{*} - e)}^{T} ({\hat{θ}}_{𝒮}^{o} - θ_{𝒮}^{*}) + \frac{ρ_{-} (s^{*})}{2} {‖ θ_{𝒮}^{*} - {\hat{θ}}_{𝒮}^{o} ‖}_{2}^{2},

(G.3)

where (i) comes from the fact that θ̂^o is the minimizer to (G.1). For notational simplicity, we denote ${\hat{Δ}}_{𝒮}^{o} = {\hat{θ}}_{𝒮}^{o} - θ_{𝒮}^{*}$ . By the Cauchy-Schwarz inequality, (G.3) can be rewritten as

\frac{ρ_{-} (s^{*})}{2} {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{2}^{2} \leq - {(Ŝ_{𝒮 𝒮} θ_{𝒮}^{*} - e_{𝒮})}^{T} {\hat{Δ}}_{𝒮}^{o} \leq {‖ Ŝ_{𝒮 𝒮} θ_{𝒮}^{*} - e_{𝒮} ‖}_{max} {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{1} \leq {‖ Ŝ θ^{*} - e ‖}_{max} \sqrt{s^{*}} {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{2},

where the last inequality comes from (G.2) and the fact that Δ̂^o contains at most s^* entries. By simple manipulations, we obtain

{‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{2} \leq \frac{2 \sqrt{s^{*}} {‖ Ŝ θ^{*} - e ‖}_{max}}{ρ_{-} (s^{*})} \leq \frac{2 \sqrt{s^{*}} {‖ θ_{𝒮}^{*} ‖}_{1} {‖ Ŝ_{𝒮 𝒮} - Σ_{𝒮 𝒮}^{*} ‖}_{max}}{ρ_{-} (s^{*})} \leq \frac{2 \sqrt{2} π M}{ρ_{-} (s^{*})} \sqrt{\frac{s^{*} log d}{n}},

(G.4)

where the last inequality comes from the fact ‖θ^*‖₁ ≤ ‖Θ^*‖₁ ≤ M. By combining (G.4) with Assumption 4.3, we obtain

min_{j \in 𝒮} | {\hat{θ}}_{j}^{o} | \geq min_{j \in 𝒮} | {\hat{θ}}_{j}^{*} | - {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{\infty} \overset{(i)}{\geq} min_{j \in 𝒮} | θ_{j}^{*} | - {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{2} = (c_{3} - \frac{2 \sqrt{2} π}{ρ_{-} (s^{*})}) M \sqrt{\frac{s^{*} log d}{n}},

where (i) comes from the fact ${‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{\infty} \leq {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{2}$ . Now we assume $c_{3} \geq 2 \sqrt{2} π ρ_{-}^{- 1} (s^{*}) + c_{4} \sqrt{2} π β$ for some constant c₄ (will be discussed later). We then have

min_{j \in 𝒮} | {\hat{θ}}_{j}^{o} | \geq c_{4} \sqrt{2} π M \sqrt{\frac{s^{*} log d}{n}} \geq λ_{N} β .

Now we show that θ̂^o is a sparse local solution to (2.4). In particular, we have the following decomposition,

ℒ {\hat{θ}}^{o} = Ŝ {\hat{θ}}^{o} - e = [\begin{matrix} Ŝ_{𝒮 𝒮} & Ŝ_{𝒮 𝒮^{⊥}} \\ Ŝ_{𝒮^{⊥} 𝒮} & Ŝ_{𝒮^{⊥} 𝒮^{⊥}} \end{matrix}] [\begin{matrix} {\hat{θ}}_{𝒮}^{o} \\ 0 \end{matrix}] - [\begin{matrix} e_{𝒮} \\ 0 \end{matrix}] .

Since ${\hat{θ}}_{𝒮}^{o}$ is the minimizer to (G.1), by the KKT condition of (G.1), we have

Ŝ_{𝒮 𝒮} {\hat{θ}}_{𝒮}^{o} - e_{𝒮} = 0 .

(G.5)

Moreover, since ${min}_{j \in 𝒮} | {\hat{θ}}_{j}^{o} | \geq λ_{N} β$ , we have

\partial ℛ_{λ_{N}} ({\hat{θ}}_{𝒮}^{o}) = - \nabla ℋ_{λ_{N}} ({\hat{θ}}_{𝒮}^{o}) + λ_{N} \partial {‖ {\hat{θ}}_{𝒮}^{o} ‖}_{1} = 0 .

(G.6)

By combining (G.5) with (G.6), we have

Ŝ_{𝒮 𝒮} {\hat{θ}}_{𝒮}^{o} - e_{𝒮} - \nabla ℋ_{λ_{N}} ({\hat{θ}}_{𝒮}^{o}) + λ_{N} \partial {‖ {\hat{θ}}_{𝒮}^{o} ‖}_{1} = 0 .

(G.7)

Now we consider

{‖ Ŝ_{𝒮^{⊥} 𝒮} {\hat{θ}}_{𝒮}^{o} ‖}_{\infty} = (Ŝ_{𝒮^{⊥} 𝒮} - {\bar{Σ}}_{𝒮^{⊥} 𝒮} + {\bar{Σ}}_{𝒮^{⊥} 𝒮}) ({\hat{θ}}_{𝒮}^{o} - θ_{𝒮}^{*} + θ_{𝒮}^{*}) = {‖ (Ŝ_{𝒮^{⊥} 𝒮} - {\bar{Σ}}_{𝒮^{⊥} 𝒮}) {\hat{Δ}}_{𝒮}^{o} ‖}_{\infty} + {‖ {\bar{Σ}}_{𝒮^{⊥} 𝒮} {\hat{Δ}}_{𝒮}^{o} ‖}_{\infty} + {‖ Ŝ_{𝒮^{⊥} 𝒮} - {\bar{Σ}}_{𝒮^{⊥} 𝒮} θ^{*} ‖}_{\infty} = \sqrt{s^{*}} {‖ Ŝ_{𝒮^{⊥} 𝒮} - {\bar{Σ}}_{𝒮^{⊥} 𝒮} ‖}_{max} {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{2} + {‖ {\hat{Δ}}_{𝒮}^{o} ‖}_{\infty} + {‖ Ŝ_{𝒮^{⊥} 𝒮} - {\bar{Σ}}_{𝒮^{⊥} 𝒮} ‖}_{\infty} {‖ θ^{*} ‖}_{1} = \frac{4 π^{2} M s^{*} log d}{ρ_{-} (s^{*}) n} + \frac{2 \sqrt{2} π M}{ρ_{-} (s^{*})} \sqrt{\frac{s^{*} log d}{n}} + \sqrt{2} π M \sqrt{\frac{log d}{n}} = (\frac{\sqrt{2} π ψ_{min} c_{2}}{(1 + 2 c_{3}) ρ_{-} (s^{*})} + \frac{2}{ρ_{-} (s^{*})} + 1) \sqrt{2} π M \sqrt{\frac{s^{*} log d}{n}} .

Therefore as long as

c_{4} \geq \frac{\sqrt{2} π ψ_{min} c_{2}}{(1 + 2 c_{3}) ρ_{-} (s^{*})} + \frac{2}{ρ_{-} (s^{*})} + 1,

we have ${‖ Ŝ_{𝒮^{⊥} 𝒮} {\hat{θ}}_{𝒮}^{o} ‖}_{\infty} \leq λ_{N}$ , which implies that there exists ξ ∈ ∂‖0‖₁ such that

Ŝ_{𝒮^{⊥} 𝒮} {\hat{θ}}_{𝒮}^{o} - \nabla ℋ_{λ_{N}} (0) + λ_{N} ξ = 0 .

(G.8)

By combining (G.7) with (G.8), we know that θ̂^o satisfies the KKT condition and is a local solution to (2.4).

Now we will show that θ̂^o and θ̄^λ_N are identical. Since $‖ {\bar{θ}}_{𝒮^{⊥}}^{λ_{N}} ‖ \leq \tilde{s}$ and $‖ {\hat{θ}}_{𝒮^{⊥}}^{o} ‖ = 0$ , we have

| supp ({\hat{θ}}^{o}) \cup supp ({\bar{θ}}^{λ_{N}}) | \leq s^{*} + \tilde{s} .

By the restricted strong convexity of ℱ_{λ_N}, we have

ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) \geq ℱ_{λ_{N}} ({\hat{θ}}^{o}) + {(\nabla {\tilde{ℒ}}_{λ_{N}} ({\hat{θ}}^{o}) + λ_{N} {\tilde{ξ}}^{o})}^{T} ({\bar{θ}}^{λ_{N}} - {\hat{θ}}^{o}) + \frac{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})}{2} {‖ {\bar{θ}}^{λ_{N}} - {\hat{θ}}^{o} ‖}_{2}^{2}, = ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) + \frac{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})}{2} {‖ {\bar{θ}}^{λ_{N}} - {\hat{θ}}^{o} ‖}_{2}^{2},

(G.9)

ℱ_{λ_{N}} ({\hat{θ}}^{o}) \geq ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) + {(\nabla {\tilde{ℒ}}_{λ_{N}} ({\bar{θ}}^{λ_{N}}) + λ_{N} \tilde{ξ})}^{T} ({\hat{θ}}^{o} - {\bar{θ}}^{λ_{N}}) + \frac{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})}{2} {‖ {\hat{θ}}^{o} - {\bar{θ}}^{λ_{N}} ‖}_{2}^{2}, = ℱ_{λ_{N}} ({\bar{θ}}^{λ_{N}}) + \frac{{\tilde{ρ}}_{-} (s^{*} + \tilde{s})}{2} {‖ {\bar{θ}}^{λ_{N}} - {\hat{θ}}^{o} ‖}_{2}^{2},

(G.10)

where ξ̃ and ξ̃^o are defined as

\tilde{ξ} = \underset{ξ \in \partial {‖ {\bar{θ}}^{λ_{N}} ‖}_{1}}{argmin} {‖ \nabla {\tilde{ℒ}}_{λ_{N}} ({\bar{θ}}^{λ_{N}}) + λ_{N} ξ ‖}_{\infty} and {\tilde{ξ}}^{o} = \underset{ξ \in \partial {‖ {\hat{θ}}^{o} ‖}_{1}}{argmin} {‖ \nabla {\tilde{ℒ}}_{λ_{N}} ({\hat{θ}}^{o}) + λ_{N} ξ ‖}_{\infty} .

By combining (G.9) with (G.10), we have ${‖ {\hat{θ}}^{o} - {\bar{θ}}^{λ_{N}} ‖}_{2}^{2} = 0$ , i.e., θ̂^o = θ̄^λ_N. Note that we choose $λ_{N} = c_{4} \sqrt{2} π M \sqrt{log d / n}$ , which is different from the selected regularization parameter in Assumption 4.8. But as long as we have $c_{4} \sqrt{s^{*}} \geq 8$ , which is not an issue under the high dimensional scaling

M, s^{*}, n, d \to \infty and M s^{*} log d / n \to 0,

λ_N ≥ 8‖∇ℒ(θ^*)‖_∞ still holds with high probability. Since the above results universally hold over all columns of Θ̅^λ_N and Θ^* under Assumptions 4.1 and (4.2), by Lemmas 4.8 and 4.9, we obtain Θ̂^o = Θ̅^λ_N, which completes the proof.

Footnotes

In our numerical experiments, PCDA is implemented by the R package “ncvreg”.

References

Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Image Processing, IEEE Transactions on. 2009a;18:2419–2434. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]
Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009b;2:183–202. [Google Scholar]
Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai T, Liu W, Luo X. A constrained 2113;1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
Dennis JJE, Schnabel RB. Numerical methods for unconstrained optimization and nonlinear equations. Vol. 16. SIAM; 1983. [Google Scholar]
Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014;42:819–849. doi: 10.1214/13-aos1198. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33:1–13. [PMC free article] [PubMed] [Google Scholar]
Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
Han F, Liu H. Statistical analysis of latent generalized correlation matrix estimation in transelliptical distribution. Bernoulli. 2015 doi: 10.3150/15-BEJ702. (Accepted) [DOI] [PMC free article] [PubMed] [Google Scholar]
Han F, Zhao T, Liu H. CODA: High dimensional copula discriminant analysis. Journal of Machine Learning Research. 2012;14:629–671. [Google Scholar]
Jacob L, Obozinski G, Vert J-P. Group lasso with overlap and graph lasso; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]
Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]
Ledoux M. The concentration of measure phenomenon. Vol. 89. AMS Bookstore; 2005. [Google Scholar]
Li X, Zhao T, Yuan X, Liu H. The ”flare” package for high-dimensional sparse linear regression in R. Journal of Machine Learning Research. 2015;16:553–557. [PMC free article] [PubMed] [Google Scholar]
Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics. 2012a;40:2293–2326. [Google Scholar]
Liu H, Han F, Zhang C-H. Transelliptical graphical models. Advances in Neural Information Processing Systems 25. 2012b [PMC free article] [PubMed] [Google Scholar]
Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]
Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. Advances in Neural Information Processing Systems. 2010 [PMC free article] [PubMed] [Google Scholar]
Liu H, Wang L, Zhao T. Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics. 2014;23:439–459. doi: 10.1080/10618600.2013.782818. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liu H, Wang L, Zhao T. Calibrated multivariate regression with application to neural semantic basis discovery. Journal of Machine Learning Research. 2015;16:1579–1606. [PMC free article] [PubMed] [Google Scholar]
Liu W, Luo X. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis. 2015;135:153–162. doi: 10.1016/j.jmva.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu Z, Xiao L. Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint arXiv:1306.5918. 2013 [Google Scholar]
Mazumder R, Friedman JH, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B. 2008;70:53–71. [Google Scholar]
Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72:417–473. [Google Scholar]
Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37:246–270. [Google Scholar]
Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39:1069–1097. [Google Scholar]
Negahban SN, Ravikumar P, Wainwright MJ, Yu B. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27:538–557. [Google Scholar]
Nesterov Y. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody. 1988;24:509–517. [Google Scholar]
Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–152. [Google Scholar]
NESTEROV Y. Gradient methods for minimizing composite objective function. Mathematical Programming Series B. 2013;140:125–161. [Google Scholar]
Nocedal J, Wright S. Numerical optimization, series in operations research and financial engineering. New York: Springer; 2006. [Google Scholar]
Qin Z, Scheinberg K, Goldfarb D. Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation. 2010:1–27. [Google Scholar]
Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]
Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
Shalev-Shwartz S, Tewari A. Stochastic methods for ℓ1-regularized loss minimization. The Journal of Machine Learning Research. 2011;12:1865–1892. [Google Scholar]
Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 2012;107:223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B. 2005;67:91–108. [Google Scholar]
Tseng P, Yun S. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications. 2009a;140:513–535. [Google Scholar]
Tseng P, Yun S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming. 2009b;117:387–423. [Google Scholar]
Van de Geer SA. High-dimensional generalized linear models and the lasso. The Annals of Statistics. 2008;36:614–645. [Google Scholar]
Wainwright M. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming. IEEE Transactions on Information Theory. 2009;55:2183–2201. [Google Scholar]
Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Z, Liu H, Zhang T. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics. 2014;42:2164–2201. doi: 10.1214/14-AOS1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics. 2008;2:224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xue L, Zou H, Cai T. Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics. 2012;40:1403–1429. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68:49–67. [Google Scholar]
Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010a;38:894–942. [Google Scholar]
Zhang C-H, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]
Zhang C-H, Zhang T. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
Zhang T. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics. 2009;37:2109–2144. [Google Scholar]
Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research. 2010b;11:1081–1107. [Google Scholar]
Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
Zhao T, Liu H. Sparse additive machine; International Conference on Artificial Intelligence and Statistics; 2012. [Google Scholar]
Zhao T, Liu H. Calibrated precision matrix estimation for high-dimensional elliptical distributions. IEEE transactions on Information Theory. 2014;60:7874. doi: 10.1109/TIT.2014.2360980. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research. 2012;13:1059–1062. [PMC free article] [PubMed] [Google Scholar]
Zhao T, Liu H, Zhang T. A general theory of pathwise coordinate optimization. arXiv preprint arXiv:1412.7477. 2014a [Google Scholar]
Zhao T, Roeder K, Liu H. Positive semidefinite rank-based correlation matrix estimation with application to semiparametric graph estimation. Journal of Computational and Graphical Statistics. 2014b;23:895–922. doi: 10.1080/10618600.2013.858633. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhao T, Yu M, Wang Y, Arora R, Liu H. Accelerated mini-batch randomized block coordinate descent method. Advances in neural information processing systems. 2014c [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]

[R1] Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]

[R2] Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Image Processing, IEEE Transactions on. 2009a;18:2419–2434. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]

[R3] Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009b;2:183–202. [Google Scholar]

[R4] Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Cai T, Liu W, Luo X. A constrained 2113;1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]

[R6] Dennis JJE, Schnabel RB. Numerical methods for unconstrained optimization and nonlinear equations. Vol. 16. SIAM; 1983. [Google Scholar]

[R7] Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R10] Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014;42:819–849. doi: 10.1214/13-aos1198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]

[R12] Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33:1–13. [PMC free article] [PubMed] [Google Scholar]

[R14] Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]

[R15] Han F, Liu H. Statistical analysis of latent generalized correlation matrix estimation in transelliptical distribution. Bernoulli. 2015 doi: 10.3150/15-BEJ702. (Accepted) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Han F, Zhao T, Liu H. CODA: High dimensional copula discriminant analysis. Journal of Machine Learning Research. 2012;14:629–671. [Google Scholar]

[R17] Jacob L, Obozinski G, Vert J-P. Group lasso with overlap and graph lasso; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]

[R18] Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]

[R19] Ledoux M. The concentration of measure phenomenon. Vol. 89. AMS Bookstore; 2005. [Google Scholar]

[R20] Li X, Zhao T, Yuan X, Liu H. The ”flare” package for high-dimensional sparse linear regression in R. Journal of Machine Learning Research. 2015;16:553–557. [PMC free article] [PubMed] [Google Scholar]

[R21] Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics. 2012a;40:2293–2326. [Google Scholar]

[R22] Liu H, Han F, Zhang C-H. Transelliptical graphical models. Advances in Neural Information Processing Systems 25. 2012b [PMC free article] [PubMed] [Google Scholar]

[R23] Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]

[R24] Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. Advances in Neural Information Processing Systems. 2010 [PMC free article] [PubMed] [Google Scholar]

[R25] Liu H, Wang L, Zhao T. Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics. 2014;23:439–459. doi: 10.1080/10618600.2013.782818. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Liu H, Wang L, Zhao T. Calibrated multivariate regression with application to neural semantic basis discovery. Journal of Machine Learning Research. 2015;16:1579–1606. [PMC free article] [PubMed] [Google Scholar]

[R27] Liu W, Luo X. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis. 2015;135:153–162. doi: 10.1016/j.jmva.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Lu Z, Xiao L. Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint arXiv:1306.5918. 2013 [Google Scholar]

[R29] Mazumder R, Friedman JH, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B. 2008;70:53–71. [Google Scholar]

[R31] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R32] Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72:417–473. [Google Scholar]

[R33] Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37:246–270. [Google Scholar]

[R34] Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39:1069–1097. [Google Scholar]

[R35] Negahban SN, Ravikumar P, Wainwright MJ, Yu B. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27:538–557. [Google Scholar]

[R36] Nesterov Y. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody. 1988;24:509–517. [Google Scholar]

[R37] Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–152. [Google Scholar]

[R38] NESTEROV Y. Gradient methods for minimizing composite objective function. Mathematical Programming Series B. 2013;140:125–161. [Google Scholar]

[R39] Nocedal J, Wright S. Numerical optimization, series in operations research and financial engineering. New York: Springer; 2006. [Google Scholar]

[R40] Qin Z, Scheinberg K, Goldfarb D. Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation. 2010:1–27. [Google Scholar]

[R41] Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]

[R42] Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]

[R43] Shalev-Shwartz S, Tewari A. Stochastic methods for ℓ1-regularized loss minimization. The Journal of Machine Learning Research. 2011;12:1865–1892. [Google Scholar]

[R44] Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 2012;107:223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R46] Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B. 2005;67:91–108. [Google Scholar]

[R47] Tseng P, Yun S. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications. 2009a;140:513–535. [Google Scholar]

[R48] Tseng P, Yun S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming. 2009b;117:387–423. [Google Scholar]

[R49] Van de Geer SA. High-dimensional generalized linear models and the lasso. The Annals of Statistics. 2008;36:614–645. [Google Scholar]

[R50] Wainwright M. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming. IEEE Transactions on Information Theory. 2009;55:2183–2201. [Google Scholar]

[R51] Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] Wang Z, Liu H, Zhang T. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics. 2014;42:2164–2201. doi: 10.1214/14-AOS1238. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics. 2008;2:224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R54] Xue L, Zou H, Cai T. Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics. 2012;40:1403–1429. [Google Scholar]

[R55] Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68:49–67. [Google Scholar]

[R56] Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]

[R57] Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010a;38:894–942. [Google Scholar]

[R58] Zhang C-H, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]

[R59] Zhang C-H, Zhang T. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]

[R60] Zhang T. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics. 2009;37:2109–2144. [Google Scholar]

[R61] Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research. 2010b;11:1081–1107. [Google Scholar]

[R62] Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]

[R63] Zhao T, Liu H. Sparse additive machine; International Conference on Artificial Intelligence and Statistics; 2012. [Google Scholar]

[R64] Zhao T, Liu H. Calibrated precision matrix estimation for high-dimensional elliptical distributions. IEEE transactions on Information Theory. 2014;60:7874. doi: 10.1109/TIT.2014.2360980. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research. 2012;13:1059–1062. [PMC free article] [PubMed] [Google Scholar]

[R66] Zhao T, Liu H, Zhang T. A general theory of pathwise coordinate optimization. arXiv preprint arXiv:1412.7477. 2014a [Google Scholar]

[R67] Zhao T, Roeder K, Liu H. Positive semidefinite rank-based correlation matrix estimation with application to semiparametric graph estimation. Journal of Computational and Graphical Statistics. 2014b;23:895–922. doi: 10.1080/10618600.2013.858633. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] Zhao T, Yu M, Wang Y, Arora R, Liu H. Accelerated mini-batch randomized block coordinate descent method. Advances in neural information processing systems. 2014c [PMC free article] [PubMed] [Google Scholar]

[R69] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R70] Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]

PERMALINK

Accelerated Path-following Iterative Shrinkage Thresholding Algorithm with Application to Semiparametric Graph Estimation

Tuo Zhao

Han Liu

Abstract

1 Introduction

NOTATIONS

2 Background and Problem Setup

2.1 Sparsity-inducing Nonconvex Regularization Functions

Figure 2.1.

2.2 Nonconvex Loss Function

Definition 2.1 (Transelliptical Distribution)

Definition 2.2 (Transformed Kendall’s tau Estimator)

Remark 2.1

Remark 2.2

3 Method

3.1 Path-following Optimization Scheme

3.2 Accelerated Iterative Shrinkage Thresholding Algorithm

Algorithm 1.

(I) Proximal Gradient Descent Iteration

(II) Coordinate Descent Subroutine

Algorithm 2.

Remark 3.1

Algorithm 3.

3.3 Stopping Criteria

4 Theory

Assumption 4.1

Definition 4.1

Lemma 4.1

Assumption 4.2

Remark 4.2 (Step Size Initialization)

4.1 Computational Theory

Theorem 4.3 (Geometric Rate of Convergence of CCDA)

Theorem 4.4 (Geometric Rate of Convergence of AISTA)

Theorem 4.5 (Path-following Optimization Scheme)

Theorem 4.6 (Global Geometric Rate of Convergence of APISTA)

Remark 4.7

4.2 Statistical Theory

Lemma 4.8

Lemma 4.9

Remark 4.10

Theorem 4.11. [Parameter Estimation]

Assumption 4.3

Theorem 4.12. [Graph Estimation]

Remark 4.13

5 Numerical Experiments

5.1 Simulated Data

Figure 5.1.

Table 5.1.

Table 5.2.

Table 5.3.

Table 5.4.

5.2 Real Data

Figure 5.2.

6 Discussions

Table 6.1.

Acknowledgments

Appendix

A Proof of Theorem 4.3

Proof

B Proof of Theorem 4.4

Proof

Lemma B.1

Lemma B.2

Lemma B.3

C Proof of Theorem 4.6

Proof

Lemma C.1

Lemma C.2

D Proof of Lemma 4.8

Proof

Lemma D.1

E Proof of Lemma 4.9

Proof

Lemma E.1

F Proof of Theorem 4.11

Proof

G Proof of Theorem 4.12

Proof

Footnotes

Definition 2.1 `(Transelliptical Distribution)`

Definition 2.2 `(Transformed Kendall’s tau Estimator)`

Remark 4.2 `(Step Size Initialization)`

Theorem 4.3 (`Geometric Rate of Convergence of CCDA`)

Theorem 4.4 `(Geometric Rate of Convergence of AISTA)`

Theorem 4.5 `(Path-following Optimization Scheme)`

Theorem 4.6 (`Global Geometric Rate of Convergence of APISTA`)

Theorem 4.11. `[Parameter Estimation]`

Theorem 4.12. `[Graph Estimation]`