Abstract
We propose an accelerated path-following iterative shrinkage thresholding algorithm (APISTA) for solving high dimensional sparse nonconvex learning problems. The main difference between APISTA and the path-following iterative shrinkage thresholding algorithm (PISTA) is that APISTA exploits an additional coordinate descent subroutine to boost the computational performance. Such a modification, though simple, has profound impact: APISTA not only enjoys the same theoretical guarantee as that of PISTA, i.e., APISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties, but also significantly outperforms PISTA in empirical benchmarks. As an application, we apply APISTA to solve a family of nonconvex optimization problems motivated by estimating sparse semiparametric graphical models. APISTA allows us to obtain new statistical recovery results which do not exist in the existing literature. Thorough numerical results are provided to back up our theory.
1 Introduction
High dimensional data challenge both statistics and computation. In the statistics community, researchers have proposed a large family of regularized M-estimators, including Lasso, Group Lasso, Fused Lasso, Graphical Lasso, Sparse Inverse Column Operator, Sparse Multivariate Regression, Sparse Linear Discriminant Analysis (Tibshirani, 1996; Zou and Hastie, 2005; Yuan and Lin, 2005, 2007; Banerjee et al., 2008; Tibshirani et al., 2005; Jacob et al., 2009; Fan et al., 2012; Liu and Luo, 2015; Han et al., 2012; Liu et al., 2015). Theoretical analysis of these methods usually rely on the sparsity of the parameter space and requires the resulting optimization problems to be strongly convex over a restricted parameter space. More details can be found in Meinshausen and Bühlmann (2006); Zhao and Yu (2006); Zou (2006); Rothman et al. (2008); Zhang and Huang (2008); Van de Geer (2008); Zhang (2009); Meinshausen and Yu (2009); Wainwright (2009); Fan et al. (2009); Zhang (2010a); Ravikumar et al. (2011); Liu et al. (2012a); Negahban et al. (2012); Han et al. (2012); Kim and Kwon (2012); Shen et al. (2012). In the optimization community, researchers have proposed a large variety of computational algorithms including the proximal gradient methods (Nesterov, 1988, 2005; NESTEROV, 2013; Beck and Teboulle, 2009b,a; Zhao and Liu, 2012; Liu et al., 2015) and coordinate descent methods (Fu, 1998; Friedman et al., 2007; Wu and Lange, 2008; Friedman et al., 2008; Meier et al., 2008; Liu et al., 2009; Friedman et al., 2010; Qin et al., 2010; Mazumder et al., 2011; Breheny and Huang, 2011; Shalev-Shwartz and Tewari, 2011; Zhao et al., 2014c).
Recently, Wang et al. (2014) propose the path-following iterative soft shrinkage thresholding algorithm (PISTA), which combines the proximal gradient algorithm with path-following optimization scheme. By exploiting the solution sparsity and restricted strong convexity, they show that PISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties for solving a large class of sparse nonconvex learning problems. However, though the PISTA has superior theoretical properties, it is empirical performance is in general not as good as some heuristic competing methods such as the path-following coordinate descent algorithm (PCDA) (Tseng and Yun, 2009b,a; Lu and Xiao, 2013; Friedman et al., 2010; Mazumder et al., 2011; Zhao et al., 2012, 2014a). To address this concern, we propose a new computational algorithm named APISTA (Accelerated Path-following Iterative Shrinkage Thresholding Algorithm). More specifically, we exploit an additional coordinate descent subroutine to assist PISTA to efficiently decrease the objective value in each iteration. This makes APISTA significantly outperform PISTA in practice. Meanwhile, the coordinate descent subroutine preserves the solution sparsity and restricted strong convexity, therefore APISTA enjoys the same theoretical guarantee as those of PISTA, i.e., APISTA attains a linear rate of convergence to a unique sparse local optimum with good statistical properties. As an application, we apply APISTA to a family of nonconvex optimization problems motivated by estimating semiparametric graphical models (Liu et al., 2012b; Zhao and Liu, 2014). PISTA allows us to obtain new sparse recovery results on graph estimation consistency which has not been established before. Thorough numerical results are presented to back up our theory.
NOTATIONS
Let υ = (υ1, …, υd)T ∈ ℝd, we define ‖υ‖1 = ∑j |υj|, , and ‖υ‖∞ = maxj |υj|. We denote the number of nonzero entries in υ as ‖υ‖0 = ∑j 𝟙(υj ≠ 0). We define the soft-thresholding operator as for any λ ≥ 0. Given a matrix A ∈ ℝd×d, we use A*j = (A1j, …, Adj)T to denote the jth column of A, and Ak* = (Ak1, …, Akd)T to denote the kth row of A. Let Λmax(A) and Λmin(A) denote the largest and smallest eigenvalues of A. Let ψ1(A), …, ψd(A) be the singular values of A, we define the following matrix norms: , ‖A‖max = maxj ‖A*j‖∞, ‖A‖1 = maxj ‖A*j‖1, ‖A‖2 = maxj ψj(A), ‖A‖∞ = maxk ‖Ak*‖1. We denote υ\j = (υ1, …, υj−1, υj+1, …, υd)T ∈ ℝd−1 as the subvector of υ with the jth entry removed. We denote A\i\j as the submatrix of A with the ith row and the jth column removed. We denote Ai\j to be the ith row of A with its jth entry removed. Let 𝒜 ⊆ {1, …, d}, we use υ𝒜 to denote a subvector of υ by extracting all entries of υ with indices in 𝒜, and A𝒜𝒜 to denote a submatrix of A by extracting all entries of A with both row and column indices in 𝒜.
2 Background and Problem Setup
Let be a parameter vector to be estimated. We are interested in solving a class of regularized optimization problems in a generic form:
(2.1) |
where ℒ(θ) is a smooth loss function and ℛλ(θ) is a nonsmooth regularization function with a regularization parameter λ.
2.1 Sparsity-inducing Nonconvex Regularization Functions
For high dimensional problems, we exploit sparsity-inducing regularization functions, which are usually continuous and decomposable with respect to each coordinate, i.e., . For example, the widely used ℓ1 norm regularization decomposes as . One drawback of the ℓ1 norm is that it incurs large estimation bias when is large. This motivates the usage of nonconvex regularizers. Examples include the SCAD (Fan and Li, 2001) regularization
and MCP (Zhang, 2010a) regularization
Both SCAD and MCP can be written as the sum of an ℓ1 norm and a concave function ℋλ(θ), i.e., ℛλ(θ) = λ‖θ‖1 + ℋλ(θ). It is easy to see that is also decomposable with respect to each coordinate. More specifically, the SCAD regularization has
and the MCP regularization has
In general, the concave function hλ(·) is smooth and symmetric about zero with hλ(0) = 0 and . Its gradient is monotone decreasing and Lipschitz continuous, i.e., for any , there exists a constant α ≥ 0 such that
(2.2) |
Moreover, we require if |θj| ≥ λβ, and if |θj| ≤ λβ.
It is easy to verify that both SCAD and MCP satisfy the above properties. In particular, the SCAD regularization has α = 1/(β − 1), and the MCP regularization has α = 1/β. These nonconvex regularization functions have been shown to achieve better asymptotic behavior than the convex ℓ1 regularization. More technical details can be found in Fan and Li (2001); Zhang (2010a, b); Zhang and Zhang (2012); Fan et al. (2014); Xue et al. (2012); Wang et al. (2014, 2013); Liu et al. (2014). We present several illustrative examples of the nonconvex regularizers in Figure 2.1.
2.2 Nonconvex Loss Function
A motivating application of the method proposed in this paper is sparse transelliptical graphical model estimation (Liu et al., 2012b). The transelliptical graphical model is a semiparametric graphical modeling tool for exploring the relationships between a large number of variables. We start with a brief review the transelliptical distribution defined below.
Definition 2.1 (Transelliptical Distribution)
Let be a set of strictly increasing univariate functions. Given a positive semidefinite matrix Σ* ∈ ℝd×d with rank(Σ*) = r ≤ d and for j = 1, …, d, we say that a d-dimensional random vector X = (X1, …, Xd)T follows a transelliptical distribution, denoted as X ~ TEd(Σ*, ξ, f1, …, fd), if X has a stochastic representation
where Σ* = AAT, U ∈ 𝕊r − 1 is uniformly distributed on the unit sphere in ℝr, and ξ ≥ 0 is a continuous random variable independent of U.
Note that Σ* in Definition 2.1 is not necessarily the correlation matrix of X. To interpret Σ*, Liu et al. (2012b) provide a latent Gaussian representation for the transelliptical distribution, which implies that the sparsity pattern of Θ* = (Σ*)−1 encodes the graph structure of some underlying Gaussian distribution. Since Σ* needs to be invertible, we have r = d. To estimate Θ*, Liu et al. (2012b) suggest to directly plug in the following transformed Kendall’s tau estimator into existing gaussian graphical model estimation procedures.
Definition 2.2 (Transformed Kendall’s tau Estimator)
Let x1, …, xn ∈ ℝd be n independent observations of X = (X1, …, Xd)T, where xi = (xi1, …, xid)T. The transformed Kendall’s tau estimator Ŝ ∈ ℝd×d is defined as , where τ̂kj is the empirical Kendall’s tau statistic between Xk and Xj defined as
We then adopt the sparse column inverse operator to estimate the jth column of Θ*. In particular, we solve the following regularized quadratic optimization problem (Liu and Luo, 2015),
(2.3) |
For notational simplicity, we omit the column index j in (2.3), and denote Θ*j and I*j by θ and e respectively. Throughout the rest of this paper, if not specified, we study the following optimization problem for the transelliptical graph estimation
(2.4) |
The quadratic loss function used in (2.4) is twice differentiable with
Since the transformed Kendall’s tau estimator is rank-based and could be indefinite (Zhao et al., 2014b), the optimization in (2.3) may not be convex even if ℛλ(θ) is a convex.
Remark 2.1
It is worth mentioning that the indefiniteness of Ŝ also makes(2.3) unbounded from below, but as will be shown later, our proposed algorithm can still guarantee a unique sparse local solution with optimal statistical properties under suitable solutions.
Remark 2.2
To handle the possible nonconvexity, Liu et al. (2012b) estimate based on a graphical model estimation procedure proposed in Cai et al. (2011) as follows,
(2.5) |
(2.5) is convex regardless the indefiniteness of Ŝ. But a major disadvantage of (2.5) is the computation. Existing solvers can only solve (2.5) up to moderate dimensions. We will present more empirical comparison between (2.3) and (2.5) in our numerical experiments.
3 Method
For notational convenience, we rewrite the objective function ℱλ(θ) as
We call ℒ̃λ(θ) the augmented loss function, which is smooth but possibly nonconvex. We first introduce the path-following optimization scheme, which is a multistage optimization framework and also used in PISTA.
3.1 Path-following Optimization Scheme
The path-following optimization scheme solves the regularized optimization problem (2.1) using a decreasing sequence of N + 1 regularization parameters , and yields a sequence of N + 1 output solutions from sparse to dense. We set the initial tuning parameter as λ0 = ‖∇ℒ(0)‖∞. By checking the KKT condition of (2.1) for λ0, we have
(3.1) |
where the second equality comes from ‖ξ‖∞ ≤ 1 and as introduced in §2.1. Since (3.1) indicates that 0 is a local solution to (2.1) for λ0, we take the leading output solution as θ̂{0} = 0. Let η ∈ (0, 1), we set λK = ηλK − 1 for K = 1, …, N. We then solve (2.1) for the regularization parameter λK with θ̂{K − 1} as the initial solution, which leads to the next output solution θ̂{K}. The path-following optimization scheme is illustrated in Algorithm 1.
3.2 Accelerated Iterative Shrinkage Thresholding Algorithm
We then explain the accelerated iterative shrinkage thresholding (AISTA) subroutine, which solves (2.1) in each stage of the path-following optimization scheme. For notational simplicity, we omit the stage index K, and only consider the iteration index m of AISTA. Suppose that AISTA takes some initial solution θ[0] and an initial step size parameter L[0], and we want to solve (2.1) with the regularization parameter λ. Then at the mth iteration of AISTA, we already have L[m] and θ[m]. Each iteration of AISTA contains two steps: The first one is the proximal gradient descent iteration, and the second one is the coordinate descent subroutine.
Algorithm 1.
Algorithm: |
Parameter: η, Lmin |
Initialize: λ0 = ‖∇ℒ(0)‖∞, θ̂{0} ← 0, L̂{0} ← Lmin |
For: K = 0, …., N − 1 |
λK+1 ← ηλK, {θ̂{K+1}, L̂{K+1}} ← AISTA(λK+1, θ̂{K}, L̂{K}) |
End for |
Output: |
(I) Proximal Gradient Descent Iteration
We consider the following quadratic approximation of ℱλ(θ) at θ = θ[m],
where L[m+1] is the step size parameter such that 𝒬λ,L[m+1] (θ; θ[m]) ≥ ℱλ(θ). We then take a proximal gradient descent iteration and obtain θ[m] by
(3.2) |
where θ̃[m] = θ[m] − ∇ℒ̃λ(θ[m])/L[m+1]. For notational simplicity, we write
(3.3) |
For sparse column inverse operator, we can obtain a closed form solution to (3.2) by soft thresholding
The step size 1/L[m+1] can be obtained by the backtracking line search. In particular, we start with a small enough L[0]. Then in each iteration of the middle loop, we choose the minimum nonnegative integer z such that L[m+1] = 2zL[m] satisfies
(3.4) |
(II) Coordinate Descent Subroutine
Unlike the proximal gradient algorithm which repeats (3.3) until convergence at each stage of the path-following optimization scheme, AISTA exploits an additional coordinate descent subroutine to further boost the computational performance. More specifically, we define and solve the following optimization problem
(3.5) |
using the cyclic coordinate descent algorithm (CCDA) initiated by θ[m+0.5]. For notational simplicity, we omit the stage index K and iteration index m, and only consider the iteration index t of CCDA. Suppose that the CCDA algorithm takes some initial solution θ(0) for solving (2.1) with the regularization parameter λ. Without loss of generality, we denote 𝒜 = {1, …, |𝒜|}. At the tth iteration, we have θ(t). Then at the (t + 1)th iteration, we conduct the coordinate minimization cyclically over all active coordinates. Let w(t+1,k) be an auxiliary solution of the (t + 1)th iteration with the first k − 1 coordinates updated. For k = 1, we have w(t+1,1) = θ(t). We then update the kth coordinate to obtain the next auxiliary solution w(t+1,k+1).
More specifically, let ∇kℒ̃λ(θ) be the kth entry of ∇ℒ̃λ(θ). We minimize the objective function with respect to each selected coordinate and keep all other coordinates fixed,
(3.6) |
Once we obtain , we can set to obtain the next auxiliary solution w(t+1,k+1). For sparse column inverse operator, let , we have
(3.7) |
where the last equality comes from the fact Ŝkk = 1 for all k = 1, …, d. By setting the subgradient of (3.7) equal to zero, we can obtain as follows:
For the ℓ1 norm regularization, we have .
- For the SCAD regularization, we have
- For the MCP regularization, we have
When all |𝒜| coordinate updates in the (t + 1)th iteration of CCDA finish, we set θ(t+1) = w(t+1,|𝒜|+1). We summarize CCDA in Algorithm 2. Once CCDA terminates, we denote its output solution by θ[m+1], and start the next iteration of AISTA. We summarize AISTA in Algorithm 3.
Algorithm 2.
Algorithm: θ̂ ← CCDA(λ, θ(0)). |
Initialize: t ← 0, 𝒜 = supp(θ(0)) |
Repeat: |
w(t+1,1) ← θ(t) |
For k = 1, …, |𝒜| |
End for |
θ(t+1) ← w(t+1,|𝒜|+1), t ← t + 1 |
Until convergence |
θ̂ ← θ(t) |
Remark 3.1
The backtracking line search procedure in PISTA has been extensively studied in existing optimization literature on the adaptive step size selection (Dennis and Schnabel, 1983; Nocedal and Wright, 2006), especially for proximal gradient algorithms (Beck and Teboulle, 2009b,a; NESTEROV, 2013). Many empirical results have corroborated better computational performance than that using a fix step size. But unlike the classical proximal gradient algorithms, APISTA can efficiently reduce the objective value by the coordinate descent subroutine in each iteration. Therefore we can simply choose a constant step size parameter L such that
(3.8) |
The step size parameter L in (3.8) guarantees 𝒬λ,L(θ; θ[m]) ≥ ℱλ(θ) in each iteration of AISTA. For sparse column inverse operator, ∇2ℒ(θ) = Ŝ does not depend on θ. Therefore we choose
Algorithm 3.
Algorithm: {θ̂, L̂} ← AISTA(λ, θ[0], L[0]) |
Initialize: m ← 0 |
Repeat: |
z ← 0 |
Repeat: |
L [m+1] ← 2z L[m], θ[m+0.5] ← 𝒯λ,Ω,L[m+1] (θ[m]), z ← z + 1 |
Until: 𝒬λ,L[m+1] (θ[m+0.5]; θ[m]) ≥ ℱλ(θ[m+1]) |
θ[m+1] ← CCDA(λ, θ[m+0.5]), m ← m + 1 |
Until convergence |
θ̂ ← θ[m−0.5], L̂ ← L[m] |
Output: {θ̂, L̂} |
L = Λmax(Ŝ). Our numerical experiments show that choosing a fixed step not only simplifies the implementation, but also attains better empirical computational performance than the backtracking line search. See more details in §5.
3.3 Stopping Criteria
Since θ is a local minimum if and only if the KKT condition minξ∈∂‖θ‖1 ‖∇ℒ̃λ(θ) + λξ‖∞ = 0 holds, we terminate AISTA when
(3.9) |
where ε is the target precision and usually proportional to the regularization parameter. More specifically, given the regularization parameter λK, we have
(3.10) |
where δK ∈ (0, 1) is a convergence parameter for the Kth stage of the path-following optimization scheme. Moreover, for CCDA, we terminate the iteration when
(3.11) |
where δ0 ∈ (0, 1) is a convergence parameter. This stopping criterion is natural to the sparse coordinate descent algorithm, since we only need to calculate the value change of each coordinate (not the gradient). We will discuss how to choose δK’s and δ0 in §4.1.
4 Theory
Before we present the computational and statistical theories of APISTA, we introduce some additional assumptions. The first one is about the choice of regularization parameters.
Assumption 4.1
Let δK’s and η saitsify
where η is the rescaling parameter of the path-following optimization scheme, δK’s are the convergence parameters defined in (3.10), and δ0 is the convergence parameter defined in (3.11). We have the regularization parameters
Assumption 4.1 has been extensively studied in existing literature on high dimensional statistical theory of the regularized M-estimators (Rothman et al., 2008; Zhang and Huang, 2008; Negahban and Wainwright, 2011; Negahban et al., 2012). It requires the regularization parameters to be large enough such that irrelevant variables can be eliminated along the solution path. Though ‖∇ℒ(θ*)‖∞ cannot be explicitly calculated (θ* is unknown), we can exploit concentration inequalities to show that Assumption 4.1 holds with high probability (Ledoux, 2005). In particular, we will verify Assumption 4.1 for sparse transellpitical graphical model estimation in Lemma 4.8.
Before we proceed with our second assumption, we define the largest and smallest s-sparse eigenvalues of the Hessian matrix of the loss function as follows.
Definition 4.1
Given an integer s ≥ 1, we define the largest and smallest s-sparse eigenvalues of ∇2ℒ(θ) as
Largest s-Sparse Eigenvalue : ,
Smallest s-Sparse Eigenvalue : .
Moreover, we define ρ̃−(s) = ρ−(s) − α and ρ+(s) = ρ+(s) for notational simplicity, where α is defined in (2.2).
The next lemma shows the connection between the sparse eigenvalue conditions and restricted strongly convex and smooth conditions.
Lemma 4.1
Given ρ−(s) > 0, for any θ, θ′ ∈ ℝd with |supp(θ) ∪ supp(θ′)| ≤ s, we have
Moreover, if ρ−(s) > α, then we have
and for any ξ ∈ ∂‖θ‖1,
The proof of Lemma 4.1 is provided in Wang et al. (2014), therefore omitted. We then introduce the second assumption.
Assumption 4.2
Given ‖θ*‖0 ≤ s*, there exists an integer s̃ satisfying
where κ = ρ+(s + 2s̃)ρ̃−(s + 2s̃).
Assumption 4.2 requires that ℒ̃λ(θ) satisfies the strong convexity and smoothness when θ is sparse. As will be shown later, APISTA can always guarantee the number of irrelevant coordinates with nonzero values not to exceed s̃. Therefore the restricted strong convexity is preserved along the solution path. We will verify that Assumption 4.2 holds with high probability for the transellpitical graphical model estimation in Lemma 4.9.
Remark 4.2 (Step Size Initialization)
We take the initial step size parameter as Lmin ≥ ρ+(1). For sparse column inverse operator, we directly choose Lmin = ρ+(1) = 1.
4.1 Computational Theory
We develop the computational theory of APISTA. For notational simplicity, we define and for characterizing the the solution sparsity. We first start with the convergence analysis for the cyclic coordinate descent algorithm (CCDA). The next theorem presents its rate of convergence in term of the objective value.
Theorem 4.3 (Geometric Rate of Convergence of CCDA)
Suppose that Assumption 4.2 holds. Given a sparse initial solution satisfying , (3.5) is a strongly convex optimization problem with a unique global minimizer θ̄. Moreover, for t = 1, 2…, we have
The proof of Theorems 4.3 is provided in Appendix A. Theorem 4.3 suggests that when the initial solution is sparse, CCDA essentially solves a strongly convex optimization problem with a unique global minimizer. Consequently we can establish the geometric rate of convergence in term of the objective value for CCDA. We then proceed with the convergence analysis of AISTA. The next theorem presents its theoretical rate of convergence in term of the objective value.
Theorem 4.4 (Geometric Rate of Convergence of AISTA)
Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if the initial solution θ[0] satisfies
(4.1) |
then we have for m = 0.5, 1, 1.5, 2, …. Moreover, for m = 1, 2, …, we have
where θ̄λ is a unique sparse local solution to (2.1) satisfying ωλ(θ̄λ) = 0 and .
The proof of Theorem 4.4 is provided in Appendix B. Theorem 4.4 suggests that all solutions of AISTA are sparse such that the restricted strongly convex and smooth conditions hold for all iterations. Therefore, AISTA attains the geometric rate of convergence in term of the objective value. Theorem 4.4 requires a proper initial solution to satisfy (4.1). This can be verified by the following theorem.
Theorem 4.5 (Path-following Optimization Scheme)
Suppose that Assumptions 4.1 and 4.2 hold. Given θ satisfying
(4.2) |
we have ωλK(θ) ≤ λK/2.
The proof of Theorem 4.5 is provided in Wang et al. (2014), therefore omitted. Since θ{0} naturally satisfies (4.2) for λ1, by Theorem 4.5 and induction, we can show that the path-following optimization scheme always guarantees that the output solution of the (K − 1)th stage is a proper initial solution for the Kth stage, where K = 1, …, N. Eventually, we combine Theorems 4.3 and 4.4 with Theorem 4.5, and establish the global geometric rate of convergence in term of the objective value for APISTA in the next theorem.
Theorem 4.6 (Global Geometric Rate of Convergence of APISTA)
Suppose that Assumptions 4.1 and 4.2 hold. Recall that δ0 and δK’s are defined in §3.3, κ and s̃ are defined in Assumption 4.2, and α is defined in (2.2). We have the following results:
- At the Kth stage (K = 1, …, N), the number of coordinate descent iterations within each CCDA is at most C1 log (C2/δ0), where
- At the Kth stage (K = 1, …, N), the number of the proximal gradient iterations in each AISTA is at most C3 log (C4/δK), where
- To compute all N + 1 output solutions, the total number of coordinate descent iterations in APISTA is at most
(4.3) - At the Kth stage (K = 1, …, N), we have
The proof Theorem 4.6 is provided in Appendix C. We then present a more intuitive explanation about Result (3). To secure the generalization performance in practice, we usually tune the regularization parameter over a refined sequence based on cross validation. In particular, we solve (2.1) using partial data with high precision for every regularization parameter. If we set δK = δoptλK for K = 1, …N, where δopt is a very small value (e.g. 10−8), then we can rewrite (4.3) as
(4.4) |
where δ0 is some reasonably large value (e.g. 10−2) defined in §3.3. The iteration complexity in (4.4) depends on N.
Once the regularization parameter is selected, we still need to solve (2.1) using full data with some regularization sequence. But we only need high precision for the selected regularization parameter (e.g., λN), and for K = 1, …, N − 1, we only solve (2.1) for λK up to an adequate precision, e.g., δK = δ0 for K = 1, …, N − 1 and δN = δoptλN. Since 1/δopt is much larger than N, we can rewrite (4.3) as
(4.5) |
Now the iteration complexity in (4.5) does not depend on N.
Remark 4.7
To establish computational theories of APISTA with a fixed step size, we only need to slightly modify the proofs of Theorems 4.4 and 4.6 by replacing ρ+(s* + 2s̃) and ρ+(s* + s̃) by their upper bound L defined in (3.8). Then a global geometric rate of convergence can also be derived, but with a worse constant term.
4.2 Statistical Theory
We then establish the statistical theory of the SCIO estimator obtained by APISTA under transelliptical models. We use Θ* and Σ* to denote the true latent precision and covariance matrices. We assume that Θ* belongs to the following class of sparse, positive definite, and symmetric matrices:
where ψmax and ψmin are positive constants, and do not scale with (M, s*, n, d). Since Σ* = (Θ*)−1, we also have ψmin ≤ Λmin(Σ*) ≤ Λmax(Σ*) ≤ ψmax.
We first verify Assumptions 4.1 and 4.2 in the next two lemmas for transelliptical models.
Lemma 4.8
Suppose that . Given , we have
The proof of Lemma 4.8 is provided in Appendix D. Lemma 4.8 guarantees that the selected regularization parameter λN satisfies Assumption 4.1 with high probability.
Lemma 4.9
Suppose that . Given α = ψmin/2, there exist universal positive constants c1 and c2 such that for , with probability at least 1 − 2/d2, we have
where κ is defined in Assumption 4.2.
The proof of Lemma 4.9 is provided in Appendix E. Lemma 4.9 guarantees that if the Lipschitz constant of defined in (2.2) satisfies α = ψmin/2, then the transformed Kendall’s tau estimator Ŝ = ∇2ℒ(θ) satisfies Assumption 4.2 with high probability.
Remark 4.10
Since Assumptions 4.1 and 4.2 have been verified, by Theorem 4.6, we know that APISTA attains the geometric rate of convergence to a unique sparse local solution to (2.3) in term of the objective value with high probability.
Recall that we use θ to denote Θ*j in (2.4), by solving (2.3) with respect to all d columns, we obtain and , where denotes the output solution of APISTA corresponding to λN for the jth column (j = 1, ‥d), and to denote the unique sparse local solution corresponding to λN for the jth column (j = 1, ‥d), which APISTA converges to. We then present concrete rates of convergence of the estimator obtained by APISTA under the matrix ℓ1 and Frobenius norms in the following theorem.
Theorem 4.11. [Parameter Estimation]
Suppose that , and α = ψmin/2. For , given , we have
The proof of (4.11) is provided in Appendix F. The results in Theorem 4.11 show that the SCIO estimator obtained by APISTA achieves the same rates of convergence as those for subguassian distributions (Liu and Luo, 2015). Moreover, when using the nonconvex regularization such as MCP and SCAD, we can achieve graph estimation consistency under the following assumption.
Assumption 4.3
Suppose that . Define as the support of Θ*. There exists some universal constant c3 such that
Assumption 4.3 is a sufficient condition for sparse column inverse operator to achieve graph estimation consistency in high dimensions for transelliptical models. The violation of Assumption 4.3 may result in underselection of the nonzero entries in Θ*.
The next theorem shows that, with high probability, Θ̅λN and the oracle solution Θ̂o are identical. More specifically, let for j = 1, …, d, defined as follows,
(4.6) |
Theorem 4.12. [Graph Estimation]
Suppose that , α = ψmin/2, and Assumption 4.3 holds. There exists a universal constant c4 such that , if we choose , then we have
The proof of Theorem 4.12 is provided in Appendix G. Since Θ̂o shares the same support with Θ*, Theorem 4.12 guarantees that the SCIO estimator obtained by APISTA can perfectly recover ℰ* with high probability. To the best of our knowledge, Theorem 4.12 is the first graph estimation consistency result for transelliptical models without any post-processing procedure (e.g. thresholding).
Remark 4.13
In Theorem (4.12), we choose , which is different from the selected regularization parameter in Assumption 4.8. But as long as we have , which is not an issue under the high dimensional scaling
λN ≥ 8‖∇ℒ(θ*)‖∞ still holds with high probability. Therefore all computational theories in §4.1 hold for Θ̅λN in Theorem 4.12.
5 Numerical Experiments
In this section, we study the computational and statistical performance of APISTA method through numerical experiments on sparse transelliptical graphical model estimation. All experiments are conducted on a personal computer with Intel Core i5 3.3 GHz and 16GB memory. All programs are coded in double precision C, called from R. The computation are optimized by exploiting the sparseness of vector and matrices. Thus we can gain a significant speedup in vector and matrix manipulations (e.g. calculating the gradient and evaluating the objective value). We choose the MCP regularization with varying β’s for all simulations.
5.1 Simulated Data
We consider the chain and Erdös-Rényi graph generation schemes with varying d = 200, 400, and 800 to obtain the latent precision matrices:
Chain. Each node is assigned a coordinate j for j = 1, …, d. Two nodes are connected by an edge whenever the corresponding points are at distance no more than 1.
Erdös-Rényi. We set an edge between each pair of nodes with probability 1/d, independently of the other edges.
Two illustrative examples are presented in Figure 5.1. Let 𝒟 be the adjacency matrix of the generated graph, and ℳ2 be the rescaling operator that converts a symmetric positive semidefinite matrix to a correlation matrix. We calculate
We use Σ* as the covariance matrix to generate n = ⌈60 log d⌉ independent observations from a multivariate t-distribution with mean 0 and degrees of freedom 3. We then adopt the power transformation g(t) = t5 to convert to the t-distributed data to the transelliptical data. Note that the corresponding latent precision matrix is Ω* = (Σ*)−1. We compare the following five computational methods:
APISTA: The computational algorithm proposed in §3.
F-APISTA: APISTA without the backtracking line search (using a fixed step size instead).
PISTA: The pathwise iterative shrinkage thresholding algoritm proposed in Wang et al. (2014).
CLIME: The sparse latent precision matrix estimation method proposed in Liu et al. (2012b), which solves (2.5) by the ADMM method (Alternating Direction Method of Multipliers, Li et al. (2015); Liu et al. (2014)).
SCIO(P): The SCIO estimator based on the positive semidefinite projection method proposed in Zhao et al. (2014b). More specifically, we first project the possibly indefinite Kendall’s tau matrix into the cone of all positive semidefinite matrices. Then we plug the obtained replacement into (2.3), and solve it by the coordinate descent method proposed in Liu and Luo (2015).
Note that (4) and (5) have theoretical guarantees only when the ℓ1 norm regularization is applied. For (1)–(3), we set δ0 = δK = 10−5 for K = 1, …, N.
We first compare the statistical performance in parameter estimation and graph estimation of all methods. To meet this end, we generate a validation set of the same size as the training set. We use the regularization sequence with N = 100 and . The optimal regularization parameter is selected by
where Θ̂λ denotes the estimated latent precision matrix using the training set with the regularization parameter λ, and S̃ denotes the estimated latent covariance matrix using the validation set. We repeat the simulation for 100 times, and summarize the averaged results in Tables 5.1 and 5.2. For all settings, we set δ0 = δK = 10−5. We also vary β of the MCP regularization from 100 to 20/19, thus the corresponding α varies from 0.01 to 0.95. The parameter estimation performance is evaluated by the difference between the obtained estimator and the true latent prediction matrix under the Forbenius and matrix ℓ1 norms. The graph estimation performance is evaluated by the true positive rate (T. P. R.) and false positive rate (F. P. R.) defined as follows,
Table 5.1.
Method | d | ‖Θ̂−Θ‖F | ‖Θ̂−Θ‖1 | T. P. R. | F. P. R. | α |
---|---|---|---|---|---|---|
PISTA | 200 | 4.1112(0.7856) | 1.0517(0.1141) | 1.0000(0.0000) | 0.0048(0.0079) | 0.20 |
400 | 6.4507(0.9062) | 1.0756(0.0717) | 1.0000(0.0000) | 0.0007(0.0004) | 0.20 | |
800 | 8.2640(1.1456) | 1.0434(0.0673) | 1.0000(0.0000) | 0.0003(0.0006) | 0.20 | |
APISTA | 200 | 2.5162(0.2677) | 0.7665(0.1583) | 0.9993(0.0012) | 0.0001(0.0001) | 0.95 |
400 | 3.3664(0.2735) | 0.8298(0.0986) | 1.0000(0.0000) | 0.0002(0.0000) | 0.67 | |
800 | 5.0244(0.7984) | 0.9312(0.1226) | 1.0000(0.0000) | 0.0002(0.0004) | 0.50 | |
F-APISTA | 200 | 2.5163(0.2670) | 0.7658(0.1559) | 0.9994(0.0015) | 0.0001(0.0002) | 0.95 |
400 | 3.3629(0.2702) | 0.8253(0.0959) | 1.0000(0.0000) | 0.0002(0.0000) | 0.67 | |
800 | 5.0237(0.7963) | 0.9373(0.1289) | 1.0000(0.0000) | 0.0002(0.0005) | 0.50 | |
SCIO(P) | 200 | 6.1812(1.2924) | 1.2245(0.0777) | 1.0000(0.0000) | 0.0165(0.0220) | 0.00 |
400 | 8.9991(0.9894) | 1.2255(0.0785) | 1.0000(0.0000) | 0.0058(0.0047) | 0.00 | |
CLIME | 200 | 6.4771(0.8617) | 1.2187(0.0358) | 1.0000(0.0000) | 0.0126(0.0043) | 0.00 |
400 | 9.1221(0.9997) | 1.2177(0.0629) | 1.0000(0.0000) | 0.0043(0.0032) | 0.00 |
Table 5.2.
Method | d | ‖Θ̂−Θ‖F | ‖Θ̂−Θ‖1 | T. P. R. | F. P. R. | α̂ |
---|---|---|---|---|---|---|
PISTA | 200 | 3.2647(0.1235) | 1.6807(0.2675) | 1.0000(0.0000) | 0.0587(0.0013) | 0.20 |
400 | 4.5609(0.7666) | 2.2113(0.3358) | 1.0000(0.0000) | 0.0295(0.0091) | 0.20 | |
800 | 5.0751(0.3832) | 2.5718(0.2826) | 1.0000(0.0000) | 0.0099(0.0020) | 0.20 | |
APISTA | 200 | 2.2888(0.1141) | 1.1644(0.2343) | 1.0000(0.0000) | 0.0193(0.0005) | 0.33 |
400 | 3.2206(0.2733) | 1.4974(0.2778) | 1.0000(0.0000) | 0.0067(0.0100) | 0.33 | |
800 | 4.0929(0.1862) | 1.6347(0.2023) | 1.0000(0.0000) | 0.0036(0.0008) | 0.50 | |
F-APISTA | 200 | 2.2890(0.1161) | 1.1647(0.2390) | 1.0000(0.0000) | 0.0197(0.0007) | 0.33 |
400 | 3.2251(0.2702) | 1.4928(0.2731) | 1.0000(0.0000) | 0.0060(0.0102) | 0.33 | |
800 | 4.0984(0.1891) | 1.6397(0.2096) | 1.0000(0.0000) | 0.0034(0.0009) | 0.50 | |
SCIO(P) | 200 | 3.4277(0.5405) | 1.5213(0.3223) | 1.0000(0.0000) | 0.0618(0.0170) | 0.00 |
400 | 5.7144(0.8158) | 1.9057(0.2933) | 0.9994(0.0017) | 0.0341(0.0145) | 0.00 | |
CLIME | 200 | 3.6297(0.6103) | 1.4876(0.2855) | 1.0000(0.0000) | 0.0581(0.0159) | 0.00 |
400 | 5.9206(0.8385) | 1.8246(0.2817) | 1.0000(0.0000) | 0.0320(0.0112) | 0.00 |
Since the convergence of PISTA is very slow when α is large, we only present its results for α = 0.2. APISTA and F-APISTA can work for larger α’s. Therefore they effectively reduces the estimation bias to attain the best statistical performance in both parameter estimation and graph estimation among all estimators. The SCIO(P) and CLIME methods only use ℓ1 norm without any bias reduction, their performance is worse than the other competitors. Moreover, due to the poor scalability of their solvers, SCIO(P) and CLIME fail to output valid results within 10 hours when d = 800.
We then compare the computational performance of all methods. We use a regularization sequence with N = 50, and λN is proper selected such that the graphs obtained by all methods have approximately the same number of edges for each regularization parameter. In particular, the obtained graphs corresponding to λN have approximately 0.1 · d(d − 1)/2 edges. To make a fair comparison, we choose the ℓ1 norm regularization for all methods. We repeat the simulation for 100 times, and the timing results are summarized in Tables 5.3 and 5.4. We see that F-APISTA method is up to 10 times faster than PISTA algorithm, and APISTA is up to 5 times after than PISTA. SCIO(P) and CLIME are much slower than the other three competitors.
Table 5.3.
d | PISTA | APISTA | F-APISTA | SCIO(P) | CLIME |
---|---|---|---|---|---|
200 | 0.8342(0.0248) | 0.2693(0.0031) | 0.1013(0.0022) | 2.6572(0.1253) | 8.5932(0.5396) |
400 | 3.8782(0.0696) | 1.2103(0.0368) | 0.4559(0.0308) | 25.451(2.5752) | 48.235(5.3494) |
800 | 30.014(0.3514) | 6.5970(0.2338) | 2.4283(0.2605) | 315.87(34.638) | 460.12(45.121) |
Table 5.4.
d | PISTA | APISTA | F-APISTA | SCIO(P) | CLIME |
---|---|---|---|---|---|
200 | 0.5401(0.0248) | 0.2048(0.0056) | 0.1063(0.0110) | 2.712(0.13558) | 7.1325(0.7891) |
400 | 3.0501(0.0829) | 0.9982(0.0453) | 0.4555(0.0071) | 26.140(2.1503) | 45.160(4.9026) |
800 | 28.581(0.3517) | 6.8417(0.7543) | 2.7037(0.2145) | 332.90(30.115) | 442.57(50.978) |
5.2 Real Data
We present a real data example to demonstrate the usefulness of the transelliptical graph obtained by the sparse column inverse operator (based on the transformed Kendall’s tau matrix). We acquire closing prices from all stocks of the S&P 500 for all the days that the market was open between January 1, 2003 and January 1, 2005, which results in 504 samples for the 452 stocks. We transform the dataset by calculating the log-ratio of the price at time t + 1 to price at time t. The 452 stocks are categorized into 10 Global Industry Classification Standard (GICS) sectors.
We adopt the stability graphs obtained by the following procedure (Meinshausen and Bühlmann, 2010; Liu et al., 2010):
Calculate the graph path using all samples, and choose the regularization parameter at the sparsity level 0.1;
Randomly choose 50% of all samples without replacement using the regularization parameter chosen in (1);
Repeat (2) 100 times and retain the edges that appear with frequencies no less than 95%.
We choose the sparsity level 0.1 in (1) and subsampling ratio 50% in (2) based on two criteria: The resulting graphs need to be sparse to ease visualization, interpretation, and computation; The resulting graphs need to be stable. We then present the obtained stability graphs in Figure 5.2. The nodes are colored according to the GICS sector of the corresponding stock. We highlight a region in the transelliptical graph obtained by the SCIO method and by color coding we see that the nodes in this region belong to the same sector of the market. A similar pattern is also found in the transelliptical graph obtained by the CLIME method. In contrast, this region is shown to be sparse in the Gaussian graph obtained by the SCIO method (based on the Pearson correlation matrix). Therefore we can see that the SCIO method is also capable of generating refined structures as the CLIME method when estimating the transelliptical graph.
6 Discussions
We compare F-APISTA with a closely related algorithm – the path-following coordinate descent algorithm (PCDA1) in timing performance. In particular, we give a failure example of PCDA for solving sparse linear regression. Let X ∈ ℝn×d denote design matrix and y ∈ ℝn denote the response vector. We solve the following regularized optimization problem,
We generate each row of the design matrix Xi* from a d-variate Gaussian distribution with mean 0 and covariance matrix Σ ∈ ℝd×d, where Σkj = 0.75 if k ≠ j and Σkk = 1 for all j, k = 1, …, d. We then normalize each column of the design matrix X*j such that . The response vector is generated from the linear model y = Xθ* + ε, where θ* ∈ ℝd is the regression coefficient vector, and ε is generated from a n-variate Gaussian distribution with mean 0 and covariance matrix I. We set n = 60 and d = 1000. We set the coefficient vector as , and for all j ≠ 250, 500, 750. We then set α = 0.95, N = 100, , and δc = δK = 10−5.
We then generate a validation set using the same design matrix as the training set for the regularization selection. We denote the response vector of the validation set as ỹ ∈ ℝn. Let θ̂λ denote the obtained estimator using the regularization parameter λ. We then choose the optimal regularization parameter λ̂ by
We repeat 100 simulations, and summarize the average results in Table 6. We see that F-APISTA and PCDA attain similar timing results. But PCDA achieves worse statistical performance than F-APISTA in both support recovery and parameter estimation. This is because PCDA has no control over the solution sparsity. The overselection irrelevant variables compromise the restricted strong convexity, and make PCDA attain some local optima with poor statistical properties.
Table 6.1.
Method | ‖θ̂−θ*‖2 | ‖θ̂𝒮‖0 | ‖θ̂𝒮c‖0 | Correct Selection | Timing |
---|---|---|---|---|---|
F-APISTA | 0.8001(0.9089) | 2.801(0.5123) | 0.890(2.112) | 667/1000 | 0.0181(0.0025) |
PCDA | 1.1275(1.2539) | 2.655(0.7051) | 1.644(3.016) | 517/1000 | 0.0195(0.0021) |
Acknowledgments
Research supported by NSF Grants III-1116730 and NSF III-1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841, and FDA HHSF223201000072C.
Appendix
A Proof of Theorem 4.3
Proof
Since ‖θ(0)‖0 ≤ s* + s̃ implies that |𝒜| ≤ s* + s̃, by Assumption 4.2 and Lemma 4.1, we know that (3.5) is strongly convex over θ𝒜. Thus it has a unique global minimizer. We then analyze the amount of successive decrease. By the restricted strong convexity of ℱλ(θ), we have
(A.1) |
where satisfies the optimality condition of (3.6),
(A.2) |
By combining (A.1) with (A.2), we have
which further implies
(A.3) |
We then analyze the gap in the objective value yet to be minimized after each iteration. For any θ′, θ ∈ ℝd with , by the restricted strong convexity of ℱλ(θ), we have
(A.4) |
where ξ ∈ ℝd with ξ𝒜 ∈ ∂‖θ𝒜‖1 and ξ𝒜⊥ = 0. We then minimize both sides of (A.4) with respect to and obtain
(A.5) |
where (i) comes from (A.2) and (ii) comes from the restricted strong smoothness of ℒ̃λ(θ).
Eventually, by combing (A.5) with (A.3), we obtain
which further implies
(A.6) |
By recursively applying (A.6), we complete the proof.
B Proof of Theorem 4.4
Proof
Before we proceed with the proof, we first introduce several important lemmas.
Lemma B.1
Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies,
(B.1) |
then we have
Lemma B.2
Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies,
then we have ‖[𝒯L,λ(θ)]𝒮⊥‖0 ≤ s̃ for any L ≤ 2ρ+(s* + 2s̃).
The proofs of Lemmas B.1 and B.2 are provided in Wang et al. (2014), therefore omitted. Since the initial solution θ[0] satisfies the approximate KKT condition. By Lemma B.1, we know that θ[0] satisfies
(B.2) |
We assume L[m] ≤ 2ρ+(s* + 2s̃). Since , by (B.2) and Lemma B.2, we have θ[0.5] = 𝒯L,λ(θ[0]) and . Since the coordinate descent subroutine iterates over 𝒜 = supp(θ[0.5]), its output solution θ[1] also satisfies . Since the proximal gradient descent iteration and coordinate descent subroutine decrease the objective value, by (B.2), we also have
Then by induction, we know that all successive θ[m]’s satisfy for m = 1.5, 2, 2.5, ….
Now we verify L[m] ≤ 2ρ+(s* + 2s̃). Since we start with a small enough L = ρ+(1) ≤ 2ρ+(s* + 2s̃). If L does not satisfy the stopping criterion for the backtracking line search in (3.4), then we multiply L by 2. Once L attains the interval ∈ [ρ+(s* + 2s̃), 2ρ+(s* + 2s̃)], it stops increasing. Because by the restricted strong smoothness of ℒ̃λ(θ), such a step size parameter always guarantees that the algorithm iterates from a sparse θ[m] to a sparse θ[m+0.5], and meanwhile satisfies the stopping criterion of the backtracking line search. Thus L[m] ≤ 2ρ+(s* + 2s̃) is verified.
The existence and uniqueness of θ̄λ has been verified in Wang et al. (2014). Therefore the proof is omitted. We then proceed to derive the geometric rate of convergence to θ̄λ by the next lemma.
Lemma B.3
Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies
(B.3) |
given L ≤ 2ρ+(s* + 2s̃), then we have
The proof of Lemma B.3 is provided in Wang et al. (2014), therefore omitted. Since we have verified that all θ[m]’s satisfy (B.3) and all L[m]’s satisfy L[m] ≤ 2ρ+(s* + 2s̃) for m = 0, 1, 2, …, Lemma B.3 implies
(B.4) |
where the first inequality holds because the coordinate descent subroutine decreases the objective value. Then by recursively applying (B.4), we compete the proof.
C Proof of Theorem 4.6
Proof
Before we proceed with the proof of Result (1), we first introduce the following lemma.
Lemma C.1
Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies
then for any λ′ ∈ [λN, λ], we have
The proof of Lemmas C.1 is provided in Wang et al. (2014), therefore omitted. If we take λ = λ′ = λK and θ = θ̂{K−1}, then Lemma C.1 implies
(C.1) |
Recall (A.3) in Appendix A. Within each coordinate descent subroutine for λK, we have
(C.2) |
By combining Theorem 4.3 with (C.2), we have
Therefore given
(C.3) |
we have
which satisfies the stopping criterion of CCDA for λK. Since both the proximal gradient descent iteration and coordinate descent subroutine decrease the objective value, we have
(C.4) |
within each coordinate descent subroutine for the Kth stage. By combining (C.1) and (C.3) with (C.4), we have
Before we proceed with the proof of Result (2), we first introduce the following lemma.
Lemma C.2
Suppose that Assumptions 4.1 and 4.2 hold. For any λ ≥ λN, if θ satisfies,
(C.5) |
given L ≤ 2ρ+(s* + 2s̃), we have
The proof of Lemma C.2 is provided in Wang et al. (2014), therefore omitted. Recall that in Appendix B, we have shown that at the Kth stage, θ[m] satisfies (C.5). The backtracking line search guarantees L[m+1] ≤ 2ρ+(s* + 2s̃). Thus by Lemma C.2, we have
(C.6) |
where the last inequality holds since the coordinate descent subroutine decreases the objective value. By combining (C.6) with Theorem 4.4, we obtain
Thus as long as
(C.7) |
we have
which satisfies the stopping criterion of AISTA at the Kth stage. By combining (C.1) with (C.7), we have
Result (3) is just a straightforward combination of Results (1) and (2).
To prove Result (4), we need to use Lemma C.1 again. In particular, for K < N, we take λ′ = λN, λ = λK and θ = θ̂{K}. We then have
(C.8) |
Since we have λK > λN for K = 1, …, N − 1, (C.8) implies
(C.9) |
For K = N, (C.8) implies
(C.10) |
D Proof of Lemma 4.8
Proof
Before we proceed with the proof, we need to introduce the following lemma.
Lemma D.1
Suppose that . We have
(D.1) |
The proof of Lemma D.1 is provided in Liu et al. (2012a), therefore omitted. We consider the following decomposition,
(D.2) |
Then by combining (D.1) and (D.2) with the fact ‖θ*‖1 ≤ ‖Θ*‖1 ≤ M, we have
which completes the proof.
E Proof of Lemma 4.9
Proof
Before we proceed with the proof, we first introduce the following lemma.
Lemma E.1
Suppose that . There exists a universal constant c2 such that
(E.1) |
The proof of Lemma E.1 is provided in Han and Liu (2015), therefore omitted. We consider the decomposition
(E.2) |
By assuming ‖θ‖0 ≤ s* + 2s̃ and
we further have
(E.3) |
(E.4) |
Thus for , we have
Given α = ψmin/2, we have
(E.5) |
Since we need to secure s̃ = c1s* ≥ (144κ2 + 250κ)s*, we take
(E.6) |
In another word, we need
Eventually by combining (E.1) and (E.5) with (E.6), we complete the proof.
F Proof of Theorem 4.11
Proof
Recall that the output solution θ̂{N} satisfies and ωλN ≤ δNλN. By Lemma B.1, we have
(F.1) |
By the definition of the matrix ℓ1 and Frobenius norms, we have
(F.2) |
Recall that we use θ̂{N} to denote arbitrary column of Θ̂{N}. By combining (F.2) with (F.1), we have
Since all above results rely on Assumptions 4.1 and 4.2, by Lemma 4.8 and 4.9, we have
with probability 1 − 3d−2, which completes the proof.
G Proof of Theorem 4.12
Proof
For notational simplicity, we omit the column index j, and use 𝒮 and θ̂o ∈ ℝd to denote the true support 𝒮j and corresponding oracle estimator Θ̂o respectively for the jth column. In particular, we can rewrite (4.6) as follows,
(G.1) |
Suppose that Assumption 4.2 holds. We have
which implies that S𝒮𝒮 is positive definite. Thus (G.1) is strongly convex and θ̂o is a unique minimizer. In our following analysis, we also assume
(G.2) |
By the strong convexity of (G.1), we have
(G.3) |
where (i) comes from the fact that θ̂o is the minimizer to (G.1). For notational simplicity, we denote . By the Cauchy-Schwarz inequality, (G.3) can be rewritten as
where the last inequality comes from (G.2) and the fact that Δ̂o contains at most s* entries. By simple manipulations, we obtain
(G.4) |
where the last inequality comes from the fact ‖θ*‖1 ≤ ‖Θ*‖1 ≤ M. By combining (G.4) with Assumption 4.3, we obtain
where (i) comes from the fact . Now we assume for some constant c4 (will be discussed later). We then have
Now we show that θ̂o is a sparse local solution to (2.4). In particular, we have the following decomposition,
Since is the minimizer to (G.1), by the KKT condition of (G.1), we have
(G.5) |
Moreover, since , we have
(G.6) |
By combining (G.5) with (G.6), we have
(G.7) |
Now we consider
Therefore as long as
we have , which implies that there exists ξ ∈ ∂‖0‖1 such that
(G.8) |
By combining (G.7) with (G.8), we know that θ̂o satisfies the KKT condition and is a local solution to (2.4).
Now we will show that θ̂o and θ̄λN are identical. Since and , we have
By the restricted strong convexity of ℱλN, we have
(G.9) |
(G.10) |
where ξ̃ and ξ̃o are defined as
By combining (G.9) with (G.10), we have , i.e., θ̂o = θ̄λN. Note that we choose , which is different from the selected regularization parameter in Assumption 4.8. But as long as we have , which is not an issue under the high dimensional scaling
λN ≥ 8‖∇ℒ(θ*)‖∞ still holds with high probability. Since the above results universally hold over all columns of Θ̅λN and Θ* under Assumptions 4.1 and (4.2), by Lemmas 4.8 and 4.9, we obtain Θ̂o = Θ̅λN, which completes the proof.
Footnotes
In our numerical experiments, PCDA is implemented by the R package “ncvreg”.
References
- Banerjee O, El Ghaoui L, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. The Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
- Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. Image Processing, IEEE Transactions on. 2009a;18:2419–2434. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]
- Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences. 2009b;2:183–202. [Google Scholar]
- Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. The Annals of Applied Statistics. 2011;5:232–253. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cai T, Liu W, Luo X. A constrained 2113;1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
- Dennis JJE, Schnabel RB. Numerical methods for unconstrained optimization and nonlinear equations. Vol. 16. SIAM; 1983. [Google Scholar]
- Fan J, Feng Y, Tong X. A road to classification in high dimensional space: the regularized optimal affine discriminant. Journal of the Royal Statistical Society: Series B. 2012;74:745–771. doi: 10.1111/j.1467-9868.2012.01029.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Feng Y, Wu Y. Network exploration via the adaptive lasso and scad penalties. The Annals of Applied Statistics. 2009;3:521–541. doi: 10.1214/08-AOAS215SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Xue L, Zou H. Strong oracle optimality of folded concave penalized estimation. The Annals of Statistics. 2014;42:819–849. doi: 10.1214/13-aos1198. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman J, Hastie T, Höfling H, Tibshirani R. Pathwise coordinate optimization. The Annals of Applied Statistics. 2007;1:302–332. [Google Scholar]
- Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software. 2010;33:1–13. [PMC free article] [PubMed] [Google Scholar]
- Fu WJ. Penalized regressions: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
- Han F, Liu H. Statistical analysis of latent generalized correlation matrix estimation in transelliptical distribution. Bernoulli. 2015 doi: 10.3150/15-BEJ702. (Accepted) [DOI] [PMC free article] [PubMed] [Google Scholar]
- Han F, Zhao T, Liu H. CODA: High dimensional copula discriminant analysis. Journal of Machine Learning Research. 2012;14:629–671. [Google Scholar]
- Jacob L, Obozinski G, Vert J-P. Group lasso with overlap and graph lasso; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]
- Kim Y, Kwon S. Global optimality of nonconvex penalized estimators. Biometrika. 2012;99:315–325. [Google Scholar]
- Ledoux M. The concentration of measure phenomenon. Vol. 89. AMS Bookstore; 2005. [Google Scholar]
- Li X, Zhao T, Yuan X, Liu H. The ”flare” package for high-dimensional sparse linear regression in R. Journal of Machine Learning Research. 2015;16:553–557. [PMC free article] [PubMed] [Google Scholar]
- Liu H, Han F, Yuan M, Lafferty J, Wasserman L. High-dimensional semiparametric gaussian copula graphical models. The Annals of Statistics. 2012a;40:2293–2326. [Google Scholar]
- Liu H, Han F, Zhang C-H. Transelliptical graphical models. Advances in Neural Information Processing Systems 25. 2012b [PMC free article] [PubMed] [Google Scholar]
- Liu H, Palatucci M, Zhang J. Blockwise coordinate descent procedures for the multi-task lasso, with applications to neural semantic basis discovery; Proceedings of the 26th Annual International Conference on Machine Learning; 2009. [Google Scholar]
- Liu H, Roeder K, Wasserman L. Stability approach to regularization selection (stars) for high dimensional graphical models. Advances in Neural Information Processing Systems. 2010 [PMC free article] [PubMed] [Google Scholar]
- Liu H, Wang L, Zhao T. Sparse covariance matrix estimation with eigenvalue constraints. Journal of Computational and Graphical Statistics. 2014;23:439–459. doi: 10.1080/10618600.2013.782818. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu H, Wang L, Zhao T. Calibrated multivariate regression with application to neural semantic basis discovery. Journal of Machine Learning Research. 2015;16:1579–1606. [PMC free article] [PubMed] [Google Scholar]
- Liu W, Luo X. Fast and adaptive sparse precision matrix estimation in high dimensions. Journal of Multivariate Analysis. 2015;135:153–162. doi: 10.1016/j.jmva.2014.11.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lu Z, Xiao L. Randomized block coordinate non-monotone gradient method for a class of nonlinear programming. arXiv preprint arXiv:1306.5918. 2013 [Google Scholar]
- Mazumder R, Friedman JH, Hastie T. Sparsenet: Coordinate descent with nonconvex penalties. Journal of the American Statistical Association. 2011;106:1125–1138. doi: 10.1198/jasa.2011.tm09738. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B. 2008;70:53–71. [Google Scholar]
- Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
- Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B. 2010;72:417–473. [Google Scholar]
- Meinshausen N, Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37:246–270. [Google Scholar]
- Negahban S, Wainwright MJ. Estimation of (near) low-rank matrices with noise and high-dimensional scaling. The Annals of Statistics. 2011;39:1069–1097. [Google Scholar]
- Negahban SN, Ravikumar P, Wainwright MJ, Yu B. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science. 2012;27:538–557. [Google Scholar]
- Nesterov Y. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody. 1988;24:509–517. [Google Scholar]
- Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–152. [Google Scholar]
- NESTEROV Y. Gradient methods for minimizing composite objective function. Mathematical Programming Series B. 2013;140:125–161. [Google Scholar]
- Nocedal J, Wright S. Numerical optimization, series in operations research and financial engineering. New York: Springer; 2006. [Google Scholar]
- Qin Z, Scheinberg K, Goldfarb D. Efficient block-coordinate descent algorithms for the group lasso. Mathematical Programming Computation. 2010:1–27. [Google Scholar]
- Ravikumar P, Wainwright MJ, Raskutti G, Yu B. High-dimensional covariance estimation by minimizing ℓ1-penalized log-determinant divergence. Electronic Journal of Statistics. 2011;5:935–980. [Google Scholar]
- Rothman AJ, Bickel PJ, Levina E, Zhu J. Sparse permutation invariant covariance estimation. Electronic Journal of Statistics. 2008;2:494–515. [Google Scholar]
- Shalev-Shwartz S, Tewari A. Stochastic methods for ℓ1-regularized loss minimization. The Journal of Machine Learning Research. 2011;12:1865–1892. [Google Scholar]
- Shen X, Pan W, Zhu Y. Likelihood-based selection and sharp parameter estimation. Journal of the American Statistical Association. 2012;107:223–232. doi: 10.1080/01621459.2011.645783. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B. 2005;67:91–108. [Google Scholar]
- Tseng P, Yun S. Block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications. 2009a;140:513–535. [Google Scholar]
- Tseng P, Yun S. A coordinate gradient descent method for nonsmooth separable minimization. Mathematical Programming. 2009b;117:387–423. [Google Scholar]
- Van de Geer SA. High-dimensional generalized linear models and the lasso. The Annals of Statistics. 2008;36:614–645. [Google Scholar]
- Wainwright M. Sharp thresholds for high-dimensional and noisy sparsity recovery using ℓ1-constrained quadratic programming. IEEE Transactions on Information Theory. 2009;55:2183–2201. [Google Scholar]
- Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension. The Annals of Statistics. 2013;41:2505–2536. doi: 10.1214/13-AOS1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Z, Liu H, Zhang T. Optimal computational and statistical rates of convergence for sparse nonconvex learning problems. The Annals of Statistics. 2014;42:2164–2201. doi: 10.1214/14-AOS1238. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu TT, Lange K. Coordinate descent algorithms for lasso penalized regression. The Annals of Applied Statistics. 2008;2:224–244. doi: 10.1214/10-AOAS388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xue L, Zou H, Cai T. Nonconcave penalized composite conditional likelihood estimation of sparse ising models. The Annals of Statistics. 2012;40:1403–1429. [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B. 2005;68:49–67. [Google Scholar]
- Yuan M, Lin Y. Model selection and estimation in the gaussian graphical model. Biometrika. 2007;94:19–35. [Google Scholar]
- Zhang C-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010a;38:894–942. [Google Scholar]
- Zhang C-H, Huang J. The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics. 2008;36:1567–1594. [Google Scholar]
- Zhang C-H, Zhang T. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science. 2012;27:576–593. [Google Scholar]
- Zhang T. Some sharp performance bounds for least squares regression with l1 regularization. The Annals of Statistics. 2009;37:2109–2144. [Google Scholar]
- Zhang T. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research. 2010b;11:1081–1107. [Google Scholar]
- Zhao P, Yu B. On model selection consistency of lasso. Journal of Machine Learning Research. 2006;7:2541–2563. [Google Scholar]
- Zhao T, Liu H. Sparse additive machine; International Conference on Artificial Intelligence and Statistics; 2012. [Google Scholar]
- Zhao T, Liu H. Calibrated precision matrix estimation for high-dimensional elliptical distributions. IEEE transactions on Information Theory. 2014;60:7874. doi: 10.1109/TIT.2014.2360980. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao T, Liu H, Roeder K, Lafferty J, Wasserman L. The huge package for high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research. 2012;13:1059–1062. [PMC free article] [PubMed] [Google Scholar]
- Zhao T, Liu H, Zhang T. A general theory of pathwise coordinate optimization. arXiv preprint arXiv:1412.7477. 2014a [Google Scholar]
- Zhao T, Roeder K, Liu H. Positive semidefinite rank-based correlation matrix estimation with application to semiparametric graph estimation. Journal of Computational and Graphical Statistics. 2014b;23:895–922. doi: 10.1080/10618600.2013.858633. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao T, Yu M, Wang Y, Arora R, Liu H. Accelerated mini-batch randomized block coordinate descent method. Advances in neural information processing systems. 2014c [PMC free article] [PubMed] [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
- Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B. 2005;67:301–320. [Google Scholar]