Abstract
We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model – an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.
Index Terms—: approximate dual certificate, convex optimization, sparsity, sparse group Lasso, simultaneously structured model
I. Introduction
Consider the high-dimensional double sparse regression with simultaneously group-wise and element-wise sparsity structures
| (1) |
Here, the covariates and parameter β* are divided into d known groups, where the jth group contains bj variables,
| (2) |
β* is a (s, sg)-sparse vector in the sense that
| (3) |
The focus of this paper is on the estimation of and inference for β* based on (y, X). This problem has great importance in a variety of applications. For example in genome-wide association studies (GWAS) [1], the genes can be grouped into pathways and it is believed that only a small portion of the pathways contain causal single nucleotide polymorphisms (SNPs), and the number of causal SNPs is much less than the one of non-causal SNPs in a causal pathway. The sparse group Lasso has been applied to identify causal genes or SNPs associated with a certain trait [1]. Other examples include cancer diagnosis and therapy [2], [3], classification [4], and climate prediction [5] among many others. The problem can also be viewed as a prototype of various problems in statistics and machine learning, such as the sparse multiple response regression [6] and multiple task learning [7], [8], [9].
The sparse group Lasso [10], [11], [12] provides a classic and straightforward estimator for β*:
| (4) |
Here, and ‖β‖1,2 = Σj ‖β(j)‖2 are ℓ1 and ℓ1,2 convex regularizers to account for element-wise and group-wise sparsity structures, respectively. λ, λg ≥ 0 are tuning parameters. In the noiseless setting that ε = 0, one can apply the constrained ℓ1 + ℓ1,2 minimization instead to estimate β*:
| (5) |
In fact, when λ, λg tend to zero while λ/λg is fixed as a constant, the sparse group Lasso (4) tends to the ℓ1 + ℓ1,2 minimization (5).
When β* is only element-wise sparse, the regular Lasso [13]
| (6) |
can be applied and its theoretical properties have been well studied. See, for example, [14], [15]. When β* is only group-wise sparse, the group Lasso
| (7) |
and its variations have been widely investigated [16], [17], [18]. However, to estimate the simultaneously element-wise and group-wise sparse vector β*, despite many empirical successes of sparse group Lasso in practice, the theoretical properties, including optimal rate of convergence and sample complexity, are still unclear so far to the best of our knowledge.
A. Simultaneously Structured Models
More broadly speaking, the simultaneously structured models, i.e., the parameter of interest has multiple structures at the same time, have attracted enormous attention in many fields including statistics, applied mathematics, and machine learning. In addition to the high-dimensional double sparse regression, other simultaneously structured models include sparse principal component analysis [19], [20], tensor singular value decomposition [21], [22], simultaneously sparse and low-rank matrix/tensor recovery [23], [24], sparse matrix/tensor SVD [25], and sparse phase retrieval [26], [27], [28]. As shown in [29], [23], by minimizing multi-objective regularizers with norms associated with these structures (such as ℓ1 norm for element-wise sparsity, nuclear norm for low-rankness, and total variation norm for piecewise constant structures), one usually cannot do better than applying an algorithm that only exploits one structure. They particularly illustrated that simultaneously sparse and low-rank structured matrix cannot be well estimated by penalizing ℓ1 and nuclear norm regularizers. Instead, non-convex methods were proposed and shown to achieve better performance.
However based on their results, it remains an open question whether the convex regularization, such as sparse group Lasso or ℓ1 + ℓ1,2 minimization, can achieve good performance in estimation of parameter with two types of sparsity structures, such as the aforementioned high-dimensional double sparse regression. Specifically, as illustrated in Section II-B, a direct application of [23] does not provide a sample complexity lower bound for exact recovery that matches our upper bound.
B. Optimality and Related Literature
This paper fills the void of statistical limits of sparse group Lasso and provides an affirmative answer to the aforementioned question: by exploiting both element-wise and group-wise sparsity structures, the ℓ1 + ℓ1,2 regularization does provide better performance in high-dimensional double sparse regression. Particularly in the noiseless case, it is shown that (s, sg)-sparse vectors can be exactly recovered and approximately (s, sg)-sparse vectors can be stably estimated with high probability whenever the sample size satisfies n ≳ sg log(d/sg) + s log(esgb), where b = max1≤i≤d bi. On the other hand, we prove that exact recovery cannot be achieved by ℓ1 + ℓ1,2 regularization and stable estimation of approximately (s, sg)-sparse vectors is impossible in general unless n ≳ sg log(d/sg) + s log(esgb/s). We then consider the noisy case and develop the matching upper and lower bounds on the convergence rate for the estimation error. Simulation studies are carried out and the results support our theoretical findings. In addition, statistical inference for the individual coordinates of β* is studied. A confidence interval is constructed based on the debiased sparse group Lasso estimator and its asymptotic property. The results show that by exploring the simultaneously element-wise and group-wise sparsity structures, the debiased sparse group Lasso requires less sample size than the debiased Lasso and debiased group Lasso in the literature [30], [31], [32], [33].
The theoretical analysis of sparse group Lasso and ℓ1 + ℓ1,2 minimization is highly non-trivial. First, the regularizer λ‖ · ‖1 + λg‖ · ‖1,2 is not decomposable with respect to the support of β* so that the classic techniques of decomposable regularizers [34] and null space property [35] may not be suitable here. Despite a substantial body of literature on high-dimensional element-wise sparse vector estimation based on restricted isometry property (RIP) [36], [37], [38], [39], [40] and restricted eigenvalue [14], these techniques cannot provide nearly optimal results for sparse group Lasso here as it is technically difficult to partition general vectors into simultaneously element-wise and group-wise ones that preserves some ordering structures. Departing from the previous literature, our theoretical analysis relies on a novel construction of approximate dual certificate. See Section II-C for further details. Although our results mostly focus on the performance of sparse group Lasso and ℓ1 + ℓ1,2 estimators, the techniques of approximate dual certificate on multi-norm structures here can also be of independent interest.
The statistical properties of sparse group Lasso and related estimators have been studied previously. For example, [5] developed consistency results for estimators with a general tree-structured norm regularizers, of which the sparse group Lasso is a special case. [41] analyzed the asymptotic behaviors of the adaptive sparse group Lasso estimator. [4], [42] studied the multi-task learning and classification problems based on a variant of sparse group Lasso estimator. [12] studied multivariate linear regression via sparse group Lasso. [43] provided a theoretical framework for developing error bounds of the group Lasso, sparse group Lasso, and group Lasso with tree structured overlapping groups. Specifically, their results imply that the group-wise sparse signal can be exactly recovered with high probability by solving (5) if the sample size satisfies n ≳ sg (b + log d). Different from previous results, this paper focused on both the required sample size and convergence rate of estimation error of sparse group Lasso. To the best of our knowledge, this is the first paper that provides optimal theoretical guarantees for both the sample complexity and estimation error of sparse group Lasso.
C. Organization of the Paper
The rest of the article is organized as follows. After a brief introduction to notation and preliminaries in Section II-A, the main theoretical results on constrained ℓ1 + ℓ1,2 minimization in the noiseless setting is presented in Section II-B and the key proof ideas are explained in Section II-C. Results for sparse group Lasso in the noisy setting are discussed in Section III. In particular, the optimal rate of estimation error and statistical inference are studied in Sections III-A and III-B, respectively. In Section IV-A, we introduce a practical scheme to select tuning parameters. In Section IV-B, we provide simulation results in both noiseless and noisy cases to justify our theoretical findings. The proofs of technical results are given in Section VI. All technical lemmas and their proofs can be found in Appendix A.
II. ℓ1 + ℓ1,2 Minimization in Noiseless Case
A. Notation and Preliminaries
The following notation will be used throughout the paper. We denote a⋀b = min{a, b}, a⋁b = max{a, b}. Let sgn(·) be the sign function, i.e., sgn(x) = 1, 0, or −1, if x > 0, x = 0, or x < 0, respectively. Hα(·) is the soft-thresholding function such that Hα(x) = sgn(x) · {(|x| − α) ⋁ 0} for any . We say a ≲ b and a ≳ b if a ≤ Cb and b ≤ Ca for some uniform constant C > 0, respectively. a ≍ b means a ≲ b and a ≳ b both hold. Let the uppercase C, C1, C0, … and lowercase c, c1, c0, … denote large and small positive constants respectively, whose actual values vary from time to time. Throughout the paper, we focus on the parameter index set {1, …, p} partitioned into d groups. Denote (1), …, (d) ⊆ {1, … , p} as the index sets belonging to each group. Additionally, for any group index subset G ⊆ {1, …, d}, define (G) = ∪j∈G(j), (Gc) = ∪j∉G(j). For any vector γ and index subset T, represents the sub-vector of γ with index set T. In particular, γ(G) represents the sub-vector of γ in the union of Groups j ∈ G. Define the ℓq norm of any vector γ as ‖y‖q = (∑i|γi|q)1/q. For any vector with group structures, we also define the ℓq1,q2 norm for any 0 ≤ q1, q2 ≤ ∞ as
In particular, is the number of non-zero groups of γ, ‖γ‖∞,2 = maxj ‖γ(j)‖2 is the maximum ℓ2 norm among all groups of γ, and is the group-wise ℓ1 penalty. With a slight abuse of notation, we simply denote ‖γT‖q1,q2 = ‖u‖q1,q2 if , u restricted on subset T is γT and u restricted on Tc is 0.
The focus of this paper is on simultaneously element-wise and group-wise sparse vectors defined as follows.
Definition 1 (Simultaneous element-wise and group-wise sparsity): Assume is associated with group partition (1), …, (d). For positive integers s, sg satisfying sg ≤ d and , we say β* is (s, sg)- sparse if
B. Noiseless Case and Sample Complexity
To analyze the performance of sparse group Lasso and ℓ1 + ℓ1,2 minimization, we first introduce the following assumption on the design matrix X.
Assumption 1 (Sub-Gaussian assumption): Suppose all rows of X are i.i.d. centered sub-Gaussian distributed. Specifically, , , and for any , we have for constant κ > 0. We also assume there exist two constants Cmax ≥ cmin > 0 such that cmin ≤ σmin (Σ) ≤ σmax(Σ) ≤ Cmax, where σmax(Σ) and σmin(Σ) are the largest and smallest eigenvalues of Σ, respectively.
Clear, a random matrix X with i.i.d. standard normal entries satisfies this assumption – this design is referred to as the Gaussian ensemble and has been considered as a benchmark setting in compressed sensing and high-dimensional regression literature [44], [45].
The following theorem shows that the ℓ1 + ℓ1,2 minimization achieves the exact recovery with high probability when β* is simultaneously element-wise and group-wise sparse, X is weakly dependent, and Assumption 1 holds. The theorem also provides a more general upper bound on estimation error if β* is approximately element-wise and group-wise sparse.
Theorem 1 (ℓ1 + ℓ1,2 minimization in noiseless case): Suppose one observes y = Xβ*, where X has the group structure (2) and satisfies Assumption 1, β* is (s, sg)-sparse, and b = max1≤i≤d bi. Let T be the support of β*. Suppose there exist uniform constants C, c > 0 such that
| (8) |
| (9) |
then the constrained ℓ1 + ℓ1,2 minimization (5) with achieves the exact recovery with probability at least 1 − C exp(−cn/s).
Moreover, if is a general vector and is the solution to the constrained ℓ1 + ℓ1,2 minimization (5) with , then
| (10) |
with probability at least 1 − C exp(−cn/s). Here,
Remark 1 (Interpretation and comparison): In Theorem 1, the required sample size for achieving exact recovery contains two terms: sg log(d/sg) and s log(esgb). Intuitively speaking, sg log(d/sg) corresponds to the complexity of identifying sg non-zero groups and s log(esgb) corresponds to the complexity of estimating s non-zero elements of β in sg known groups.
When β* is only element-wise or group-wise sparse, one can apply respectively the classic ℓ1 or ℓ1,2 minimization to recover β*,
| (11) |
| (12) |
The ℓ1 minimization and ℓ1,2 minimization here are respectively the special form of the regular Lasso and group Lasso (if λ, λg = 0+ in (6) and (7)), respectively. Especially if the group size b1 ≍ ⋯ ≍ bd ≍ b, to ensure exact recovery in the noiseless setting with high probability, (11) requires n ≳ Cs log(ebd/s) [46] and group Lasso requires n ≳ sg(b + log(ed/sg)). The ℓ1 + ℓ1,2 minimization (5) has provable advantages over both regular and group Lasso when b ≫ log(d) ≫ log(esgb) and sgb/ log(esgb) ≫ s ≫ sg. In particular, when sg = s, the double sparse regression reduces to the vanilla sparse linear regression, and the upper bound (10) matches the classic upper bound for ℓ1 minimization [44].
In addition, Condition (9) is an important technical condition we used in our theoretical analysis.
Next, we consider the sample complexity lower bound. Suppose b1 = b2 = ⋯ = bd and d ≥ 2sg. Recall that one observes y = Xβ* without noise and aims to estimate the (s, sg)-sparse vector β* based on y and X. As indicated by classic results in compressed sensing [47], with sufficient computing power, the ℓ0 minimization below achieves exact recovery of β*
| (13) |
as along as X is non-degenerate and n ≥ 2s. This bound is actually sharp: when n < 2s, for any set T ⊆ {1, …, db} with cardinality 2s, one can find a vector γ such that supp(γ) ⊆ T and Xγ = 0. By choosing an appropriate T, we can split the support γ to obtain two (s, sg)-sparse vectors β1, β2 satisfying β1 + β2 = γ. Then, Xβ1 = X(−β2) but there is no way to distinguish β1 and β2 merely based on X and y = Xβ1 = X(−β2).
However, the ℓ0 minimization (13) is computational infeasible in practice while a larger sample size is required for applying more practical methods. The following theorem shows that by performing the convex ℓ1 regularization, ℓ1,2 regularization, or any weighted combination of them, one requires at least Ω(sg log(d/sg) + s log(esgb/s)) observations to ensure exact recovery of (s, sg)-sparse vectors.
Theorem 2 (Sample complexity lower bound for exact recovery): Suppose b1 = ⋯ = bd = b, d, b ≥ 3. Suppose X is an n-by-(db) matrix. If every (2s, 2sg)-sparse vector is a minimizer of the following programming for some (λ, λg) ∈ {(λ, λg) : λ, λg ≥ 0, λ + λg > 0}:
In other words, if the ℓ1 + ℓ1,2 minimization exactly recover all (2s, 2sg)-sparse vector β, then we must have n ≳ sg log(d/sg) + s log(esgb/s).
The following sample complexity lower bound shows that for arbitrary methods, to ensure stable estimation of all approximately sparse vectors, one requires at least Ω(sg log(d/sg) + s log(esgb/s)) observations.
Theorem 3 (Sample complexity lower bound for stable estimation): Suppose b1 = ⋯ = bd = b, b, d ≥ 3. Assume there exists a matrix , a map (Δ may depend on X), and a constant C > 0 satisfying
| (14) |
for all and some s, sg satisfying d ≥ sg, sgb ≥ s ≥ sg. There exists constants C0 and c0 that depend only on C such that whenever sg ≥ C0, we must have
Remark 2 (Optimality and comparison with previous results): Theorems 2 and 3 show that the sample complexity upper bound in Theorem 1 is rate-optimal under a weak condition: log(esgb) ≍ log(esgb)−log(s) or log(d) ≥ 2s log(s)/sg. Oymak, et al. [23] provided a general analysis for convex regularization of simultaneously structured parameter estimation. Specifically for the high-dimensional double sparse regression, a direct application of their Theorem 3.2 and Corollary 3.1 implies that if ℓ1 + ℓ1,2 minimization can exactly recover (s, sg)-sparse vector β* with a constant probability, one must have n ≳ s. We can see that Theorem 2 provides a sharper lower bound on sample complexity.
In addition, by setting sg = s, the lower bound in Theorems 2 and 3 reduces to n ≳ s log(p/s), which matches the optimal sample complexity lower bound for exact recovery of s-sparse vectors [46, Theorem10.11, Proposition 10.7]. By setting s = sgb, we obtain a sample complexity lower bound n ≳ sg(b + log(d/sg)) for (approximate) sg-group-wise sparse vector recovery and stable estimation. To the best of our knowledge, this is the first sample complexity lower bound for group Lasso.
C. Proof Sketches
We briefly discuss the proof sketches of the main technical results in this section. The detailed proofs are postponed to Section VI.
The proof of Theorem 1 is based on a novel dual certificate scheme. The dual certificate [48] has been used in the theoretical analysis for various convex optimization methods in high-dimensional problems, such as matrix completion [49], [50], compressed sensing [44], robust PCA [51], tensor completion [52], etc. The high-dimensional double sparse linear regression exhibits different aspects from these previous works due to the simultaneous sparsity structure. In particular, we can show that if the uet defined below is in the row space of X, it can be used as an exact dual certificate for recovery of (s, sg)-sparse vector β*:
| (15) |
Here, T and G are the element-wise and group-wise supports of β*:
Roughly speaking, uet is the sub-gradient of objective function (5) evaluated at β = β*. If uet is in the row space of X, the sub-gradient will be perpendicular to the feasible set of (5), which implies that β* is the unique minimizer of ℓ1 + ℓ1,2 minimization (5).
For more general vector β* that does not necessarily have a sparse support T or G, we consider the following (s, sg)-sparse approximation:
| (16) |
Let and G = {j : (βap)(j) ≠ 0} be the element-wise and group-wise supports of βap. Define
| (17) |
Here is the subvector β* restricted on the j-th group with all entries in Tc set to zero. Similarly to the exactly sparse case, if ũ0 is in the row space of X and the true β* is approximately (s, sg)-sparse, the minimizer of (5) will be close to β*.
However, it is often difficult to find an exact dual certificate that lies in the row space of X and satisfies stringent conditions in (15) or (17). We instead propose to analyze via the approximate dual certificate defined as (18) in the following lemma.
Lemma 1 (Approximate dual certificate for sparse group Lasso): Suppose T, G are element-wise and group-wise support defined in (16). ũ0 is defined in (17). Assume X satisfies . If there exists in the row span of X satisfying
| (18) |
Then the conclusion of Theorem 1 (10) holds with probability at least 1 − 2e−cn. Here, H1/2(·) is the soft-thresholding operator defined at the beginning of Section II.
If we additionally assume β* is (s, sg)-sparse, then β* is the unique solution to the sparse group ℓ1 +ℓ1,2 minimization (5) with probability at least 1 – 2e−cn.
Lemma 1 shows that the conclusion of Theorem 1 holds if there exists an approximate dual certificate u satisfying the condition (18). The following lemma shows that, under the assumptions in Theorem 1, one can find such an approximate dual certificate with high probability.
Lemma 2: Suppose X has group structure (2) and satisfies Assumption 1. Recall is the least eigenvalue of . Then and (18) holds with probability at least 1 − Ce−cn/s, where T is defined in (16).
Another key technical tool to the proof of Theorem 1 is the following Lemma, which shows that X satisfies the restricted isometry property for all simultaneously element-wise and group-wise sparse vectors with high probability when there are enough samples.
Lemma 3: If n ≥ C(sg log(d/sg) + s log(esgb)),
| (19) |
with probability at least 1 – 2e−cn.
Next we briefly discuss the proof of Theorem 2. Consider the quotient space and define an associated norm as ‖[γ]‖ = infv∈ker(X) {λ‖γ − v‖1 + λg‖γ − v‖1,2}. We show that there exist N different (s, sg)-sparse vectors β(1), …, β(N) such that log(N) ≍ s log(esgb/s) + sg log(d/sg) and ‖[β(i)]‖ = 1, ‖[β(i)] − [β(j)]‖ ≥ 2/9 for all 1 ≤ i ≠ j ≤ N. By a property of the packing number and the fact that , we must have N ≤ 10n. Thus n ≳ log(N) ≍ s log(esgb/s) + sg log(d/sg).
We prove Theorem 3 by contradiction. Assume that
| (20) |
for a sufficiently small constant c0. Let and be the unit ball associated with ‖·‖. Define
We have by the assumption of this theorem. We can also show that there exists a uniform constant c > 0 such that
The previous two inequalities and (20) together imply that
This contradiction shows that
III. Sparse Group Lasso in Noisy Case
We now turn to the noisy case.
A. Optimal Rate of Estimation Error of Sparse Group Lasso
When observations are noisy, we have the following theoretical guarantee for the sparse group Lasso.
Theorem 4 (Upper bound of estimation error): Suppose y = Xβ* + ε, X satisfies Assumption 1, n ≥ C (sg log(d/sg) + s log(esgb)) for some uniform constant C > 0, , and b = max1≤i≤d bi. Then the sparse group Lasso estimator (4) with
and
satisfies
with probability at least
Here,
Especially, if β* is exactly (s, sg)-sparse and
holds, then
| (21) |
with probability at least
In addition, we focus on the following class of simultaneously element-wise and group-wise sparse vectors,
The following minimax lower bound of estimation error holds.
Theorem 5 (Lower bound of estimation error): Suppose X satisfies Assumption 1, b1 = ⋯ = bd = b, and d, b ≥ 3. Then we have
Remark 3: Theorems 4 and 5 together show that the sparse group Lasso yields the minimax optimal rate of convergence as long as the following condition holds: log(esgb) ≍ log(esgb) − log(s) or log(d) ≳ s log(s)/sg.
Remark 4: We briefly discuss the main proof ideas of Theorem 5 here. First, we randomly generate a series of subsets Ω(i) ⊆ {1, …, p} as feasible supports of (s, sg)-sparse vectors. Then, we prove by a probabilistic argument that there exist N ≳ (sg log(d/sg) + s log(esgb/s)) subsets such that |Ω(i) ∩ Ω(j)| < 8sg⌊s/sg⌋/9 for any i < j. Next, we construct a series of candidate (s, sg)-sparse vectors β(i) such that . Intuitively speaking, are non-distinguishable based only on observations (y, X) by such a construction. Theorem 5 then follows by choosing an appropriate τ and the generalized Fano’s lemma.
B. Statistical Inference via Debiased Sparse Group Lasso
We further consider the statistical inference for β* under the double sparse linear regression model. First, let be the sparse group Lasso estimator given by (4). Inspired by the recent advances in inference for high-dimensional linear regression [30], [53], [31], [33], we propose the following debiased sparse group Lasso estimator,
| (22) |
Here, is the sample covariance matrix and is an approximation of the inverse covariance matrix Σ−1, where is the solution to the following convex optimization,
| (23) |
Here, Hα is the soft-thresholding operator with thresholding level α defined at the beginning of Section II and ei is the i-th vector in the canonical basis of . The following theorem establishes an asymptotic result for debiased sparse group Lasso.
Theorem 6 (Asymptotic distribution of debiased sparse group Lasso): Suppose is (s, sg)-sparse, satisfies Assumption 1, and . Set and in (4), , in (23). Then with probability at least , the debiased sparse group Lasso estimator can be decomposed as , where
| (24) |
In particular, if , for any 1 ≤ i ≤ p,
| (25) |
Remark 5: (25) provides a method to construct confidence intervals for β*. Specifically if is a consistent estimator of σ, such as the scaled sparse group Lasso to be discussed in Section V,
would be an asymptotic (1 – α)-confidence interval for . We can see that the debiased sparse group Lasso estimator has the provably advantage on sample complexity (n ≫ (s log(esgb) + sg log(ed/sg))2) over the ones via debiased Lasso (n ≫ s log p, see [30], [31], [33]) or debiased group Lasso (n ≫ (sgb + sg log p)2, see [32]) for constructing asymptotic confidence intervals of β*.
IV. Simulation Studies
In this section, we investigate the numerical performance of the sparse group Lasso and ℓ1 +ℓ1,2 minimization for double sparse regression. The results support our theoretical findings in Sections II and III. We first discuss the practical choice for the tuning parameters used in the proposed algorithms.
A. Practical Selection of Tuning Parameters
By introducing τ as a surrogate for (λg/λ)2, we can rewrite the ℓ1 + ℓ1,2 minimization and the sparse group Lasso as
| (26) |
| (27) |
As suggested by Theorems 1 and 4, the theoretical choice of the tuning parameters (λ, τ) relies on σ, s, and sg in sparse group Lasso and ℓ1 + ℓ1,2 minimization for double sparse regression. These values, however, are usually unknown in practice. In addition, those theoretical values of tuning parameters may not achieve the best finite-sample numerical performance. We thus introduce in this section a data-driven approach to tuning parameter selection using K-fold cross-validation.
We first discuss how to select τ in the ℓ1 + ℓ1,2 minimization (26). Recall n is the sample size, p is the total number of covariates, d is the number of groups, b1, …, bd are the number of covariates in each group, and b = maxj bj. Since the theoretical value τ = s/sg and s/sg must satisfy 1 ≤ s/sg ≤ b, for a given integer L ≥ 1, we introduce a grid
| (28) |
as a set of candidate values for τ. Here, the grid size L can be set to a typical value of 10, or a larger value if more computing power is available. We split the data into K groups. For 1 ≤ k ≤ K, let Jk ⊂ {1, …, n} be the index set of the kth group and . For each τ ∈ S0, we solve
and calculate the prediction error
Let τ* be the minimizer of the prediction error: . Then, the final estimator is calculated using (26) with τ*.
Then we consider the sparse group Lasso (27), which includes two tuning parameters (τ, λ). We still define S0 in (28) as a grid of candidate values of τ. Following the idea in [11, Section 3.3], for each τ ∈ S0, we begin with a large value of λmax(τ) so that , the outcome of sparse group Lasso (27) with tuning parameters (τ, λmax(τ)), is zero (this can be achieved by the SGL package1). Let λmin(τ) be a small fraction of λmax(τ) (e.g., λmin = 0.1λmax as suggested in [11, Section 5]). Then we define
Next, we split the data into K groups. For 1 ≤ k ≤ K, let Jk ⊂ {1, …, n} be the index set of the kth group and . For each τ ∈ S0, λ ∈ Λ(τ), and k ∈ {1, …, K}, we solve
and calculate the prediction error
Let (τ*, λ*) be the minimizer of the prediction error:
The final estimator is calculated using (27) with (τ*, λ*).
In our simulation studies next, we will examine the performance of this cross-validation scheme with K = L = 10, λmin = 0.1λmax.
B. Numerical Results
We begin by considering the sample complexity for the exact recovery in the noiseless case. Suppose all group sizes are equal (b1 = ⋯ = bd = b) and the number of observations n varies from 5 to 200. We consider four simulation designs with (1) d = 60, b = 20, sg = 1; (2) d = 100, b = 30, sg = 2; (3) d = b = 20, sg = 1; and (4) d = b = 40, sg = 1. For each setting, we randomly draw with i.i.d. standard normal entries, construct the fixed vector satisfying
and generate . We implement the ℓ1 + ℓ1,2 minimization (5) with (SGL), ℓ1 minimization (11) (Lasso), and ℓ1,2 minimization (12) (Group Lasso), and ℓ1 + ℓ1,2 minimization (5) with the tuning parameter λg/λ selected using cross validation discussed in Section IV-A (SGL_CV). An exact recovery of β* is considered to be successful if . The successful recovery rate based on 100 replicates is shown in Figure 1. It can be seen that SGL and SGL_CV have comparable performance and both methods have significantly better performance than Lasso and Group Lasso. This is in line with our theoretical results.
Fig. 1.

Exact recovery rate in the noiseless case
Then we consider the noisy case and focus on average estimation errors of different methods. We generate
where X, β* are drawn in the same way as the previous setting and . We consider four designs: i. d = 60, b = 20, sg = 1; ii. d = 100, b = 30, sg = 2; iii. d = b = 20, sg = 1; and iv. d = b = 40, sg = 2. For each case, the number of observations n is chosen from an equally spaced sequence from 5 to 200 and the simulation is replicated for 500 times. We compare the average estimation error of (a) SGL_CV1: sparse group Lasso with theoretical value and λ selected via cross validation; (b) SGL_package: sparse group Lasso via SGL package2 in R with the option of automatic tuning parameter selection; (c) Lasso: regular Lasso with tuning parameter selected via cross validation; (d) group Lasso: group Lasso with tuning parameter selected via cross validation; (e) SGL_CV2: sparse group Lasso with both λ and λg selected using the proposed cross validation scheme. We can see the proposed method SGL_CV2 achieves smaller estimation error than all other methods, including SGL_CV1, the focus of our theory. These experimental results demonstrate our theory and the applicability of the proposed cross-validation scheme.
V. Discussions
In this paper, we study the high-dimensional double sparse regression and investigate the theoretical properties of the sparse group Lasso and ℓ1 + ℓ1,2 minimization. Particularly, we develop the matching upper and lower bounds on the sample complexity for ℓ1 + ℓ1,2 minimization in the noiseless case. We also prove that the sparse group Lasso achieves minimax optimal rate of convergence in a range of settings in the noisy case. Our results give an affirmative answer to the open question for high-dimensional statistical inference for simultaneously structured model: by introducing both ℓ1 and ℓ1,2 penalties, one can achieve better performance on estimation and statistical inference for simultaneously element-wise and group-wise sparse vectors.
In addition to β*, the estimation and inference for noise level σ is another importance task in high-dimensional double sparse regression. Motivated by the recent development of scaled Lasso [54], one may consider the following scaled sparse group Lasso estimator:
where . and are tuning parameters that do not rely on σ. The consistency of can be established based on similar ideas of scaled Lasso in the literature [54], [31] and the approximate dual certificate in this work.
Moreover, our technical results can be useful in a variety of other problems with simultaneous sparsity structures. For example, [55], [56] considered the estimation of piece-wise constant sparse signals, i.e., both the signal vector and the difference between successive entries of the signal vector are sparse. [57], [58] discussed the estimation of structured parameters where both the number of non-zero elements and the number of distinct values of the parameter vectors are small. [59] considered the estimation of matrices with simultaneous sparsity structures within each block and among different blocks. It is interesting to further study the statistical limits, including the sample complexity and minimax optimal rate of convergence for these problems. In particular, based on the specific sparsity structures of each problem, we can introduce corresponding multi-objective regularizers and the convex regularization methods. The corresponding approximate dual certificates can be proposed, constructed, and analyzed to provide strong theoretical guarantees.
VI. Proofs
We collect the proofs of technical results in this section.
A. Proof of Lemma 1
Let T satisfy (16). For convenience, we denote s0 = s/sg and decompose u as
| (29) |
Note that |H1/2(x) − x| ≤ 1/2 for any . Based on the property of (18), ‖u(G)\T‖∞ ≤ 1/2, then
| (30) |
| (31) |
Suppose is the minimizer to (5), , then based on the sub-differential of ‖β‖1 and ‖β‖1,2, we have
| (32) |
The last inequality comes from and .
In particular, given Xh = 0 and that u lies in the row span of X, we have v⊤h + w⊤h = u⊤h = 0. Therefore,
| (33) |
Next note that , we must have , then
| (34) |
Combining (30), (33), and (34), one obtains
Plug this inequality to (32), we finally have
Since is the minimizer to (5), we must have , then
| (35) |
If β* is (s, sg)-sparse, immediately we have . Then . By , we know is non-singular, then hT = 0.
Now, we consider the general case. Without loss of generality, suppose G = {1, …, g}, where g ≤ sg. Denote T1 as the indices of the s largest entries of h(G)\T, T2 as the indices of the s largest entries of , and so on. For sg + 1 ≤ i ≤ d, denote Si,1 as the indices of the ⌊s/sg⌋ largest entries of h(i), Si,2 as the indices of the ⌊s/sg⌋ largest entries of , and so on. Let be an arrangement of Si,j(1 ≤ j ≤ ⌈bi/⌊s/sg⌋⌉, g + 1 ≤ i ≤ d) such that . Let , , and so on. Then (T1, T2, …, R1, R2,…) is a partition of Tc, and |Ti|, |Rj| ≤ s, |g(Ti)|, |g(Rj)| ≤ sg, where g(S) = {i1, …, ik} if and S ∩ (ij) are not empty for all 1 ≤ j ≤ k. Let T = T ∪ T1 ∪ R1. If (19) holds, then
| (36) |
Since Xh = 0, we have
| (37) |
Now, we consider . By triangle inequality,
The triangle inequality shows that
Combine the parallelogram identity and (19) together, we have
Thus,
| (38) |
By (3.10) in [37], we have
| (39) |
and
For all g +1 ≤ i ≤ d, apply (3.10) in [37] again,
Moreover, by the definition of Si,1,
Therefore,
| (40) |
Combine (38), (39) and (40) together, if (19) holds, we have
Similarly, if (19) holds, then
and
Thus, with probability at least 1 – 2e−cn,
| (41) |
The last inequality holds due to Cauchy-Schwarz inequality. Combine (36), (37), (41) and Lemma 3 together, we know that with probability at least 1 – 2e−cn,
i.e., with probability at least 1 – 2e−cn,
Finally, by (35), (39), (40) and the previous inequality, with probability at least 1 – 2e−cn,
In summary, we have finished the proof of this lemma. □
B. Proof of Lemma 2
Let T satisfy (16). Given , without loss of generally we assume that
We also denote T(j) as the support of . First by Lemma 6 Part 3 with
and notice that x log(eu/x) ≥ log(eu) for all 1 ≤ x ≤ u, we have
| (42) |
provided that n ≥ C (s log(esgb/s) + sg log(d/sg)) for some large constant C > 0. Note that the fourth inequality comes from the facts that ‖ΣT,T‖ ≤ ‖Σ‖ ≤ Cmax and . By Lemma 7 Part 1, we also know
Next, we apply the well-regarded golfing scheme [50], [44] to find an approximate dual certificate u that satisfies (18). Let such that
| (43) |
Immediately we have . We divide n rows of X into non-overlapping batches, say , , …, with |Il| = nl. Here, n1, n2,… will be specified a little while later. Consider the following sequences
| (44) |
Finally the approximate dual certificate is defined as
| (45) |
From the inductive definition we can see
Next, we apply the random matrix results (Lemmas 7 and 6) and obtain the following tail probabilities.
- if nl ≥ C stl for large constant C > 0 and tl ≥ C, by Part 1 of Lemma 7,
(46) -
Suppose is independent of . If for , by Lemma 7 Part 2,
(47) The third inequality comes from .
- Suppose is fixed. If , by Lemma 6 Part 2,
(48)
Then we specify {nl, tl, δl, θl}l ≥ 1 as follows,
n1 = n2 ≥ C(s log(esgb) + sg log(d/sg)), t1 = t2 = cn1/(s log(es)) ≥ C, , ;
, , , , with lmax = ⌈C log(es)⌉ + 2.
We can see the following events happen
| (49) |
| (50) |
| (51) |
with probability at least . By triangle inequality, u0 satisfies
| (52) |
When and (49)–(52) hold, we have
| (53) |
For large constant . Notice that
we know that
In addition,
Since
where
Thus, the construction of u satisfies all required condition in Lemma 1 with probability at least . This has finished the proof of this lemma. □
C. Proof of Lemma 3
Let g(S) be the group support of set S, that is, g(S) = {i1, …, ik} if and S ∩ (ij) are not empty for all 1 ≤ j ≤ k. Lemma 7 Part 1 and the union bound show that
□
D. Proof of Theorem 2
If d ≥ 3sg and b ≥ 3s/sg, by (80), we can find Ω(1), …, Ω(N) ⊂ {1, …, db} such that |Ω(i)| = sg⌊s/sg⌋, for all 1 ≤ i ≤ N, ≤ k ≤ d, and
| (54) |
| (55) |
where . For any 1 ≤ j ≤ db, 1 ≤ i ≤ N, define
then ∥β(i)∥0 ≤ s, ∥β(i)∥0,2 ≤ sg. We consider the quotient space
Then the dimension of is rank(X) ≤ n. Define the norm ∥[x]∥ = infv∈ker(X){λ∥x − v∥1 + λg ∥x − v∥1,2}. For any vector satisfying ∥x∥0 ≤ 2s, ∥x∥0,2 ≤ 2sg, note that x − v with v ∈ ker(X) satisfies X(x − v) = Xx, by our assumption, we have ∥[x]∥ = λ∥x∥1 +λg∥x∥1,2. Thus ∥[β(1)]∥ = ⋯ = ∥[β(N)]∥ = 1. Moreover, by (54) and (55),
and
where
Since β(i) − β(j) is (2s, 2sg)-sparse,
By [46, Proposition C.3], we have N ≤ 10rank(X) ≤ 10n. Therefore we have
which means that n ≥ c(sg log(d/sg) + s log(esgb/s)).
If d < 3sg or b < 3s/sg, let , , then and . Since all (2s, 2sg)-sparse vectors can be exactly recovered by the ℓ1 + ℓ1,2 minimization and s′ ≤ s, , we know that the ℓ1 + ℓ1,2 minimization exactly recover all (2s′, )-sparse vectors. Therefore, we have
| (56) |
□
E. Proof of Theorem 3
We would like prove Theorem 3 by contradiction. Let
where c′ is a uniform constant such that n ≥ c′(s log(esgb/s) + sg log(d/sg)) if the conditions in Theorem 2 are satisfied. Assume for contradiction that
| (57) |
Let s0 = s/sg, define the norm . Let ,
By [46, Theorem 10.4], we have
| (58) |
If
| (59) |
since
(58) and (59) together imply that
| (60) |
By (57),
| (61) |
In the last inequality, we used x1/2 ≥ log(x)/2 for all x ≥ 1. Combine (60) and (61) together, we have
contradiction!
Thus, we only need to prove (59) based on (57). We still use the proof of contradiction. If
then there exists a subspace Ln of with such that for all v ∈ Ln\{0},
Let satisfying ker(B) = Ln. Let , , by (57) and (61),
which means that
Moreover, we have . For any (2s′, )-sparse β with support set T and group support set G, and v ∈ ker(A), by Cauchy-Schwarz inequality,
i.e.,
Based on Cauchy-Schwarz inequality and the sub-differential of ∥β∥1 and ∥β∥1,2, we have
By Theorem 2,
Thus
provided that , contradiction! This means that (59) holds if (57) is true.
Therefore, we have finished the proof of Theorem 3. □
F. Proof of Theorem 4
Let , . By (86) in Lemma 5 and (101), one has
| (62) |
By the definition of and KKT condition, we have
where
Therefore,
(62), Lemma 8 Part 1 and the previous inequality together imply that
| (63) |
where . By the definition of , we have
(32) and the previous inequality show that
| (64) |
First, we consider 〈Xh, ε〉. Denote , since and (In − P)XT = 0,
| (65) |
Therefore, to give an upper bound of |〈Xh, ε〉|, we only need to bound and , respectively. By Part 1 of Lemma 7 and also notice that cmin ≤ σmin (Σ) ≤ σmax (Σ) ≤ Cmax,
| (66) |
(66), Lemma 9 and Cauchy-Schwarz inequality together imply that with probability at least ,
Combine Lemma 8 Part 2, (63) and the previous two inequalities together, with probability at least ,
| (67) |
Similarly to the proof of (62), also notice that ∥(In − P)ε∥2 ≤ ∥ε∥2 and is independent of In − P, we have
By Lemma 8 Part 2 and (62), with probability at least ,
Notice that and In − P are independent and |Tc\(Gc)| ≤ |G| ≤ sgb, by Lemma 9, with probability at least ,
Combine the previous two inequalities together, we have
| (68) |
with probability . Combine (65), (67) and (68) together, we know that with probability at least ,
| (69) |
Moreover, by the proof of Theorem 1, with probability at least 1 − C exp(−cn/s), there exists an approximate dual certificate in the row span of X satisfying (18), and , where v is defined in (29). Similarly to (33), we have
By Lemma 10, with probability at least 1 − Ce−cn/s, u = X⊤ w with . Therefore, with probability at least 1 − Ce−cn/s,
The two previous inequalities together imply that
| (70) |
with probability at least 1 − Ce−cn/s.
Combine (64), (69) and (70) together, with probability at least ,
| (71) |
By (42), (63) and (66), with probability at least ,
| (72) |
The fourth inequality comes from for ; the fifth inequality holds since . (71) and (72) together imply that
with probability at least . Also notice that
with probability at least ,
| (73) |
From the proof of Lemma 1, we know that (36) and (41) hold with probability at least 1 − 2e−cn. By Lemma 8 Part 2 and (63), with probability at least ,
| (74) |
The second inequality is due to , and Cauchy-Schwarz inequality.
Combine (36), (41), (73) and (74) together, with probability at least , we have
Therefore, with probability at least ,
| (75) |
By (39), (40), (73) and the previous inequality, also notice that , with probability at least ,
| (76) |
i.e., with probability at least ,
Moreover, if β* is (s, sg)-sparse, then . Therefore, with probability at least ,
□
G. Proof of Theorem 5
First, we consider the case that d ≥ 3sg and b ≥ 3s/sg. Let ω(1), …, ω(N) be uniformly randomly vectors from
Denote , and β(i) = τω(i), for all 1 ≤ i ≤ N, 1 ≤ k ≤ d, where τ is a parameter that will be specified later. Obviously, ∥β(i)∥0 = sg⌊s/sg⌋ ≤ s, therefore .
Moreover, if |Ω(i) ∩ Ω(j)| ≥ 8sg⌊s/sg⌋/9, then we must have
otherwise , which is a contradiction.
Therefore,
| (77) |
Note that
The inequality holds since , for all 1 ≤ i ≤ sg. Similarly, for 1 ≤ t ≤ ⌊s/sg⌋,
Combine (77) and the previous two inequalities together, we have
| (78) |
Set , then
i.e., the probability that β(1), …, β(N); Ω(1), ⋯, Ω(N) satisfy
| (79) |
| (80) |
is positive. For convenience, we fix β(1), …, β(N) to be the vectors satisfying (79).
Denote y(i) = Xβ(i) + ε for all 1 ≤ i ≤ N. We consider the Kullback-Leibler divergence between different distribution pairs:
where p(y(i), X) is the probability density of (y(i), X). Conditioning on X, we have
Thus for 1 ≤ i ≠ j ≤ N,
| (81) |
In the first inequality, we used . By generalized Fano’s Lemma,
Since , by setting , we have
If d < 3sg or b < 3s/sg, let , , then and . Similarly to (56), we have
□
H. Proof of Theorem 6
The proof of Theorem 6 relies on the following key lemma, which shows that Σ−1 is in the feasible set of the optimization problem (23) with high probability by choosing appropriate α and γ.
Lemma 4: By setting , in (23), we have
Note that Y = Xβ* + ε, we have
Since , we know that
Denote . Since β* is (s, sg)-sparse, by (73), (76) and Cauchy-Schwarz inequality, with probability at least ,
In addition, Lemma 4 shows that Σ−1 is in the feasible set of (23) with probability at least . By the definition of ,
| (82) |
Combining these facts, by Lemma 8 Part 2, we must have
with probability at least . This has finished the proof of (24).
Next, we consider . By (82) and Lemma 8 Part 2, we have
Therefore, for any c ≥ 0,
Since m = cei/2 achieves the minimum of the right hand side, we have
If for all 1 ≤ i ≤ p, by setting , we have
| (83) |
Moreover, by Lemma 6 Part 2 with u = v = ei, we have
By the union bound,
Therefore, with probability at least 1 – 2 exp (−cn),
(83) and the previous inequality together imply that with probability at least 1 – 2 exp (−cn),
Fig. 2.

Average estimation error in the noisy case
Acknowledgments
The research of Tony Cai was supported in part by NSF grants DMS-1712735 and DMS-2015259 and NIH grants R01-GM129781 and R01-GM123056. The research of Anru R. Zhang and Yuchen Zhou was supported in part by NSF Grants CAREER-2203741 and DMS-1811868 and NIH grant R01-GM131399-01.
Biographies
T. Tony Cai received the Ph.D. degree from Cornell University, Ithaca, NY, in 1996. His research interests include high-dimensional statistics, machine learning, large- scale inference, nonparametric function estimation, functional data analysis, and statistical decision theory. He is the Daniel H. Silberberg Professor of Statistics and data science at the Wharton School of the University of Pennsylvania, Philadelphia, PA, USA. Dr. Cai is the recipient of the 2008 COPSS Presidents Award and a fellow of the Institute of Mathematical Statistics. He is a past editor of the Annals of Statistics.
Anru R. Zhang is the Eugene Anson Stead, Jr. M.D. Associate Professor in the Department of Biostatistics & Bioinformatics and Associate Professor in the Departments of Computer Science, Mathematics, and Statistical Science at Duke University. He was an assistant professor of statistics at the University of Wisconsin-Madison in 2015–2021. He obtained his bachelor’s degree from Peking University in 2010 and his Ph.D. from the University of Pennsylvania in 2015. His work focuses on high-dimensional statistical inference, non-convex optimization, statistical tensor analysis, computational complexity, and applications in genomics, microbiome, electronic health records, and computational imaging. He received the IMS Tweedie Award (2022), ASA Gottfried E. Noether Junior Award (2021), Bernoulli Society New Researcher Award (2021), ICSA Outstanding Young Researcher Award (2021), and NSF CAREER Award (2020).
Yuchen Zhou is a postdoctoral researcher in the Department of Statistics and Data Science, The Wharton School, University of Pennsylvania. He received the B.E. degree from Peking University in 2016 and the Ph.D. degree in statistics from the University of Wisconsin-Madison in 2021. His research interests include high-dimensional statistical inference, tensor data analysis, reinforcement learning and statistical learning theory.
Appendix
We collect all additional technical lemmas and their proofs in this section.
Lemma 5 (Bernstein-type Inequality for Soft-thresholded Sub-Gaussian Vectors):
Suppose the rows of are independent sub-Gaussian vectors satisfying Assumption 1. is a fixed vector, Ω is a subset of {1, …, p} with |Ω| = r. Then
| (84) |
For any fixed vector and fixed index subset Ω ⊆ {1, …, p} with |Ω| = r,
| (85) |
In particular, for any b ≥ r, if , , we have
| (86) |
Proof of Lemma 5. We only need to focus on the case where ∥w∥2 = 1. Let , immediately we know that W1,Ω, …, Wn,Ω are isotropic sub-Gaussian distributed. Then for any fixed w, w⊤WΩ is also an isotropic sub-Gaussian vector such that for any ,
The last equation holds since (Σ1/2)Ω,·(Σ1/2)·,Ω = (Σ1/2 Σ1/2)Ω,Ω = ΣΩ,Ω.
By the tail inequality of sub-Gaussian quadratic form ([60, Theorem 2.1]),
By taking square-root of the previous inequality, we have
Also note that
we obtain (84).
For the second part of proof, note that
By the first part of this lemma,
Plug in this to the previous inequality, one has
Specifically, if , ,
Therefore, (85) shows that
□
Lemma 6 (sub-Gaussian quadratic form concentrations): Suppose is a sub-Gaussian vector satisfying Assumption 1.
- For any fixed , u, v ≠ 0, u⊤ZZ⊤v is sub-exponential such that for every t > 0,
(87) - In addition, suppose is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1,
(88) - More generally, for any fixed matrix , the following concentration inequality in spectral norm holds,
(89)
Proof of Lemma 6. Since we can rescale u and v without essentially changing the problem, without loss of generality we assume ∥u∥2 = ∥v∥2 = 1. Let A = uv⊤, then u⊤ZZ⊤v = Z⊤uv⊤Z = Z⊤AZ. By Assumption 1, and . By Hanson-Wright inequality ([61, Theorem 1.1]),
where
Therefore, for every t ≥ κ2,
Thus, there exists a constant c < log 2, for every t ≥ 0,
Notice that , for all 1 ≤ k ≤ n, by Bernstein-type concentration inequality (c.f., [62, Proposition 5.16]),
This has finished the proof of (88).
Finally, we consider (89), which can be done by an ε-net argument and the result in (88). For any , ∥w∥2 = 1, set u = Uw in (88), we have
By [62, Lemma 5.3], we can find a -net of with . By the union bound,
| (90) |
For any , g ≠ 0, set , we can find such that . By triangle inequality,
Therefore,
The (90) and the previous inequality together, also notice that , we have
| (91) |
Finally, note that
we have proved (89). □
We collect the random matrix properties of X in the following lemma. These properties will be extensively used in the main content of the paper.
Lemma 7: Suppose is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1.
- Suppose T ⊆ {1, …, p} is with cardinality s. Then,
(92) - For any fixed vector , δ > 0, and fixed index subset Ω ⊆ Tc satisfying |Ω| = r, ,
(93)
Here, Hλ(·) is the soft-thresholding estimator at level λ.
Proof of Lemma 7.
-
The first statement is via ε-net. Denote , then the rows of WT are independent isotropic sub-Gaussian distributed. For any fixed vector , by [62, Lemma 5.5], are independent sub-Gaussian random variables with and . Therefore, by Remark 5.18 and Lemma 5.14 in [62], . Bernstein-type inequality shows thatBy [62, Lemma 5.2], we can find a -net of with . The union bound tells us
(94) By [62, Lemma 5.4],(95) Since cmin ≤ σmin(Σ) ≤ σmax(Σ) ≤ Cmax, we have and . Therefore,(96) Combine (94), (95) and (96) together, we have arrived at the conclusion.
-
Now we consider the proof for (93). Note that implies that there exists Λ ⊂ Ω such that all entry of are greater than δ, and . Thus,
(97) Since , we know that no matter |Λ| = ⌊(t/δ)2⌋ ∧ r or ⌈(t/δ)2⌉,By Part 3 of Lemma 6, for any Λ ⊆ Ω, , we haveCombine (97) and the previous inequality, one obtains□
Lemma 8 (Properties of Soft-thresholding):
- Suppose a, b > 0, , H. (·) is the soft-thresholding operator satisfying Ha(x) = sgn(x) · (|x| − a)+. Then the following triangular inequality holds,
(98) - Suppose a, b > 0, , if ∥Ha(x)∥∞,2 ≤ b, then
(99)
Proof of Lemma 8.
1)
2)
□
Lemma 9: Suppose is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1, Suppose T ⊆ {1, …, p} is with cardinality s, is a projection matrix and independent of XT. Then, for any t ≥ log(es),
Proof of Lemma 9. For fixed vector , since Assumption 2 is satisfied, for i ∈ T, X1i, …, Xni are independent sub-Gaussian distributed such that
By Hoeffding-type inequality,
| (100) |
Moreover, by [63, Lemma 1], for any x ≥ 0,
Set x = n in the last inequality, we have
| (101) |
Combine (100) and (101) together and notice that ∥Pε∥2 ≤ ∥ε∥2, we have
□
Lemma 10: With probability at least 1 − Ce−cn/s, the approximate dual certificate defined in (45) can be written as u = X⊤w, where .
Proof of Lemma 10. By (45), we have u = X⊤w, where and . Thus . Also note that
By (53), with probability at least 1 − C exp(−cn/s),
□
Proof of Lemma 4. For any 1 ≤ i ≤ p, 1 ≤ j ≤ d, ,Λ ⊆ (j), |Λ| = k, by Lemma 6 with
we have
| (102) |
By the same method in Lemma 5 Part 2,
Combine (102) and the previous inequality together, we have
| (103) |
By (103) and the union bound, we have
□
Footnotes
References
- [1].Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, Tai E-S, Teo Y-Y, and Montana G, “Pathways-driven sparse regression identifies pathways and genes associated with high-density lipoprotein cholesterol in two asian cohorts,” PLoS genetics, vol. 9, no. 11, p. e1003939, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Vidyasagar M, “Machine learning methods in the computational biology of cancer,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 470, no. 2167, p. 20140081, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Allahyar A and De Ridder J, “Feral: network-based classifier with application to breast cancer outcome prediction,” Bioinformatics, vol. 31, no. 12, pp. i311–i319, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Rao N, Nowak R, Cox C, and Rogers T, “Classification with the sparse group lasso,” IEEE Transactions on Signal Processing, vol. 64, no. 2, pp. 448–463, 2015. [Google Scholar]
- [5].Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, and Ganguly A, “Sparse group lasso: Consistency and climate applications,” in Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 2012, pp. 47–58. [Google Scholar]
- [6].Wang W, Liang Y, and Xing E, “Block regularized lasso for multivariate multi-response linear regression,” in Artificial Intelligence and Statistics, 2013, pp. 608–617. [Google Scholar]
- [7].Lounici K, Pontil M, Tsybakov A, and Van De Geer S, “Taking advantage of sparsity in multi-task learning,” in COLT 2009-The 22nd Conference on Learning Theory, 2009. [Google Scholar]
- [8].Lozano AC and Swirszcz G, “Multi-level lasso for sparse multi-task regression,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 595–602. [Google Scholar]
- [9].Zhou HH, Zhang Y, Ithapu VK, Johnson SC, and Singh V, “When can multi-site datasets be pooled for regression? hypothesis tests, ℓ2-consistency and neuroscience applications,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 4170–4179. [PMC free article] [PubMed] [Google Scholar]
- [10].Friedman J, Hastie T, and Tibshirani R, “A note on the group lasso and a sparse group lasso,” arXiv preprint arXiv:1001.0736, 2010. [Google Scholar]
- [11].Simon N, Friedman J, Hastie T, and Tibshirani R, “A sparse-group lasso,” Journal of Computational and Graphical Statistics, vol. 22, no. 2, pp. 231–245, 2013. [Google Scholar]
- [12].Li Y, Nan B, and Zhu J, “Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure,” Biometrics, vol. 71, no. 2, pp. 354–363, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Tibshirani R, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. [Google Scholar]
- [14].Bickel PJ, Ritov Y, and Tsybakov AB, “Simultaneous analysis of lasso and dantzig selector,” The Annals of Statistics, vol. 37, no. 4, pp. 1705–1732, 2009. [Google Scholar]
- [15].Verzelen N, “Minimax risks for sparse regressions: Ultra-high dimensional phenomenons,” Electronic Journal of Statistics, vol. 6, pp. 38–90, 2012. [Google Scholar]
- [16].Yuan M and Lin Y, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. [Google Scholar]
- [17].Lounici K, Pontil M, Van De Geer S, and Tsybakov AB, “Oracle inequalities and optimal inference under group sparsity,” The Annals of Statistics, vol. 39, no. 4, pp. 2164–2204, 2011. [Google Scholar]
- [18].Bunea F, Lederer J, and She Y, “The group square-root lasso: Theoretical properties and fast algorithms,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 1313–1325, 2013. [Google Scholar]
- [19].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Ma Z, “Sparse principal component analysis and iterative thresholding,” The Annals of Statistics, vol. 41, no. 2, pp. 772–801, 2013. [Google Scholar]
- [21].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]
- [22].Wang M and Li L, “Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality,” arXiv preprint arXiv:1811.05076, 2018. [PMC free article] [PubMed] [Google Scholar]
- [23].Oymak S, Jalali A, Fazel M, Eldar YC, and Hassibi B, “Simultaneously structured models with application to sparse and low-rank matrices,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2886–2908, 2015. [Google Scholar]
- [24].Hao B, Zhang A, and Cheng G, “Sparse and low-rank tensor estimation via cubic sketchings,” arXiv preprint arXiv:1801.09326, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [26].Jaganathan K, Oymak S, and Hassibi B, “Sparse phase retrieval: Convex algorithms and limitations,” in 2013 IEEE International Symposium on Information Theory. IEEE, 2013, pp. 1022–1026. [Google Scholar]
- [27].Shechtman Y, Beck A, and Eldar YC, “Gespar: Efficient phase retrieval of sparse signals,” IEEE transactions on signal processing, vol. 62, no. 4, pp. 928–938, 2014. [Google Scholar]
- [28].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow,” The Annals of Statistics, vol. 44, pp. 2221–2251, 2016. [Google Scholar]
- [29].Oymak S, Jalali A, Fazel M, and Hassibi B, “Noisy estimation of simultaneously structured models: Limitations of convex relaxation,” in Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. IEEE, 2013, pp. 6019–6024. [Google Scholar]
- [30].Zhang C-H and Zhang SS, “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 1, pp. 217–242, 2014. [Google Scholar]
- [31].Javanmard A and Montanari A, “Confidence intervals and hypothesis testing for high-dimensional regression,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2869–2909, 2014. [Google Scholar]
- [32].Mitra R and Zhang C-H, “The benefit of group sparsity in group inference with de-biased scaled group lasso,” Electronic Journal of Statistics, vol. 10, no. 2, pp. 1829–1873, 2016. [Google Scholar]
- [33].Cai TT and Guo Z, “Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity,” The Annals of statistics, vol. 45, no. 2, pp. 615–646, 2017. [Google Scholar]
- [34].Negahban SN, Ravikumar P, Wainwright MJ, and Yu B, “A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012. [Google Scholar]
- [35].Stojnic M, Xu W, and Hassibi B, “Compressed sensing-probabilistic analysis of a null-space characterization,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 3377–3380. [Google Scholar]
- [36].Candes E, Romberg J, and Tao T, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [Google Scholar]
- [37].Candes E and Tao T, “The dantzig selector: Statistical estimation when p is much larger than n,” The Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007. [Google Scholar]
- [38].Cai TT and Zhang A, “Compressed sensing and affine rank minimization under restricted isometry,” IEEE Transactions on Signal Processing, vol. 61, no. 13, pp. 3279–3290, 2013. [Google Scholar]
- [39].——, “Sharp rip bound for sparse signal and low-rank matrix recovery,” Applied and Computational Harmonic Analysis, vol. 35, no. 1, pp. 74–93, 2013. [Google Scholar]
- [40].——, “Sparse representation of a polytope and recovery of sparse signals and low-rank matrices,” IEEE transactions on information theory, vol. 60, no. 1, pp. 122–132, 2014. [Google Scholar]
- [41].Poignard B, “Asymptotic theory of the adaptive sparse group lasso,” Annals of the Institute of Statistical Mathematics, pp. 1–32, 2018. [Google Scholar]
- [42].Rao N, Cox C, Nowak R, and Rogers TT, “Sparse overlapping sets lasso for multitask learning and its application to fmri analysis,” in Advances in neural information processing systems, 2013, pp. 2202–2210. [Google Scholar]
- [43].Ahsen ME and Vidyasagar M, “Error bounds for compressed sensing algorithms with group sparsity: A unified approach,” Applied and Computational Harmonic Analysis, vol. 43, no. 2, pp. 212–232, 2017. [Google Scholar]
- [44].Candes EJ and Plan Y, “A probabilistic and ripless theory of compressed sensing,” IEEE transactions on information theory, vol. 57, no. 11, pp. 7235–7254, 2011. [Google Scholar]
- [45].Javanmard A and Montanari A, “Debiasing the lasso: Optimal sample size for gaussian designs,” The Annals of Statistics, vol. 46, no. 6A, pp. 2593–2622, 2018. [Google Scholar]
- [46].Foucart S and Rauhut H, A mathematical introduction to compressive sensing. Birkhäuser Basel, 2013, vol. 1, no. 3. [Google Scholar]
- [47].Candes E and Tao T, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [Google Scholar]
- [48].Bertsekas D and Nedic A, “Convex analysis and optimization (conservative),” 2003.
- [49].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]
- [50].Gross D, “Recovering low-rank matrices from few coefficients in any basis,” IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1548–1566, 2011. [Google Scholar]
- [51].Candès EJ, Li X, Ma Y, and Wright J, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011. [Google Scholar]
- [52].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]
- [53].Van de Geer S, Bühlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]
- [54].Sun T and Zhang C-H, “Scaled sparse linear regression,” Biometrika, vol. 99, no. 4, pp. 879–898, 2012. [Google Scholar]
- [55].Tibshirani R, Saunders M, Rosset S, Zhu J, and Knight K, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005. [Google Scholar]
- [56].Rinaldo A, “Properties and refinements of the fused lasso,” The Annals of Statistics, vol. 37, no. 5B, pp. 2922–2952, 2009. [Google Scholar]
- [57].Jalali A and Fazel M, “A convex method for learning d-valued models,” in 2013 IEEE Global Conference on Signal and Information Processing. IEEE, 2013, pp. 1123–1126. [Google Scholar]
- [58].Jalali A, Javanmard A, and Fazel M, “New computational and statistical aspects of regularized regression with application to rare feature selection and aggregation,” arXiv preprint arXiv:1904.05338, 2019. [Google Scholar]
- [59].Sprechmann P, Ramirez I, Sapiro G, and Eldar Y, “Collaborative hierarchical sparse modeling,” in 2010 44th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2010, pp. 1–6. [Google Scholar]
- [60].Hsu D, Kakade S, and Zhang T, “A tail inequality for quadratic forms of subgaussian random vectors,” Electronic Communications in Probability, vol. 17, 2012. [Google Scholar]
- [61].Rudelson M and Vershynin R, “Hanson-wright inequality and subgaussian concentration,” Electronic Communications in Probability, vol. 18, 2013. [Google Scholar]
- [62].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]
- [63].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000. [Google Scholar]
