Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

T Tony Cai; Anru R Zhang; Yuchen Zhou

doi:10.1109/tit.2022.3175455

. Author manuscript; available in PMC: 2023 Sep 1.

Published in final edited form as: IEEE Trans Inf Theory. 2022 May 16;68(9):5975–6002. doi: 10.1109/tit.2022.3175455

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

T Tony Cai ¹, Anru R Zhang ^2,⁴, Yuchen Zhou ^3,⁴

PMCID: PMC9974176 NIHMSID: NIHMS1830781 PMID: 36865503

Abstract

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model – an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

Index Terms—: approximate dual certificate, convex optimization, sparsity, sparse group Lasso, simultaneously structured model

I. Introduction

Consider the high-dimensional double sparse regression with simultaneously group-wise and element-wise sparsity structures

y = X β^{*} + ε, or equivalently y_{i} = X_{i}^{⊤} β^{*} + ε_{i}, i = 1, \dots, n .

(1)

Here, the covariates $X \in ℝ^{n \times p}$ and parameter β* are divided into d known groups, where the jth group contains b_j variables,

X = [X_{(1)} \dots X_{(d)}], β^{*} = {({(β_{(1)}^{*})}^{⊤}, \dots {(β_{(d)}^{*})}^{⊤})}^{⊤}, X_{(j)} \in ℝ^{n \times b_{j}}, β_{(j)}^{*} \in ℝ^{b_{j}};

(2)

β* is a (s, s_g)-sparse vector in the sense that

{‖ β^{*} ‖}_{0, 2} ≔ \sum_{j = 1}^{d} 1_{{β_{(j)}^{*} \neq 0}} \leq s_{g} and {‖ β^{*} ‖}_{0} ≔ \sum_{i = 1}^{p} 1_{{β_{i}^{*} \neq 0}} \leq s,

(3)

The focus of this paper is on the estimation of and inference for β* based on (y, X). This problem has great importance in a variety of applications. For example in genome-wide association studies (GWAS) [1], the genes can be grouped into pathways and it is believed that only a small portion of the pathways contain causal single nucleotide polymorphisms (SNPs), and the number of causal SNPs is much less than the one of non-causal SNPs in a causal pathway. The sparse group Lasso has been applied to identify causal genes or SNPs associated with a certain trait [1]. Other examples include cancer diagnosis and therapy [2], [3], classification [4], and climate prediction [5] among many others. The problem can also be viewed as a prototype of various problems in statistics and machine learning, such as the sparse multiple response regression [6] and multiple task learning [7], [8], [9].

The sparse group Lasso [10], [11], [12] provides a classic and straightforward estimator for β*:

\hat{β} = \underset{β}{arg min} {‖ y - X β ‖}_{2}^{2} + λ {‖ β ‖}_{1} + λ_{g} {‖ β ‖}_{1, 2} .

(4)

Here, ${‖ β ‖}_{1} = \sum_{i = 1}^{p} | β_{i} |$ and ‖β‖_1,2 = Σ_j ‖β_(j)‖₂ are ℓ₁ and ℓ_1,2 convex regularizers to account for element-wise and group-wise sparsity structures, respectively. λ, λ_g ≥ 0 are tuning parameters. In the noiseless setting that ε = 0, one can apply the constrained ℓ₁ + ℓ_1,2 minimization instead to estimate β*:

\hat{β} = arg min λ {‖ β ‖}_{1} + λ_{g} {‖ β ‖}_{1, 2} subject to y = X β .

(5)

In fact, when λ, λ_g tend to zero while λ/λ_g is fixed as a constant, the sparse group Lasso (4) tends to the ℓ₁ + ℓ_1,2 minimization (5).

When β* is only element-wise sparse, the regular Lasso [13]

{\hat{β}}^{L} = \underset{β}{arg min} {‖ y - X β ‖}_{2}^{2} + λ {‖ β ‖}_{1}

(6)

can be applied and its theoretical properties have been well studied. See, for example, [14], [15]. When β* is only group-wise sparse, the group Lasso

{\hat{β}}^{G L} = \underset{β}{arg min} {‖ y - X β ‖}_{2}^{2} + λ_{g} {‖ β ‖}_{1, 2}

(7)

and its variations have been widely investigated [16], [17], [18]. However, to estimate the simultaneously element-wise and group-wise sparse vector β*, despite many empirical successes of sparse group Lasso in practice, the theoretical properties, including optimal rate of convergence and sample complexity, are still unclear so far to the best of our knowledge.

A. Simultaneously Structured Models

More broadly speaking, the simultaneously structured models, i.e., the parameter of interest has multiple structures at the same time, have attracted enormous attention in many fields including statistics, applied mathematics, and machine learning. In addition to the high-dimensional double sparse regression, other simultaneously structured models include sparse principal component analysis [19], [20], tensor singular value decomposition [21], [22], simultaneously sparse and low-rank matrix/tensor recovery [23], [24], sparse matrix/tensor SVD [25], and sparse phase retrieval [26], [27], [28]. As shown in [29], [23], by minimizing multi-objective regularizers with norms associated with these structures (such as ℓ₁ norm for element-wise sparsity, nuclear norm for low-rankness, and total variation norm for piecewise constant structures), one usually cannot do better than applying an algorithm that only exploits one structure. They particularly illustrated that simultaneously sparse and low-rank structured matrix cannot be well estimated by penalizing ℓ₁ and nuclear norm regularizers. Instead, non-convex methods were proposed and shown to achieve better performance.

However based on their results, it remains an open question whether the convex regularization, such as sparse group Lasso or ℓ₁ + ℓ_1,2 minimization, can achieve good performance in estimation of parameter with two types of sparsity structures, such as the aforementioned high-dimensional double sparse regression. Specifically, as illustrated in Section II-B, a direct application of [23] does not provide a sample complexity lower bound for exact recovery that matches our upper bound.

B. Optimality and Related Literature

This paper fills the void of statistical limits of sparse group Lasso and provides an affirmative answer to the aforementioned question: by exploiting both element-wise and group-wise sparsity structures, the ℓ₁ + ℓ_1,2 regularization does provide better performance in high-dimensional double sparse regression. Particularly in the noiseless case, it is shown that (s, s_g)-sparse vectors can be exactly recovered and approximately (s, s_g)-sparse vectors can be stably estimated with high probability whenever the sample size satisfies n ≳ s_g log(d/s_g) + s log(es_gb), where b = max_1≤i≤d b_i. On the other hand, we prove that exact recovery cannot be achieved by ℓ₁ + ℓ_1,2 regularization and stable estimation of approximately (s, s_g)-sparse vectors is impossible in general unless n ≳ s_g log(d/s_g) + s log(es_gb/s). We then consider the noisy case and develop the matching upper and lower bounds on the convergence rate for the estimation error. Simulation studies are carried out and the results support our theoretical findings. In addition, statistical inference for the individual coordinates of β* is studied. A confidence interval is constructed based on the debiased sparse group Lasso estimator and its asymptotic property. The results show that by exploring the simultaneously element-wise and group-wise sparsity structures, the debiased sparse group Lasso requires less sample size than the debiased Lasso and debiased group Lasso in the literature [30], [31], [32], [33].

The theoretical analysis of sparse group Lasso and ℓ₁ + ℓ_1,2 minimization is highly non-trivial. First, the regularizer λ‖ · ‖₁ + λ_g‖ · ‖_1,2 is not decomposable with respect to the support of β* so that the classic techniques of decomposable regularizers [34] and null space property [35] may not be suitable here. Despite a substantial body of literature on high-dimensional element-wise sparse vector estimation based on restricted isometry property (RIP) [36], [37], [38], [39], [40] and restricted eigenvalue [14], these techniques cannot provide nearly optimal results for sparse group Lasso here as it is technically difficult to partition general vectors into simultaneously element-wise and group-wise ones that preserves some ordering structures. Departing from the previous literature, our theoretical analysis relies on a novel construction of approximate dual certificate. See Section II-C for further details. Although our results mostly focus on the performance of sparse group Lasso and ℓ₁ + ℓ_1,2 estimators, the techniques of approximate dual certificate on multi-norm structures here can also be of independent interest.

The statistical properties of sparse group Lasso and related estimators have been studied previously. For example, [5] developed consistency results for estimators with a general tree-structured norm regularizers, of which the sparse group Lasso is a special case. [41] analyzed the asymptotic behaviors of the adaptive sparse group Lasso estimator. [4], [42] studied the multi-task learning and classification problems based on a variant of sparse group Lasso estimator. [12] studied multivariate linear regression via sparse group Lasso. [43] provided a theoretical framework for developing error bounds of the group Lasso, sparse group Lasso, and group Lasso with tree structured overlapping groups. Specifically, their results imply that the group-wise sparse signal can be exactly recovered with high probability by solving (5) if the sample size satisfies n ≳ s_g (b + log d). Different from previous results, this paper focused on both the required sample size and convergence rate of estimation error of sparse group Lasso. To the best of our knowledge, this is the first paper that provides optimal theoretical guarantees for both the sample complexity and estimation error of sparse group Lasso.

C. Organization of the Paper

The rest of the article is organized as follows. After a brief introduction to notation and preliminaries in Section II-A, the main theoretical results on constrained ℓ₁ + ℓ_1,2 minimization in the noiseless setting is presented in Section II-B and the key proof ideas are explained in Section II-C. Results for sparse group Lasso in the noisy setting are discussed in Section III. In particular, the optimal rate of estimation error and statistical inference are studied in Sections III-A and III-B, respectively. In Section IV-A, we introduce a practical scheme to select tuning parameters. In Section IV-B, we provide simulation results in both noiseless and noisy cases to justify our theoretical findings. The proofs of technical results are given in Section VI. All technical lemmas and their proofs can be found in Appendix A.

II. ℓ₁ + ℓ_1,2 Minimization in Noiseless Case

A. Notation and Preliminaries

The following notation will be used throughout the paper. We denote a⋀b = min{a, b}, a⋁b = max{a, b}. Let sgn(·) be the sign function, i.e., sgn(x) = 1, 0, or −1, if x > 0, x = 0, or x < 0, respectively. H_α(·) is the soft-thresholding function such that H_α(x) = sgn(x) · {(|x| − α) ⋁ 0} for any $x \in ℝ$ . We say a ≲ b and a ≳ b if a ≤ Cb and b ≤ Ca for some uniform constant C > 0, respectively. a ≍ b means a ≲ b and a ≳ b both hold. Let the uppercase C, C₁, C₀, … and lowercase c, c₁, c₀, … denote large and small positive constants respectively, whose actual values vary from time to time. Throughout the paper, we focus on the parameter index set {1, …, p} partitioned into d groups. Denote (1), …, (d) ⊆ {1, … , p} as the index sets belonging to each group. Additionally, for any group index subset G ⊆ {1, …, d}, define (G) = ∪_j∈G(j), (G^c) = ∪_j∉G(j). For any vector γ and index subset T, $γ_{T} \in ℝ^{| T |}$ represents the sub-vector of γ with index set T. In particular, γ(G) represents the sub-vector of γ in the union of Groups j ∈ G. Define the ℓq norm of any vector γ as ‖y‖_q = (∑_i|γ_i|^q)^1/q. For any vector $β \in ℝ^{p}$ with group structures, we also define the ℓ_q1,q2 norm for any 0 ≤ q₁, q₂ ≤ ∞ as

‖ γ ‖_{q_{1}, q_{2}} = {(\sum_{j = 1}^{d} {‖ γ_{(j)} ‖}_{q_{2}}^{q_{1}})}^{1 / q_{1}} = {\sum_{j = 1}^{d} {(\sum_{i \in (j)} {| γ_{i} |}^{q_{2}})}^{q_{1} / q_{2}}}^{1 / q_{1}} .

In particular, ${‖ γ ‖}_{0, 2} = \sum_{j = 1}^{d} 1_{{γ_{(j)} \neq 0}}$ is the number of non-zero groups of γ, ‖γ‖_∞,2 = max_j ‖γ_(j)‖₂ is the maximum ℓ₂ norm among all groups of γ, and ${‖ γ ‖}_{1, 2} = \sum_{j = 1}^{d} {‖ γ_{(j)} ‖}_{2}$ is the group-wise ℓ₁ penalty. With a slight abuse of notation, we simply denote ‖γ_T‖_q1,q2 = ‖u‖_q1,q2 if $u \in ℝ^{p}$ , u restricted on subset T is γ_T and u restricted on T^c is 0.

The focus of this paper is on simultaneously element-wise and group-wise sparse vectors defined as follows.

Definition 1 (Simultaneous element-wise and group-wise sparsity): Assume $β^{*} \in ℝ^{p}$ is associated with group partition (1), …, (d). For positive integers s, s_g satisfying s_g ≤ d and $s_{g} \leq s \leq {max}_{Ω \subseteq {1, \dots, d}, | Ω | = s_{g}} \sum_{i \in Ω} b_{i}$ , we say β* is (s, s_g)- sparse if

{‖ β^{*} ‖}_{0, 2} = \sum_{j = 1}^{d} 1_{{β_{(j)}^{*} \neq 0}} \leq s_{g}, {‖ β^{*} ‖}_{0} = \sum_{i} 1_{{β_{i}^{*} \neq 0}} \leq s .

B. Noiseless Case and Sample Complexity

To analyze the performance of sparse group Lasso and ℓ₁ + ℓ_1,2 minimization, we first introduce the following assumption on the design matrix X.

Assumption 1 (Sub-Gaussian assumption): Suppose all rows of X are i.i.d. centered sub-Gaussian distributed. Specifically, $E X_{i \cdot} = 0$ , $Var (X_{i \cdot}^{⊤}) = Σ$ , and for any $α \in ℝ^{p}$ , we have $E exp (α^{⊤} Σ^{- 1 / 2} X_{i \cdot}^{⊤}) \leq exp (κ^{2} {‖ α ‖}_{2}^{2} / 2)$ for constant κ > 0. We also assume there exist two constants C_max ≥ c_min > 0 such that c_min ≤ σ_min (Σ) ≤ σ_max(Σ) ≤ C_max, where σ_max(Σ) and σ_min(Σ) are the largest and smallest eigenvalues of Σ, respectively.

Clear, a random matrix X with i.i.d. standard normal entries satisfies this assumption – this design is referred to as the Gaussian ensemble and has been considered as a benchmark setting in compressed sensing and high-dimensional regression literature [44], [45].

The following theorem shows that the ℓ₁ + ℓ_1,2 minimization achieves the exact recovery with high probability when β* is simultaneously element-wise and group-wise sparse, X is weakly dependent, and Assumption 1 holds. The theorem also provides a more general upper bound on estimation error if β* is approximately element-wise and group-wise sparse.

Theorem 1 (ℓ₁ + ℓ_1,2 minimization in noiseless case): Suppose one observes y = Xβ*, where X has the group structure (2) and satisfies Assumption 1, β* is (s, s_g)-sparse, and b = max_1≤i≤d b_i. Let T be the support of β*. Suppose there exist uniform constants C, c > 0 such that

n \geq C (s_{g} log (d / s_{g}) + s log (e s_{g} b)),

(8)

max_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} \leq c / \sqrt{s},

(9)

then the constrained ℓ₁ + ℓ_1,2 minimization (5) with $λ_{g} = \sqrt{s / s_{g}} λ$ achieves the exact recovery with probability at least 1 − C exp(−cn/s).

Moreover, if $β^{*} \in ℝ^{p}$ is a general vector and $\hat{β}$ is the solution to the constrained ℓ₁ + ℓ_1,2 minimization (5) with $λ_{g} = \sqrt{s / s_{g}} λ$ , then

{‖ \hat{β} - β^{*} ‖}_{2} ≲ min_{S \in 𝒮} (\frac{1}{\sqrt{s}} {‖ β_{S^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{S^{c}}^{*} ‖}_{1, 2}) .

(10)

with probability at least 1 − C exp(−cn/s). Here,

𝒮 = {S : {‖ β_{S}^{*} ‖}_{0} \leq s, {‖ β_{S}^{*} ‖}_{0, 2} \leq s_{g} max_{i \in S^{c}} {‖ Σ_{i, S} Σ_{S, S}^{- 1} ‖}_{2} \leq c / \sqrt{s}} .

Remark 1 (Interpretation and comparison): In Theorem 1, the required sample size for achieving exact recovery contains two terms: s_g log(d/s_g) and s log(es_gb). Intuitively speaking, s_g log(d/s_g) corresponds to the complexity of identifying s_g non-zero groups and s log(es_gb) corresponds to the complexity of estimating s non-zero elements of β in s_g known groups.

When β* is only element-wise or group-wise sparse, one can apply respectively the classic ℓ₁ or ℓ_1,2 minimization to recover β*,

{\hat{β}}^{ℓ_{1}} = \underset{β}{arg min} {‖ β ‖}_{1} subject to y = X β,

(11)

{\hat{β}}^{ℓ_{1, 2}} = \underset{β}{arg min} {‖ β ‖}_{1, 2} subject to y = X β .

(12)

The ℓ₁ minimization and ℓ_1,2 minimization here are respectively the special form of the regular Lasso and group Lasso (if λ, λ_g = 0₊ in (6) and (7)), respectively. Especially if the group size b₁ ≍ ⋯ ≍ b_d ≍ b, to ensure exact recovery in the noiseless setting with high probability, (11) requires n ≳ Cs log(ebd/s) [46] and group Lasso requires n ≳ s_g(b + log(ed/s_g)). The ℓ₁ + ℓ_1,2 minimization (5) has provable advantages over both regular and group Lasso when b ≫ log(d) ≫ log(es_gb) and s_gb/ log(es_gb) ≫ s ≫ s_g. In particular, when s_g = s, the double sparse regression reduces to the vanilla sparse linear regression, and the upper bound (10) matches the classic upper bound for ℓ₁ minimization [44].

In addition, Condition (9) is an important technical condition we used in our theoretical analysis.

Next, we consider the sample complexity lower bound. Suppose b₁ = b₂ = ⋯ = b_d and d ≥ 2s_g. Recall that one observes y = Xβ* without noise and aims to estimate the (s, s_g)-sparse vector β* based on y and X. As indicated by classic results in compressed sensing [47], with sufficient computing power, the ℓ₀ minimization below achieves exact recovery of β*

{\hat{β}}^{ℓ_{0}} = arg min {‖ β ‖}_{0} subject to X β = y

(13)

as along as X is non-degenerate and n ≥ 2s. This bound is actually sharp: when n < 2s, for any set T ⊆ {1, …, db} with cardinality 2s, one can find a vector γ such that supp(γ) ⊆ T and Xγ = 0. By choosing an appropriate T, we can split the support γ to obtain two (s, s_g)-sparse vectors β₁, β₂ satisfying β₁ + β₂ = γ. Then, Xβ₁ = X(−β₂) but there is no way to distinguish β₁ and β₂ merely based on X and y = Xβ₁ = X(−β₂).

However, the ℓ₀ minimization (13) is computational infeasible in practice while a larger sample size is required for applying more practical methods. The following theorem shows that by performing the convex ℓ₁ regularization, ℓ_1,2 regularization, or any weighted combination of them, one requires at least Ω(s_g log(d/s_g) + s log(es_gb/s)) observations to ensure exact recovery of (s, s_g)-sparse vectors.

Theorem 2 (Sample complexity lower bound for exact recovery): Suppose b₁ = ⋯ = b_d = b, d, b ≥ 3. Suppose X is an n-by-(db) matrix. If every (2s, 2s_g)-sparse vector $β \in ℝ^{d b}$ is a minimizer of the following programming for some (λ, λ_g) ∈ {(λ, λ_g) : λ, λ_g ≥ 0, λ + λ_g > 0}:

min_{z} λ {‖ z ‖}_{1} + λ_{g} {‖ z ‖}_{1, 2} subject to X z = y = X β .

In other words, if the ℓ₁ + ℓ_1,2 minimization exactly recover all (2s, 2s_g)-sparse vector β, then we must have n ≳ s_g log(d/s_g) + s log(es_gb/s).

The following sample complexity lower bound shows that for arbitrary methods, to ensure stable estimation of all approximately sparse vectors, one requires at least Ω(s_g log(d/s_g) + s log(es_gb/s)) observations.

Theorem 3 (Sample complexity lower bound for stable estimation): Suppose b₁ = ⋯ = b_d = b, b, d ≥ 3. Assume there exists a matrix $X \in ℝ^{n \times (b d)}$ , a map $Δ : ℝ^{n} \to ℝ^{b d}$ (Δ may depend on X), and a constant C > 0 satisfying

{‖ β - Δ (X β) ‖}_{2} \leq C (\frac{{‖ β ‖}_{1}}{\sqrt{s}} + \frac{{‖ β ‖}_{1, 2}}{\sqrt{s_{g}}})

(14)

for all $β \in ℝ^{p}$ and some s, s_g satisfying d ≥ s_g, s_gb ≥ s ≥ s_g. There exists constants C₀ and c₀ that depend only on C such that whenever s_g ≥ C₀, we must have

n \geq c_{0} (s_{g} log (d / s_{g}) + s log (e s_{g} b / s)) .

Remark 2 (Optimality and comparison with previous results): Theorems 2 and 3 show that the sample complexity upper bound in Theorem 1 is rate-optimal under a weak condition: log(es_gb) ≍ log(es_gb)−log(s) or log(d) ≥ 2s log(s)/s_g. Oymak, et al. [23] provided a general analysis for convex regularization of simultaneously structured parameter estimation. Specifically for the high-dimensional double sparse regression, a direct application of their Theorem 3.2 and Corollary 3.1 implies that if ℓ₁ + ℓ_1,2 minimization can exactly recover (s, s_g)-sparse vector β* with a constant probability, one must have n ≳ s. We can see that Theorem 2 provides a sharper lower bound on sample complexity.

In addition, by setting s_g = s, the lower bound in Theorems 2 and 3 reduces to n ≳ s log(p/s), which matches the optimal sample complexity lower bound for exact recovery of s-sparse vectors [46, Theorem10.11, Proposition 10.7]. By setting s = s_gb, we obtain a sample complexity lower bound n ≳ s_g(b + log(d/s_g)) for (approximate) s_g-group-wise sparse vector recovery and stable estimation. To the best of our knowledge, this is the first sample complexity lower bound for group Lasso.

C. Proof Sketches

We briefly discuss the proof sketches of the main technical results in this section. The detailed proofs are postponed to Section VI.

The proof of Theorem 1 is based on a novel dual certificate scheme. The dual certificate [48] has been used in the theoretical analysis for various convex optimization methods in high-dimensional problems, such as matrix completion [49], [50], compressed sensing [44], robust PCA [51], tensor completion [52], etc. The high-dimensional double sparse linear regression exhibits different aspects from these previous works due to the simultaneous sparsity structure. In particular, we can show that if the u_et defined below is in the row space of X, it can be used as an exact dual certificate for recovery of (s, s_g)-sparse vector β*:

u_{e t} = v_{e t} + w_{e t} \in ℝ^{p}, {\begin{array}{l} {(v_{e t})}_{(j)} = \sqrt{s / s_{g}} β_{(j)}^{*} / {‖ β_{(j)}^{*} ‖}_{2}, & j \in G; \\ {‖ {(v_{e t})}_{(j)} ‖}_{2} < \sqrt{s / s_{g}}, & j \in G^{c}; \end{array} {\begin{array}{l} {(w_{e t})}_{T} = sgn (β_{T}^{*}) \\ {‖ {(w_{e t})}_{T^{c}} ‖}_{\infty} < 1. \end{array}

(15)

Here, T and G are the element-wise and group-wise supports of β*:

T = {i : β_{i} \neq 0} \subseteq {1, \dots, p}, G = {j : β_{(j)} \neq 0} \subseteq {1, \dots, d} .

Roughly speaking, u_et is the sub-gradient of objective function (5) evaluated at β = β*. If u_et is in the row space of X, the sub-gradient will be perpendicular to the feasible set of (5), which implies that β* is the unique minimizer of ℓ₁ + ℓ_1,2 minimization (5).

For more general vector β* that does not necessarily have a sparse support T or G, we consider the following (s, s_g)-sparse approximation:

β^{a p} = \underset{S}{arg min} \frac{1}{\sqrt{s}} {‖ β_{S^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{S^{c}}^{*} ‖}_{1, 2} subject to {‖ β_{S}^{*} ‖}_{0} \leq s . {‖ β_{S}^{*} ‖}_{0, 2} \leq s_{g}, max_{i \in S^{c}} {‖ Σ_{i, S} Σ_{S, S}^{- 1} ‖}_{2} \leq c / \sqrt{s} .

(16)

Let $T = {i : β_{i}^{a p} \neq 0}$ and G = {j : (β^ap)_(j) ≠ 0} be the element-wise and group-wise supports of β^ap. Define

{\tilde{u}}_{0} = {\tilde{v}}_{0} + {\tilde{w}}_{0} \in ℝ^{p}, {\begin{array}{l} {({\tilde{v}}_{0})}_{(j)} = \sqrt{s / s_{g}} β_{T, (j)}^{*} / {‖ β_{T, (j)}^{*} ‖}_{2}, & j \in G; \\ {‖ {({\tilde{v}}_{0})}_{(j)} ‖}_{2} < \sqrt{s / s_{g}}, & j \in G^{c}; \end{array} {\begin{array}{l} {({\tilde{w}}_{0})}_{T} = sgn (β_{T}^{*}) \\ {‖ {({\tilde{w}}_{0})}_{T^{c}} ‖}_{\infty} < 1. \end{array}

(17)

Here $β_{T, (j)}^{*} \in ℝ^{b_{j}}$ is the subvector β* restricted on the j-th group with all entries in T^c set to zero. Similarly to the exactly sparse case, if ũ₀ is in the row space of X and the true β* is approximately (s, s_g)-sparse, the minimizer of (5) will be close to β*.

However, it is often difficult to find an exact dual certificate that lies in the row space of X and satisfies stringent conditions in (15) or (17). We instead propose to analyze via the approximate dual certificate defined as (18) in the following lemma.

Lemma 1 (Approximate dual certificate for sparse group Lasso): Suppose T, G are element-wise and group-wise support defined in (16). ũ₀ is defined in (17). Assume X satisfies $σ_{min} (X_{T}^{⊤} X_{T} / n) \geq c_{min} / 2$ . If there exists $u \in ℝ^{p}$ in the row span of X satisfying

{‖ u_{T} - {({\tilde{u}}_{0})}_{T} ‖}_{2} \cdot max_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} \leq c_{min} / 8, {‖ H_{1 / 2} (u_{(G^{c})}) ‖}_{\infty, 2} \leq \sqrt{s_{0}} / 2, {‖ u_{(G) \ T} ‖}_{\infty} \leq 1 / 2,

(18)

Then the conclusion of Theorem 1 (10) holds with probability at least 1 − 2e^−cn. Here, H_1/2(·) is the soft-thresholding operator defined at the beginning of Section II.

If we additionally assume β* is (s, s_g)-sparse, then β* is the unique solution to the sparse group ℓ₁ +ℓ_1,2 minimization (5) with probability at least 1 – 2e^−cn.

Lemma 1 shows that the conclusion of Theorem 1 holds if there exists an approximate dual certificate u satisfying the condition (18). The following lemma shows that, under the assumptions in Theorem 1, one can find such an approximate dual certificate with high probability.

Lemma 2: Suppose X has group structure (2) and satisfies Assumption 1. Recall $σ_{min} (X_{T}^{⊤} X_{T} / n)$ is the least eigenvalue of $X_{T}^{⊤} X_{T} / n$ . Then $σ_{min} (X_{T}^{⊤} X_{T} / n) \geq 1 / 2$ and (18) holds with probability at least 1 − Ce^−cn/s, where T is defined in (16).

Another key technical tool to the proof of Theorem 1 is the following Lemma, which shows that X satisfies the restricted isometry property for all simultaneously element-wise and group-wise sparse vectors with high probability when there are enough samples.

Lemma 3: If n ≥ C(s_g log(d/s_g) + s log(es_gb)),

\frac{c_{min}}{2} {‖ γ ‖}_{2}^{2} \leq \frac{1}{n} {‖ X γ ‖}_{2}^{2} \leq (C_{max} + \frac{c_{min}}{2}) {‖ γ ‖}_{2}^{2}, \forall γ \in {γ \in ℝ^{p} : {‖ γ ‖}_{0} \leq 2 s, {‖ γ ‖}_{0, 2} \leq 2 s_{g}}

(19)

with probability at least 1 – 2e^−cn.

Next we briefly discuss the proof of Theorem 2. Consider the quotient space $ℝ^{d b} / ker (X) = {[γ] ≔ x + ker (X), γ \in ℝ^{d b}}$ and define an associated norm as ‖[γ]‖ = inf_v∈ker(X) {λ‖γ − v‖₁ + λ_g‖γ − v‖_1,2}. We show that there exist N different (s, s_g)-sparse vectors β⁽¹⁾, …, β^(N) such that log(N) ≍ s log(es_gb/s) + s_g log(d/s_g) and ‖[β⁽ⁱ⁾]‖ = 1, ‖[β⁽ⁱ⁾] − [β^(j)]‖ ≥ 2/9 for all 1 ≤ i ≠ j ≤ N. By a property of the packing number and the fact that $dim (ℝ^{d b} / ker (X)) \leq n$ , we must have N ≤ 10ⁿ. Thus n ≳ log(N) ≍ s log(es_gb/s) + s_g log(d/s_g).

We prove Theorem 3 by contradiction. Assume that

n < c_{0} (s log (e s_{g} b / s) + s_{g} log (d / s_{g}))

(20)

for a sufficiently small constant c₀. Let $‖ \cdot ‖ = {‖ \cdot ‖}_{1} + \sqrt{s / s_{g}} {‖ \cdot ‖}_{1, 2}$ and $B = {x \in ℝ^{d b} : ‖ x ‖ \leq 1}$ be the unit ball associated with ‖·‖. Define

d^{n} (B, ℝ^{p}) = inf_{L^{n} is a subspace of ℝ^{p} with dim (ℝ^{p} / L^{n}) \leq n} {sup_{β \in B \cap L^{n}} {‖ β ‖}_{2}},

We have $d^{n} (B, ℝ^{p}) \leq \frac{C}{\sqrt{s}}$ by the assumption of this theorem. We can also show that there exists a uniform constant c > 0 such that

d^{n} (B, ℝ^{p}) \geq c min {\frac{1}{\sqrt{s_{0}}}, [(\frac{s_{g}}{s} log (\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n}) {+ log (e s_{g} b / s)) / n]}^{1 / 2}} .

The previous two inequalities and (20) together imply that

n \geq c (s_{g} log (\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n}) + s log (e s_{g} b / s)) \geq c_{0} (s log (e s_{g} b / s) + s_{g} log (d / s_{g})) > n .

This contradiction shows that

n \geq c_{0} (s log (e s_{g} b / s) + s_{g} log (d / s_{g})) .

III. Sparse Group Lasso in Noisy Case

We now turn to the noisy case.

A. Optimal Rate of Estimation Error of Sparse Group Lasso

When observations are noisy, we have the following theoretical guarantee for the sparse group Lasso.

Theorem 4 (Upper bound of estimation error): Suppose y = Xβ* + ε, X satisfies Assumption 1, n ≥ C (s_g log(d/s_g) + s log(es_gb)) for some uniform constant C > 0, $ε \overset{i i d}{~} N (0, σ^{2})$ , and b = max_1≤i≤d b_i. Then the sparse group Lasso estimator (4) with

λ = C σ \sqrt{(s log (e s_{g} b) + s_{g} log (e d / s_{g})) n / s}

and

λ_{g} = \sqrt{s / s_{g}} λ

satisfies

{‖ \hat{β} - β^{*} ‖}_{2} ≲ {min}_{S \in 𝒮} {\sqrt{\frac{σ^{2} (s_{g} log (d / s_{g}) + s log (e s_{g} b))}{n}} + \frac{{‖ β_{S^{c}}^{*} ‖}_{1}}{\sqrt{s}} + \frac{{‖ β_{S^{c}}^{*} ‖}_{1, 2}}{\sqrt{s_{g}}}}

with probability at least

1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) .

Here,

𝒮 = {S : {‖ β_{S}^{*} ‖}_{0} \leq s, {‖ β_{S}^{*} ‖}_{0, 2} \leq s_{g}, max_{i \in S^{c}} {‖ Σ_{i, S} Σ_{S, S}^{- 1} ‖}_{2} \leq c / \sqrt{s}} .

Especially, if β* is exactly (s, s_g)-sparse and

max_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} \leq c / \sqrt{s}

holds, then

{‖ \hat{β} - β^{*} ‖}_{2}^{2} ≲ \frac{σ^{2} (s_{g} log (d / s_{g}) + s log (e s_{g} b))}{n}

(21)

with probability at least

1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) .

In addition, we focus on the following class of simultaneously element-wise and group-wise sparse vectors,

ℱ_{s, s_{g}} = {β : {‖ β ‖}_{0} \leq s, {‖ β ‖}_{0, 2} \leq s_{g}} .

The following minimax lower bound of estimation error holds.

Theorem 5 (Lower bound of estimation error): Suppose X satisfies Assumption 1, b₁ = ⋯ = b_d = b, and d, b ≥ 3. Then we have

inf_{\hat{β}} sup_{β \in ℱ_{s, s_{g}}} E {‖ \hat{β} - β ‖}_{2}^{2} ≳ \frac{σ^{2} (s_{g} log (e d / s_{g}) + s log (e s_{g} b / s))}{n} .

Remark 3: Theorems 4 and 5 together show that the sparse group Lasso yields the minimax optimal rate of convergence as long as the following condition holds: log(es_gb) ≍ log(es_gb) − log(s) or log(d) ≳ s log(s)/s_g.

Remark 4: We briefly discuss the main proof ideas of Theorem 5 here. First, we randomly generate a series of subsets Ω⁽ⁱ⁾ ⊆ {1, …, p} as feasible supports of (s, s_g)-sparse vectors. Then, we prove by a probabilistic argument that there exist N ≳ (s_g log(d/s_g) + s log(es_gb/s)) subsets ${Ω^{(i)}}_{i = 1}^{N}$ such that |Ω⁽ⁱ⁾ ∩ Ω^(j)| < 8s_g⌊s/s_g⌋/9 for any i < j. Next, we construct a series of candidate (s, s_g)-sparse vectors β⁽ⁱ⁾ such that $β_{k}^{(i)} = τ 1_{{k \in Ω^{(i)}}}$ . Intuitively speaking, ${β^{(i)}}_{i = 1}^{N}$ are non-distinguishable based only on observations (y, X) by such a construction. Theorem 5 then follows by choosing an appropriate τ and the generalized Fano’s lemma.

B. Statistical Inference via Debiased Sparse Group Lasso

We further consider the statistical inference for β* under the double sparse linear regression model. First, let $\hat{β}$ be the sparse group Lasso estimator given by (4). Inspired by the recent advances in inference for high-dimensional linear regression [30], [53], [31], [33], we propose the following debiased sparse group Lasso estimator,

{\hat{β}}^{u} = \hat{β} + \frac{1}{n} \hat{M} X^{⊤} (Y - X \hat{β}) .

(22)

Here, $\hat{Σ} = \frac{1}{n} \sum_{k = 1}^{n} X_{k} X_{k}^{⊤}$ is the sample covariance matrix and $\hat{M} = {[\begin{array}{l} {\hat{m}}_{1} & \dots & {\hat{m}}_{p} \end{array}]}^{⊤}$ is an approximation of the inverse covariance matrix Σ⁻¹, where ${\hat{m}}_{i}$ is the solution to the following convex optimization,

minimize m^{⊤} \hat{Σ} m subject to {‖ H_{α} (\hat{Σ} m - e_{i}) ‖}_{\infty, 2} \leq γ .

(23)

Here, H_α is the soft-thresholding operator with thresholding level α defined at the beginning of Section II and e_i is the i-th vector in the canonical basis of $ℝ^{p}$ . The following theorem establishes an asymptotic result for debiased sparse group Lasso.

Theorem 6 (Asymptotic distribution of debiased sparse group Lasso): Suppose $β^{*} \in ℝ^{p}$ is (s, s_g)-sparse, $X \in ℝ^{n \times p}$ satisfies Assumption 1, and ${max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} \leq c / \sqrt{s}$ . Set $λ = C σ \sqrt{\frac{(s log (e s_{g} b) + s_{g} log (d / s_{g})) n}{s}}$ and $λ_{g} = \sqrt{\frac{s}{s_{g}}} λ$ in (4), $α = \frac{λ}{n σ}$ , $γ = \sqrt{\frac{s}{s_{g}}} \frac{λ}{n σ}$ in (23). Then with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ , the debiased sparse group Lasso estimator ${\hat{β}}^{u}$ can be decomposed as $\sqrt{n} ({\hat{β}}^{u} - β^{*}) = Δ + w$ , where

{‖ Δ ‖}_{\infty} \leq \frac{C (s log (e s_{g} b) + s_{g} log (e d / s_{g}))}{\sqrt{n}} σ, w ∣ X ~ N (0, σ^{2} \hat{M} \hat{Σ} {\hat{M}}^{⊤}) .

(24)

In particular, if $\sqrt{n} ≫ s log (e s_{g} b) + s_{g} log (e d / s_{g})$ , for any 1 ≤ i ≤ p,

\frac{\sqrt{n} ({\hat{β}}_{i}^{u} - β_{i}^{*})}{\sqrt{{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i}}} \to N (0, σ^{2}) .

(25)

Remark 5: (25) provides a method to construct confidence intervals for β*. Specifically if $\hat{σ}$ is a consistent estimator of σ, such as the scaled sparse group Lasso to be discussed in Section V,

[{\hat{β}}_{i}^{u} - Φ^{- 1} (1 - α / 2) \hat{σ} \sqrt{\frac{{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i}}{n}}, {\hat{β}}_{i}^{u} + Φ^{- 1} (1 - α / 2) \hat{σ} \sqrt{\frac{{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i}}{n}}]

would be an asymptotic (1 – α)-confidence interval for $β_{i}^{*}$ . We can see that the debiased sparse group Lasso estimator has the provably advantage on sample complexity (n ≫ (s log(es_gb) + s_g log(ed/s_g))²) over the ones via debiased Lasso (n ≫ s log p, see [30], [31], [33]) or debiased group Lasso (n ≫ (s_gb + s_g log p)², see [32]) for constructing asymptotic confidence intervals of β*.

IV. Simulation Studies

In this section, we investigate the numerical performance of the sparse group Lasso and ℓ₁ +ℓ_1,2 minimization for double sparse regression. The results support our theoretical findings in Sections II and III. We first discuss the practical choice for the tuning parameters used in the proposed algorithms.

A. Practical Selection of Tuning Parameters

By introducing τ as a surrogate for (λ_g/λ)², we can rewrite the ℓ₁ + ℓ_1,2 minimization and the sparse group Lasso as

\hat{β} = arg min {‖ β ‖}_{1} + \sqrt{τ} {‖ β ‖}_{1, 2} subject to y = X β,

(26)

\hat{β} = \underset{β}{arg min} {‖ y - X β ‖}_{2}^{2} + λ {‖ β ‖}_{1} + λ \sqrt{τ} {‖ β ‖}_{1, 2} .

(27)

As suggested by Theorems 1 and 4, the theoretical choice of the tuning parameters (λ, τ) relies on σ, s, and s_g in sparse group Lasso and ℓ₁ + ℓ_1,2 minimization for double sparse regression. These values, however, are usually unknown in practice. In addition, those theoretical values of tuning parameters may not achieve the best finite-sample numerical performance. We thus introduce in this section a data-driven approach to tuning parameter selection using K-fold cross-validation.

We first discuss how to select τ in the ℓ₁ + ℓ_1,2 minimization (26). Recall n is the sample size, p is the total number of covariates, d is the number of groups, b₁, …, b_d are the number of covariates in each group, and b = max_j b_j. Since the theoretical value τ = s/s_g and s/s_g must satisfy 1 ≤ s/s_g ≤ b, for a given integer L ≥ 1, we introduce a grid

S_{0} = {b^{(l - 1) / (L - 1)} : 1 \leq l \leq L}

(28)

as a set of candidate values for τ. Here, the grid size L can be set to a typical value of 10, or a larger value if more computing power is available. We split the data ${X_{i}, y_{i}}_{i = 1}^{n}$ into K groups. For 1 ≤ k ≤ K, let J_k ⊂ {1, …, n} be the index set of the kth group and $J_{k}^{c} = {1, \dots, n} \ J_{k}$ . For each τ ∈ S₀, we solve

{\hat{β}}^{(k)} (τ) = arg min {‖ β ‖}_{1} + \sqrt{τ} {‖ β ‖}_{1, 2} subject to y J_{k}^{c} = X_{[J_{k}^{c}, :]} β

and calculate the prediction error

\hat{R} (τ) = \sum_{k = 1}^{K} \sum_{j \in J_{k}} {(y_{j} - X_{[j, :]} {\hat{β}}^{(k)} (τ))}^{2} .

Let τ_* be the minimizer of the prediction error: $τ_{*} = {arg min}_{τ \in S_{0}} \hat{R} (τ)$ . Then, the final estimator $\hat{β}$ is calculated using (26) with τ_*.

Then we consider the sparse group Lasso (27), which includes two tuning parameters (τ, λ). We still define S₀ in (28) as a grid of candidate values of τ. Following the idea in [11, Section 3.3], for each τ ∈ S₀, we begin with a large value of λ_max(τ) so that $\hat{β}$ , the outcome of sparse group Lasso (27) with tuning parameters (τ, λ_max(τ)), is zero (this can be achieved by the SGL package¹). Let λ_min(τ) be a small fraction of λ_max(τ) (e.g., λ_min = 0.1λ_max as suggested in [11, Section 5]). Then we define

Λ (τ) = {{λ_{min} (τ)}^{(L - l) / (L - 1)} \cdot {λ_{max} (τ)}^{(l - 1) / (L - 1)} : l = 1, \dots, L} .

Next, we split the data ${X_{i}, y_{i}}_{i = 1}^{n}$ into K groups. For 1 ≤ k ≤ K, let J_k ⊂ {1, …, n} be the index set of the kth group and $J_{k}^{c} = {1, \dots, n} \ J_{k}$ . For each τ ∈ S₀, λ ∈ Λ(τ), and k ∈ {1, …, K}, we solve

{\hat{β}}^{(k)} (τ, λ) = \underset{β}{arg min} {{‖ y_{J_{k}^{c}} - X_{[J_{k}^{c},]]} β ‖}_{2}^{2} + λ {‖ β ‖}_{1} + λ \sqrt{τ} {‖ β ‖}_{1, 2}}

and calculate the prediction error

\hat{R} (τ, λ) = \sum_{k = 1}^{K} \sum_{j \in J_{k}} {(y_{j} - X_{[j, :]} {\hat{β}}^{(k)} (τ, λ))}^{2} .

Let (τ_*, λ_*) be the minimizer of the prediction error:

(τ_{*}, λ_{*}) = \underset{τ \in S_{0}, λ \in Λ (τ)}{arg min} \hat{R} (τ, λ) .

The final estimator $\hat{β}$ is calculated using (27) with (τ_*, λ_*).

In our simulation studies next, we will examine the performance of this cross-validation scheme with K = L = 10, λ_min = 0.1λ_max.

B. Numerical Results

We begin by considering the sample complexity for the exact recovery in the noiseless case. Suppose all group sizes are equal (b₁ = ⋯ = b_d = b) and the number of observations n varies from 5 to 200. We consider four simulation designs with (1) d = 60, b = 20, s_g = 1; (2) d = 100, b = 30, s_g = 2; (3) d = b = 20, s_g = 1; and (4) d = b = 40, s_g = 1. For each setting, we randomly draw $X \in ℝ^{n \times d b}$ with i.i.d. standard normal entries, construct the fixed vector $β^{*} \in ℝ^{d b}$ satisfying

β_{(j)}^{*} = {\begin{array}{l} (1, 2, 3, 4, 5, 0, \dots, 0) \in ℝ^{b} & j = 1, \dots, s_{g}; \\ 0 & j = s_{g} + 1, \dots, d, \end{array}

and generate $y = X β^{*} = \sum_{j = 1}^{s_{g}} X_{(j)} β_{(j)}^{*}$ . We implement the ℓ₁ + ℓ_1,2 minimization (5) with $λ_{g} = \sqrt{s / s_{g}} λ$ (SGL), ℓ₁ minimization (11) (Lasso), and ℓ_1,2 minimization (12) (Group Lasso), and ℓ₁ + ℓ_1,2 minimization (5) with the tuning parameter λ_g/λ selected using cross validation discussed in Section IV-A (SGL_CV). An exact recovery of β* is considered to be successful if ${‖ \hat{β} - β^{*} ‖}_{2} \leq 10^{- 4}$ . The successful recovery rate based on 100 replicates is shown in Figure 1. It can be seen that SGL and SGL_CV have comparable performance and both methods have significantly better performance than Lasso and Group Lasso. This is in line with our theoretical results.

Fig. 1. — Exact recovery rate in the noiseless case

Then we consider the noisy case and focus on average estimation errors of different methods. We generate

y = X β^{*} + ε = \sum_{j = 1}^{s_{g}} X_{(j)} β_{(j)}^{*} + ε,

where X, β* are drawn in the same way as the previous setting and $ε \overset{i i d}{~} N (0, {0.1}^{2})$ . We consider four designs: i. d = 60, b = 20, s_g = 1; ii. d = 100, b = 30, s_g = 2; iii. d = b = 20, s_g = 1; and iv. d = b = 40, s_g = 2. For each case, the number of observations n is chosen from an equally spaced sequence from 5 to 200 and the simulation is replicated for 500 times. We compare the average estimation error of (a) SGL_CV1: sparse group Lasso with theoretical value $λ_{g} = \sqrt{s / s_{g}} λ$ and λ selected via cross validation; (b) SGL_package: sparse group Lasso via SGL package² in R with the option of automatic tuning parameter selection; (c) Lasso: regular Lasso with tuning parameter selected via cross validation; (d) group Lasso: group Lasso with tuning parameter selected via cross validation; (e) SGL_CV2: sparse group Lasso with both λ and λ_g selected using the proposed cross validation scheme. We can see the proposed method SGL_CV2 achieves smaller estimation error than all other methods, including SGL_CV1, the focus of our theory. These experimental results demonstrate our theory and the applicability of the proposed cross-validation scheme.

V. Discussions

In this paper, we study the high-dimensional double sparse regression and investigate the theoretical properties of the sparse group Lasso and ℓ₁ + ℓ_1,2 minimization. Particularly, we develop the matching upper and lower bounds on the sample complexity for ℓ₁ + ℓ_1,2 minimization in the noiseless case. We also prove that the sparse group Lasso achieves minimax optimal rate of convergence in a range of settings in the noisy case. Our results give an affirmative answer to the open question for high-dimensional statistical inference for simultaneously structured model: by introducing both ℓ₁ and ℓ_1,2 penalties, one can achieve better performance on estimation and statistical inference for simultaneously element-wise and group-wise sparse vectors.

In addition to β*, the estimation and inference for noise level σ is another importance task in high-dimensional double sparse regression. Motivated by the recent development of scaled Lasso [54], one may consider the following scaled sparse group Lasso estimator:

{{\hat{β}}^{s}, \hat{σ}} = \underset{β \in ℝ^{p}, σ > 0}{arg min} {\frac{{‖ y - X β ‖}_{2}^{2}}{σ} + n σ + \tilde{λ} {‖ β ‖}_{1} + {\tilde{λ}}_{g} {‖ β ‖}_{2}},

where $\tilde{λ}$ . and ${\tilde{λ}}_{g}$ are tuning parameters that do not rely on σ. The consistency of $\hat{σ}$ can be established based on similar ideas of scaled Lasso in the literature [54], [31] and the approximate dual certificate in this work.

Moreover, our technical results can be useful in a variety of other problems with simultaneous sparsity structures. For example, [55], [56] considered the estimation of piece-wise constant sparse signals, i.e., both the signal vector and the difference between successive entries of the signal vector are sparse. [57], [58] discussed the estimation of structured parameters where both the number of non-zero elements and the number of distinct values of the parameter vectors are small. [59] considered the estimation of matrices with simultaneous sparsity structures within each block and among different blocks. It is interesting to further study the statistical limits, including the sample complexity and minimax optimal rate of convergence for these problems. In particular, based on the specific sparsity structures of each problem, we can introduce corresponding multi-objective regularizers and the convex regularization methods. The corresponding approximate dual certificates can be proposed, constructed, and analyzed to provide strong theoretical guarantees.

VI. Proofs

We collect the proofs of technical results in this section.

A. Proof of Lemma 1

Let T satisfy (16). For convenience, we denote s₀ = s/s_g and decompose u as

u = v + w, v_{i} = {\begin{array}{l} u_{i} - \sqrt{s_{0}} β_{i}^{*} / {‖ β_{T, (j)}^{*} ‖}_{2}, & i \in T, i \in (j); \\ u_{i}, & i \in (G) \ T; \\ u_{i} - H_{1 / 2} (u_{i}), & i \in (G^{c}) . \end{array} w_{(j)} = {\begin{array}{l} \sqrt{s_{0}} β_{T, (j)}^{*} / {‖ β_{T, (j)}^{*} ‖}_{2}, & j \in G; \\ H_{1 / 2} (u_{(j)}), & j \notin G . \end{array}

(29)

Note that |H_1/2(x) − x| ≤ 1/2 for any $x \in ℝ$ . Based on the property of (18), ‖u_(G)\T‖_∞ ≤ 1/2, then

max_{i \in T^{c}} | v_{i} | \leq 1 / 2, {‖ v_{T} - sgn (β_{T}^{*}) ‖}_{2} = {‖ u_{T} - {({\tilde{u}}_{0})}_{T} ‖}_{2} \leq \frac{c_{min}}{8 {max}_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2}};

(30)

w_{(j)} = \sqrt{s_{0}} β_{T, (j)}^{*} / {‖ β_{T, (j)}^{*} ‖}_{2}, if j \in G; {‖ w_{(j)} ‖}_{2} \leq \sqrt{s_{0}} / 2, if j \notin G .

(31)

Suppose $\hat{β}$ is the minimizer to (5), $h = \hat{β} - β^{*}$ , then based on the sub-differential of ‖β‖₁ and ‖β‖_1,2, we have

𝒫 (\hat{β}) = {‖ \hat{β} ‖}_{1} + \sqrt{s_{0}} {‖ \hat{β} ‖}_{1, 2} = {‖ β^{*} + h ‖}_{1} + \sqrt{s_{0}} {‖ β^{*} + h ‖}_{1, 2} \geq {‖ β_{T}^{*} ‖}_{1} + sgn {(β_{T}^{*})}^{⊤} h_{T} + {‖ h_{T^{c}} ‖}_{1} + \sqrt{s_{0}} ({‖ β_{T}^{*} ‖}_{1, 2} + \sum_{j \in G} \frac{β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} + \sum_{j \notin G} {‖ h_{(j)} ‖}_{2}) - {‖ β_{T^{c}}^{*} ‖}_{1} - \sqrt{s_{0}} {‖ β_{T^{c}}^{*} ‖}_{1, 2} \geq 𝒫 (β^{*}) + {‖ h_{T^{c}} ‖}_{1} + \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} + sgn {(β_{T}^{*})}^{⊤} h_{T} + \sum_{j \in G} \frac{\sqrt{s_{0}} β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} - 2 {‖ β_{T^{c}}^{*} ‖}_{1} - 2 \sqrt{s_{0}} {‖ β_{T^{c}}^{*} ‖}_{1, 2} .

(32)

The last inequality comes from ${‖ β^{*} ‖}_{1} = {‖ β_{T}^{*} ‖}_{1} + {‖ β_{T^{c}}^{*} ‖}_{1}$ and ${‖ β^{*} ‖}_{1, 2} \leq {‖ β_{T}^{*} ‖}_{1, 2} + {‖ β_{T^{c}}^{*} ‖}_{1, 2}$ .

In particular, given Xh = 0 and that u lies in the row span of X, we have v^⊤h + w^⊤h = u^⊤h = 0. Therefore,

sgn {(β_{T}^{*})}^{⊤} h_{T} + \sum_{j \in G} \frac{\sqrt{s_{0}} β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} = sgn {(β_{T}^{*})}^{⊤} h_{T} - v^{⊤} h + \sum_{j \in G} \frac{\sqrt{s_{0}} β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} - w^{⊤} h = - {(v_{T} - sgn (β_{T}^{*}))}^{⊤} h_{T} - v_{T^{c}}^{⊤} h_{T^{c}} - \sum_{j \in G} {(w_{(j)} - \sqrt{s_{0}} β_{T, (j)}^{*} / {‖ β_{T, (j)}^{*} ‖}_{2})}^{⊤} h_{(j)} - {(w_{(G^{c})})}^{⊤} h_{(G^{c})} \geq - {‖ v_{T} - sgn (β_{T}^{*}) ‖}_{2} {‖ h_{T} ‖}_{2} - {‖ v_{T^{c}} ‖}_{\infty} \cdot {‖ h_{T^{c}} ‖}_{1} - max_{j \in G} {{‖ w_{(j)} - \sqrt{s_{0}} β_{T, (j)}^{*} / {‖ β_{T, (j)}^{*} ‖}_{2} ‖}_{2} \cdot {‖ h_{(G)} ‖}_{1, 2}} - {‖ w_{(G^{c})} ‖}_{\infty, 2} {‖ h_{(G^{c})} ‖}_{1, 2} \overset{(30) (31)}{\geq} - {‖ v_{T} - sgn (β_{T}^{*}) ‖}_{2} \cdot {‖ h_{T} ‖}_{2} - {‖ h_{T^{c}} ‖}_{1} / 2 - \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2.

(33)

Next note that $h = h_{T} + h_{T^{c}}$ , we must have $X_{T} h_{T} = - X_{T^{c}} + h_{T^{c}}$ , then

{‖ h_{T} ‖}_{2} = {‖ {(X_{T}^{⊤} X_{T} / n)}^{- 1} X_{T}^{⊤} X_{T} h_{T} / n ‖}_{2} \leq σ_{min}^{- 1} (X_{T}^{⊤} X_{T} / n) {‖ X_{T} X_{T^{c}} h_{T^{c}} / n ‖}_{2} \leq \frac{2}{c_{min}} \cdot max_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} \cdot {‖ h_{T^{c}} ‖}_{1} .

(34)

Combining (30), (33), and (34), one obtains

sgn {(β_{T}^{*})}^{⊤} h_{T} + \sum_{j \in G} \frac{\sqrt{s_{0}} β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} \geq - 3 / 4 \cdot {‖ h_{T^{c}} ‖}_{1} - \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2.

Plug this inequality to (32), we finally have

𝒫 (\hat{β}) \geq 𝒫 (β^{*}) + {‖ h_{T^{c}} ‖}_{1} / 4 + \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2 - 2 {‖ β_{T^{c}}^{*} ‖}_{1} - 2 \sqrt{s_{0}} {‖ β_{T^{c}}^{*} ‖}_{1, 2} .

Since $\hat{β}$ is the minimizer to (5), we must have $𝒫 (\hat{β}) \leq 𝒫 (β^{*})$ , then

{‖ h_{T^{c}} ‖}_{1} / 4 + \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2 \leq 2 {‖ β_{T^{c}}^{*} ‖}_{1} + 2 \sqrt{s_{0}} {‖ β_{T^{c}}^{*} ‖}_{1, 2} .

(35)

If β* is (s, s_g)-sparse, immediately we have $h_{T^{c}} = 0$ . Then $0 = X_{T}^{⊤} X h = (X_{T}^{⊤} X_{T}) h_{T}$ . By $σ_{min} (X_{T}^{⊤} X_{T} / n) \geq c_{min} / 2$ , we know $X_{T}^{⊤} X_{T} / n$ is non-singular, then h_T = 0.

Now, we consider the general case. Without loss of generality, suppose G = {1, …, g}, where g ≤ s_g. Denote T₁ as the indices of the s largest entries of h_(G)\T, T₂ as the indices of the s largest entries of $h_{(G) \ [T \cup T_{1}]}$ , and so on. For s_g + 1 ≤ i ≤ d, denote S_i,1 as the indices of the ⌊s/s_g⌋ largest entries of h_(i), S_i,2 as the indices of the ⌊s/s_g⌋ largest entries of $h_{(i) \ S_{i, 1}}$ , and so on. Let ${\tilde{S}}_{1}, \dots, {\tilde{S}}_{\sum_{i = g + 1}^{d} ⌈ b_{i} / ⌊ s / s_{g} ⌋ ⌉}$ be an arrangement of S_i,j(1 ≤ j ≤ ⌈b_i/⌊s/s_g⌋⌉, g + 1 ≤ i ≤ d) such that ${‖ h_{{\tilde{S}}_{1}} ‖}_{2}^{2} \geq \dots \geq {‖ h_{{\tilde{S}}_{\sum_{i = g + 1}^{d} ⌈ b_{i} / ⌊ s / s_{g} ⌋ ⌉}} ‖}_{2}^{2}$ . Let $R_{1} = \cup_{i = 1}^{s_{g}} {\tilde{S}}_{i}$ , $R_{2} = \cup_{i = s_{g} + 1}^{2 s_{g}} {\tilde{S}}_{i}$ , and so on. Then (T₁, T₂, …, R₁, R₂,…) is a partition of T^c, and |T_i|, |R_j| ≤ s, |g(T_i)|, |g(R_j)| ≤ s_g, where g(S) = {i₁, …, i_k} if $S \subseteq \cup_{j = 1}^{k} (i_{j})$ and S ∩ (i_j) are not empty for all 1 ≤ j ≤ k. Let T = T ∪ T₁ ∪ R₁. If (19) holds, then

\frac{c_{min}}{2} {‖ h_{\tilde{T}} ‖}_{2}^{2} \leq \frac{1}{n} {‖ X_{\tilde{T}} h_{\tilde{T}} ‖}_{2}^{2} = \frac{1}{n} 〈 X_{\tilde{T}} h_{\tilde{T}}, X h 〉 - \frac{1}{n} 〈 X_{\tilde{T}} h_{\tilde{T}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 .

(36)

Since Xh = 0, we have

〈 X_{\tilde{T}} h_{\tilde{T}}, X h 〉 = 0.

(37)

Now, we consider $| 〈 X_{\tilde{T}} h_{\tilde{T}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 |$ . By triangle inequality,

| 〈 X_{\tilde{T}} h_{\tilde{T}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq | 〈 X_{T} h_{T}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | + | 〈 X_{T_{1}} h_{T_{1}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | + | 〈 X_{R_{1}} h_{R_{1}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | .

The triangle inequality shows that

| 〈 X_{T} h_{T}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq \sum_{i \geq 2} | 〈 X_{T} h_{T}, X_{T_{i}} h_{T_{i}} 〉 | + \sum_{j \geq 2} | 〈 X_{T} h_{T}, X_{R_{j}} h_{R_{j}} 〉 | .

Combine the parallelogram identity and (19) together, we have

| 〈 X_{T} h_{T}, X_{T_{i}} h_{T_{i}} 〉 | \leq C_{max} n {‖ h_{T} ‖}_{2} {‖ h_{T_{i}} ‖}_{2}, | 〈 X_{T} h_{T}, X_{R_{j}} h_{R_{j}} 〉 | \leq C_{max} n {‖ h_{T} ‖}_{2} {‖ h_{R_{j}} ‖}_{2} .

Thus,

| 〈 X_{T} h_{T}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq C_{max} n {‖ h_{T} ‖}_{2} (\sum_{i \geq 2} {‖ h_{T_{i}} ‖}_{2} + \sum_{j \geq 2} {‖ h_{R_{j}} ‖}_{2}) .

(38)

By (3.10) in [37], we have

\sum_{i \geq 2} {‖ h_{T_{i}} ‖}_{2} \leq s^{- 1 / 2} {‖ h_{(G) \ T} ‖}_{1},

(39)

and

\sum_{j \geq 2} {‖ h_{R_{j}} ‖}_{2} = \sum_{j \geq 2} {(\sum_{i = (j - 1) s_{g} + 1}^{j s_{g}} {‖ h_{{\tilde{S}}_{i}} ‖}_{2}^{2})}^{1 / 2} \leq \sum_{j \geq 2} \sqrt{s_{g}} {‖ h_{{\tilde{S}}_{(j - 1) s_{g}}} ‖}_{2} \leq \sum_{j \geq 2} \sqrt{s_{g}} \sum_{i = (j - 2) s_{g} + 1}^{(j - 1) s_{g}} {‖ h_{{\tilde{S}}_{i}} ‖}_{2} / s_{g} = s_{g}^{- 1 / 2} \sum_{k} {‖ h_{{\tilde{S}}_{k}} ‖}_{2} = s_{g}^{- 1 / 2} \sum_{i = g + 1}^{d} \sum_{j} {‖ h_{S_{i, j}} ‖}_{2} .

For all g +1 ≤ i ≤ d, apply (3.10) in [37] again,

\sum_{j \geq 2} {‖ h_{S_{i, j}} ‖}_{2} \leq {(⌊ s / s_{g} ⌋)}^{- 1 / 2} {‖ h_{(i)} ‖}_{1} \leq \sqrt{2} {(s / s_{g})}^{- 1 / 2} {‖ h_{(i)} ‖}_{1} .

Moreover, by the definition of S_i,1,

\sum_{i = g + 1}^{d} {‖ h_{S_{i, 1}} ‖}_{2} \leq \sum_{i = g + 1}^{d} {‖ h_{(i)} ‖}_{2} = {‖ h_{(G^{c})} ‖}_{1, 2} .

Therefore,

\sum_{j \geq 2} {‖ h_{R_{j}} ‖}_{2} \leq s_{g}^{- 1 / 2} (\sum_{i = g + 1}^{d} \sqrt{2} {(s / s_{g})}^{- 1 / 2} {‖ h_{(i)} ‖}_{1}) + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2} = \sqrt{2} s^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2} .

(40)

Combine (38), (39) and (40) together, if (19) holds, we have

| 〈 X_{T} h_{T}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq (s^{- 1 / 2} {‖ h_{(G) \ T} ‖}_{1} + \sqrt{2} s^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) \cdot C_{max} n {‖ h_{T} ‖}_{2} \leq C_{max} n {‖ h_{T} ‖}_{2} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) .

Similarly, if (19) holds, then

| 〈 X_{T_{1}} h_{T_{1}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq C_{max} n {‖ h_{T_{1}} ‖}_{2} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2})

and

| 〈 X_{R_{1}} h_{R_{1}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq C_{max} n {‖ h_{R_{1}} ‖}_{2} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) .

Thus, with probability at least 1 – 2e^−cn,

| 〈 X_{\tilde{T}} h_{\tilde{T}}, X_{{\tilde{T}}^{c}} h_{{\tilde{T}}^{c}} 〉 | \leq C_{max} n ({‖ h_{T} ‖}_{2} + {‖ h_{T_{1}} ‖}_{2} + {‖ h_{R_{1}} ‖}_{2}) \cdot (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) \leq \sqrt{3} C_{max} n {‖ h_{\tilde{T}} ‖}_{2} \cdot (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) .

(41)

The last inequality holds due to Cauchy-Schwarz inequality. Combine (36), (37), (41) and Lemma 3 together, we know that with probability at least 1 – 2e^−cn,

\frac{c_{min}}{2} {‖ h_{\tilde{T}} ‖}_{2}^{2} \leq \sqrt{3} C_{max} {‖ h_{\tilde{T}} ‖}_{2} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}),

i.e., with probability at least 1 – 2e^−cn,

{‖ h_{\tilde{T}} ‖}_{2} \leq 2 \sqrt{3} \frac{C_{max}}{c_{min}} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) .

Finally, by (35), (39), (40) and the previous inequality, with probability at least 1 – 2e^−cn,

‖ h ‖_{2} \leq {‖ h_{\tilde{T}} ‖}_{2} + \sum_{i \geq 2} {‖ h_{T_{i}} ‖}_{2} + \sum_{j \geq 2} {‖ h_{R_{j}} ‖}_{2} \leq 2 \sqrt{3} \frac{C_{max}}{c_{min}} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) + \sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{2} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2} \leq C (\frac{1}{\sqrt{s}} {‖ β_{T^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}) .

In summary, we have finished the proof of this lemma. □

B. Proof of Lemma 2

Let T satisfy (16). Given ${‖ β_{T}^{*} ‖}_{0, 2} \leq s_{g}$ , without loss of generally we assume that

β_{T, (s_{g} + 1)}^{*}, \dots, β_{T, (d)}^{*} = 0.

We also denote T_(j) as the support of $β_{T, (j)}^{*}$ . First by Lemma 6 Part 3 with

v \in ℝ^{p}, v_{k} = {\begin{array}{l} 1, & k = i; \\ 0, & k \neq i; \end{array} U \in ℝ^{p \times | T |} = ℝ^{(\sum_{i = 1}^{d} b_{i}) \times | T |}, U_{[T, :]} = I; U_{[T^{c}, :]} = 0,

and notice that x log(eu/x) ≥ log(eu) for all 1 ≤ x ≤ u, we have

ℙ (max_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} \geq 1 / 2) \leq \sum_{i \in T^{c}} ℙ ({‖ X_{T}^{⊤} X_{i} / n ‖}_{2} \geq 1 / 2) \leq \sum_{i \in T^{c}} ℙ ({‖ X_{T}^{⊤} X_{i} / n - E X_{T}^{⊤} X_{i} / n ‖}_{2} + {‖ E X_{T}^{⊤} X_{i} / n ‖}_{2} \geq 1 / 2) \leq \sum_{i \in T^{c}} ℙ ({‖ X_{T}^{⊤} X_{i} / n - E X_{T}^{⊤} X_{i} / n ‖}_{2} \geq 1 / 2 - ‖ Σ_{T, T} ‖ {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2}) \leq \sum_{i \in T^{c}} ℙ ({‖ X_{T}^{⊤} X_{i} / n - E X_{T}^{⊤} X_{i} / n ‖}_{2} \geq 1 / 4) \leq d b \cdot C exp (C s - n) \leq C exp (log (d) + log (b) + C s - n) \leq C exp (s_{g} log (e d / s_{g}) + s log (e s_{g} b / s) + C s - n) \leq C exp (- c n)

(42)

provided that n ≥ C (s log(es_gb/s) + s_g log(d/s_g)) for some large constant C > 0. Note that the fourth inequality comes from the facts that ‖Σ_T,_T‖ ≤ ‖Σ‖ ≤ C_max and ${‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} \leq c / \sqrt{s} \leq 1 / (4 C_{max})$ . By Lemma 7 Part 1, we also know

ℙ (σ_{min} (X_{T}^{⊤} X_{T} / n) \leq c_{min} / 2) \leq ℙ (‖ X_{T}^{⊤} X_{T} / n - Σ_{T, T} ‖ \geq c_{min} / 2) \leq ℙ (‖ X_{T}^{⊤} X_{T} Σ_{T, T}^{- 1} / n - I_{| T |} ‖ ‖ Σ_{T, T} ‖ \geq c_{min} / 2) \leq ℙ (‖ X_{T}^{⊤} X_{T} Σ_{T, T}^{- 1} / n - I_{| T |} ‖ \geq c_{min} / (2 C_{max})) \leq C exp (C s - c n) \leq C exp (- c n)

Next, we apply the well-regarded golfing scheme [50], [44] to find an approximate dual certificate u that satisfies (18). Let $u_{0} \in ℝ^{p}$ such that

{(u_{0})}_{(j)} = {\begin{array}{l} \frac{\sqrt{s / s_{g}} β_{T, (j)}^{*}}{{‖ β_{T, (j)}^{*} ‖}_{2}} + sgn (β_{T, (j)}^{*}), & j \in G; \\ 0, & j \in G^{c} . \end{array}

(43)

Immediately we have ${(u_{0})}_{T} = {({\tilde{u}}_{0})}_{T}$ . We divide n rows of X into non-overlapping batches, say $X_{[I_{1}, :]}$ , $X_{[I_{2}, :]}$ , …, with |I_l| = n_l. Here, n₁, n₂,… will be specified a little while later. Consider the following sequences

α_{0} = u_{0}, γ_{l} = X_{[I_{l}, :]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot {(α_{l - 1})}_{T}, α_{l} = α_{l - 1} - γ_{l}, l = 1, 2, \dots, l_{max} .

(44)

Finally the approximate dual certificate is defined as

u = \sum_{l = 1}^{l_{max}} γ_{l} = \sum_{l = 1}^{l_{max}} X_{[I_{l}, :]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot {(α_{l - 1})}_{T} .

(45)

From the inductive definition we can see

{(α_{l})}_{T} = (I - X_{[I_{l}, T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l}) {(α_{l - 1})}_{T}, {(γ_{l})}_{T^{c}} = X_{[I_{l}, T^{c}]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot {(α_{l - 1})}_{T}, l = 1, 2, \dots .

Next, we apply the random matrix results (Lemmas 7 and 6) and obtain the following tail probabilities.

if n_l ≥ C st_l for large constant C > 0 and t_l ≥ C, by Part 1 of Lemma 7,
$ℙ (‖ X_{[I_{l}, T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} - I_{| T |} ‖ \geq C \sqrt{\frac{s t_{l}}{n_{l}}}) \leq C exp (C s - n_{l} min {\frac{s t_{l}}{n_{l}}, {(\frac{s t_{l}}{n_{l}})}^{1 / 2}}) \leq C exp (- c s t_{l});$ (46)

Suppose

q_{l - 1} = {(α_{l - 1})}_{T} \in ℝ^{| T |}

is independent of

X_{[I_{l}, :]}

. If

n_{l} \geq \frac{C (s_{0} log (e s_{g} b / s) + log d)}{min {s_{0} δ_{l}^{2}, \sqrt{s_{0}} δ_{l}}}

for

δ_{l} \geq C {max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} \geq C ({max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2}) {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2} / {‖ q_{l - 1} ‖}_{2}

, by Lemma 7 Part 2,

ℙ (max_{j \in G^{c}} {‖ H_{{‖ q_{l - 1} ‖}_{2} δ_{l}} \cdot (X_{[I_{l}, (j)]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot q_{l - 1}) ‖}_{2} \geq \sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} δ_{l}) \leq \sum_{j \in G^{c}} ℙ ({‖ H_{{‖ q_{l - 1} ‖}_{2} δ_{l}} \cdot (X_{[I_{l}, (j)]}^{⊤} X_{[I_{l}, T]} (Σ_{T, T}^{- 1} q_{l - 1}) / n_{l}) ‖}_{2} \geq \sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} δ_{l}) \leq d \cdot (\begin{matrix} b \\ ⌈ s_{0} ⌉ \end{matrix}) \cdot exp (C s_{0} - c n_{l} min {\frac{s_{0} {‖ q_{l - 1} ‖}_{2}^{2} δ_{l}^{2}}{κ^{4} {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2}^{2}}, \frac{\sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} δ_{l}}{κ^{2} {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2}}}) + d \cdot (\begin{matrix} b \\ ⌊ s_{0} ⌋ \end{matrix}) \cdot exp (C s_{0} - c n_{l} min {\frac{s_{0} {‖ q_{l - 1} ‖}_{2}^{2} δ_{l}^{2}}{κ^{4} {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2}^{2}}, \frac{\sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} δ_{l}}{κ^{2} {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2}}}) \leq 2 d \cdot {(\frac{e b}{⌊ s_{0} ⌋})}^{⌈ s_{0} ⌉} \cdot exp (C s_{0} - c n_{l} min {s_{0} δ_{l}^{2}, \sqrt{s_{0}} δ_{l}}) \leq C exp (log (d) + C s_{0} log (2 e s_{g} b / s) + C s_{0} - c n_{l} min {s_{0} δ_{l}^{2}, \sqrt{s_{0}} δ_{l}}) \leq C exp (- c n_{l} min {s_{0} δ_{l}^{2}, \sqrt{s_{0}} δ_{l}});

(47)

The third inequality comes from $‖ \sum_{T, T}^{- 1} ‖ \leq \frac{1}{c_{min}}$ .

Suppose

q_{l - 1} = {(α_{l - 1})}_{T} \in ℝ^{| T |}

is fixed. If

n_{l} min {θ_{l}^{2}, θ_{l}} \geq C log (e s_{g} b), θ_{l} \geq 2 {max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2}

, by Lemma 6 Part 2,

ℙ ({‖ X_{[I_{l}, (G) \ T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot q_{l - 1} ‖}_{\infty} \geq θ_{l} {‖ q_{l - 1} ‖}_{2}) \leq \sum_{i \in (G) \ T} ℙ (| X_{[I_{l}, i]}^{⊤} X_{[I_{l}, T]} / n_{l} \cdot (Σ_{T, T}^{- 1} q_{l - 1}) | \geq θ_{l} {‖ q_{l - 1} ‖}_{2}) \leq \sum_{i \in (G) \ T} ℙ (| X_{[I_{l}, i]}^{⊤} X_{[I_{l}, T]} / n_{l} \cdot (Σ_{T, T}^{- 1} q_{l - 1}) - Σ_{i, T} Σ_{T, T}^{- 1} q_{l - 1} | \geq θ_{l} {‖ q_{l - 1} ‖}_{2} - | Σ_{i, T} Σ_{T, T}^{- 1} q_{l - 1} |) \leq \sum_{i \in (G) \ T} ℙ (| X_{[I_{l}, i]}^{⊤} X_{[I_{l}, T]} / n_{l} \cdot (Σ_{T, T}^{- 1} q_{l - 1}) - Σ_{i, T} Σ_{T, T}^{- 1} q_{l - 1} | \geq θ_{l} {‖ q_{l - 1} ‖}_{2} - {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} {‖ q_{l - 1} ‖}_{2}) \leq \sum_{i \in (G) \ T} ℙ (| X_{[I_{l}, i]}^{⊤} X_{[I_{l}, T]} / n_{l} \cdot (Σ_{T, T}^{- 1} q_{l - 1}) - Σ_{i, T} Σ_{T, T}^{- 1} q_{l - 1} | \geq \frac{1}{2} θ_{l} {‖ q_{l - 1} ‖}_{2}) \leq \sum_{i \in (G) \ T} ℙ (| X_{[I_{l}, i]}^{⊤} X_{[I_{l}, T]} / n_{l} \cdot (Σ_{T, T}^{- 1} q_{l - 1}) - Σ_{i, T} Σ_{T, T}^{- 1} q_{l - 1} | \geq \frac{c_{min}}{2} θ_{l} {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2}) \leq s_{g} b \cdot C exp (- c n_{l} min {θ_{l}^{2}, θ_{l}}) = C exp (log (s_{g} b) - c n_{l} min {θ_{l}^{2}, θ_{l}}) \leq C exp (- c n_{l} min {θ_{l}^{2}, θ_{l}}) .

(48)

Then we specify {n_l, t_l, δ_l, θ_l}_{l ≥ 1} as follows,

n₁ = n₂ ≥ C(s log(es_gb) + s_g log(d/s_g)), t₁ = t₂ = cn₁/(s log(es)) ≥ C, $δ_{1} = δ_{2} = 1 / (16 \sqrt{s})$ , $θ_{1} = θ_{2} = 1 / (16 \sqrt{s})$ ;
$n_{3} = \dots = n_{l_{max}} ≍ \frac{n_{1}}{l_{max} - 2} \geq C (s log (e s_{g} b) + s_{g} log (d / s_{g})) / log (e s)$ , $t_{3} = \dots = t_{l_{max}} = c n_{3} / s \geq C$ , $δ_{3} = \dots = δ_{l_{max}} = log (e s) / (16 \sqrt{s}) \geq max {{(log (e s) / s)}^{1 / 2} / 16, log (e s) \sqrt{s_{0}} / (16 s)}$ , $θ_{3} = \dots = θ_{l_{max}} = {(log (e s) / s)}^{1 / 2} / 16$ , with l_max = ⌈C log(es)⌉ + 2.

We can see the following events happen

‖ X_{[I_{l}, T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} - I_{| T |} ‖ \leq C \sqrt{s t_{l} / n_{l}} \leq \sqrt{1 / log (e s)}, l = 1, 2; ‖ X_{[I_{l}, T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} - I_{| T |} ‖ \leq C \sqrt{s t_{l} / n_{l}} \leq 1 / 2, l = 3, \dots, l_{max};

(49)

max_{j \in G^{c}} {‖ H_{{‖ q_{l - 1} ‖}_{2} / (16 \sqrt{s})} \cdot (X_{[I_{l}, (j)]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot q_{l - 1}) ‖}_{2} \leq \sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} / (16 \sqrt{s}), l = 1, 2; max_{j \in G^{c}} {‖ H_{{‖ q_{l - 1} ‖}_{2} \cdot log (e s) / (16 \sqrt{s})} \cdot (X_{[I_{l}, (j)]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot q_{l - 1}) ‖}_{2} \leq \sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} log (e s) / (16 \sqrt{s}), l = 3, \dots, l_{max};

(50)

{‖ X_{[I_{l}, (G) \ T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot q_{l - 1} ‖}_{\infty} \leq {‖ q_{l - 1} ‖}_{2} / (16 \sqrt{s}), l = 1, 2 {‖ X_{[I_{l}, (G) \ T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot q_{l - 1} ‖}_{\infty} \leq {‖ q_{l - 1} ‖}_{2} \cdot {(log (e s) / s)}^{1 / 2} / 16, l = 3, \dots, l_{max} .

(51)

with probability at least $1 - C log (e s) exp (- c \frac{n}{log (e s)}) - C log (e s) exp (- c \frac{n}{s_{g}}) - C log (e s) exp (- c \frac{n}{s})$ . By triangle inequality, u₀ satisfies

{‖ u_{0} ‖}_{2} \leq \sqrt{s / s_{g}} {(\sum_{j \in G} {‖ \frac{β_{T, (j)}^{*}}{{‖ β_{T, (j)}^{*} ‖}_{2}} ‖}_{2}^{2})}^{1 / 2} + {‖ sgn (β_{T}^{*}) ‖}_{2} \leq 2 \sqrt{s} .

(52)

When ${max}_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} \leq \frac{1}{2}$ and (49)–(52) hold, we have

{‖ q_{0} ‖}_{2} \leq 2 \sqrt{s}, {‖ q_{1} ‖}_{2} = ‖ (I_{| T |} - X_{I_{1}, T}^{⊤} X_{I_{1}, T} Σ_{T, T}^{- 1} / n_{1}) q_{0} ‖ \leq ‖ I_{| T |} - X_{I_{1}, T}^{⊤} X_{I_{1}, T} Σ_{T, T}^{- 1} / n_{1} ‖ \cdot {‖ q_{0} ‖}_{2} \leq 2 \sqrt{s / log (e s)}; similarly, {‖ q_{2} ‖}_{2} \leq {‖ q_{1} ‖}_{2} / \sqrt{log (e s)} \leq 2 \sqrt{s} / (log (e s)); {‖ q_{l} ‖}_{2} \leq {‖ q_{l - 1} ‖}_{2} / 2 \leq \dots \leq ‖ q_{2} ‖ / 2^{l - 2} \leq 2^{3 - l} \sqrt{s} / (log (e s)), l \geq 3.

(53)

For large constant $C > 0, {‖ q_{l_{max}} ‖}_{2} \leq 2^{3 - C log (e s)} \sqrt{s} / log (e s) \leq c_{min} / 8$ . Notice that

u_{T} = {(\sum_{l = 1}^{l_{max}} γ_{l})}_{T} = {(\sum_{l = 1}^{l_{max}} (α_{l - 1} - α_{l}))}_{T} = {(α_{0} - α_{l_{max}})}_{T} = {({\tilde{u}}_{0})}_{T} - {(q_{l_{max}})}_{T},

we know that

{‖ u_{T} - {({\tilde{u}}_{0})}_{T} ‖}_{2} \cdot max_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} = {‖ q_{l_{max}} ‖}_{2} \cdot max_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} \leq \frac{c_{min}}{8} \cdot \frac{1}{2} < \frac{c_{min}}{8} .

In addition,

{‖ u_{(G) \ T} ‖}_{\infty} \leq \sum_{l = 1}^{l_{max}} {‖ X_{[I_{l}, (G) \ T]}^{⊤} X_{[I_{l}, T]} Σ_{T, T}^{- 1} / n_{l} \cdot {(α_{l - 1})}_{T} ‖}_{\infty} \leq {‖ q_{0} ‖}_{2} / (16 \sqrt{s}) + {‖ q_{1} ‖}_{2} / (16 \sqrt{s}) + \sum_{l = 3}^{l_{max}} {‖ q_{l - 1} ‖}_{2} \cdot {(log (e s) / s)}^{1 / 2} / 16 \leq 1 / 8 + 1 / 8 + \sum_{l = 3}^{\infty} 2^{4 - l} / 16 \leq 1 / 2.

Since

{‖ q_{0} ‖}_{2} / (16 \sqrt{s}) + {‖ q_{1} ‖}_{2} / (16 \sqrt{s}) + \sum_{l = 3}^{l_{max}} {‖ q_{l - 1} ‖}_{2} \cdot log (e s) / (16 \sqrt{s}) \leq \frac{1}{8} + \frac{1}{8} + \sum_{l = 3}^{l_{max}} 2^{4 - l} \sqrt{s} / (log (e s)) \cdot log (e s) / (16 \sqrt{s}) \leq \frac{1}{2},

{‖ H_{1 / 2} (u_{(G^{c})}) ‖}_{\infty, 2} \leq {‖ H_{ν} (u_{(G^{c})}) ‖}_{\infty, 2} \leq \sum_{l = 3}^{l_{max}} {‖ H_{{‖ q_{l - 1} ‖}_{2} \cdot log (e s) / (16 \sqrt{s})} (X_{[I_{l}, (G^{c})]}^{⊤} X_{[I_{l}, T^{c}]} q_{l - 1}) ‖}_{\infty, 2} + \sum_{l = 1}^{2} {‖ H_{{‖ q_{l - 1} ‖}_{2} / (16 \sqrt{s})} (X_{[I_{l}, (G^{c})]}^{⊤} X_{[I_{l}, T^{c}]} q_{l - 1}) ‖}_{\infty, 2} \leq \sum_{l = 1}^{2} \sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} / (16 \sqrt{s}) + \sum_{l = 3}^{l_{max}} \sqrt{s_{0}} {‖ q_{l - 1} ‖}_{2} \cdot log (e s) / (16 \sqrt{s}) \leq \sqrt{s_{0}} / 2,

where

ν = {‖ q_{0} ‖}_{2} / (16 \sqrt{s}) + {‖ q_{1} ‖}_{2} / (16 \sqrt{s}) + \sum_{l = 3}^{l_{max}} {‖ q_{l - 1} ‖}_{2} \cdot log (e s) / (16 \sqrt{s}) .

Thus, the construction of u satisfies all required condition in Lemma 1 with probability at least $1 - C exp (- c \frac{n}{s})$ . This has finished the proof of this lemma. □

C. Proof of Lemma 3

Let g(S) be the group support of set S, that is, g(S) = {i₁, …, i_k} if $S \subset \cup_{j = 1}^{k} (i_{j})$ and S ∩ (i_j) are not empty for all 1 ≤ j ≤ k. Lemma 7 Part 1 and the union bound show that

ℙ (\exists γ \in ℝ^{p}, {‖ γ ‖}_{0} \leq 2 s, {‖ γ ‖}_{0, 2} \leq 2 s_{g} \frac{1}{n} {‖ X γ ‖}_{2}^{2} \notin [\frac{c_{min}}{2} {‖ γ ‖}_{2}^{2}, (C_{min} + \frac{c_{min}}{2}) {‖ γ ‖}_{2}^{2}]) = ℙ (\exists x \in ℝ^{2 s \land p}, S \subseteq 1, \dots, p, | S | = 2 s \land p, | g (S) | \leq 2 s_{g} \frac{1}{n} {‖ X_{S} x ‖}_{2}^{2} \notin [\frac{c_{min}}{2} {‖ γ ‖}_{2}^{2}, (C_{min} + \frac{c_{min}}{2}) {‖ γ ‖}_{2}^{2}]) \leq \sum_{S \subseteq {1, \dots, p}, | S | = 2 s \land p, | g (S) | \leq 2 s_{g}} ℙ (\forall x \in ℝ^{2 s \land p}, \frac{1}{n} {‖ X_{S} x ‖}_{2}^{2} \notin [\frac{c_{min}}{2} {‖ γ ‖}_{2}^{2} (C_{min} + \frac{c_{min}}{2}) {‖ γ ‖}_{2}^{2}]) \leq \sum_{S \subseteq {1, \dots, p}, | S | = 2 s \land p, | g (S) | \leq 2 s_{g}} ℙ (‖ \frac{1}{n} X_{S}^{⊤} X_{S} - Σ_{S, S} ‖ \geq \frac{c_{min}}{2}) \leq \sum_{S \subseteq {1, \dots, p}, | S | = 2 s \land p, | g (S) | \leq 2 s_{g}} ℙ (‖ \frac{1}{n} X_{S}^{⊤} X_{S} Σ_{S, S}^{- 1} - I_{| S |} ‖ \geq \frac{c_{min}}{2 C_{max}}) \leq [(\begin{matrix} d \\ 2 s_{g} \end{matrix}) \lor 1] (\begin{matrix} 2 s_{g} b \\ 2 s \end{matrix}) \cdot 2 exp (C s - c n) \leq {(\frac{e d}{2 s_{g}})}^{2 s_{g}} {(\frac{e \cdot 2 s_{g} b}{2 s})}^{2 s} \cdot 2 exp (C s - c n) \leq 2 exp (2 s log (e s_{g} b / s) + 2 s_{g} log (e d / s_{g}) + C s - c n) \leq 2 e^{- c n} .

□

D. Proof of Theorem 2

If d ≥ 3s_g and b ≥ 3s/s_g, by (80), we can find Ω⁽¹⁾, …, Ω^(N) ⊂ {1, …, db} such that |Ω⁽ⁱ⁾| = s_g⌊s/s_g⌋, $| Ω_{(k)}^{(i)} | = ⌊ s / s_{g} ⌋ 1_{{Ω_{(k)}^{(i)} is not empty}}$ for all 1 ≤ i ≤ N, ≤ k ≤ d, and

| Ω^{(i)} \cap Ω^{(j)} | \leq 8 s_{g} ⌊ s / s_{g} ⌋ / 9, 1 \leq i \neq j \leq N,

(54)

| {k | Ω_{(k)}^{(i)} \cap Ω_{(k)}^{(j)} | \geq 2 ⌊ s / s_{g} ⌋ / 3} | \leq 2 s_{g} / 3, 1 \leq i \neq j \leq N,

(55)

where $N = | {(\frac{d}{2 \sqrt{2} s_{g}})}^{s_{g} / 3} {(\frac{b}{2 \sqrt{2} ⌊ s / s_{g} ⌋})}^{s / 9} |$ . For any 1 ≤ j ≤ db, 1 ≤ i ≤ N, define

β_{j}^{(i)} = {\begin{array}{l} \frac{1}{λ s_{g} ⌊ s / s_{g} ⌋ + λ_{g} s_{g} \sqrt{⌊ s / s_{g} ⌋}}, & j \in Ω^{(i)} \\ 0 & j \notin Ω^{(i)}, \end{array}

then ∥β⁽ⁱ⁾∥₀ ≤ s, ∥β⁽ⁱ⁾∥_0,2 ≤ s_g. We consider the quotient space

ℝ^{d b} / ker (X) = {[x] ≔ x + ker (X), x \in ℝ^{d b}} .

Then the dimension of $ℝ^{d b} / ker (X)$ is rank(X) ≤ n. Define the norm ∥[x]∥ = inf_v∈ker(X){λ∥x − v∥₁ + λ_g ∥x − v∥_1,2}. For any vector $x \in ℝ^{d b}$ satisfying ∥x∥₀ ≤ 2s, ∥x∥_0,2 ≤ 2s_g, note that x − v with v ∈ ker(X) satisfies X(x − v) = Xx, by our assumption, we have ∥[x]∥ = λ∥x∥₁ +λ_g∥x∥_1,2. Thus ∥[β⁽¹⁾]∥ = ⋯ = ∥[β^(N)]∥ = 1. Moreover, by (54) and (55),

{‖ β^{(i)} - β^{(j)} ‖}_{1} = \frac{1}{λ s_{g} ⌊ s / s_{g} ⌋ + λ_{g} s_{g} \sqrt{⌊ s / s_{g} ⌋}} \cdot (| Ω^{(i)} | + | Ω^{(j)} | - 2 | Ω^{(i)} \cap Ω^{(j)} |) \geq \frac{2 s_{g} ⌊ s / s_{g} ⌋}{9 (λ s_{g} ⌊ s / s_{g} ⌋ + λ_{g} s_{g} \sqrt{⌊ s / s_{g} ⌋})},

and

{‖ β^{(i)} - β^{(j)} ‖}_{1, 2} = \sum_{k = 1}^{d} {‖ β_{(k)}^{(i)} - β_{(k)}^{(j)} ‖}_{2} \geq \sum_{k \in S_{i, j}} {‖ β_{(k)}^{(i)} - β_{(k)}^{(j)} ‖}_{2} \geq \frac{1}{λ s_{g} ⌊ s / s_{g} ⌋ + λ_{g} s_{g} \sqrt{⌊ s / s_{g} ⌋}} \sqrt{\frac{2 ⌊ s / s_{g} ⌋}{3}} \cdot | S_{i, j} | \geq \frac{1}{λ s_{g} ⌊ s / s_{g} ⌋ + λ_{g} s_{g} \sqrt{⌊ s / s_{g} ⌋}} \sqrt{\frac{2 ⌊ s / s_{g} ⌋}{3}} \cdot \frac{s_{g}}{3},

where

S_{i, j} = {k | Ω_{(k)}^{(i)}, Ω_{(k)}^{(j)} are not empty sets, | Ω_{(k)}^{(i)} \cap Ω_{(k)}^{(j)} | < 2 ⌊ s / s_{g} ⌋ / 3 |} .

Since β⁽ⁱ⁾ − β^(j) is (2s, 2s_g)-sparse,

‖ [β^{(i)}] - [β^{(j)}] ‖ = ‖ [β^{(i)} - β^{(j)}] ‖ = λ {‖ β^{(i)} - β^{(j)} ‖}_{1} + λ_{g} {‖ β^{(i)} - β^{(j)} ‖}_{1, 2} \geq 2 / 9.

By [46, Proposition C.3], we have N ≤ 10^rank(X) ≤ 10ⁿ. Therefore we have

⌊ {(\frac{d}{2 \sqrt{2} s_{g}})}^{s_{g} / 3} {(\frac{b}{2 \sqrt{2} ⌊ s / s_{g} ⌋})}^{s / 9} ⌋ \leq 10^{n},

which means that n ≥ c(s_g log(d/s_g) + s log(es_gb/s)).

If d < 3s_g or b < 3s/s_g, let $s_{g}^{'} = [s_{g} / 3] \lor 1 \geq s_{g} / 5$ , $s^{'} = [s / 15] \lor s_{g}^{'}$ , then $d \geq 3 s_{g}^{'}$ and $b \geq 3 s^{'} / s_{g}^{'}$ . Since all (2s, 2s_g)-sparse vectors can be exactly recovered by the ℓ₁ + ℓ_1,2 minimization and s′ ≤ s, $s_{g}^{'} \leq s_{g}$ , we know that the ℓ₁ + ℓ_1,2 minimization exactly recover all (2s′, $2 s_{g}^{'}$ )-sparse vectors. Therefore, we have

n \geq c (s_{g}^{'} log (d / s_{g}^{'}) + s^{'} log (e s_{g}^{'} b / s^{'})) \geq c (\frac{s_{g}}{5} \cdot log (\frac{d}{s_{g}}) + \frac{s}{15} \cdot log (\frac{e b (s_{g} / 5)}{s / 15} \lor e b)) \geq c^{'} (s_{g} log (d / s_{g}) + s log (e s_{g} b / s)) .

(56)

□

E. Proof of Theorem 3

We would like prove Theorem 3 by contradiction. Let

c = min {\frac{1}{8}, c^{'}, \sqrt{\frac{c^{'}}{256}}}, c_{0} = min {\frac{c}{2 e}, \frac{c^{2}}{2 C^{2}}, 16 c^{2}}, C_{0} = max {\frac{C^{2}}{c^{2}}, \frac{1}{32 c^{2}}},

where c′ is a uniform constant such that n ≥ c′(s log(es_gb/s) + s_g log(d/s_g)) if the conditions in Theorem 2 are satisfied. Assume for contradiction that

n < c_{0} (s log (e s_{g} b / s) + s_{g} log (d / s_{g})) .

(57)

Let s₀ = s/s_g, define the norm $‖ \cdot ‖ = {‖ \cdot ‖}_{1} + \sqrt{s_{0}} {‖ \cdot ‖}_{1, 2}$ . Let $B = {x \in ℝ^{p} ∣ ‖ x ‖ \leq 1}$ ,

d^{n} (B, ℝ^{p}) = inf_{L^{n} is a subspace of ℝ^{p} with dim (ℝ^{p} / L^{n}) \leq n} {sup_{β \in B \cap L^{n}} {‖ β ‖}_{2}} .

By [46, Theorem 10.4], we have

d^{n} (B, ℝ^{p}) \leq sup_{β \in B} {‖ β - Δ (X β) ‖}_{2} \leq \frac{C}{\sqrt{s}} sup_{β \in B} ({‖ β ‖}_{1} + \sqrt{s_{0}} {‖ β ‖}_{1, 2}) = \frac{C}{\sqrt{s}} .

(58)

d^{n} (B, ℝ^{p}) \geq c min {\frac{1}{\sqrt{s_{0}}}, {[(\frac{s_{g}}{s} log (\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n}) + log (e s_{g} b / s)) / n]}^{1 / 2}},

(59)

since

\frac{C}{\sqrt{s}} \leq \frac{c \sqrt{C_{0}}}{\sqrt{s}} \leq \frac{c \sqrt{s_{g}}}{\sqrt{s}} = \frac{c}{\sqrt{s_{0}}},

(58) and (59) together imply that

n \geq \frac{c^{2}}{C^{2}} (s_{g} log (\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n}) + s log (e s_{g} b / s)) .

(60)

By (57),

\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n} > \frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{c_{0} (s log (e s_{g} b / s) + s_{g} log (d / s_{g}))} \geq 2 e \frac{\frac{s}{s_{g}} d log (e s_{g} b / s)}{s log (e s_{g} b / s) + s_{g} log (d / s_{g})} \geq min {e \frac{\frac{s}{s_{g}} d log (e s_{g} b / s)}{s log (e s_{g} b / s)}, \frac{e \frac{s}{s_{g}} d log (e s_{g} b / s)}{s_{g} log (e d / s_{g})}} \geq min {\frac{e d}{s_{g}}, \frac{\frac{e d}{s_{g}}}{log (\frac{e d}{s_{g}})}} \geq {(\frac{e d}{s_{g}})}^{1 / 2} .

(61)

In the last inequality, we used x^1/2 ≥ log(x)/2 for all x ≥ 1. Combine (60) and (61) together, we have

n \geq \frac{c^{2}}{2 C^{2}} (s log (e s_{g} b / s) + s_{g} log (d / s_{g})) \geq c_{0} (s log (e s_{g} b / s) + s_{g} log (d / s_{g})) > n,

contradiction!

Thus, we only need to prove (59) based on (57). We still use the proof of contradiction. If

d^{n} (B, ℝ^{p}) < c min {\frac{1}{\sqrt{s_{0}}}, {[(\frac{s_{g}}{s} log (\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n}) + log (e s_{g} b / s)) / n]}^{1 / 2}} = : μ,

then there exists a subspace Lⁿ of $ℝ^{p}$ with $dim (ℝ^{p} / L^{n}) \leq n$ such that for all v ∈ Lⁿ\{0},

{‖ v ‖}_{2} < μ ({‖ v ‖}_{1} + \sqrt{s_{0}} {‖ v ‖}_{1, 2}) .

Let $B \in ℝ^{n \times p}$ satisfying ker(B) = Lⁿ. Let $s^{'} = ⌊ \frac{1}{32 μ^{2}} ⌋$ , $s_{g}^{'} = ⌊ s^{'} / s_{0} ⌋$ , by (57) and (61),

\frac{1}{8} s_{0}^{- 1 / 2} \geq c s_{0}^{- 1 / 2} \geq μ \geq c min {\sqrt{\frac{C_{0}}{s}}, {(\frac{\frac{s_{g}}{2 s} log (d / s_{g}) + log (e s_{g} b / s)}{c_{0} (s_{g} log (d / s_{g}) + s log (e s_{g} b / s))})}^{1 / 2}} \geq \frac{1}{4 \sqrt{2}} s^{- 1 / 2},

which means that

1 \leq s^{'} \leq s, 1 \leq s_{g}^{'} \leq s_{g} .

Moreover, we have $\frac{1}{64 μ^{2}} < s^{'} \leq \frac{1}{32 μ^{2}}$ . For any (2s′, $2 s_{g}^{'}$ )-sparse β with support set T and group support set G, and v ∈ ker(A), by Cauchy-Schwarz inequality,

{‖ v_{T} ‖}_{1} + \sqrt{s_{0}} {‖ v_{(G)} ‖}_{1, 2} \leq \sqrt{2 s^{'}} {‖ v_{T} ‖}_{2} + \sqrt{s_{0}} \sqrt{2 s_{g}^{'}} {‖ v_{T} ‖}_{2} \leq 2 \sqrt{2 s^{'}} {‖ v_{T} ‖}_{2} < 2 \sqrt{2} \frac{1}{4 \sqrt{2} μ} μ ({‖ v ‖}_{1} + \sqrt{s_{0}} {‖ v ‖}_{1, 2}) = \frac{1}{2} ({‖ v ‖}_{1} + \sqrt{s_{0}} {‖ v ‖}_{1, 2}),

i.e.,

{‖ v_{T} ‖}_{1} + \sqrt{s_{0}} {‖ v_{(G)} ‖}_{1, 2} < {‖ v_{T^{c}} ‖}_{1} + \sqrt{s_{0}} {‖ v_{(G^{c})} ‖}_{1, 2} .

Based on Cauchy-Schwarz inequality and the sub-differential of ∥β∥₁ and ∥β∥_1,2, we have

{‖ β + v ‖}_{1} + \sqrt{s_{0}} {‖ β + v ‖}_{1, 2} \geq {‖ β ‖}_{1} + sgn {(β)}^{⊤} v_{T} + {‖ v_{T^{c}} ‖}_{1} + \sqrt{s_{0}} ({‖ β ‖}_{1, 2} + \sum_{j \in G} \frac{β_{(j)}^{⊤} v_{(j)}}{{‖ β_{(j)} ‖}_{2}} + \sum_{j \in G^{c}} {‖ v_{j} ‖}_{2}) \geq {‖ β ‖}_{1} - {‖ v_{T} ‖}_{1} + {‖ v_{T^{c}} ‖}_{1} + \sqrt{s_{0}} ({‖ β ‖}_{1, 2} - {‖ v_{(G)} ‖}_{2} + {‖ v_{(G^{c})} ‖}_{2}) > {‖ β ‖}_{1} + \sqrt{s_{0}} {‖ β ‖}_{1, 2} .

By Theorem 2,

n \geq c^{'} (s^{'} log (e s_{g}^{'} b / s^{'}) + s_{g}^{'} log (d / s_{g}^{'})) \geq c^{'} s^{'} {log (\frac{e s_{g} b}{2 s}) + \frac{1}{2 s_{0}} log (s_{0} d / s^{'})} \geq c^{'} s^{'} log (\frac{e s_{g} b}{2 s}) .

Thus

n \geq c^{'} s^{'} (log (\frac{e s_{g} b}{2 s}) + \frac{s_{g}}{s} log (\frac{c^{'} \frac{s}{s_{g}} d log (e s_{g} b / s)}{n})) > \frac{c^{'}}{64 μ^{2}} (\frac{1}{4} log (e s_{g} b / s) + \frac{s_{g}}{s} log (\frac{c \frac{s}{s_{g}} d log (e s_{g} b / s)}{n})) \geq n

provided that $c = min {\frac{1}{8}, c^{'}, \sqrt{\frac{c^{'}}{256}}}$ , contradiction! This means that (59) holds if (57) is true.

Therefore, we have finished the proof of Theorem 3. □

F. Proof of Theorem 4

Let $λ = C σ \sqrt{\frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s} n}$ , $λ_{g} = \sqrt{s / s_{g}} λ$ . By (86) in Lemma 5 and (101), one has

ℙ ({‖ H_{\frac{1}{10} λ} (X^{⊤} ε) ‖}_{\infty, 2} \geq \frac{1}{10} λ_{g}) \leq ℙ (\exists 1 \leq j \leq d, {‖ H_{\frac{1}{10} λ} (X_{(j)}^{⊤} ε) ‖}_{2} \geq \frac{1}{10} λ_{g}, ‖ ε ‖_{2} \geq 5 \sqrt{n σ^{2}}) + ℙ (‖ ε ‖_{2} \geq 5 \sqrt{n σ^{2}}) \leq \sum_{j = 1}^{d} ℙ ({‖ H_{\frac{1}{10} λ} (X_{(j)}^{⊤} ε) ‖}_{2} \geq | \frac{1}{10} λ_{g} | ‖ ε ‖_{2} \geq 5 \sqrt{n σ^{2}}) + ℙ (‖ ε ‖_{2} \geq 5 \sqrt{n σ^{2}}) \leq d exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) + e^{- n} = exp (log (s_{g}) + log (d / s_{g}) - C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) + e^{- n} \leq exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) + e^{- n} .

(62)

By the definition of $\hat{β}$ and KKT condition, we have

X^{⊤} (y - X \hat{β}) + λ z_{1} + λ_{g} z_{2} = 0,

where

{\begin{array}{l} {(z_{1})}_{i} = sgn ({\hat{β}}_{i}), & {\hat{β}}_{i} \neq 0; \\ | {(z_{1})}_{i} | \leq 1, & {\hat{β}}_{i} = 0; \end{array}

{\begin{array}{l} {(z_{2})}_{(j)} = \frac{{\hat{β}}_{(j)}}{{‖ {\hat{β}}_{(j)} ‖}_{2}}, & {\hat{β}}_{(j)} \neq 0; \\ {‖ {(z_{2})}_{(j)} ‖}_{2} \leq 1, & {\hat{β}}_{(j)} = 0. \end{array}

Therefore,

{‖ H_{λ} (X^{⊤} (X \hat{β} - y)) ‖}_{\infty, 2} \leq λ_{g} .

(62), Lemma 8 Part 1 and the previous inequality together imply that

ℙ ({‖ H_{(1 + \frac{1}{10}) λ} (X^{⊤} X h) ‖}_{\infty, 2} \leq (1 + \frac{1}{10}) λ_{g}) \geq 1 - exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) - e^{- n},

(63)

where $h = \hat{β} - β^{*}$ . By the definition of $\hat{β}$ , we have

{‖ y - X \hat{β} ‖}_{2}^{2} + λ {‖ \hat{β} ‖}_{1} + λ_{g} {‖ \hat{β} ‖}_{1, 2} \leq {‖ y - X β^{*} ‖}_{2}^{2} + λ {‖ β^{*} ‖}_{1} + λ_{g} {‖ β^{*} ‖}_{1, 2} .

(32) and the previous inequality show that

{‖ X h ‖}_{2}^{2} + λ {‖ h_{T^{c}} ‖}_{1} + λ_{g} {‖ h_{(G^{c})} ‖}_{1, 2} \leq 2 〈 X h, ε 〉 - λ \cdot sgn {(β_{T}^{*})}^{⊤} h_{T} - λ_{g} \sum_{j \in G} \frac{β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} + 2 λ {‖ β_{T^{c}}^{*} ‖}_{1} + 2 λ_{g} {‖ β_{T^{c}}^{*} ‖}_{1, 2} .

(64)

First, we consider 〈Xh, ε〉. Denote $P = X_{T} {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤}$ , since $X h = X_{T} h_{T} + X_{T^{c}} h_{T^{c}}$ and (I_n − P)X_T = 0,

| 〈 X h, ε 〉 | \leq | 〈 P X h, ε 〉 | + | 〈 (I_{n} - P) X h, ε 〉 | = | 〈 X_{T}^{⊤} X h, {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε 〉 | + | 〈 (I_{n} - P) X_{T^{c}} h_{T^{c}}, ε 〉 | = | 〈 X_{T}^{⊤} X h, {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε 〉 | + | 〈 X_{T^{c}} h_{T^{c}}, (I_{n} - P) ε 〉 | .

(65)

Therefore, to give an upper bound of |〈Xh, ε〉|, we only need to bound $| 〈 X_{T}^{⊤} X h, {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε 〉 |$ and $| 〈 X_{T^{c}} h_{T^{c}}, (I_{n} - P) ε 〉 |$ , respectively. By Part 1 of Lemma 7 and also notice that c_min ≤ σ_min (Σ) ≤ σ_max (Σ) ≤ C_max,

ℙ (‖ {(\frac{1}{n} X_{T}^{⊤} X_{T})}^{- 1} ‖ \geq \frac{2}{c_{min}}) \leq ℙ (‖ \frac{1}{n} X_{T}^{⊤} X_{T} - Σ_{T, T} ‖ \geq \frac{c_{min}}{2}) \leq ℙ (‖ \frac{1}{n} X_{T}^{⊤} X_{T} Σ_{T, T}^{- 1} - I_{s} ‖ \geq \frac{c_{min}}{2 C_{max}}) \leq 2 exp (C s - c n) \leq 2 exp (- c n) .

(66)

(66), Lemma 9 and Cauchy-Schwarz inequality together imply that with probability at least $1 - exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - 2 exp (- c n)$ ,

{‖ {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε ‖}_{1} \leq \frac{2}{c_{min}} \frac{\sqrt{s}}{n} {‖ X_{T}^{⊤} ε ‖}_{2} \leq \frac{2}{c_{min}} \frac{s}{n} {‖ X_{T}^{⊤} ε ‖}_{\infty} \leq C \frac{s}{n} \sqrt{n \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}} σ^{2} \leq C \frac{s}{n} λ,

{‖ {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε ‖}_{1, 2} \leq \sqrt{s_{g}} {‖ {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε ‖}_{2} \leq \frac{2}{c_{min}} \frac{\sqrt{s_{g}}}{n} {‖ X_{T}^{⊤} ε ‖}_{2} \leq C \frac{\sqrt{s \cdot s_{g}}}{n} λ .

Combine Lemma 8 Part 2, (63) and the previous two inequalities together, with probability at least $1 - 2 exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - 3 e^{- c n}$ ,

| 〈 X_{T}^{⊤} X h, {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε 〉 | \leq \frac{11}{10} λ {‖ {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε ‖}_{1} + \frac{11}{10} λ_{g} {‖ {(X_{T}^{⊤} X_{T})}^{- 1} X_{T}^{⊤} ε ‖}_{1, 2} \leq C \frac{s}{n} λ^{2} .

(67)

Similarly to the proof of (62), also notice that ∥(I_n − P)ε∥₂ ≤ ∥ε∥₂ and $X_{(G^{c})}$ is independent of I_n − P, we have

ℙ ({‖ H_{\frac{1}{10} λ} (X_{(G^{c})}^{⊤} (I_{n} - P) ε) ‖}_{\infty, 2} \geq \frac{1}{10} λ_{g}) \leq ℙ (\exists j \in G^{c}, {‖ H_{\frac{1}{10} λ} (X_{(j)}^{⊤} (I_{n} - P) ε) ‖}_{2} \geq \frac{1}{10} λ_{g} | {‖ (I_{n} - P) ε ‖}_{2} \geq 5 \sqrt{n σ^{2}}) + ℙ (‖ ε ‖_{2} \geq 5 \sqrt{n σ^{2}}) \leq exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) + e^{- n} .

By Lemma 8 Part 2 and (62), with probability at least $1 - exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) - e^{- η}$ ,

| 〈 X_{(G^{c})} h_{(G^{c})}, (I_{n} - P) ε 〉 | = | 〈 h_{(G^{c})}, X_{(G^{c})}^{⊤} (I_{n} - P) ε 〉 | \leq \frac{1}{10} λ {‖ h_{(G^{c})} ‖}_{1} + \frac{1}{10} λ_{g} {‖ h_{(G^{c})} ‖}_{1, 2} .

Notice that $X_{T^{c} \ (G^{c})}$ and I_n − P are independent and |T^c\(G^c)| ≤ |G| ≤ s_gb, by Lemma 9, with probability at least $1 - exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - e^{- n}$ ,

| 〈 X_{T^{c} \ (G^{c})} h_{T^{c} \ (G^{c})}, (I_{n} - P) ε 〉 | \leq {‖ h_{T^{c} \ (G^{c})} ‖}_{1} {‖ X_{T^{c} \ (G^{c})}^{⊤} (I_{n} - P) ε ‖}_{\infty} \leq C \sqrt{n \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s} σ^{2}} {‖ h_{T^{c} \ (G^{c})} ‖}_{1} \leq \frac{1}{10} λ {‖ h_{T^{c} \ (G^{c})} ‖}_{1} .

Combine the previous two inequalities together, we have

| 〈 X_{T^{c}} h_{T^{c}}, (I_{n} - P) ε 〉 | \leq | 〈 X_{(G^{c})} h_{(G^{c})}, (I_{n} - P) ε 〉 | + | 〈 X_{T^{c} \ (G^{c})} h_{T^{c} \ (G^{c})}, (I_{n} - P) ε 〉 | \leq \frac{1}{10} λ {‖ h_{T^{c}} ‖}_{1} + \frac{1}{10} λ_{g} {‖ h_{(G^{c})} ‖}_{1, 2}

(68)

with probability $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - C e^{- c n}$ . Combine (65), (67) and (68) together, we know that with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - C e^{- c n}$ ,

| 〈 X h, ε 〉 | \leq C \frac{s}{n} λ^{2} + \frac{1}{10} λ {‖ h_{T^{c}} ‖}_{1} + \frac{1}{10} λ_{g} {‖ h_{(G^{c})} ‖}_{1, 2} .

(69)

Moreover, by the proof of Theorem 1, with probability at least 1 − C exp(−cn/s), there exists an approximate dual certificate $u \in ℝ^{p}$ in the row span of X satisfying (18), and ${‖ v_{T} - sgn (β_{T}^{*}) ‖}_{2} \leq \frac{1}{8}$ , where v is defined in (29). Similarly to (33), we have

sgn {(β_{T}^{*})}^{⊤} h_{T} + \sum_{j \in G} \frac{\sqrt{s_{0}} β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} \geq - {‖ v_{T} - sgn (β_{T}^{*}) ‖}_{2} \cdot {‖ h_{T} ‖}_{2} - {‖ h_{T^{c}} ‖}_{1} / 2 - \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2 + 〈 h, u 〉 \geq - \frac{c_{min}}{8} \cdot {‖ h_{T} ‖}_{2} - {‖ h_{T^{c}} ‖}_{1} / 2 - \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2 + 〈 h, u 〉 .

By Lemma 10, with probability at least 1 − Ce^−cn/s, u = X^⊤ w with ${‖ w ‖}_{2} \leq C \sqrt{s / n}$ . Therefore, with probability at least 1 − Ce^−cn/s,

| 〈 h, u 〉 | = | 〈 X h, w 〉 | \leq {‖ X h ‖}_{2} {‖ w ‖}_{2} \leq C \sqrt{s / n} {‖ X h ‖}_{2} .

The two previous inequalities together imply that

sgn {(β_{T}^{*})}^{⊤} h_{T} + \sum_{j \in G} \frac{\sqrt{s_{0}} β_{T, (j)}^{* ⊤} h_{(j)}}{{‖ β_{T, (j)}^{*} ‖}_{2}} \geq - \frac{c_{min}}{8} \cdot {‖ h_{T} ‖}_{2} - {‖ h_{T^{c}} ‖}_{1} / 2 - \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} / 2 - C \sqrt{s / n} {‖ X h ‖}_{2}

(70)

with probability at least 1 − Ce^−cn/s.

Combine (64), (69) and (70) together, with probability at least $1 - C e^{- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}} - C e^{- c n / s}$ ,

{‖ X h ‖}_{2}^{2} + \frac{3}{10} λ {‖ h_{T^{c}} ‖}_{1} + \frac{3}{10} λ_{g} {‖ h_{(G^{c})} ‖}_{1, 2} \leq C \frac{s}{n} λ^{2} + \frac{c_{min}}{8} λ {‖ h_{T} ‖}_{2} + C \sqrt{s / n} λ {‖ X h ‖}_{2} + 2 λ {‖ β_{T^{c}}^{*} ‖}_{1} + 2 λ_{g} {‖ β_{T^{c}}^{*} ‖}_{1, 2} .

(71)

By (42), (63) and (66), with probability at least $1 - exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) - C e^{- c n}$ ,

{‖ h_{T} ‖}_{2} \leq ‖ {(X_{T}^{⊤} X_{T})}^{- 1} ‖ {‖ X_{T}^{⊤} X_{T} h_{T} ‖}_{2} \leq \frac{2}{c_{min} n} {‖ X_{T}^{⊤} X h - X_{T}^{⊤} X_{T^{c}} h_{T^{c}} ‖}_{2} \leq \frac{2}{c_{min} n} ({‖ X_{T}^{⊤} X h ‖}_{2} + {‖ X_{T}^{⊤} X_{T^{c}} h_{T^{c}} ‖}_{2}) \leq \frac{2}{c_{min} n} ({‖ H_{\frac{11}{10} λ} (X_{T}^{⊤} X h) ‖}_{2} + \frac{11}{10} \sqrt{s} λ + n \sum_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} | h_{i} |) \leq \frac{2}{c_{min} n} (\sqrt{s_{g}} {‖ H_{\frac{11}{10} λ} (X_{T}^{⊤} X h) ‖}_{\infty, 2} + \frac{11}{10} \sqrt{s} λ + n max_{i \in T^{c}} {‖ X_{T}^{⊤} X_{i} / n ‖}_{2} {‖ h_{T^{c}} ‖}_{1}) \leq \frac{2}{c_{min} n} (\sqrt{s_{g}} \frac{11}{10} λ_{g} + \frac{11}{10} \sqrt{s} λ + \frac{n}{2} {‖ h_{T^{c}} ‖}_{1}) \leq \frac{5}{c_{min}} \frac{\sqrt{s}}{n} λ + \frac{1}{c_{min}} {‖ h_{T^{c}} ‖}_{1} .

(72)

The fourth inequality comes from ${‖ x ‖}_{2} \leq {‖ H_{α} (x) ‖}_{2} + \sqrt{s} α$ for $x \in ℝ^{s}$ ; the fifth inequality holds since ${‖ X_{T}^{⊤} X h ‖}_{0, 2} \leq s_{g}$ . (71) and (72) together imply that

{‖ X h ‖}_{2}^{2} + \frac{7}{40} λ {‖ h_{T^{c}} ‖}_{1} + \frac{3}{10} λ_{g} {‖ h_{(G^{c})} ‖}_{1, 2} \leq C \frac{s}{n} λ^{2} + C \sqrt{s / n} λ {‖ X h ‖}_{2} + 2 λ {‖ β_{T^{c}}^{*} ‖}_{1} + 2 λ_{g} {‖ β_{T^{c}}^{*} ‖}_{1, 2}

with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - C e^{- c n / s}$ . Also notice that

C \sqrt{s / n} λ {‖ X h ‖}_{2} \leq {‖ X h ‖}_{2}^{2} + C \frac{s}{n} λ^{2},

with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - C e^{- c n / s}$ ,

{‖ h_{T^{c}} ‖}_{1} + \sqrt{s_{0}} {‖ h_{(G^{c})} ‖}_{1, 2} \leq C (\frac{s}{n} λ + {‖ β_{T^{c}}^{*} ‖}_{1} + \sqrt{s_{0}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}) .

(73)

From the proof of Lemma 1, we know that (36) and (41) hold with probability at least 1 − 2e^−cn. By Lemma 8 Part 2 and (63), with probability at least $1 - exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) - e^{- n}$ ,

| 〈 X_{\tilde{T}} h_{\tilde{T}}, X h 〉 | = | 〈 h_{\tilde{T}}, X_{\tilde{T}}^{⊤} X h 〉 | \leq \frac{11}{10} (λ {‖ h_{\tilde{T}} ‖}_{1} + λ_{g} {‖ h_{\tilde{T}} ‖}_{1, 2}) \leq \frac{11}{10} (λ \cdot \sqrt{3 s} {‖ h_{\tilde{T}} ‖}_{2} + λ_{g} \sqrt{2 s_{g}} {‖ h_{\tilde{T}} ‖}_{2}) \leq 4 λ \sqrt{s} {‖ h_{\tilde{T}} ‖}_{2} .

(74)

The second inequality is due to ${‖ h_{\tilde{T}} ‖}_{0} \leq 3 s$ , ${‖ h_{\tilde{T}} ‖}_{0, 2} \leq 2 s_{g}$ and Cauchy-Schwarz inequality.

Combine (36), (41), (73) and (74) together, with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - C e^{- c n / s}$ , we have

\frac{c_{min}}{2} {‖ h_{\tilde{T}} ‖}_{2}^{2} \leq \frac{1}{n} 4 λ \sqrt{s} {‖ h_{\tilde{T}} ‖}_{2} + \sqrt{3} C_{max} {‖ h_{\tilde{T}} ‖}_{2} (\sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{1} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2}) \leq \frac{1}{n} 4 λ \sqrt{s} {‖ h_{\tilde{T}} ‖}_{2} + \sqrt{3} C_{max} {‖ h_{\tilde{T}} ‖}_{2} \cdot \frac{C}{\sqrt{s}} \cdot (\frac{s}{n} λ + {‖ β_{T^{c}}^{*} ‖}_{1} + \sqrt{s_{0}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}) \leq C (\frac{\sqrt{s}}{n} λ + \frac{1}{\sqrt{s}} {‖ β_{T^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}) {‖ h_{\tilde{T}} ‖}_{2} .

Therefore, with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}) - C e^{- c n / s}$ ,

{‖ h_{\tilde{T}} ‖}_{2} \leq C (\frac{\sqrt{s}}{n} λ + \frac{1}{\sqrt{s}} {‖ β_{T^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}) .

(75)

By (39), (40), (73) and the previous inequality, also notice that $e^{- c n / s} \leq e^{- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}}$ , with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ ,

‖ h ‖_{2} \leq {‖ h_{\tilde{T}} ‖}_{2} + \sum_{i \geq 2} {‖ h_{T_{i}} ‖}_{2} + \sum_{j \geq 2} {‖ h_{R_{j}} ‖}_{2} \leq {‖ h_{\tilde{T}} ‖}_{2} + \sqrt{2} s^{- 1 / 2} {‖ h_{T^{c}} ‖}_{2} + s_{g}^{- 1 / 2} {‖ h_{(G^{c})} ‖}_{1, 2} \leq C (\frac{\sqrt{s}}{n} λ + \frac{1}{\sqrt{s}} {‖ β_{T^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}),

(76)

i.e., with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ ,

{‖ h ‖}_{2} \leq C (\sqrt{\frac{σ^{2} (s_{g} log (d / s_{g}) + s log (e s_{g} b))}{n}} + \frac{1}{\sqrt{s}} {‖ β_{T^{c}}^{*} ‖}_{1} + \frac{1}{\sqrt{s_{g}}} {‖ β_{T^{c}}^{*} ‖}_{1, 2}) .

Moreover, if β* is (s, s_g)-sparse, then ${‖ β_{T^{c}}^{*} ‖}_{1} = {‖ β_{T^{c}}^{*} ‖}_{1, 2} = 0$ . Therefore, with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ ,

{‖ h ‖}_{2}^{2} \leq \frac{C σ^{2} (s_{g} log (d / s_{g}) + s log (e s_{g} b))}{n} .

□

G. Proof of Theorem 5

First, we consider the case that d ≥ 3s_g and b ≥ 3s/s_g. Let ω⁽¹⁾, …, ω^(N) be uniformly randomly vectors from

A = {ω \in {0, 1}^{d b} ∣ \sum_{j} 1_{{ω_{(j)} \neq 0}} = s_{g}, {‖ ω_{(j)} ‖}_{0} = ⌊ s / s_{g} ⌋ if ω_{(j)} \neq 0} .

Denote $Ω^{(i)} = {j ∣ ω_{j}^{(i)} \neq 0}$ , $Ω_{(k)}^{(i)} = {j ∣ j \in (k), ω_{j}^{(i)} \neq 0}$ and β⁽ⁱ⁾ = τω⁽ⁱ⁾, for all 1 ≤ i ≤ N, 1 ≤ k ≤ d, where τ is a parameter that will be specified later. Obviously, ∥β⁽ⁱ⁾∥₀ = s_g⌊s/s_g⌋ ≤ s, therefore ${‖ β^{(i)} - β^{(j)} ‖}_{2}^{2} \leq 2 s_{g} ⌊ s / s_{g} ⌋ τ^{2} \leq 2 s τ^{2}$ .

Moreover, if |Ω⁽ⁱ⁾ ∩ Ω^(j)| ≥ 8s_g⌊s/s_g⌋/9, then we must have

| {k | ω_{(k)}^{(i)}, ω_{(k)}^{(j)} \neq 0, | Ω_{(k)}^{(i)} \cap Ω_{(k)}^{(j)} | \geq 2 ⌊ s / s_{g} ⌋ / 3} | \geq 2 s_{g} / 3,

otherwise $| Ω^{(i)} \cap Ω^{(j)} | \leq \frac{2 s_{g}}{3} ⌊ s / s_{g} ⌋ + \frac{s_{g}}{3} 2 ⌊ s / s_{g} ⌋ / 3 \leq 8 s_{g} ⌊ s / s_{g} ⌋ / 9$ , which is a contradiction.

Therefore,

ℙ ({‖ β^{(i)} - β^{(j)} ‖}_{2}^{2} \leq 2 s_{g} ⌊ s / s_{g} ⌋ τ^{2} / 9) = ℙ (| Ω^{(i)} \cap Ω^{(j)} | \geq 8 s_{g} ⌊ s / s_{g} ⌋ / 9) \leq ℙ (| {k | ω_{(k)}^{(i)}, ω_{(k)}^{(j)} \neq 0, | Ω_{(k)}^{(i)} \cap Ω_{(k)}^{(j)} | \geq 2 ⌊ s / s_{g} ⌋ / 3} | \geq 2 s_{g} / 3) \leq \sum_{l = ⌈ s_{g} / 3 ⌉}^{s_{g}} (\begin{matrix} s_{g} \\ l \end{matrix}) \cdot [\sum_{t = ⌈ 2 ⌊ s / s_{g} ⌋ / 3 ⌉}^{⌊ s / s_{g} ⌋} (\begin{matrix} ⌊ s / s_{g} ⌋ \\ t \end{matrix}) {(\begin{matrix} b - ⌊ s / s_{g} ⌋ \\ ⌊ s / s_{g} ⌋ - t \end{matrix}]}^{l} \cdot {(\begin{matrix} b \\ ⌊ s / s_{g} ⌋ \end{matrix})}^{s_{g} - l} (\begin{matrix} d - l \\ s_{g} - l \end{matrix}) / (\begin{matrix} d \\ s_{g} \end{matrix}) {(\begin{matrix} b \\ ⌊ s / s_{g} ⌋ \end{matrix})}^{s_{g}} = \sum_{l = ⌈ 2 s_{g} / 3 ⌉}^{s_{g}} (\begin{matrix} s_{g} \\ l \end{matrix}) \frac{(\begin{matrix} d - l \\ s_{g} - l \end{matrix})}{(\begin{matrix} d \\ s_{g} \end{matrix})} \cdot {[\sum_{t = ⌈ 2 ⌊ s / s_{g} ⌋ / 3 ⌉}^{⌊ s / s_{g} ⌋} (\begin{matrix} ⌊ s / s_{g} ⌋ \\ t \end{matrix}) \frac{(\begin{matrix} b - ⌊ s / s_{g} ⌋ \\ ⌊ s / s_{g} ⌋ - t \end{matrix})}{(\begin{array}{l} b \\ ⌊ s / s_{g} ⌋ \end{array})}]}^{l} .

(77)

Note that

\frac{(\begin{array}{l} d - l \\ s_{g} - l \end{array})}{(\begin{array}{l} d \\ s_{g} \end{array})} = \frac{\frac{(d - l) \dots (d - s_{g} + 1)}{(s_{g} - l)!}}{\frac{d (d - 1) \dots (d - s_{g} + 1)}{s_{g}!}} = \frac{s_{g} (s_{g} - 1) \dots (s_{g} - l + 1)}{d (d - 1) \dots (d - l + 1)} \leq {(\frac{s_{g}}{d})}^{l},

The inequality holds since $\frac{s_{g} - i}{d - i} \leq \frac{s_{g}}{d}$ , for all 1 ≤ i ≤ s_g. Similarly, for 1 ≤ t ≤ ⌊s/s_g⌋,

\frac{(\begin{matrix} b - ⌊ s / s_{g} ⌋ \\ ⌊ s / s_{g} ⌋ - t \end{matrix})}{(\begin{matrix} b \\ ⌊ s / s_{g} ⌋ \end{matrix})} \leq \frac{(\begin{matrix} b - t \\ ⌊ s / s_{g} ⌋ - t \end{matrix})}{(\begin{matrix} b \\ ⌊ s / s_{g} ⌋ \end{matrix})} \leq {(\frac{⌊ s / s_{g} ⌋}{b})}^{t} .

Combine (77) and the previous two inequalities together, we have

ℙ ({‖ β^{(i)} - β^{(j)} ‖}_{2}^{2} \leq 2 s_{g} ⌊ s / s_{g} ⌋ τ^{2} / 9) \leq \sum_{l = ⌈ 2 s_{g} / 3 ⌉}^{s_{g}} (\begin{matrix} s_{g} \\ l \end{matrix}) {(\frac{s_{g}}{d})}^{l} \cdot {[\sum_{t = ⌈ 2 ⌊ s / s_{g} ⌋ / 3 ⌉}^{⌊ s / s_{g} ⌋} (\begin{matrix} ⌊ s / s_{g} ⌋ \\ t \end{matrix}) {(\frac{⌊ s / s_{g} ⌋}{b})}^{t}]}^{l} \leq \sum_{l = ⌈ 2 s_{g} / 3 ⌉}^{s_{g}} (\begin{matrix} s_{g} \\ l \end{matrix}) {(\frac{s_{g}}{d})}^{l} \cdot {[\sum_{t = ⌈ 2 ⌊ s / s_{g} ⌋ / 3 ⌉}^{⌊ s / s_{g} ⌋} (\begin{matrix} ⌊ s / s_{g} ⌋ \\ t \end{matrix}) {(\frac{⌊ s / s_{g} ⌋}{b})}^{2 ⌊ s / s_{g} ⌋ / 3}]}^{l} \leq \sum_{l = ⌈ 2 s_{g} / 3 ⌉}^{s_{g}} (\begin{matrix} s_{g} \\ l \end{matrix}) {(\frac{s_{g}}{d})}^{l} \cdot {[2^{⌊ s / s_{g} ⌋} {(\frac{⌊ s / s_{g} ⌋}{b})}^{2 ⌊ s / s_{g} ⌋ / 3}]}^{l} \leq \sum_{l = ⌈ 2 s_{g} / 3 ⌉}^{s_{g}} (\begin{matrix} s_{g} \\ l \end{matrix}) {(\frac{s_{g}}{d})}^{2 s_{g} / 3} \cdot {[{(\frac{2 \sqrt{2} ⌊ s / s_{g} ⌋}{b})}^{2 ⌊ s / s_{g} ⌋ / 3}]}^{2 s_{g} / 3} \leq {(\frac{2 \sqrt{2} s_{g}}{d})}^{2 s_{g} / 3} \cdot {(\frac{2 \sqrt{2} ⌊ s / s_{g} ⌋}{b})}^{2 s / 9} .

(78)

Set $N = [{(\frac{d}{2 \sqrt{2} s_{g}})}^{s_{g} / 3} {(\frac{b}{2 \sqrt{2} ⌊ s / s_{g} ⌋})}^{s / 9}]$ , then

ℙ (\forall 1 \leq i \neq j \leq N, {‖ β^{(i)} - β^{(j)} ‖}_{2}^{2} > 2 s_{g} ⌊ s / s_{g} ⌋ τ^{2} / 9) \geq 1 - \frac{N (N - 1)}{2} {(\frac{2 \sqrt{2} s_{g}}{d})}^{2 s_{g} / 3} \cdot {(\frac{2 \sqrt{2} ⌊ s / s_{g} ⌋}{b})}^{2 s / 9} > 0.

i.e., the probability that β⁽¹⁾, …, β^(N); Ω⁽¹⁾, ⋯, Ω^(N) satisfy

\frac{s}{9} τ^{2} < 2 s_{g} ⌊ s / s_{g} ⌋ τ^{2} / 9 < min_{i \neq j} {‖ β^{(i)} - β^{(j)} ‖}_{2}^{2} \leq 2 s τ^{2},

(79)

| Ω^{(i)} \cap Ω^{(j)} | < 8 s_{g} ⌊ s / s_{g} ⌋ / 9, \forall 1 \leq i < j \leq N

(80)

is positive. For convenience, we fix β⁽¹⁾, …, β^(N) to be the vectors satisfying (79).

Denote y⁽ⁱ⁾ = Xβ⁽ⁱ⁾ + ε for all 1 ≤ i ≤ N. We consider the Kullback-Leibler divergence between different distribution pairs:

D_{K L} ((y^{(i)}, X), (y^{(j)}, X)) = E_{(y^{(j)}, X)} [log (\frac{p (y^{(i)}, X)}{p (y^{(j)}, X)})],

where p(y⁽ⁱ⁾, X) is the probability density of (y⁽ⁱ⁾, X). Conditioning on X, we have

E_{(y^{(j)}, X)} [log (\frac{p (y^{(i)}, X)}{p (y^{(j)}, X)}) ∣ X] = \frac{{‖ X (β^{(i)} - β^{(j)}) ‖}_{2}^{2}}{2 σ^{2}} .

Thus for 1 ≤ i ≠ j ≤ N,

D_{K L} ((y^{(i)}, X), (y^{(j)}, X)) = E_{X} \frac{{‖ X (β^{(i)} - β^{(j)}) ‖}_{2}^{2}}{2 σ^{2}} = \frac{n {(β^{(i)} - β^{(j)})}^{⊤} Σ (β^{(i)} - β^{(j)})}{2 σ^{2}} \leq \frac{3 n {‖ β^{(i)} - β^{(j)} ‖}_{2}^{2}}{4 σ^{2}} \leq \frac{3 n s τ^{2}}{2 σ^{2}} .

(81)

In the first inequality, we used $σ_{max} (Σ) \leq \frac{3}{2}$ . By generalized Fano’s Lemma,

inf_{\hat{β}} sup_{β \in F_{s, s_{g}}} E {‖ \hat{β} - β ‖}_{2} \geq \frac{\sqrt{s τ^{2} / 9}}{2} (1 - \frac{\frac{3 n s τ^{2}}{2 σ^{2}} + log 2}{log N}) .

Since $log N ≍ s_{g} log (\frac{d}{s_{g}}) + s log (\frac{e s_{g} b}{s})$ , by setting $τ = c \sqrt{\frac{σ^{2} (s_{g} log (\frac{d}{s_{g}}) + s log (\frac{e s_{g} b}{s}))}{n s}}$ , we have

inf_{\hat{β}} sup_{β \in F_{s, s_{g}}} E {‖ \hat{β} - β ‖}_{2}^{2} \geq {(inf_{\hat{β}} sup_{β \in F_{s, s_{g}}} E {‖ \hat{β} - β ‖}_{2})}^{2} \geq c \frac{σ^{2} (s_{g} log (d / s_{g}) + s log (e s_{g} b / s))}{n} .

If d < 3s_g or b < 3s/s_g, let $s_{g}^{'} = [s_{g} / 3] \lor 1 \geq s_{g} / 5$ , $s^{'} = [s / 15] \lor s_{g}^{'}$ , then $d \geq 3 s_{g}^{'}$ and $b \geq 3 s^{'} / s_{g}^{'}$ . Similarly to (56), we have

inf_{\hat{β}} sup_{β \in F_{s, s_{g}}} E {‖ \hat{β} - β ‖}_{2}^{2} \geq inf_{\hat{β}} sup_{β \in F_{s^{'}, s_{g}^{'}}} E {‖ \hat{β} - β ‖}_{2}^{2} \geq c \frac{σ^{2} (s_{g}^{'} log (d / s_{g}^{'}) + s^{'} log (e s_{g}^{'} b / s^{'}))}{n} \geq c^{'} \frac{σ^{2} (s_{g} log (d / s_{g}) + s log (e s_{g} b / s))}{n} .

□

H. Proof of Theorem 6

The proof of Theorem 6 relies on the following key lemma, which shows that Σ⁻¹ is in the feasible set of the optimization problem (23) with high probability by choosing appropriate α and γ.

Lemma 4: By setting $α = C \sqrt{\frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s n}}$ , $γ = \sqrt{\frac{s}{s_{g}}} α$ in (23), we have

ℙ ({max}_{1 \leq i \leq p} {‖ H_{α} (e_{i} - \frac{1}{n} X^{⊤} X Σ^{- 1} e_{i}) ‖}_{\infty, 2} \leq γ) \geq 1 - 4 exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) .

Note that Y = Xβ* + ε, we have

\sqrt{n} ({\hat{β}}^{u} - β^{*}) = \sqrt{n} (\hat{β} - β^{*} + \frac{1}{n} \hat{M} X^{⊤} (Y - X \hat{β})) = \sqrt{n} (I - \frac{1}{n} \hat{M} X^{⊤} X) (\hat{β} - β^{*}) + \frac{1}{\sqrt{n}} \hat{M} X^{⊤} ε .

Since $ε_{i} \overset{i . i . d .}{~} N (0, σ^{2})$ , we know that

\frac{1}{\sqrt{n}} \hat{M} X^{⊤} ε ∣ X ~ N (0, \hat{M} \hat{Σ} {\hat{M}}^{⊤}) .

Denote $h = \hat{β} - β^{*}$ . Since β* is (s, s_g)-sparse, by (73), (76) and Cauchy-Schwarz inequality, with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ ,

{‖ h ‖}_{1} \leq {‖ h_{T} ‖}_{1} + {‖ h_{T^{c}} ‖}_{1} \leq \sqrt{s} {‖ h_{T} ‖}_{2} + {‖ h_{T^{c}} ‖}_{1} \leq \sqrt{s} {‖ h ‖}_{2} + {‖ h_{T^{c}} ‖}_{1} \leq C \frac{s}{n} λ .

{‖ h ‖}_{1, 2} \leq {‖ h_{(G)} ‖}_{1, 2} + {‖ h_{(G^{c})} ‖}_{1, 2} \leq \sqrt{s_{g}} {‖ h_{(G)} ‖}_{2} + {‖ h_{(G^{c})} ‖}_{1, 2} \leq \sqrt{s_{g}} {‖ h ‖}_{2} + {‖ h_{(G^{c})} ‖}_{1, 2} \leq C \frac{\sqrt{s \cdot s_{g}}}{n} λ .

In addition, Lemma 4 shows that Σ⁻¹ is in the feasible set of (23) with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ . By the definition of $\hat{M}$ ,

max_{i} {‖ H_{α} (e_{i} - \hat{Σ} {\hat{M}}^{⊤} e_{i}) ‖}_{\infty, 2} = max_{i} {‖ H_{α} (e_{i} - \hat{Σ} {\hat{m}}_{i}) ‖}_{\infty, 2} \leq γ .

(82)

Combining these facts, by Lemma 8 Part 2, we must have

{‖ (I - \frac{1}{n} \hat{M} X X^{⊤}) (\hat{β} - β^{*}) ‖}_{\infty} = {max}_{i} | 〈 e_{i} - \hat{Σ} {\hat{M}}^{⊤} e_{i}, h 〉 | \leq α {‖ h ‖}_{1} + γ {‖ h ‖}_{1, 2} \leq C \frac{s}{n} α λ + C \frac{\sqrt{s \cdot s_{g}}}{n} γ λ = \frac{C (s log (e s_{g} b) + s_{g} log (d / s_{g}))}{n} σ

with probability at least $1 - C exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s})$ . This has finished the proof of (24).

Next, we consider ${\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i}$ . By (82) and Lemma 8 Part 2, we have

1 - 〈 e_{i}, \hat{Σ} {\hat{m}}_{i} 〉 = 〈 e_{i}, e_{i} - \hat{Σ} {\hat{m}}_{i} 〉 \leq α {‖ e_{i} ‖}_{1} + γ {‖ e_{i} ‖}_{1, 2} = α + γ .

Therefore, for any c ≥ 0,

{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i} \geq {\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i} + c (1 - α - γ) - c 〈 e_{i}, \hat{Σ} {\hat{m}}_{i} 〉 \geq {min}_{m} {m^{⊤} \hat{Σ} m + c (1 - α - γ) - c 〈 e_{i}, \hat{Σ} m 〉} .

Since m = ce_i/2 achieves the minimum of the right hand side, we have

{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i} \geq c (1 - α - γ) - \frac{c^{2}}{4} {\hat{Σ}}_{i, i} .

If ${\hat{Σ}}_{i i} > 0$ for all 1 ≤ i ≤ p, by setting $c = 2 (1 - α - γ) / {\hat{Σ}}_{i, i}$ , we have

{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i} \geq \frac{{(1 - α - γ)}^{2}}{{\hat{Σ}}_{i, i}}, \forall 1 \leq i \leq p .

(83)

Moreover, by Lemma 6 Part 2 with u = v = e_i, we have

ℙ (| {\hat{Σ}}_{i, i} - Σ_{i, i} | \geq \frac{c_{min}}{2}) \leq 2 exp (- c n) .

By the union bound,

ℙ (\exists 1 \leq i \leq p, | {\hat{Σ}}_{i, i} - Σ_{i, i} | \geq \frac{c_{min}}{2}) \leq \sum_{i = 1}^{p} ℙ (| {\hat{Σ}}_{i, i} - Σ_{i, i} | \geq \frac{c_{min}}{2}) \leq d b \cdot 2 exp (- c n) \leq 2 exp (- c n) .

Therefore, with probability at least 1 – 2 exp (−cn),

\frac{c_{min}}{2} \leq {\hat{Σ}}_{i, i} \leq C_{max} + \frac{c_{min}}{2}, \forall 1 \leq i \leq p .

(83) and the previous inequality together imply that with probability at least 1 – 2 exp (−cn),

{\hat{m}}_{i}^{⊤} \hat{Σ} {\hat{m}}_{i} \geq \frac{1}{2 C_{max}}, \forall 1 \leq i \leq p .

(24) and the previous inequality together imply (25). □

Fig. 2. — Average estimation error in the noisy case

Acknowledgments

The research of Tony Cai was supported in part by NSF grants DMS-1712735 and DMS-2015259 and NIH grants R01-GM129781 and R01-GM123056. The research of Anru R. Zhang and Yuchen Zhou was supported in part by NSF Grants CAREER-2203741 and DMS-1811868 and NIH grant R01-GM131399-01.

Biographies

T. Tony Cai received the Ph.D. degree from Cornell University, Ithaca, NY, in 1996. His research interests include high-dimensional statistics, machine learning, large- scale inference, nonparametric function estimation, functional data analysis, and statistical decision theory. He is the Daniel H. Silberberg Professor of Statistics and data science at the Wharton School of the University of Pennsylvania, Philadelphia, PA, USA. Dr. Cai is the recipient of the 2008 COPSS Presidents Award and a fellow of the Institute of Mathematical Statistics. He is a past editor of the Annals of Statistics.

Anru R. Zhang is the Eugene Anson Stead, Jr. M.D. Associate Professor in the Department of Biostatistics & Bioinformatics and Associate Professor in the Departments of Computer Science, Mathematics, and Statistical Science at Duke University. He was an assistant professor of statistics at the University of Wisconsin-Madison in 2015–2021. He obtained his bachelor’s degree from Peking University in 2010 and his Ph.D. from the University of Pennsylvania in 2015. His work focuses on high-dimensional statistical inference, non-convex optimization, statistical tensor analysis, computational complexity, and applications in genomics, microbiome, electronic health records, and computational imaging. He received the IMS Tweedie Award (2022), ASA Gottfried E. Noether Junior Award (2021), Bernoulli Society New Researcher Award (2021), ICSA Outstanding Young Researcher Award (2021), and NSF CAREER Award (2020).

Yuchen Zhou is a postdoctoral researcher in the Department of Statistics and Data Science, The Wharton School, University of Pennsylvania. He received the B.E. degree from Peking University in 2016 and the Ph.D. degree in statistics from the University of Wisconsin-Madison in 2021. His research interests include high-dimensional statistical inference, tensor data analysis, reinforcement learning and statistical learning theory.

Appendix

We collect all additional technical lemmas and their proofs in this section.

Lemma 5 (Bernstein-type Inequality for Soft-thresholded Sub-Gaussian Vectors):

Suppose the rows of $X \in ℝ^{n \times p}$ are independent sub-Gaussian vectors satisfying Assumption 1. $w \in ℝ^{n}$ is a fixed vector, Ω is a subset of {1, …, p} with |Ω| = r. Then

ℙ ({‖ \sum_{k = 1}^{n} w_{k} X_{k, Ω} ‖}_{2} \geq \sqrt{C_{max}} κ {‖ w ‖}_{2} \cdot (\sqrt{r} + \sqrt{2 t})) \leq exp (- t) .

(84)

For any fixed vector $w \in ℝ^{n}$ and fixed index subset Ω ⊆ {1, …, p} with |Ω| = r,

ℙ ({‖ H_{(δ ‖ w ‖_{2})} (w^{⊤} X_{Ω}) ‖}_{2} \geq t ‖ w ‖_{2}) \leq (\begin{matrix} r \\ ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}) \cdot exp (- {(t / (κ \sqrt{C_{max}}) - (t / δ) \land \sqrt{r})}_{+}^{2} / 2) + (\begin{matrix} r \\ ⌈ {(t / δ)}^{2} ⌉ \end{matrix}) \cdot exp (- {(t / (κ \sqrt{C_{max}}) - \sqrt{⌈ {(t / δ)}^{2} ⌉})}_{+}^{2} / 2) .

(85)

In particular, for any b ≥ r, if $\bar{λ} = C {‖ w ‖}_{2} \sqrt{\frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}}$ , ${\bar{λ}}_{g} = \sqrt{s / s_{g}} \bar{λ}$ , we have

ℙ ({‖ H_{\bar{λ}} (w^{⊤} X_{Ω}) ‖}_{2} \geq {\bar{λ}}_{g}) \leq exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) .

(86)

Proof of Lemma 5. We only need to focus on the case where ∥w∥₂ = 1. Let $W_{Ω} = X_{Ω} Σ_{Ω, Ω}^{- 1 / 2}$ , immediately we know that W_1,Ω, …, W_n,Ω are isotropic sub-Gaussian distributed. Then for any fixed w, w^⊤W_Ω is also an isotropic sub-Gaussian vector such that for any $α \in ℝ^{r}$ ,

E exp (w^{⊤} W_{Ω} α) = E exp (w^{⊤} X_{Ω} Σ_{Ω, Ω}^{- 1 / 2} α) = E exp (w^{⊤} X Σ^{- 1 / 2} (Σ^{1 / 2}) ._{, Ω} Σ_{Ω, Ω}^{- 1 / 2} α) \leq exp (κ^{2} {‖ (Σ^{1 / 2}) ._{, Ω} Σ_{Ω, Ω}^{- 1 / 2} α ‖}_{2}^{2} / 2) = exp (κ^{2} ‖ α ‖_{2}^{2} / 2) .

The last equation holds since (Σ^1/2)_Ω,·(Σ^1/2)·,_Ω = (Σ^1/2 Σ^1/2)_Ω,Ω = Σ_Ω,Ω.

By the tail inequality of sub-Gaussian quadratic form ([60, Theorem 2.1]),

ℙ ({‖ w^{⊤} W_{Ω} ‖}_{2}^{2} \geq κ^{2} (r + 2 \sqrt{r t} + 2 t)) \leq exp (- t) .

By taking square-root of the previous inequality, we have

ℙ ({‖ w^{⊤} W_{Ω} ‖}_{2} \geq κ {‖ w ‖}_{2} \cdot (\sqrt{r} + \sqrt{2 t})) \leq exp (- t) .

Also note that

{‖ w^{⊤} X_{Ω} ‖}_{2} = {‖ w^{⊤} W_{Ω} Σ_{Ω, Ω}^{1 / 2} ‖}_{2} \leq ‖ Σ_{Ω, Ω}^{1 / 2} ‖ {‖ w^{⊤} W_{Ω} ‖}_{2} \leq {‖ Σ ‖}^{1 / 2} {‖ w^{⊤} W_{Ω} ‖}_{2} \leq \sqrt{C_{max}} {‖ w^{⊤} W_{Ω} ‖}_{2},

we obtain (84).

For the second part of proof, note that

ℙ (‖ H_{δ} (w^{⊤} X_{Ω}) ‖ \geq t) \leq ℙ (\exists Λ \subseteq Ω, such that all entries of | w^{⊤} X_{Λ} | \geq δ and {‖ w^{⊤} X_{Λ} ‖}_{2} \geq t) \leq ℙ (\exists Λ \subseteq Ω, \sqrt{| Λ |} δ \leq t, {‖ w^{⊤} X_{Λ} ‖}_{2} \geq t) + ℙ (\exists Λ \subseteq Ω, \sqrt{| Λ |} δ > t, all entries of | w^{⊤} X_{Λ} | \geq δ) \leq \sum_{\begin{matrix} Λ \subseteq Ω \\ | Λ | = ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}} ℙ ({‖ w^{⊤} X_{Λ} ‖}_{2} \geq t) + \sum_{\begin{matrix} Λ \subseteq Ω \\ | Λ | = ⌈ {(t / δ)}^{2} ⌉ \end{matrix}} ℙ ({‖ w^{⊤} X_{Λ} ‖}_{2} \geq t) .

By the first part of this lemma,

ℙ ({‖ w^{⊤} X_{Λ} ‖}_{2} \geq \sqrt{C_{max}} κ {‖ w ‖}_{2} t) \leq exp (- {(t - \sqrt{| Λ |})}_{+}^{2} / 2) .

Plug in this to the previous inequality, one has

ℙ (‖ H_{δ} (w^{⊤} X_{Ω}) ‖ \geq t) \leq (\begin{matrix} r \\ ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}) \cdot exp (- {(t / (κ \sqrt{C_{max}}) - (t / δ) \land \sqrt{r})}_{+}^{2} / 2) + (\begin{matrix} r \\ [{(t / δ)}^{2}] \end{matrix}) \cdot exp (- {(t / (κ \sqrt{C_{max}}) - \sqrt{⌈ {(t / δ)}^{2} ⌉})}_{+}^{2} / 2) .

Specifically, if $δ = C \sqrt{\frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s}}$ , $t = \sqrt{s / s_{g}} δ$ ,

t / (κ \sqrt{C_{max}}) - \sqrt{⌈ {(t / δ)}^{2} ⌉} \geq C \sqrt{\frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}} - \sqrt{2 \frac{s}{s_{g}}} \geq C \sqrt{\frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}} .

Therefore, (85) shows that

ℙ ({‖ H_{\bar{λ}} (w^{⊤} X_{Ω}) ‖}_{2} \geq {\bar{λ}}_{g}) \leq r ⌊ {(t / δ)}^{2} ⌋ exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{q}}) + r^{⌈ {(t / δ)}^{2} ⌉} exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) \leq 2 r^{2 s / s_{g}} exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) \leq exp (log 2 + \frac{2 s log (e b)}{s_{g}} - C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) \leq exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) .

□

Lemma 6 (sub-Gaussian quadratic form concentrations): Suppose $Z \in ℝ^{p}$ is a sub-Gaussian vector satisfying Assumption 1.

For any fixed $u, v \in ℝ^{p}$ , u, v ≠ 0, u^⊤ZZ^⊤v is sub-exponential such that for every t > 0,
$ℙ (| u^{⊤} Z Z^{⊤} v - E u^{⊤} Z Z^{⊤} v | \geq t {‖ u ‖}_{2} {‖ v ‖}_{2}) \leq C exp (- c t / κ^{2}) .$ (87)
In addition, suppose $X = {[X_{1}^{⊤}, \dots, X_{n}^{⊤}]}^{⊤} \in ℝ^{n \times p}$ is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1,
$ℙ (| \frac{1}{n} \sum_{k = 1}^{n} u^{⊤} X_{k} X_{k}^{⊤} v - u^{⊤} Σ v | \geq t {‖ u ‖}_{2} {‖ v ‖}_{2}) \leq 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .$ (88)
More generally, for any fixed matrix $U \in ℝ^{p \times r}$ , the following concentration inequality in spectral norm holds,
$ℙ ({‖ \frac{1}{n} \sum_{k = 1}^{n} U^{⊤} X_{k} X_{k}^{⊤} v - U^{⊤} Σ v ‖}_{2} \geq t ‖ U ‖ {‖ v ‖}_{2}) \leq 2 exp (C r - c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .$ (89)

Proof of Lemma 6. Since we can rescale u and v without essentially changing the problem, without loss of generality we assume ∥u∥₂ = ∥v∥₂ = 1. Let A = uv^⊤, then u^⊤ZZ^⊤v = Z^⊤uv^⊤Z = Z^⊤AZ. By Assumption 1, $E Z = 0$ and ${‖ 〈 Z, e_{i} 〉 ‖}_{ψ_{2}} \leq C κ$ . By Hanson-Wright inequality ([61, Theorem 1.1]),

ℙ (| u^{⊤} Z Z^{⊤} v - E u^{⊤} Z Z^{⊤} v | \geq t) = ℙ (| Z^{⊤} A Z - E Z^{⊤} A Z | \geq t) \leq 2 exp [- c min (\frac{t^{2}}{κ^{4} {‖ A ‖}_{H S}^{2}}, \frac{t}{κ^{2} ‖ A ‖})] \leq 2 exp [- c min (\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}})],

where

{‖ A ‖}_{H S} = {(\sum_{i, j} {| a_{i, j} |}^{2})}^{1 / 2} = {(\sum_{i, j} {| u_{i} v_{j} |}^{2})}^{1 / 2} = {‖ u ‖}_{2} {‖ v ‖}_{2} = 1, ‖ A ‖ = max_{{‖ x ‖}_{2} \leq 1} {‖ A x ‖}_{2} = max_{{‖ x ‖}_{2} \leq 1} {‖ u v^{⊤} x ‖}_{2} = {‖ u ‖}_{2} max_{{‖ x ‖}_{2} \leq 1} | v^{⊤} x | = {‖ u ‖}_{2} {‖ v ‖}_{2} = 1.

Therefore, for every t ≥ κ²,

ℙ (| u^{⊤} Z Z^{⊤} v - E u^{⊤} Z Z^{⊤} v | \geq t) \leq 2 exp (- c t / κ^{2}) .

Thus, there exists a constant c < log 2, for every t ≥ 0,

ℙ (| u^{⊤} Z Z^{⊤} v - E u^{⊤} Z Z^{⊤} v | \geq t) \leq 2 exp (- c t / κ^{2}) .

Notice that $E u^{⊤} X_{k} X_{k}^{⊤} v = u^{⊤} Σ v$ , for all 1 ≤ k ≤ n, by Bernstein-type concentration inequality (c.f., [62, Proposition 5.16]),

ℙ (| \frac{1}{n} \sum_{k = 1}^{n} u^{⊤} X_{k} X_{k}^{⊤} v - u^{⊤} Σ v | \geq t) \leq 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .

This has finished the proof of (88).

Finally, we consider (89), which can be done by an ε-net argument and the result in (88). For any $w \in ℝ^{r}$ , ∥w∥₂ = 1, set u = Uw in (88), we have

ℙ (| \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | \geq \frac{t}{2} {‖ U w ‖}_{2} {‖ v ‖}_{2}) \leq 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .

By [62, Lemma 5.3], we can find a $\frac{1}{2}$ -net $𝒩_{\frac{1}{2}}$ of $S^{r - 1} = {x ∣ x \in ℝ^{r}, {‖ x ‖}_{2} = 1}$ with $| 𝒩_{\frac{1}{2}} | \leq 5^{r}$ . By the union bound,

ℙ (\forall w \in 𝒩_{\frac{1}{2}}, | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | \geq \frac{t}{2} {‖ U w ‖}_{2} {‖ v ‖}_{2}) \leq 5^{r} \cdot 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .

(90)

For any $g \in ℝ^{r}$ , g ≠ 0, set $x = \frac{g}{{‖ g ‖}_{2}} \in {arg max}_{w \in ℝ^{r}, {‖ w ‖}_{2} = 1} | w^{⊤} g |$ , we can find $y \in 𝒩_{\frac{1}{2}}$ such that ${‖ x - y ‖}_{2} \leq \frac{1}{2}$ . By triangle inequality,

{‖ g ‖}_{2} - | y^{⊤} g | = | x^{⊤} g | - | y^{⊤} g | \leq | x^{⊤} g - y^{⊤} g | \leq {‖ x - y ‖}_{2} {‖ g ‖}_{2} \leq \frac{1}{2} {‖ g ‖}_{2} .

Therefore,

sup_{w \in ℝ^{r}, {‖ w ‖}_{2} = 1} | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | \leq 2 sup_{w \in 𝒩_{\frac{1}{2}}} | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | .

The (90) and the previous inequality together, also notice that $‖ U ‖ = {sup}_{w \in ℝ^{r}, {‖ w ‖}_{2} = 1} {‖ U w ‖}_{2}$ , we have

ℙ (sup_{w \in ℝ^{r}, {‖ w ‖}_{2} = 1} | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | \geq t ‖ U ‖ {‖ v ‖}_{2}) \leq ℙ (sup_{w \in 𝒩_{\frac{1}{2}}} | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | \geq \frac{t}{2} ‖ U ‖ {‖ v ‖}_{2}) \leq ℙ (\forall w \in 𝒩_{\frac{1}{2}}, | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v | \geq \frac{t}{2} {‖ U w ‖}_{2} {‖ v ‖}_{2}) \leq 5^{r} \cdot 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .

(91)

Finally, note that

{‖ \frac{1}{n} \sum_{k = 1}^{n} U^{⊤} X_{k} X_{k}^{⊤} v - U^{⊤} Σ v ‖}_{2} = sup_{w \in ℝ^{r}, {‖ w ‖}_{2} = 1} | \frac{1}{n} \sum_{k = 1}^{n} w^{⊤} U^{⊤} X_{k} X_{k}^{⊤} v - w^{⊤} U^{⊤} Σ v |,

we have proved (89). □

We collect the random matrix properties of X in the following lemma. These properties will be extensively used in the main content of the paper.

Lemma 7: Suppose $X = {[X_{1}^{⊤}, \dots, X_{n}^{⊤}]}^{⊤} \in ℝ^{n \times p}$ is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1.

Suppose T ⊆ {1, …, p} is with cardinality s. Then,
$ℙ (‖ \frac{1}{n} X_{T}^{⊤} X_{T} Σ_{T, T}^{- 1} - I_{s} ‖ \geq t) \leq 2 exp (C s - c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}});$ (92)

For any fixed vector

α \in ℝ^{s}

, δ > 0, and fixed index subset Ω ⊆ T^c satisfying |Ω| = r,

t \geq δ \geq C ({max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2}) {‖ α ‖}_{2}

ℙ ({‖ H_{δ} (α^{⊤} X_{T}^{⊤} X_{Ω} / n) ‖}_{2} \geq t) \leq (\begin{matrix} r \\ ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}) \cdot exp (C ⌊ {(t / δ)}^{2} ⌋ \land r - c n min {\frac{t^{2}}{κ^{4} {‖ α ‖}_{2}^{2}}, \frac{t}{κ^{2} {‖ α ‖}_{2}}}) + {(\begin{matrix} r \\ ⌈ {(t / δ)}^{2} ⌉ \end{matrix})}_{+} \cdot exp (C ⌈ {(t / δ)}^{2} ⌉ - c n min {\frac{t^{2}}{κ^{4} {‖ α ‖}_{2}^{2}}, \frac{t}{κ^{2} {‖ α ‖}_{2}}}) .

(93)

Here, H_λ(·) is the soft-thresholding estimator at level λ.

Proof of Lemma 7.

The first statement is via ε-net. Denote $W_{T} = X_{T} Σ_{T, T}^{- 1 / 2}$ , then the rows of W_T are independent isotropic sub-Gaussian distributed. For any fixed vector $x \in S^{s - 1} = {x : x \in ℝ^{s}, {‖ x ‖}_{2} = 1}$ , by [62, Lemma 5.5], $Z_{i} = 〈 {(W_{T})}_{i .}^{⊤}, x 〉$ are independent sub-Gaussian random variables with $E Z_{i}^{2} = 1$ and ${‖ Z_{i} ‖}_{ψ_{2}} \leq C κ$ . Therefore, by Remark 5.18 and Lemma 5.14 in [62], ${‖ Z_{i}^{2} - 1 ‖}_{ψ_{1}} \leq 2 {‖ Z_{i}^{2} ‖}_{ψ_{1}} \leq 4 {‖ Z_{i} ‖}_{ψ_{2}}^{2} \leq C κ^{2}$ . Bernstein-type inequality shows that
$ℙ (| \frac{1}{n} {‖ W_{T} x ‖}_{2}^{2} - 1 | \geq \frac{t}{2}) = ℙ (| \frac{1}{n} \sum_{i = 1}^{n} (Z_{i}^{2} - 1) | \geq \frac{t}{2}) \leq 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .$

By [62, Lemma 5.2], we can find a $\frac{1}{4}$ -net $𝒩_{\frac{1}{4}}$ of $S^{s - 1} = {x : x \in ℝ^{s}, {‖ x ‖}_{2} = 1}$ with $| 𝒩_{\frac{1}{4}} | \leq 9^{s}$ . The union bound tells us
$ℙ (max_{x \in 𝒩_{\frac{1}{4}}} | \frac{1}{n} {‖ W_{T} x ‖}_{2}^{2} - 1 | \geq \frac{t}{2}) \leq 9^{s} \cdot 2 exp (- c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .$ (94)

By [62, Lemma 5.4],
$‖ \frac{1}{n} W_{T}^{⊤} W_{T} - I_{s} ‖ \leq 2 max_{x \in 𝒩_{\frac{1}{4}}} | 〈 (\frac{1}{n} W_{T}^{⊤} W_{T} - I_{s}) x, x 〉 | = 2 max_{x \in 𝒩_{\frac{1}{4}}} | \frac{1}{n} {‖ W_{T} x ‖}_{2}^{2} - 1 | .$ (95)

Since c_min ≤ σ_min(Σ) ≤ σ_max(Σ) ≤ C_max, we have $‖ Σ_{T, T}^{1 / 2} ‖ \leq \sqrt{C_{max}}$ and $‖ Σ_{T, T}^{- 1 / 2} ‖ \leq 1 / \sqrt{c_{min}}$ . Therefore,
$‖ \frac{1}{n} X_{T}^{⊤} X_{T}^{Σ_{T, T}^{- 1}} - I_{s} ‖ = ‖ Σ_{T, T}^{1 / 2} (\frac{1}{n} W_{T}^{⊤} W_{T} - I_{s}) Σ_{T, T}^{- 1 / 2} ‖ \leq ‖ Σ_{T, T}^{1 / 2} ‖ ‖ \frac{1}{n} W_{T}^{⊤} W_{T} - I_{s} ‖ ‖ Σ_{T, T}^{- 1 / 2} ‖ \leq \sqrt{\frac{C_{max}}{c_{min}}} ‖ \frac{1}{n} W_{T}^{⊤} W_{T} - I_{s} ‖ .$ (96)

Combine (94), (95) and (96) together, we have arrived at the conclusion.

Now we consider the proof for (93). Note that

{‖ H_{δ} (α^{⊤} X_{T}^{⊤} X_{Ω}) ‖}_{2} \geq t

implies that there exists Λ ⊂ Ω such that all entry of

| α^{⊤} X_{T}^{⊤} X_{Λ} |

are greater than δ, and

{‖ | α^{⊤} X_{T}^{⊤} X_{Λ} | - δ ‖}_{2} \geq t

. Thus,

ℙ ({‖ H_{δ} (α^{⊤} X_{T}^{⊤} X_{Ω} / n) ‖}_{2} \geq t) \leq ℙ (\exists Λ \subseteq Ω, such that all entries of | α^{⊤} X_{T}^{⊤} X_{Λ} / n | \geq δ, and {‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t) \leq ℙ (\exists Λ \subseteq Ω, \sqrt{| Λ |} δ \leq t, {‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t) + ℙ (\exists Λ \subseteq Ω, \sqrt{| Λ |} δ > t, all entries of | α^{⊤} X_{T}^{⊤} X_{Λ} / n | \geq δ) \leq \sum_{\begin{matrix} Λ \subseteq Ω \\ | Λ | = ⌈ {(t / δ)}^{2} ⌉ \end{matrix}} ℙ (all entries of | α^{⊤} X_{T}^{⊤} X_{Λ} / n | \geq δ) + + \sum_{\begin{matrix} Λ \subseteq Ω \\ | Λ | = ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}} ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t) \leq \sum_{\begin{matrix} Λ \subseteq Ω \\ | Λ | = ⌈ {(t / δ)}^{2} ⌉ \end{matrix}} ℙ ({‖ α^{⊤} X_{∣ T}^{⊤} X_{Λ} / n ‖}_{2} \geq t) + \sum_{\begin{matrix} Λ \subseteq Ω \\ | Λ | = ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}} ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t) .

(97)

Since

t \geq δ \geq C {max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} {‖ α ‖}_{2}

, we know that no matter |Λ| = ⌊(t/δ)²⌋ ∧ r or ⌈(t/δ)²⌉,

2 C_{max} \sqrt{⌈ {(t / δ)}^{2} ⌉} ({max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2}) ‖ α ‖_{2} \leq 2 C_{max} \sqrt{2} (t / δ) ({max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2}) ‖ α ‖_{2} \leq t .

By Part 3 of Lemma 6, for any Λ ⊆ Ω,

t \geq 2 C_{max} \sqrt{| Λ |} {max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} {‖ α ‖}_{2}

, we have

ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t) \leq ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n - E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t - {‖ E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2}) \leq ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n - E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t - {‖ Σ_{Λ, T} α ‖}_{2}) = ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n - E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t - {(\sum_{i \in Λ} {(Σ_{i, T} α)}^{2})}^{1 / 2}) \leq ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n - E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t - \sqrt{| Λ |} max_{i \in T^{c}} | Σ_{i, T} α |) \leq ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n - E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t - \sqrt{| Λ |} {max}_{i \in T^{c}} {‖ Σ_{i, T} Σ_{T, T}^{- 1} ‖}_{2} ‖ Σ_{T, T} ‖ {‖ α ‖}_{2}) \leq ℙ ({‖ α^{⊤} X_{T}^{⊤} X_{Λ} / n - E α^{⊤} X_{T}^{⊤} X_{Λ} / n ‖}_{2} \geq t / 2) \leq 2 exp (C | Λ | - c n min {\frac{t^{2}}{κ^{4} {‖ α ‖}_{2}^{2}}, \frac{t}{κ^{2} {‖ α ‖}_{2}}}) .

Combine (97) and the previous inequality, one obtains

ℙ ({‖ H_{δ} (α^{⊤} X_{T}^{⊤} X_{Ω} / n) ‖}_{2} \geq t) \leq (\begin{matrix} r \\ ⌊ {(t / δ)}^{2} ⌋ \land r \end{matrix}) \cdot exp (C ⌊ {(t / δ)}^{2} ⌋ \land r - c n min {\frac{t^{2}}{κ^{4} {‖ α ‖}_{2}^{2}}, \frac{t}{κ^{2} {‖ α ‖}_{2}}}) + {(\begin{matrix} r \\ ⌈ {(t / δ)}^{2} ⌉ \end{matrix})}_{+} \cdot exp (C ⌈ {(t / δ)}^{2} ⌉ - c n min {\frac{t^{2}}{κ^{4} {‖ α ‖}_{2}^{2}}, \frac{t}{κ^{2} {‖ α ‖}_{2}}}) .

□

Lemma 8 (Properties of Soft-thresholding):

Suppose a, b > 0, $x, y \in ℝ$ , H. (·) is the soft-thresholding operator satisfying H_a(x) = sgn(x) · (|x| − a)₊. Then the following triangular inequality holds,
$| H_{a + b} (x + y) | \leq | H_{a} (x) | + | H_{b} (y) | .$ (98)
Suppose a, b > 0, $x, y \in ℝ^{p}$ , if ∥H_a(x)∥_∞,2 ≤ b, then
$| 〈 x, y 〉 | \leq a {‖ y ‖}_{1} + b {‖ y ‖}_{1, 2} .$ (99)

Proof of Lemma 8.

| H_{a + b} (x + y) | = {(| x + y | - a - b)}_{+} \leq {(| x | - a + | y | - b)}_{+} \leq {(| x | - a)}_{+} + {(| y | - b)}_{+} = | H_{a} (x) | + | H_{b} (y) | .

| 〈 x, y 〉 | \leq | 〈 H_{a} (x), y 〉 | + | 〈 x - H_{a} (x), y 〉 | = | \sum_{j = 1}^{d} 〈 {[H_{a} (x)]}_{(j)}, y_{(j)} 〉 | + | 〈 x - H_{a} (x), y 〉 | \leq \sum_{j = 1}^{d} {‖ {[H_{a} (x)]}_{(j)} ‖}_{2} {‖ y_{(j)} ‖}_{2} + {‖ x - H_{a} (x) ‖}_{\infty} {‖ y ‖}_{1} \leq {‖ H_{a} (x) ‖}_{\infty, 2} {‖ y ‖}_{1, 2} + ‖ x - a {‖ y ‖}_{1} \leq b {‖ y ‖}_{1, 2} + a {‖ y ‖}_{1} .

□

Lemma 9: Suppose $X = {[X_{1}^{⊤}, \dots, X_{n}^{⊤}]}^{⊤}$ is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1, $ε_{i} \overset{i . i . d .}{~} N (0, σ^{2})$ Suppose T ⊆ {1, …, p} is with cardinality s, $P \in ℝ^{n \times n}$ is a projection matrix and independent of X_T. Then, for any t ≥ log(es),

ℙ ({‖ X_{T}^{⊤} P ε ‖}_{\infty} \geq C κ \sqrt{n t} σ^{2}) \leq e^{- n} + e^{- C t} .

Proof of Lemma 9. For fixed vector $w \in ℝ^{n}$ , since Assumption 2 is satisfied, for i ∈ T, X_1i, …, X_ni are independent sub-Gaussian distributed such that

E exp (t X_{j i}) = E exp (t e_{i}^{⊤} Σ^{1 / 2} Σ^{- 1 / 2} X_{j \cdot}^{⊤}) \leq exp (\frac{κ^{2} {‖ Σ^{1 / 2} e_{i} ‖}_{2}^{2} t^{2}}{2}) \leq exp (\frac{κ^{2} Σ_{i, i} t^{2}}{2}) \leq exp (\frac{C_{max} κ^{2} t^{2}}{2}) .

By Hoeffding-type inequality,

ℙ (| X_{\cdot i}^{⊤} w | \geq t {‖ w ‖}_{2}) \leq 2 exp (- c \frac{t^{2}}{κ^{2}}) .

(100)

Moreover, by [63, Lemma 1], for any x ≥ 0,

ℙ (\sum_{i = 1}^{n} ε_{i}^{2} \geq (n + 2 \sqrt{n x} + 2 x) σ^{2}) \leq e^{- x} .

Set x = n in the last inequality, we have

ℙ ({‖ ε ‖}_{2} \geq \sqrt{5 n σ^{2}}) \leq e^{- n} .

(101)

Combine (100) and (101) together and notice that ∥Pε∥₂ ≤ ∥ε∥₂, we have

ℙ ({‖ X_{T}^{⊤} P ε ‖}_{\infty} \geq C κ \sqrt{n t σ^{2}}) \leq \sum_{i \in T} ℙ (| X_{\cdot i}^{⊤} P ε | \geq C κ \sqrt{n t σ^{2}}) \leq ℙ ({‖ P ε ‖}_{2} \geq \sqrt{5 n σ^{2}}) + \sum_{i \in T} ℙ (| X_{\cdot i}^{⊤} P ε | \geq C κ \sqrt{n t σ^{2}}, {‖ P ε ‖}_{2} \leq \sqrt{5 n σ^{2}}) \leq ℙ ({‖ ε ‖}_{2} \geq \sqrt{5 n σ^{2}}) + \sum_{i \in T} ℙ (| X_{. i}^{⊤} P ε | \geq C κ \sqrt{n t σ^{2}} ∣ {‖ P ε ‖}_{2} \leq \sqrt{5 n σ^{2}}) \leq e^{- n} + s \cdot 2 exp (- C t) \leq e^{- n} + e^{- C t} .

□

Lemma 10: With probability at least 1 − Ce^−cn/s, the approximate dual certificate defined in (45) can be written as u = X^⊤w, where ${‖ w ‖}_{2} \leq C \sqrt{s / n}$ .

Proof of Lemma 10. By (45), we have u = X^⊤w, where $w = {(w_{1}^{⊤}, \dots, w_{l_{max}}^{⊤})}^{⊤}$ and $w_{l} = \frac{1}{n_{l}} X_{I_{l}, T}^{Σ_{T, T}^{- 1}} q_{l - 1}$ . Thus ${‖ w ‖}_{2}^{2} = \sum_{l = 1}^{l_{max}} {‖ w_{l} ‖}_{2}^{2}$ . Also note that

\frac{1}{n_{l}} {‖ X_{I_{l}, T} Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2}^{2} = 〈 \frac{1}{n_{l}} X_{I_{l}, T}^{⊤} X_{I_{l}, T} Σ_{T, T}^{- 1} q_{l - 1}, Σ_{T, T}^{- 1} q_{l - 1} 〉 = 〈 (\frac{1}{n_{l}} X_{I_{l}, T}^{⊤} X_{I_{l}, T} Σ_{T, T}^{- 1} - I_{| T |}) q_{l - 1}, Σ_{T, T}^{- 1} q_{l - 1} 〉 + {‖ Σ_{T, T}^{- 1 / 2} q_{l - 1} ‖}_{2}^{2} = 〈 - q_{l}, Σ_{T, T}^{- 1} q_{l - 1} 〉 + {‖ Σ_{T, T}^{- 1 / 2} q_{l - 1} ‖}_{2}^{2} \leq {‖ q_{l} ‖}_{2} {‖ Σ_{T, T}^{- 1} q_{l - 1} ‖}_{2} + {‖ Σ_{T, T}^{- 1 / 2} q_{l - 1} ‖}_{2}^{2} \leq \frac{1}{c_{min}} {‖ q_{l} ‖}_{2} {‖ q_{l - 1} ‖}_{2} + \frac{1}{c_{min}} {‖ q_{l - 1} ‖}_{2}^{2} \leq \frac{2}{c_{min}} {‖ q_{l - 1} ‖}_{2}^{2}

By (53), with probability at least 1 − C exp(−cn/s),

{‖ w ‖}_{2}^{2} \leq \sum_{l = 1}^{l_{max}} \frac{C}{n_{l}} {‖ q_{l - 1} ‖}_{2}^{2} \leq \frac{C}{n} {(2 \sqrt{s})}^{2} + \frac{C}{n} {(2 \sqrt{s / log (e s)})}^{2} + \frac{C log (e s)}{n} \sum_{l = 3}^{l_{max}} {(2^{4 - l} \sqrt{s} / log (e s))}^{2} \leq C \frac{s}{n} .

□

Proof of Lemma 4. For any 1 ≤ i ≤ p, 1 ≤ j ≤ d, ,Λ ⊆ (j), |Λ| = k, by Lemma 6 with

v = Σ^{- 1} e_{i}, U \in ℝ^{p \times k}, U_{[Λ, :]} = I, U_{[Λ^{c}, :]} = 0,

we have

ℙ ({‖ {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} ‖}_{2} \geq t) \leq 2 exp (C k - c n min {\frac{t^{2}}{κ^{4}}, \frac{t}{κ^{2}}}) .

(102)

By the same method in Lemma 5 Part 2,

ℙ ({‖ H_{α} ({(e_{i})}_{(j)} - \frac{1}{n} X_{(j)}^{⊤} X Σ^{- 1} e_{i}) ‖}_{2} \geq γ) \leq ℙ (\exists Λ \subseteq (j), all entries of | {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} | \geq α and {‖ {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} ‖}_{2} \geq γ) \leq ℙ (\exists Λ \subseteq (j), \sqrt{| Λ |} α \leq γ, {‖ {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} ‖}_{2} \geq γ) + ℙ (\exists Λ \subseteq (j), \sqrt{| Λ |} α > γ, all entries of | {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} | \geq α) \leq \sum_{\begin{matrix} Λ \subseteq (j) \\ | Λ | = ⌈ s / s_{g} ⌉ \end{matrix}} ℙ (all entries of | {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} | \geq α) + \sum_{\begin{matrix} Λ \subseteq (j) \\ | Λ | = ⌊ s / s_{g} ⌋ \end{matrix}} ℙ ({‖ {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} ‖}_{2} \geq γ) \leq \sum_{\begin{matrix} Λ \subseteq (j) \\ | Λ | = ⌊ s / s_{g} ⌋ \end{matrix}} ℙ ({‖ {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} ‖}_{2} \geq γ) + \sum_{\begin{matrix} Λ \subseteq (j) \\ | Λ | = ⌈ s / s_{g} ⌉ \end{matrix}} ℙ ({‖ {(e_{i})}_{Λ} - \frac{1}{n} X_{Λ}^{⊤} X Σ^{- 1} e_{i} ‖}_{2} \geq γ) .

Combine (102) and the previous inequality together, we have

ℙ ({‖ H_{α} ({(e_{i})}_{(j)} - \frac{1}{n} X_{(j)}^{⊤} X Σ^{- 1} e_{i}) ‖}_{2} \geq γ) \leq (\begin{matrix} b_{j} \\ ⌊ s / s_{g} ⌋ \end{matrix}) \cdot 2 exp (C ⌊ s / s_{g} ⌋ - c n \cdot C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g} n}) + (\begin{matrix} b_{j} \\ ⌈ s / s_{g} ⌉ \end{matrix}) \cdot 2 exp (C ⌈ s / s_{g} ⌉ - c n \cdot C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g} n}) \leq 4 {(\frac{2 e s_{g} b}{s})}^{2 s / s_{g}} \cdot exp (C s / s_{g} - C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) \leq 4 exp (\frac{2 s}{s_{g}} log (\frac{2 e s_{g} b}{s}) + C s / s_{g} - C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) .

(103)

By (103) and the union bound, we have

ℙ ({max}_{1 \leq i \leq p} {‖ H_{α} (e_{i} - \frac{1}{n} X^{⊤} X Σ^{- 1} e_{i}) ‖}_{\infty, 2} \leq γ) \leq \sum_{i = 1}^{p} \sum_{j = 1}^{d} ℙ ({‖ H_{α} ({(e_{i})}_{(j)} - \frac{1}{n} X_{(j)}^{⊤} X Σ^{- 1} e_{i}) ‖}_{2} \geq γ) \leq d^{2} b \cdot 4 exp (\frac{2 s}{s_{g}} log (\frac{2 e s_{g} b}{s}) + C s / s_{g} - C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) \leq 4 exp (2 log (s_{g}) + 2 log (d / s_{g}) + \frac{3 s}{s_{g}} log (2 e b) + C s / s_{g} - C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) \leq 4 exp (- C \frac{s log (e s_{g} b) + s_{g} log (d / s_{g})}{s_{g}}) .

□

Footnotes

https://cran.r-project.org/web/packages/SGL/index.html

References

[1].Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, Tai E-S, Teo Y-Y, and Montana G, “Pathways-driven sparse regression identifies pathways and genes associated with high-density lipoprotein cholesterol in two asian cohorts,” PLoS genetics, vol. 9, no. 11, p. e1003939, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
[2].Vidyasagar M, “Machine learning methods in the computational biology of cancer,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 470, no. 2167, p. 20140081, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Allahyar A and De Ridder J, “Feral: network-based classifier with application to breast cancer outcome prediction,” Bioinformatics, vol. 31, no. 12, pp. i311–i319, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Rao N, Nowak R, Cox C, and Rogers T, “Classification with the sparse group lasso,” IEEE Transactions on Signal Processing, vol. 64, no. 2, pp. 448–463, 2015. [Google Scholar]
[5].Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, and Ganguly A, “Sparse group lasso: Consistency and climate applications,” in Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 2012, pp. 47–58. [Google Scholar]
[6].Wang W, Liang Y, and Xing E, “Block regularized lasso for multivariate multi-response linear regression,” in Artificial Intelligence and Statistics, 2013, pp. 608–617. [Google Scholar]
[7].Lounici K, Pontil M, Tsybakov A, and Van De Geer S, “Taking advantage of sparsity in multi-task learning,” in COLT 2009-The 22nd Conference on Learning Theory, 2009. [Google Scholar]
[8].Lozano AC and Swirszcz G, “Multi-level lasso for sparse multi-task regression,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 595–602. [Google Scholar]
[9].Zhou HH, Zhang Y, Ithapu VK, Johnson SC, and Singh V, “When can multi-site datasets be pooled for regression? hypothesis tests, ℓ₂-consistency and neuroscience applications,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 4170–4179. [PMC free article] [PubMed] [Google Scholar]
[10].Friedman J, Hastie T, and Tibshirani R, “A note on the group lasso and a sparse group lasso,” arXiv preprint arXiv:1001.0736, 2010. [Google Scholar]
[11].Simon N, Friedman J, Hastie T, and Tibshirani R, “A sparse-group lasso,” Journal of Computational and Graphical Statistics, vol. 22, no. 2, pp. 231–245, 2013. [Google Scholar]
[12].Li Y, Nan B, and Zhu J, “Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure,” Biometrics, vol. 71, no. 2, pp. 354–363, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Tibshirani R, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. [Google Scholar]
[14].Bickel PJ, Ritov Y, and Tsybakov AB, “Simultaneous analysis of lasso and dantzig selector,” The Annals of Statistics, vol. 37, no. 4, pp. 1705–1732, 2009. [Google Scholar]
[15].Verzelen N, “Minimax risks for sparse regressions: Ultra-high dimensional phenomenons,” Electronic Journal of Statistics, vol. 6, pp. 38–90, 2012. [Google Scholar]
[16].Yuan M and Lin Y, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. [Google Scholar]
[17].Lounici K, Pontil M, Van De Geer S, and Tsybakov AB, “Oracle inequalities and optimal inference under group sparsity,” The Annals of Statistics, vol. 39, no. 4, pp. 2164–2204, 2011. [Google Scholar]
[18].Bunea F, Lederer J, and She Y, “The group square-root lasso: Theoretical properties and fast algorithms,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 1313–1325, 2013. [Google Scholar]
[19].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Ma Z, “Sparse principal component analysis and iterative thresholding,” The Annals of Statistics, vol. 41, no. 2, pp. 772–801, 2013. [Google Scholar]
[21].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]
[22].Wang M and Li L, “Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality,” arXiv preprint arXiv:1811.05076, 2018. [PMC free article] [PubMed] [Google Scholar]
[23].Oymak S, Jalali A, Fazel M, Eldar YC, and Hassibi B, “Simultaneously structured models with application to sparse and low-rank matrices,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2886–2908, 2015. [Google Scholar]
[24].Hao B, Zhang A, and Cheng G, “Sparse and low-rank tensor estimation via cubic sketchings,” arXiv preprint arXiv:1801.09326, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[26].Jaganathan K, Oymak S, and Hassibi B, “Sparse phase retrieval: Convex algorithms and limitations,” in 2013 IEEE International Symposium on Information Theory. IEEE, 2013, pp. 1022–1026. [Google Scholar]
[27].Shechtman Y, Beck A, and Eldar YC, “Gespar: Efficient phase retrieval of sparse signals,” IEEE transactions on signal processing, vol. 62, no. 4, pp. 928–938, 2014. [Google Scholar]
[28].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow,” The Annals of Statistics, vol. 44, pp. 2221–2251, 2016. [Google Scholar]
[29].Oymak S, Jalali A, Fazel M, and Hassibi B, “Noisy estimation of simultaneously structured models: Limitations of convex relaxation,” in Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. IEEE, 2013, pp. 6019–6024. [Google Scholar]
[30].Zhang C-H and Zhang SS, “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 1, pp. 217–242, 2014. [Google Scholar]
[31].Javanmard A and Montanari A, “Confidence intervals and hypothesis testing for high-dimensional regression,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2869–2909, 2014. [Google Scholar]
[32].Mitra R and Zhang C-H, “The benefit of group sparsity in group inference with de-biased scaled group lasso,” Electronic Journal of Statistics, vol. 10, no. 2, pp. 1829–1873, 2016. [Google Scholar]
[33].Cai TT and Guo Z, “Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity,” The Annals of statistics, vol. 45, no. 2, pp. 615–646, 2017. [Google Scholar]
[34].Negahban SN, Ravikumar P, Wainwright MJ, and Yu B, “A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012. [Google Scholar]
[35].Stojnic M, Xu W, and Hassibi B, “Compressed sensing-probabilistic analysis of a null-space characterization,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 3377–3380. [Google Scholar]
[36].Candes E, Romberg J, and Tao T, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [Google Scholar]
[37].Candes E and Tao T, “The dantzig selector: Statistical estimation when p is much larger than n,” The Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007. [Google Scholar]
[38].Cai TT and Zhang A, “Compressed sensing and affine rank minimization under restricted isometry,” IEEE Transactions on Signal Processing, vol. 61, no. 13, pp. 3279–3290, 2013. [Google Scholar]
[39].——, “Sharp rip bound for sparse signal and low-rank matrix recovery,” Applied and Computational Harmonic Analysis, vol. 35, no. 1, pp. 74–93, 2013. [Google Scholar]
[40].——, “Sparse representation of a polytope and recovery of sparse signals and low-rank matrices,” IEEE transactions on information theory, vol. 60, no. 1, pp. 122–132, 2014. [Google Scholar]
[41].Poignard B, “Asymptotic theory of the adaptive sparse group lasso,” Annals of the Institute of Statistical Mathematics, pp. 1–32, 2018. [Google Scholar]
[42].Rao N, Cox C, Nowak R, and Rogers TT, “Sparse overlapping sets lasso for multitask learning and its application to fmri analysis,” in Advances in neural information processing systems, 2013, pp. 2202–2210. [Google Scholar]
[43].Ahsen ME and Vidyasagar M, “Error bounds for compressed sensing algorithms with group sparsity: A unified approach,” Applied and Computational Harmonic Analysis, vol. 43, no. 2, pp. 212–232, 2017. [Google Scholar]
[44].Candes EJ and Plan Y, “A probabilistic and ripless theory of compressed sensing,” IEEE transactions on information theory, vol. 57, no. 11, pp. 7235–7254, 2011. [Google Scholar]
[45].Javanmard A and Montanari A, “Debiasing the lasso: Optimal sample size for gaussian designs,” The Annals of Statistics, vol. 46, no. 6A, pp. 2593–2622, 2018. [Google Scholar]
[46].Foucart S and Rauhut H, A mathematical introduction to compressive sensing. Birkhäuser Basel, 2013, vol. 1, no. 3. [Google Scholar]
[47].Candes E and Tao T, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [Google Scholar]
[48].Bertsekas D and Nedic A, “Convex analysis and optimization (conservative),” 2003.
[49].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]
[50].Gross D, “Recovering low-rank matrices from few coefficients in any basis,” IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1548–1566, 2011. [Google Scholar]
[51].Candès EJ, Li X, Ma Y, and Wright J, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011. [Google Scholar]
[52].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]
[53].Van de Geer S, Bühlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]
[54].Sun T and Zhang C-H, “Scaled sparse linear regression,” Biometrika, vol. 99, no. 4, pp. 879–898, 2012. [Google Scholar]
[55].Tibshirani R, Saunders M, Rosset S, Zhu J, and Knight K, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005. [Google Scholar]
[56].Rinaldo A, “Properties and refinements of the fused lasso,” The Annals of Statistics, vol. 37, no. 5B, pp. 2922–2952, 2009. [Google Scholar]
[57].Jalali A and Fazel M, “A convex method for learning d-valued models,” in 2013 IEEE Global Conference on Signal and Information Processing. IEEE, 2013, pp. 1123–1126. [Google Scholar]
[58].Jalali A, Javanmard A, and Fazel M, “New computational and statistical aspects of regularized regression with application to rare feature selection and aggregation,” arXiv preprint arXiv:1904.05338, 2019. [Google Scholar]
[59].Sprechmann P, Ramirez I, Sapiro G, and Eldar Y, “Collaborative hierarchical sparse modeling,” in 2010 44th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2010, pp. 1–6. [Google Scholar]
[60].Hsu D, Kakade S, and Zhang T, “A tail inequality for quadratic forms of subgaussian random vectors,” Electronic Communications in Probability, vol. 17, 2012. [Google Scholar]
[61].Rudelson M and Vershynin R, “Hanson-wright inequality and subgaussian concentration,” Electronic Communications in Probability, vol. 18, 2013. [Google Scholar]
[62].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]
[63].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000. [Google Scholar]

[R1] [1].Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, Tai E-S, Teo Y-Y, and Montana G, “Pathways-driven sparse regression identifies pathways and genes associated with high-density lipoprotein cholesterol in two asian cohorts,” PLoS genetics, vol. 9, no. 11, p. e1003939, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Vidyasagar M, “Machine learning methods in the computational biology of cancer,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 470, no. 2167, p. 20140081, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Allahyar A and De Ridder J, “Feral: network-based classifier with application to breast cancer outcome prediction,” Bioinformatics, vol. 31, no. 12, pp. i311–i319, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Rao N, Nowak R, Cox C, and Rogers T, “Classification with the sparse group lasso,” IEEE Transactions on Signal Processing, vol. 64, no. 2, pp. 448–463, 2015. [Google Scholar]

[R5] [5].Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, and Ganguly A, “Sparse group lasso: Consistency and climate applications,” in Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 2012, pp. 47–58. [Google Scholar]

[R6] [6].Wang W, Liang Y, and Xing E, “Block regularized lasso for multivariate multi-response linear regression,” in Artificial Intelligence and Statistics, 2013, pp. 608–617. [Google Scholar]

[R7] [7].Lounici K, Pontil M, Tsybakov A, and Van De Geer S, “Taking advantage of sparsity in multi-task learning,” in COLT 2009-The 22nd Conference on Learning Theory, 2009. [Google Scholar]

[R8] [8].Lozano AC and Swirszcz G, “Multi-level lasso for sparse multi-task regression,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 595–602. [Google Scholar]

[R9] [9].Zhou HH, Zhang Y, Ithapu VK, Johnson SC, and Singh V, “When can multi-site datasets be pooled for regression? hypothesis tests, ℓ₂-consistency and neuroscience applications,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 4170–4179. [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Friedman J, Hastie T, and Tibshirani R, “A note on the group lasso and a sparse group lasso,” arXiv preprint arXiv:1001.0736, 2010. [Google Scholar]

[R11] [11].Simon N, Friedman J, Hastie T, and Tibshirani R, “A sparse-group lasso,” Journal of Computational and Graphical Statistics, vol. 22, no. 2, pp. 231–245, 2013. [Google Scholar]

[R12] [12].Li Y, Nan B, and Zhu J, “Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure,” Biometrics, vol. 71, no. 2, pp. 354–363, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Tibshirani R, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. [Google Scholar]

[R14] [14].Bickel PJ, Ritov Y, and Tsybakov AB, “Simultaneous analysis of lasso and dantzig selector,” The Annals of Statistics, vol. 37, no. 4, pp. 1705–1732, 2009. [Google Scholar]

[R15] [15].Verzelen N, “Minimax risks for sparse regressions: Ultra-high dimensional phenomenons,” Electronic Journal of Statistics, vol. 6, pp. 38–90, 2012. [Google Scholar]

[R16] [16].Yuan M and Lin Y, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. [Google Scholar]

[R17] [17].Lounici K, Pontil M, Van De Geer S, and Tsybakov AB, “Oracle inequalities and optimal inference under group sparsity,” The Annals of Statistics, vol. 39, no. 4, pp. 2164–2204, 2011. [Google Scholar]

[R18] [18].Bunea F, Lederer J, and She Y, “The group square-root lasso: Theoretical properties and fast algorithms,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 1313–1325, 2013. [Google Scholar]

[R19] [19].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Ma Z, “Sparse principal component analysis and iterative thresholding,” The Annals of Statistics, vol. 41, no. 2, pp. 772–801, 2013. [Google Scholar]

[R21] [21].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]

[R22] [22].Wang M and Li L, “Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality,” arXiv preprint arXiv:1811.05076, 2018. [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Oymak S, Jalali A, Fazel M, Eldar YC, and Hassibi B, “Simultaneously structured models with application to sparse and low-rank matrices,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2886–2908, 2015. [Google Scholar]

[R24] [24].Hao B, Zhang A, and Cheng G, “Sparse and low-rank tensor estimation via cubic sketchings,” arXiv preprint arXiv:1801.09326, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] [26].Jaganathan K, Oymak S, and Hassibi B, “Sparse phase retrieval: Convex algorithms and limitations,” in 2013 IEEE International Symposium on Information Theory. IEEE, 2013, pp. 1022–1026. [Google Scholar]

[R27] [27].Shechtman Y, Beck A, and Eldar YC, “Gespar: Efficient phase retrieval of sparse signals,” IEEE transactions on signal processing, vol. 62, no. 4, pp. 928–938, 2014. [Google Scholar]

[R28] [28].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow,” The Annals of Statistics, vol. 44, pp. 2221–2251, 2016. [Google Scholar]

[R29] [29].Oymak S, Jalali A, Fazel M, and Hassibi B, “Noisy estimation of simultaneously structured models: Limitations of convex relaxation,” in Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. IEEE, 2013, pp. 6019–6024. [Google Scholar]

[R30] [30].Zhang C-H and Zhang SS, “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 1, pp. 217–242, 2014. [Google Scholar]

[R31] [31].Javanmard A and Montanari A, “Confidence intervals and hypothesis testing for high-dimensional regression,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2869–2909, 2014. [Google Scholar]

[R32] [32].Mitra R and Zhang C-H, “The benefit of group sparsity in group inference with de-biased scaled group lasso,” Electronic Journal of Statistics, vol. 10, no. 2, pp. 1829–1873, 2016. [Google Scholar]

[R33] [33].Cai TT and Guo Z, “Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity,” The Annals of statistics, vol. 45, no. 2, pp. 615–646, 2017. [Google Scholar]

[R34] [34].Negahban SN, Ravikumar P, Wainwright MJ, and Yu B, “A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012. [Google Scholar]

[R35] [35].Stojnic M, Xu W, and Hassibi B, “Compressed sensing-probabilistic analysis of a null-space characterization,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 3377–3380. [Google Scholar]

[R36] [36].Candes E, Romberg J, and Tao T, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [Google Scholar]

[R37] [37].Candes E and Tao T, “The dantzig selector: Statistical estimation when p is much larger than n,” The Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007. [Google Scholar]

[R38] [38].Cai TT and Zhang A, “Compressed sensing and affine rank minimization under restricted isometry,” IEEE Transactions on Signal Processing, vol. 61, no. 13, pp. 3279–3290, 2013. [Google Scholar]

[R39] [39].——, “Sharp rip bound for sparse signal and low-rank matrix recovery,” Applied and Computational Harmonic Analysis, vol. 35, no. 1, pp. 74–93, 2013. [Google Scholar]

[R40] [40].——, “Sparse representation of a polytope and recovery of sparse signals and low-rank matrices,” IEEE transactions on information theory, vol. 60, no. 1, pp. 122–132, 2014. [Google Scholar]

[R41] [41].Poignard B, “Asymptotic theory of the adaptive sparse group lasso,” Annals of the Institute of Statistical Mathematics, pp. 1–32, 2018. [Google Scholar]

[R42] [42].Rao N, Cox C, Nowak R, and Rogers TT, “Sparse overlapping sets lasso for multitask learning and its application to fmri analysis,” in Advances in neural information processing systems, 2013, pp. 2202–2210. [Google Scholar]

[R43] [43].Ahsen ME and Vidyasagar M, “Error bounds for compressed sensing algorithms with group sparsity: A unified approach,” Applied and Computational Harmonic Analysis, vol. 43, no. 2, pp. 212–232, 2017. [Google Scholar]

[R44] [44].Candes EJ and Plan Y, “A probabilistic and ripless theory of compressed sensing,” IEEE transactions on information theory, vol. 57, no. 11, pp. 7235–7254, 2011. [Google Scholar]

[R45] [45].Javanmard A and Montanari A, “Debiasing the lasso: Optimal sample size for gaussian designs,” The Annals of Statistics, vol. 46, no. 6A, pp. 2593–2622, 2018. [Google Scholar]

[R46] [46].Foucart S and Rauhut H, A mathematical introduction to compressive sensing. Birkhäuser Basel, 2013, vol. 1, no. 3. [Google Scholar]

[R47] [47].Candes E and Tao T, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [Google Scholar]

[R48] [48].Bertsekas D and Nedic A, “Convex analysis and optimization (conservative),” 2003.

[R49] [49].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]

[R50] [50].Gross D, “Recovering low-rank matrices from few coefficients in any basis,” IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1548–1566, 2011. [Google Scholar]

[R51] [51].Candès EJ, Li X, Ma Y, and Wright J, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011. [Google Scholar]

[R52] [52].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]

[R53] [53].Van de Geer S, Bühlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]

[R54] [54].Sun T and Zhang C-H, “Scaled sparse linear regression,” Biometrika, vol. 99, no. 4, pp. 879–898, 2012. [Google Scholar]

[R55] [55].Tibshirani R, Saunders M, Rosset S, Zhu J, and Knight K, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005. [Google Scholar]

[R56] [56].Rinaldo A, “Properties and refinements of the fused lasso,” The Annals of Statistics, vol. 37, no. 5B, pp. 2922–2952, 2009. [Google Scholar]

[R57] [57].Jalali A and Fazel M, “A convex method for learning d-valued models,” in 2013 IEEE Global Conference on Signal and Information Processing. IEEE, 2013, pp. 1123–1126. [Google Scholar]

[R58] [58].Jalali A, Javanmard A, and Fazel M, “New computational and statistical aspects of regularized regression with application to rare feature selection and aggregation,” arXiv preprint arXiv:1904.05338, 2019. [Google Scholar]

[R59] [59].Sprechmann P, Ramirez I, Sapiro G, and Eldar Y, “Collaborative hierarchical sparse modeling,” in 2010 44th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2010, pp. 1–6. [Google Scholar]

[R60] [60].Hsu D, Kakade S, and Zhang T, “A tail inequality for quadratic forms of subgaussian random vectors,” Electronic Communications in Probability, vol. 17, 2012. [Google Scholar]

[R61] [61].Rudelson M and Vershynin R, “Hanson-wright inequality and subgaussian concentration,” Electronic Communications in Probability, vol. 18, 2013. [Google Scholar]

[R62] [62].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]

[R63] [63].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000. [Google Scholar]

PERMALINK

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

T Tony Cai

Anru R Zhang

Yuchen Zhou

Abstract

I. Introduction

A. Simultaneously Structured Models

B. Optimality and Related Literature

C. Organization of the Paper

II. ℓ1 + ℓ1,2 Minimization in Noiseless Case

A. Notation and Preliminaries

B. Noiseless Case and Sample Complexity

C. Proof Sketches

III. Sparse Group Lasso in Noisy Case

A. Optimal Rate of Estimation Error of Sparse Group Lasso

B. Statistical Inference via Debiased Sparse Group Lasso

IV. Simulation Studies

A. Practical Selection of Tuning Parameters

B. Numerical Results

Fig. 1.

V. Discussions

VI. Proofs

A. Proof of Lemma 1

B. Proof of Lemma 2

C. Proof of Lemma 3

D. Proof of Theorem 2

E. Proof of Theorem 3

F. Proof of Theorem 4

G. Proof of Theorem 5

H. Proof of Theorem 6

Fig. 2.

Acknowledgments

Biographies

Appendix

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

II. ℓ₁ + ℓ_1,2 Minimization in Noiseless Case