Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Sep 1.
Published in final edited form as: IEEE Trans Inf Theory. 2022 May 16;68(9):5975–6002. doi: 10.1109/tit.2022.3175455

Sparse Group Lasso: Optimal Sample Complexity, Convergence Rate, and Statistical Inference

T Tony Cai 1, Anru R Zhang 2,4, Yuchen Zhou 3,4
PMCID: PMC9974176  NIHMSID: NIHMS1830781  PMID: 36865503

Abstract

We study sparse group Lasso for high-dimensional double sparse linear regression, where the parameter of interest is simultaneously element-wise and group-wise sparse. This problem is an important instance of the simultaneously structured model – an actively studied topic in statistics and machine learning. In the noiseless case, matching upper and lower bounds on sample complexity are established for the exact recovery of sparse vectors and for stable estimation of approximately sparse vectors, respectively. In the noisy case, upper and matching minimax lower bounds for estimation error are obtained. We also consider the debiased sparse group Lasso and investigate its asymptotic property for the purpose of statistical inference. Finally, numerical studies are provided to support the theoretical results.

Index Terms—: approximate dual certificate, convex optimization, sparsity, sparse group Lasso, simultaneously structured model

I. Introduction

Consider the high-dimensional double sparse regression with simultaneously group-wise and element-wise sparsity structures

y=Xβ*+ε,or equivalently yi=Xiβ*+εi,i=1,,n. (1)

Here, the covariates Xn×p and parameter β* are divided into d known groups, where the jth group contains bj variables,

X=[X(1)X(d)],β*=((β(1)*),(β(d)*)),X(j)n×bj,β(j)*bj; (2)

β* is a (s, sg)-sparse vector in the sense that

β*0,2j=1d1{β(j)*0}sgand β*0i=1p1{βi*0}s, (3)

The focus of this paper is on the estimation of and inference for β* based on (y, X). This problem has great importance in a variety of applications. For example in genome-wide association studies (GWAS) [1], the genes can be grouped into pathways and it is believed that only a small portion of the pathways contain causal single nucleotide polymorphisms (SNPs), and the number of causal SNPs is much less than the one of non-causal SNPs in a causal pathway. The sparse group Lasso has been applied to identify causal genes or SNPs associated with a certain trait [1]. Other examples include cancer diagnosis and therapy [2], [3], classification [4], and climate prediction [5] among many others. The problem can also be viewed as a prototype of various problems in statistics and machine learning, such as the sparse multiple response regression [6] and multiple task learning [7], [8], [9].

The sparse group Lasso [10], [11], [12] provides a classic and straightforward estimator for β*:

β^=arg minβyXβ22+λβ1+λgβ1,2. (4)

Here, β1=i=1p|βi| and ‖β1,2 = Σjβ(j)2 are 1 and 1,2 convex regularizers to account for element-wise and group-wise sparsity structures, respectively. λ, λg ≥ 0 are tuning parameters. In the noiseless setting that ε = 0, one can apply the constrained 1 + 1,2 minimization instead to estimate β*:

β^=arg min λβ1+λgβ1,2subject to y=Xβ. (5)

In fact, when λ, λg tend to zero while λ/λg is fixed as a constant, the sparse group Lasso (4) tends to the 1 + 1,2 minimization (5).

When β* is only element-wise sparse, the regular Lasso [13]

β^L=arg min βyXβ22+λβ1 (6)

can be applied and its theoretical properties have been well studied. See, for example, [14], [15]. When β* is only group-wise sparse, the group Lasso

β^GL=arg min βyXβ22+λgβ1,2 (7)

and its variations have been widely investigated [16], [17], [18]. However, to estimate the simultaneously element-wise and group-wise sparse vector β*, despite many empirical successes of sparse group Lasso in practice, the theoretical properties, including optimal rate of convergence and sample complexity, are still unclear so far to the best of our knowledge.

A. Simultaneously Structured Models

More broadly speaking, the simultaneously structured models, i.e., the parameter of interest has multiple structures at the same time, have attracted enormous attention in many fields including statistics, applied mathematics, and machine learning. In addition to the high-dimensional double sparse regression, other simultaneously structured models include sparse principal component analysis [19], [20], tensor singular value decomposition [21], [22], simultaneously sparse and low-rank matrix/tensor recovery [23], [24], sparse matrix/tensor SVD [25], and sparse phase retrieval [26], [27], [28]. As shown in [29], [23], by minimizing multi-objective regularizers with norms associated with these structures (such as 1 norm for element-wise sparsity, nuclear norm for low-rankness, and total variation norm for piecewise constant structures), one usually cannot do better than applying an algorithm that only exploits one structure. They particularly illustrated that simultaneously sparse and low-rank structured matrix cannot be well estimated by penalizing 1 and nuclear norm regularizers. Instead, non-convex methods were proposed and shown to achieve better performance.

However based on their results, it remains an open question whether the convex regularization, such as sparse group Lasso or 1 + 1,2 minimization, can achieve good performance in estimation of parameter with two types of sparsity structures, such as the aforementioned high-dimensional double sparse regression. Specifically, as illustrated in Section II-B, a direct application of [23] does not provide a sample complexity lower bound for exact recovery that matches our upper bound.

B. Optimality and Related Literature

This paper fills the void of statistical limits of sparse group Lasso and provides an affirmative answer to the aforementioned question: by exploiting both element-wise and group-wise sparsity structures, the 1 + 1,2 regularization does provide better performance in high-dimensional double sparse regression. Particularly in the noiseless case, it is shown that (s, sg)-sparse vectors can be exactly recovered and approximately (s, sg)-sparse vectors can be stably estimated with high probability whenever the sample size satisfies nsg log(d/sg) + s log(esgb), where b = max1≤id bi. On the other hand, we prove that exact recovery cannot be achieved by 1 + 1,2 regularization and stable estimation of approximately (s, sg)-sparse vectors is impossible in general unless nsg log(d/sg) + s log(esgb/s). We then consider the noisy case and develop the matching upper and lower bounds on the convergence rate for the estimation error. Simulation studies are carried out and the results support our theoretical findings. In addition, statistical inference for the individual coordinates of β* is studied. A confidence interval is constructed based on the debiased sparse group Lasso estimator and its asymptotic property. The results show that by exploring the simultaneously element-wise and group-wise sparsity structures, the debiased sparse group Lasso requires less sample size than the debiased Lasso and debiased group Lasso in the literature [30], [31], [32], [33].

The theoretical analysis of sparse group Lasso and 1 + 1,2 minimization is highly non-trivial. First, the regularizer λ‖ · ‖1 + λg‖ · ‖1,2 is not decomposable with respect to the support of β* so that the classic techniques of decomposable regularizers [34] and null space property [35] may not be suitable here. Despite a substantial body of literature on high-dimensional element-wise sparse vector estimation based on restricted isometry property (RIP) [36], [37], [38], [39], [40] and restricted eigenvalue [14], these techniques cannot provide nearly optimal results for sparse group Lasso here as it is technically difficult to partition general vectors into simultaneously element-wise and group-wise ones that preserves some ordering structures. Departing from the previous literature, our theoretical analysis relies on a novel construction of approximate dual certificate. See Section II-C for further details. Although our results mostly focus on the performance of sparse group Lasso and 1 + 1,2 estimators, the techniques of approximate dual certificate on multi-norm structures here can also be of independent interest.

The statistical properties of sparse group Lasso and related estimators have been studied previously. For example, [5] developed consistency results for estimators with a general tree-structured norm regularizers, of which the sparse group Lasso is a special case. [41] analyzed the asymptotic behaviors of the adaptive sparse group Lasso estimator. [4], [42] studied the multi-task learning and classification problems based on a variant of sparse group Lasso estimator. [12] studied multivariate linear regression via sparse group Lasso. [43] provided a theoretical framework for developing error bounds of the group Lasso, sparse group Lasso, and group Lasso with tree structured overlapping groups. Specifically, their results imply that the group-wise sparse signal can be exactly recovered with high probability by solving (5) if the sample size satisfies nsg (b + log d). Different from previous results, this paper focused on both the required sample size and convergence rate of estimation error of sparse group Lasso. To the best of our knowledge, this is the first paper that provides optimal theoretical guarantees for both the sample complexity and estimation error of sparse group Lasso.

C. Organization of the Paper

The rest of the article is organized as follows. After a brief introduction to notation and preliminaries in Section II-A, the main theoretical results on constrained 1 + 1,2 minimization in the noiseless setting is presented in Section II-B and the key proof ideas are explained in Section II-C. Results for sparse group Lasso in the noisy setting are discussed in Section III. In particular, the optimal rate of estimation error and statistical inference are studied in Sections III-A and III-B, respectively. In Section IV-A, we introduce a practical scheme to select tuning parameters. In Section IV-B, we provide simulation results in both noiseless and noisy cases to justify our theoretical findings. The proofs of technical results are given in Section VI. All technical lemmas and their proofs can be found in Appendix A.

II. 1 + 1,2 Minimization in Noiseless Case

A. Notation and Preliminaries

The following notation will be used throughout the paper. We denote ab = min{a, b}, ab = max{a, b}. Let sgn(·) be the sign function, i.e., sgn(x) = 1, 0, or −1, if x > 0, x = 0, or x < 0, respectively. Hα(·) is the soft-thresholding function such that Hα(x) = sgn(x) · {(|x| − α) ⋁ 0} for any x. We say ab and ab if aCb and bCa for some uniform constant C > 0, respectively. ab means ab and ab both hold. Let the uppercase C, C1, C0, … and lowercase c, c1, c0, … denote large and small positive constants respectively, whose actual values vary from time to time. Throughout the paper, we focus on the parameter index set {1, …, p} partitioned into d groups. Denote (1), …, (d) ⊆ {1, … , p} as the index sets belonging to each group. Additionally, for any group index subset G ⊆ {1, …, d}, define (G) = ∪jG(j), (Gc) = ∪jG(j). For any vector γ and index subset T, γT|T| represents the sub-vector of γ with index set T. In particular, γ(G) represents the sub-vector of γ in the union of Groups jG. Define the ℓq norm of any vector γ as ‖yq = (∑i|γi|q)1/q. For any vector βp with group structures, we also define the q1,q2 norm for any 0 ≤ q1, q2 ≤ ∞ as

γq1,q2=(j=1dγ(j)q2q1)1/q1={j=1d(i(j)|γi|q2)q1/q2}1/q1.

In particular, γ0,2=j=1d1{γ(j)0} is the number of non-zero groups of γ, ‖γ∞,2 = maxjγ(j)2 is the maximum 2 norm among all groups of γ, and γ1,2=j=1dγ(j)2 is the group-wise 1 penalty. With a slight abuse of notation, we simply denote ‖γTq1,q2 = ‖uq1,q2 if up, u restricted on subset T is γT and u restricted on Tc is 0.

The focus of this paper is on simultaneously element-wise and group-wise sparse vectors defined as follows.

Definition 1 (Simultaneous element-wise and group-wise sparsity): Assume β*p is associated with group partition (1), …, (d). For positive integers s, sg satisfying sgd and sgsmaxΩ{1,,d},|Ω|=sgiΩbi, we say β* is (s, sg)- sparse if

β*0,2=j=1d1{β(j)*0}sg,β*0=i1{βi*0}s.

B. Noiseless Case and Sample Complexity

To analyze the performance of sparse group Lasso and 1 + 1,2 minimization, we first introduce the following assumption on the design matrix X.

Assumption 1 (Sub-Gaussian assumption): Suppose all rows of X are i.i.d. centered sub-Gaussian distributed. Specifically, EXi=0, Var(Xi)=Σ, and for any αp, we have E exp(αΣ1/2Xi)exp(κ2α22/2) for constant κ > 0. We also assume there exist two constants Cmaxcmin > 0 such that cminσmin (Σ) ≤ σmax(Σ) ≤ Cmax, where σmax(Σ) and σmin(Σ) are the largest and smallest eigenvalues of Σ, respectively.

Clear, a random matrix X with i.i.d. standard normal entries satisfies this assumption – this design is referred to as the Gaussian ensemble and has been considered as a benchmark setting in compressed sensing and high-dimensional regression literature [44], [45].

The following theorem shows that the 1 + 1,2 minimization achieves the exact recovery with high probability when β* is simultaneously element-wise and group-wise sparse, X is weakly dependent, and Assumption 1 holds. The theorem also provides a more general upper bound on estimation error if β* is approximately element-wise and group-wise sparse.

Theorem 1 (ℓ1 + 1,2 minimization in noiseless case): Suppose one observes y = *, where X has the group structure (2) and satisfies Assumption 1, β* is (s, sg)-sparse, and b = max1≤id bi. Let T be the support of β*. Suppose there exist uniform constants C, c > 0 such that

nC(sg log(d/sg)+s log(esgb)), (8)
max iTcΣi,TΣT,T12c/s, (9)

then the constrained 1 + 1,2 minimization (5) with λg=s/sgλ achieves the exact recovery with probability at least 1 − C exp(−cn/s).

Moreover, if β*p is a general vector and β^ is the solution to the constrained 1 + 1,2 minimization (5) with λg=s/sgλ, then

β^β*2minS𝒮(1sβSc*1+1sgβSc*1,2). (10)

with probability at least 1 − C exp(−cn/s). Here,

𝒮={S:βS*0s,βS*0,2sg max iScΣi,SΣS,S12c/s}.

Remark 1 (Interpretation and comparison): In Theorem 1, the required sample size for achieving exact recovery contains two terms: sg log(d/sg) and s log(esgb). Intuitively speaking, sg log(d/sg) corresponds to the complexity of identifying sg non-zero groups and s log(esgb) corresponds to the complexity of estimating s non-zero elements of β in sg known groups.

When β* is only element-wise or group-wise sparse, one can apply respectively the classic 1 or 1,2 minimization to recover β*,

β^1=arg min ββ1  subject to  y=Xβ, (11)
β^1,2=arg min ββ1,2  subject to  y=Xβ. (12)

The 1 minimization and 1,2 minimization here are respectively the special form of the regular Lasso and group Lasso (if λ, λg = 0+ in (6) and (7)), respectively. Especially if the group size b1 ≍ ⋯ ≍ bdb, to ensure exact recovery in the noiseless setting with high probability, (11) requires nCs log(ebd/s) [46] and group Lasso requires nsg(b + log(ed/sg)). The 1 + 1,2 minimization (5) has provable advantages over both regular and group Lasso when b ≫ log(d) ≫ log(esgb) and sgb/ log(esgb) ≫ ssg. In particular, when sg = s, the double sparse regression reduces to the vanilla sparse linear regression, and the upper bound (10) matches the classic upper bound for 1 minimization [44].

In addition, Condition (9) is an important technical condition we used in our theoretical analysis.

Next, we consider the sample complexity lower bound. Suppose b1 = b2 = ⋯ = bd and d ≥ 2sg. Recall that one observes y = * without noise and aims to estimate the (s, sg)-sparse vector β* based on y and X. As indicated by classic results in compressed sensing [47], with sufficient computing power, the 0 minimization below achieves exact recovery of β*

β^0=arg min β0  subject to  Xβ=y (13)

as along as X is non-degenerate and n ≥ 2s. This bound is actually sharp: when n < 2s, for any set T ⊆ {1, …, db} with cardinality 2s, one can find a vector γ such that supp(γ) ⊆ T and = 0. By choosing an appropriate T, we can split the support γ to obtain two (s, sg)-sparse vectors β1, β2 satisfying β1 + β2 = γ. Then, 1 = X(−β2) but there is no way to distinguish β1 and β2 merely based on X and y = 1 = X(−β2).

However, the 0 minimization (13) is computational infeasible in practice while a larger sample size is required for applying more practical methods. The following theorem shows that by performing the convex 1 regularization, 1,2 regularization, or any weighted combination of them, one requires at least Ω(sg log(d/sg) + s log(esgb/s)) observations to ensure exact recovery of (s, sg)-sparse vectors.

Theorem 2 (Sample complexity lower bound for exact recovery): Suppose b1 = ⋯ = bd = b, d, b ≥ 3. Suppose X is an n-by-(db) matrix. If every (2s, 2sg)-sparse vector βdb is a minimizer of the following programming for some (λ, λg) ∈ {(λ, λg) : λ, λg ≥ 0, λ + λg > 0}:

min zλz1+λgz1,2  subject to  Xz=y=Xβ

In other words, if the 1 + 1,2 minimization exactly recover all (2s, 2sg)-sparse vector β, then we must have nsg log(d/sg) + s log(esgb/s).

The following sample complexity lower bound shows that for arbitrary methods, to ensure stable estimation of all approximately sparse vectors, one requires at least Ω(sg log(d/sg) + s log(esgb/s)) observations.

Theorem 3 (Sample complexity lower bound for stable estimation): Suppose b1 = ⋯ = bd = b, b, d ≥ 3. Assume there exists a matrix Xn×(bd), a map Δ:nbd (Δ may depend on X), and a constant C > 0 satisfying

βΔ(Xβ)2C(β1s+β1,2sg) (14)

for all βp and some s, sg satisfying dsg, sgbssg. There exists constants C0 and c0 that depend only on C such that whenever sgC0, we must have

nc0(sg log(d/sg)+s log(esgb/s)).

Remark 2 (Optimality and comparison with previous results): Theorems 2 and 3 show that the sample complexity upper bound in Theorem 1 is rate-optimal under a weak condition: log(esgb) ≍ log(esgb)−log(s) or log(d) ≥ 2s log(s)/sg. Oymak, et al. [23] provided a general analysis for convex regularization of simultaneously structured parameter estimation. Specifically for the high-dimensional double sparse regression, a direct application of their Theorem 3.2 and Corollary 3.1 implies that if 1 + 1,2 minimization can exactly recover (s, sg)-sparse vector β* with a constant probability, one must have ns. We can see that Theorem 2 provides a sharper lower bound on sample complexity.

In addition, by setting sg = s, the lower bound in Theorems 2 and 3 reduces to ns log(p/s), which matches the optimal sample complexity lower bound for exact recovery of s-sparse vectors [46, Theorem10.11, Proposition 10.7]. By setting s = sgb, we obtain a sample complexity lower bound nsg(b + log(d/sg)) for (approximate) sg-group-wise sparse vector recovery and stable estimation. To the best of our knowledge, this is the first sample complexity lower bound for group Lasso.

C. Proof Sketches

We briefly discuss the proof sketches of the main technical results in this section. The detailed proofs are postponed to Section VI.

The proof of Theorem 1 is based on a novel dual certificate scheme. The dual certificate [48] has been used in the theoretical analysis for various convex optimization methods in high-dimensional problems, such as matrix completion [49], [50], compressed sensing [44], robust PCA [51], tensor completion [52], etc. The high-dimensional double sparse linear regression exhibits different aspects from these previous works due to the simultaneous sparsity structure. In particular, we can show that if the uet defined below is in the row space of X, it can be used as an exact dual certificate for recovery of (s, sg)-sparse vector β*:

uet=vet+wetp,{(vet)(j)=s/sgβ(j)*/β(j)*2,jG;(vet)(j)2<s/sg,jGc;{(wet)T=sgn(βT*)(wet)Tc<1. (15)

Here, T and G are the element-wise and group-wise supports of β*:

T={i:βi0}{1,,p},G={j:β(j)0}{1,,d}.

Roughly speaking, uet is the sub-gradient of objective function (5) evaluated at β = β*. If uet is in the row space of X, the sub-gradient will be perpendicular to the feasible set of (5), which implies that β* is the unique minimizer of 1 + 1,2 minimization (5).

For more general vector β* that does not necessarily have a sparse support T or G, we consider the following (s, sg)-sparse approximation:

βap=arg min S1sβSc*1+1sgβSc*1,2subject to βS*0s.βS*0,2sg,max iScΣi,SΣS,S12c/s. (16)

Let T={i:βiap0} and G = {j : (βap)(j) ≠ 0} be the element-wise and group-wise supports of βap. Define

u˜0=v˜0+w˜0p,{(v˜0)(j)=s/sgβT,(j)*/βT,(j)*2,jG;(v˜0)(j)2<s/sg,jGc;{(w˜0)T=sgn(βT*)(w˜0)Tc<1. (17)

Here βT,(j)*bj is the subvector β* restricted on the j-th group with all entries in Tc set to zero. Similarly to the exactly sparse case, if ũ0 is in the row space of X and the true β* is approximately (s, sg)-sparse, the minimizer of (5) will be close to β*.

However, it is often difficult to find an exact dual certificate that lies in the row space of X and satisfies stringent conditions in (15) or (17). We instead propose to analyze via the approximate dual certificate defined as (18) in the following lemma.

Lemma 1 (Approximate dual certificate for sparse group Lasso): Suppose T, G are element-wise and group-wise support defined in (16). ũ0 is defined in (17). Assume X satisfies σmin(XTXT/n)cmin/2. If there exists up in the row span of X satisfying

uT(u˜0)T2max iTcXTXi/n2cmin/8,H1/2(u(Gc)),2s0/2,u(G)\T1/2, (18)

Then the conclusion of Theorem 1 (10) holds with probability at least 1 − 2ecn. Here, H1/2(·) is the soft-thresholding operator defined at the beginning of Section II.

If we additionally assume β* is (s, sg)-sparse, then β* is the unique solution to the sparse group 1 +1,2 minimization (5) with probability at least 1 – 2ecn.

Lemma 1 shows that the conclusion of Theorem 1 holds if there exists an approximate dual certificate u satisfying the condition (18). The following lemma shows that, under the assumptions in Theorem 1, one can find such an approximate dual certificate with high probability.

Lemma 2: Suppose X has group structure (2) and satisfies Assumption 1. Recall σmin(XTXT/n) is the least eigenvalue of XTXT/n. Then σmin(XTXT/n)1/2 and (18) holds with probability at least 1 − Cecn/s, where T is defined in (16).

Another key technical tool to the proof of Theorem 1 is the following Lemma, which shows that X satisfies the restricted isometry property for all simultaneously element-wise and group-wise sparse vectors with high probability when there are enough samples.

Lemma 3: If nC(sg log(d/sg) + s log(esgb)),

cmin2γ221nXγ22(Cmax+cmin2)γ22,γ{γp:γ02s,γ0,22sg} (19)

with probability at least 1 – 2ecn.

Next we briefly discuss the proof of Theorem 2. Consider the quotient space db/ker(X)={[γ]x+ker(X),γdb} and define an associated norm as ‖[γ]‖ = infv∈ker(X) {λγv1 + λgγv1,2}. We show that there exist N different (s, sg)-sparse vectors β(1), …, β(N) such that log(N) ≍ s log(esgb/s) + sg log(d/sg) and ‖[β(i)]‖ = 1, ‖[β(i)] − [β(j)]‖ ≥ 2/9 for all 1 ≤ ijN. By a property of the packing number and the fact that dim(db/ker(X))n, we must have N ≤ 10n. Thus n ≳ log(N) ≍ s log(esgb/s) + sg log(d/sg).

We prove Theorem 3 by contradiction. Assume that

n<c0(s log(esgb/s)+sg log(d/sg)) (20)

for a sufficiently small constant c0. Let =1+s/sg1,2 and B={xdb:x1} be the unit ball associated with ‖·‖. Define

dn(B,p)=infLn is a subspace of p with dim(p/Ln)n{supβBLnβ2},

We have dn(B,p)Cs by the assumption of this theorem. We can also show that there exists a uniform constant c > 0 such that

dn(B,p)c min{1s0,[(sgs log(cssgd log(esgb/s)n)+log(esgb/s))/n]1/2}.

The previous two inequalities and (20) together imply that

nc(sg log(cssgd log(esgb/s)n)+s log(esgb/s))c0(s log(esgb/s)+sg log(d/sg))>n.

This contradiction shows that

nc0(s log(esgb/s)+sg log(d/sg)).

III. Sparse Group Lasso in Noisy Case

We now turn to the noisy case.

A. Optimal Rate of Estimation Error of Sparse Group Lasso

When observations are noisy, we have the following theoretical guarantee for the sparse group Lasso.

Theorem 4 (Upper bound of estimation error): Suppose y = * + ε, X satisfies Assumption 1, nC (sg log(d/sg) + s log(esgb)) for some uniform constant C > 0, ε~iidN(0,σ2), and b = max1≤id bi. Then the sparse group Lasso estimator (4) with

λ=Cσ(s log(esgb)+sg log(ed/sg))n/s

and

λg=s/sgλ

satisfies

β^β*2minS𝒮{σ2(sg log(d/sg)+s log(esgb))n+βSc*1s+βSc*1,2sg}

with probability at least

1C exp(Cs log(esgb)+sg log(d/sg)s).

Here,

𝒮={S:βS*0s,βS*0,2sg,max iScΣi,SΣS,S12c/s}.

Especially, if β* is exactly (s, sg)-sparse and

maxiTcΣi,TΣT,T12c/s

holds, then

β^β*22σ2(sg log(d/sg)+s log(esgb))n (21)

with probability at least

1C exp(Cs log(esgb)+sg log(d/sg)s).

In addition, we focus on the following class of simultaneously element-wise and group-wise sparse vectors,

s,sg={β:β0s,β0,2sg}.

The following minimax lower bound of estimation error holds.

Theorem 5 (Lower bound of estimation error): Suppose X satisfies Assumption 1, b1 = ⋯ = bd = b, and d, b ≥ 3. Then we have

inf β^sup βs,sgEβ^β22σ2(sg log(ed/sg)+s log(esgb/s))n.

Remark 3: Theorems 4 and 5 together show that the sparse group Lasso yields the minimax optimal rate of convergence as long as the following condition holds: log(esgb) ≍ log(esgb) − log(s) or log(d) ≳ s log(s)/sg.

Remark 4: We briefly discuss the main proof ideas of Theorem 5 here. First, we randomly generate a series of subsets Ω(i) ⊆ {1, …, p} as feasible supports of (s, sg)-sparse vectors. Then, we prove by a probabilistic argument that there exist N ≳ (sg log(d/sg) + s log(esgb/s)) subsets {Ω(i)}i=1N such that |Ω(i) ∩ Ω(j)| < 8sgs/sg⌋/9 for any i < j. Next, we construct a series of candidate (s, sg)-sparse vectors β(i) such that βk(i)=τ1{kΩ(i)}. Intuitively speaking, {β(i)}i=1N are non-distinguishable based only on observations (y, X) by such a construction. Theorem 5 then follows by choosing an appropriate τ and the generalized Fano’s lemma.

B. Statistical Inference via Debiased Sparse Group Lasso

We further consider the statistical inference for β* under the double sparse linear regression model. First, let β^ be the sparse group Lasso estimator given by (4). Inspired by the recent advances in inference for high-dimensional linear regression [30], [53], [31], [33], we propose the following debiased sparse group Lasso estimator,

β^u=β^+1nM^X(YXβ^). (22)

Here, Σ^=1nk=1nXkXk is the sample covariance matrix and M^=[m^1m^p] is an approximation of the inverse covariance matrix Σ−1, where m^i is the solution to the following convex optimization,

minimize mΣ^msubject to Hα(Σ^mei),2γ. (23)

Here, Hα is the soft-thresholding operator with thresholding level α defined at the beginning of Section II and ei is the i-th vector in the canonical basis of p. The following theorem establishes an asymptotic result for debiased sparse group Lasso.

Theorem 6 (Asymptotic distribution of debiased sparse group Lasso): Suppose β*p is (s, sg)-sparse, Xn×p satisfies Assumption 1, and maxiTcΣi,TΣT,T12c/s. Set λ=Cσ(s log(esgb)+sg log(d/sg))ns and λg=ssgλ in (4), α=λnσ, γ=ssgλnσ in (23). Then with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s), the debiased sparse group Lasso estimator β^u can be decomposed as n(β^uβ*)=Δ+w, where

ΔC(s log(esgb)+sg log(ed/sg))nσ,wX~N(0,σ2M^Σ^M^). (24)

In particular, if ns log(esgb)+sg log(ed/sg), for any 1 ≤ ip,

n(β^iuβi*)m^iΣ^m^iN(0,σ2). (25)

Remark 5: (25) provides a method to construct confidence intervals for β*. Specifically if σ^ is a consistent estimator of σ, such as the scaled sparse group Lasso to be discussed in Section V,

[β^iuΦ1(1α/2)σ^m^iΣ^m^in,β^iu+Φ1(1α/2)σ^m^iΣ^m^in]

would be an asymptotic (1 – α)-confidence interval for βi*. We can see that the debiased sparse group Lasso estimator has the provably advantage on sample complexity (n ≫ (s log(esgb) + sg log(ed/sg))2) over the ones via debiased Lasso (ns log p, see [30], [31], [33]) or debiased group Lasso (n ≫ (sgb + sg log p)2, see [32]) for constructing asymptotic confidence intervals of β*.

IV. Simulation Studies

In this section, we investigate the numerical performance of the sparse group Lasso and 1 +1,2 minimization for double sparse regression. The results support our theoretical findings in Sections II and III. We first discuss the practical choice for the tuning parameters used in the proposed algorithms.

A. Practical Selection of Tuning Parameters

By introducing τ as a surrogate for (λg/λ)2, we can rewrite the 1 + 1,2 minimization and the sparse group Lasso as

β^=arg min β1+τβ1,2  subject to  y=Xβ, (26)
β^=arg min βyXβ22+λβ1+λτβ1,2. (27)

As suggested by Theorems 1 and 4, the theoretical choice of the tuning parameters (λ, τ) relies on σ, s, and sg in sparse group Lasso and 1 + 1,2 minimization for double sparse regression. These values, however, are usually unknown in practice. In addition, those theoretical values of tuning parameters may not achieve the best finite-sample numerical performance. We thus introduce in this section a data-driven approach to tuning parameter selection using K-fold cross-validation.

We first discuss how to select τ in the 1 + 1,2 minimization (26). Recall n is the sample size, p is the total number of covariates, d is the number of groups, b1, …, bd are the number of covariates in each group, and b = maxj bj. Since the theoretical value τ = s/sg and s/sg must satisfy 1 ≤ s/sgb, for a given integer L ≥ 1, we introduce a grid

S0={b(l1)/(L1):1lL} (28)

as a set of candidate values for τ. Here, the grid size L can be set to a typical value of 10, or a larger value if more computing power is available. We split the data {Xi,yi}i=1n into K groups. For 1 ≤ kK, let Jk ⊂ {1, …, n} be the index set of the kth group and Jkc={1,,n}\Jk. For each τS0, we solve

β^(k)(τ)=arg min β1+τβ1,2subject to yJkc=X[Jkc,:]β

and calculate the prediction error

R^(τ)=k=1KjJk(yjX[j,:]β^(k)(τ))2.

Let τ* be the minimizer of the prediction error: τ*=arg min τS0R^(τ). Then, the final estimator β^ is calculated using (26) with τ*.

Then we consider the sparse group Lasso (27), which includes two tuning parameters (τ, λ). We still define S0 in (28) as a grid of candidate values of τ. Following the idea in [11, Section 3.3], for each τS0, we begin with a large value of λmax(τ) so that β^, the outcome of sparse group Lasso (27) with tuning parameters (τ, λmax(τ)), is zero (this can be achieved by the SGL package1). Let λmin(τ) be a small fraction of λmax(τ) (e.g., λmin = 0.1λmax as suggested in [11, Section 5]). Then we define

Λ(τ)={{λmin(τ)}(Ll)/(L1){λmax(τ)}(l1)/(L1):l=1,,L}.

Next, we split the data {Xi,yi}i=1n into K groups. For 1 ≤ kK, let Jk ⊂ {1, …, n} be the index set of the kth group and Jkc={1,,n}\Jk. For each τS0, λ ∈ Λ(τ), and k ∈ {1, …, K}, we solve

β^(k)(τ,λ)=arg min β{yJkcX[Jkc,]]β22+λβ1+λτβ1,2}

and calculate the prediction error

R^(τ,λ)=k=1KjJk(yjX[j,:]β^(k)(τ,λ))2.

Let (τ*, λ*) be the minimizer of the prediction error:

(τ*,λ*)=arg min τS0,λΛ(τ)R^(τ,λ).

The final estimator β^ is calculated using (27) with (τ*, λ*).

In our simulation studies next, we will examine the performance of this cross-validation scheme with K = L = 10, λmin = 0.1λmax.

B. Numerical Results

We begin by considering the sample complexity for the exact recovery in the noiseless case. Suppose all group sizes are equal (b1 = ⋯ = bd = b) and the number of observations n varies from 5 to 200. We consider four simulation designs with (1) d = 60, b = 20, sg = 1; (2) d = 100, b = 30, sg = 2; (3) d = b = 20, sg = 1; and (4) d = b = 40, sg = 1. For each setting, we randomly draw Xn×db with i.i.d. standard normal entries, construct the fixed vector β*db satisfying

β(j)*={(1,2,3,4,5,0,,0)bj=1,,sg;0j=sg+1,,d,

and generate y=Xβ*=j=1sgX(j)β(j)*. We implement the 1 + 1,2 minimization (5) with λg=s/sgλ (SGL), 1 minimization (11) (Lasso), and 1,2 minimization (12) (Group Lasso), and 1 + 1,2 minimization (5) with the tuning parameter λg/λ selected using cross validation discussed in Section IV-A (SGL_CV). An exact recovery of β* is considered to be successful if β^β*2104. The successful recovery rate based on 100 replicates is shown in Figure 1. It can be seen that SGL and SGL_CV have comparable performance and both methods have significantly better performance than Lasso and Group Lasso. This is in line with our theoretical results.

Fig. 1.

Fig. 1.

Exact recovery rate in the noiseless case

Then we consider the noisy case and focus on average estimation errors of different methods. We generate

y=Xβ*+ε=j=1sgX(j)β(j)*+ε,

where X, β* are drawn in the same way as the previous setting and ε~iidN(0,0.12). We consider four designs: i. d = 60, b = 20, sg = 1; ii. d = 100, b = 30, sg = 2; iii. d = b = 20, sg = 1; and iv. d = b = 40, sg = 2. For each case, the number of observations n is chosen from an equally spaced sequence from 5 to 200 and the simulation is replicated for 500 times. We compare the average estimation error of (a) SGL_CV1: sparse group Lasso with theoretical value λg=s/sgλ and λ selected via cross validation; (b) SGL_package: sparse group Lasso via SGL package2 in R with the option of automatic tuning parameter selection; (c) Lasso: regular Lasso with tuning parameter selected via cross validation; (d) group Lasso: group Lasso with tuning parameter selected via cross validation; (e) SGL_CV2: sparse group Lasso with both λ and λg selected using the proposed cross validation scheme. We can see the proposed method SGL_CV2 achieves smaller estimation error than all other methods, including SGL_CV1, the focus of our theory. These experimental results demonstrate our theory and the applicability of the proposed cross-validation scheme.

V. Discussions

In this paper, we study the high-dimensional double sparse regression and investigate the theoretical properties of the sparse group Lasso and 1 + 1,2 minimization. Particularly, we develop the matching upper and lower bounds on the sample complexity for 1 + 1,2 minimization in the noiseless case. We also prove that the sparse group Lasso achieves minimax optimal rate of convergence in a range of settings in the noisy case. Our results give an affirmative answer to the open question for high-dimensional statistical inference for simultaneously structured model: by introducing both 1 and 1,2 penalties, one can achieve better performance on estimation and statistical inference for simultaneously element-wise and group-wise sparse vectors.

In addition to β*, the estimation and inference for noise level σ is another importance task in high-dimensional double sparse regression. Motivated by the recent development of scaled Lasso [54], one may consider the following scaled sparse group Lasso estimator:

{β^s,σ^}=arg min βp,σ>0{yXβ22σ+nσ+λ˜β1+λ˜gβ2},

where λ˜. and λ˜g are tuning parameters that do not rely on σ. The consistency of σ^ can be established based on similar ideas of scaled Lasso in the literature [54], [31] and the approximate dual certificate in this work.

Moreover, our technical results can be useful in a variety of other problems with simultaneous sparsity structures. For example, [55], [56] considered the estimation of piece-wise constant sparse signals, i.e., both the signal vector and the difference between successive entries of the signal vector are sparse. [57], [58] discussed the estimation of structured parameters where both the number of non-zero elements and the number of distinct values of the parameter vectors are small. [59] considered the estimation of matrices with simultaneous sparsity structures within each block and among different blocks. It is interesting to further study the statistical limits, including the sample complexity and minimax optimal rate of convergence for these problems. In particular, based on the specific sparsity structures of each problem, we can introduce corresponding multi-objective regularizers and the convex regularization methods. The corresponding approximate dual certificates can be proposed, constructed, and analyzed to provide strong theoretical guarantees.

VI. Proofs

We collect the proofs of technical results in this section.

A. Proof of Lemma 1

Let T satisfy (16). For convenience, we denote s0 = s/sg and decompose u as

u=v+w,vi={uis0βi*/βT,(j)*2,iT,i(j);ui,i(G)\T;uiH1/2(ui),i(Gc).w(j)={s0βT,(j)*/βT,(j)*2,jG;H1/2(u(j)),jG. (29)

Note that |H1/2(x) − x| ≤ 1/2 for any x. Based on the property of (18), ‖u(G)\T ≤ 1/2, then

max iTc|vi|1/2,vTsgn(βT*)2=uT(u˜0)T2cmin8 maxiTcXTXi/n2; (30)
w(j)=s0βT,(j)*/βT,(j)*2, if jG;w(j)2s0/2, if jG. (31)

Suppose β^ is the minimizer to (5), h=β^β*, then based on the sub-differential of ‖β1 and ‖β1,2, we have

𝒫(β^)=β^1+s0β^1,2=β*+h1+s0β*+h1,2βT*1+sgn(βT*)hT+hTc1+s0(βT*1,2+jGβT,(j)*h(j)βT,(j)*2+jGh(j)2)βTc*1s0βTc*1,2𝒫(β*)+hTc1+s0h(Gc)1,2+sgn(βT*)hT+jGs0βT,(j)*h(j)βT,(j)*22βTc*12s0βTc*1,2. (32)

The last inequality comes from β*1=βT*1+βTc*1 and β*1,2βT*1,2+βTc*1,2.

In particular, given Xh = 0 and that u lies in the row span of X, we have vh + wh = uh = 0. Therefore,

sgn(βT*)hT+jGs0βT,(j)*h(j)βT,(j)*2=sgn(βT*)hTvh+jGs0βT,(j)*h(j)βT,(j)*2wh=(vTsgn(βT*))hTvTchTcjG(w(j)s0βT,(j)*/βT,(j)*2)h(j)(w(Gc))h(Gc)vTsgn(βT*)2hT2vTchTc1max jG{w(j)s0βT,(j)*/βT,(j)*22h(G)1,2}w(Gc),2h(Gc)1,2(30)(31)vTsgn(βT*)2hT2hTc1/2s0h(Gc)1,2/2. (33)

Next note that h=hT+hTc, we must have XThT=XTc+hTc, then

hT2=(XTXT/n)1XTXThT/n2σmin1(XTXT/n)XTXTchTc/n22cminmax iTcXTXi/n2hTc1. (34)

Combining (30), (33), and (34), one obtains

sgn(βT*)hT+jGs0βT,(j)*h(j)βT,(j)*23/4hTc1s0h(Gc)1,2/2.

Plug this inequality to (32), we finally have

𝒫(β^)𝒫(β*)+hTc1/4+s0h(Gc)1,2/22βTc*12s0βTc*1,2.

Since β^ is the minimizer to (5), we must have 𝒫(β^)𝒫(β*), then

hTc1/4+s0h(Gc)1,2/22βTc*1+2s0βTc*1,2. (35)

If β* is (s, sg)-sparse, immediately we have hTc=0. Then 0=XTXh=(XTXT)hT. By σmin(XTXT/n)cmin/2, we know XTXT/n is non-singular, then hT = 0.

Now, we consider the general case. Without loss of generality, suppose G = {1, …, g}, where gsg. Denote T1 as the indices of the s largest entries of h(G)\T, T2 as the indices of the s largest entries of h(G)\[TT1], and so on. For sg + 1 ≤ id, denote Si,1 as the indices of the ⌊s/sg⌋ largest entries of h(i), Si,2 as the indices of the ⌊s/sg⌋ largest entries of h(i)\Si,1, and so on. Let S˜1,,S˜i=g+1dbi/s/sg be an arrangement of Si,j(1 ≤ j ≤ ⌈bi/⌊s/sg⌋⌉, g + 1 ≤ id) such that hS˜122hS˜i=g+1dbi/s/sg22. Let R1=i=1sgS˜i, R2=i=sg+12sgS˜i, and so on. Then (T1, T2, …, R1, R2,…) is a partition of Tc, and |Ti|, |Rj| ≤ s, |g(Ti)|, |g(Rj)| ≤ sg, where g(S) = {i1, …, ik} if Sj=1k(ij) and S ∩ (ij) are not empty for all 1 ≤ jk. Let T = TT1R1. If (19) holds, then

cmin2hT˜221nXT˜hT˜22=1nXT˜hT˜,Xh1nXT˜hT˜,XT˜chT˜c. (36)

Since Xh = 0, we have

XT˜hT˜,Xh=0. (37)

Now, we consider |XT˜hT˜,XT˜chT˜c|. By triangle inequality,

|XT˜hT˜,XT˜chT˜c||XThT,XT˜chT˜c|+|XT1hT1,XT˜chT˜c|+|XR1hR1,XT˜chT˜c|.

The triangle inequality shows that

|XThT,XT˜chT˜c|i2|XThT,XTihTi|+j2|XThT,XRjhRj|.

Combine the parallelogram identity and (19) together, we have

|XThT,XTihTi|CmaxnhT2hTi2,|XThT,XRjhRj|CmaxnhT2hRj2.

Thus,

|XThT,XT˜chT˜c|CmaxnhT2(i2hTi2+j2hRj2). (38)

By (3.10) in [37], we have

i2hTi2s1/2h(G)\T1, (39)

and

j2hRj2=j2(i=(j1)sg+1jsghS˜i22)1/2j2sghS˜(j1)sg2j2sgi=(j2)sg+1(j1)sghS˜i2/sg=sg1/2khS˜k2=sg1/2i=g+1djhSi,j2.

For all g +1 ≤ id, apply (3.10) in [37] again,

j2hSi,j2(s/sg)1/2h(i)12(s/sg)1/2h(i)1.

Moreover, by the definition of Si,1,

i=g+1dhSi,12i=g+1dh(i)2=h(Gc)1,2.

Therefore,

j2hRj2sg1/2(i=g+1d2(s/sg)1/2h(i)1)+sg1/2h(Gc)1,2=2s1/2h(Gc)1+sg1/2h(Gc)1,2. (40)

Combine (38), (39) and (40) together, if (19) holds, we have

|XThT,XT˜chT˜c|(s1/2h(G)\T1+2s1/2h(Gc)1+sg1/2h(Gc)1,2)CmaxnhT2CmaxnhT2(2s1/2hTc1+sg1/2h(Gc)1,2).

Similarly, if (19) holds, then

|XT1hT1,XT˜chT˜c|CmaxnhT12(2s1/2hTc1+sg1/2h(Gc)1,2)

and

|XR1hR1,XT˜chT˜c|CmaxnhR12(2s1/2hTc1+sg1/2h(Gc)1,2).

Thus, with probability at least 1 – 2ecn,

|XT˜hT˜,XT˜chT˜c|Cmaxn(hT2+hT12+hR12)(2s1/2hTc1+sg1/2h(Gc)1,2)3CmaxnhT˜2(2s1/2hTc1+sg1/2h(Gc)1,2). (41)

The last inequality holds due to Cauchy-Schwarz inequality. Combine (36), (37), (41) and Lemma 3 together, we know that with probability at least 1 – 2ecn,

cmin2hT˜223CmaxhT˜2(2s1/2hTc1+sg1/2h(Gc)1,2),

i.e., with probability at least 1 – 2ecn,

hT˜223Cmaxcmin(2s1/2hTc1+sg1/2h(Gc)1,2).

Finally, by (35), (39), (40) and the previous inequality, with probability at least 1 – 2ecn,

h2hT˜2+i2hTi2+j2hRj223Cmaxcmin(2s1/2hTc1+sg1/2h(Gc)1,2)+2s1/2hTc2+sg1/2h(Gc)1,2C(1sβTc*1+1sgβTc*1,2).

In summary, we have finished the proof of this lemma. □

B. Proof of Lemma 2

Let T satisfy (16). Given βT*0,2sg, without loss of generally we assume that

βT,(sg+1)*,,βT,(d)*=0.

We also denote T(j) as the support of βT,(j)*. First by Lemma 6 Part 3 with

vp,vk={1,k=i;0,ki;Up×|T|=(i=1dbi)×|T|,U[T,:]=I;U[Tc,:]=0,

and notice that x log(eu/x) ≥ log(eu) for all 1 ≤ xu, we have

(maxiTcXTXi/n21/2)iTc(XTXi/n21/2)iTc(XTXi/nEXTXi/n2+EXTXi/n21/2)iTc(XTXi/nEXTXi/n21/2ΣT,TΣi,TΣT,T12)iTc(XTXi/nEXTXi/n21/4)dbC exp(Csn)C exp( log(d)+log(b)+Csn)C exp(sg log(ed/sg)+s log(esgb/s)+Csn)C exp(cn) (42)

provided that nC (s log(esgb/s) + sg log(d/sg)) for some large constant C > 0. Note that the fourth inequality comes from the facts that ‖ΣT,T‖ ≤ ‖Σ‖ ≤ Cmax and Σi,TΣT,T12c/s1/(4Cmax). By Lemma 7 Part 1, we also know

(σmin(XTXT/n)cmin/2)(XTXT/nΣT,Tcmin/2)(XTXTΣT,T1/nI|T|ΣT,Tcmin/2)(XTXTΣT,T1/nI|T|cmin/(2Cmax))C exp(Cscn)C exp(cn)

Next, we apply the well-regarded golfing scheme [50], [44] to find an approximate dual certificate u that satisfies (18). Let u0p such that

(u0)(j)={s/sgβT,(j)*βT,(j)*2+sgn(βT,(j)*),jG;0,jGc. (43)

Immediately we have (u0)T=(u˜0)T. We divide n rows of X into non-overlapping batches, say X[I1,:], X[I2,:], …, with |Il| = nl. Here, n1, n2,… will be specified a little while later. Consider the following sequences

α0=u0,γl=X[Il,:]X[Il,T]ΣT,T1/nl(αl1)T,αl=αl1γl,l=1,2,,lmax. (44)

Finally the approximate dual certificate is defined as

u=l=1lmaxγl=l=1lmaxX[Il,:]X[Il,T]ΣT,T1/nl(αl1)T. (45)

From the inductive definition we can see

(αl)T=(IX[Il,T]X[Il,T]ΣT,T1/nl)(αl1)T,(γl)Tc=X[Il,Tc]X[Il,T]ΣT,T1/nl(αl1)T,l=1,2,.

Next, we apply the random matrix results (Lemmas 7 and 6) and obtain the following tail probabilities.

  • if nlC stl for large constant C > 0 and tlC, by Part 1 of Lemma 7,
    (X[Il,T]X[Il,T]ΣT,T1/nlI|T|Cstlnl)C exp(Csnl min{stlnl,(stlnl)1/2})C exp(cstl); (46)
  • Suppose ql1=(αl1)T|T| is independent of X[Il,:]. If nlC(s0 log(esgb/s)+logd)min{s0δl2,s0δl} for δlC max iTcΣi,TΣT,T12C(maxiTcΣi,TΣT,T12)ΣT,T1ql12/ql12, by Lemma 7 Part 2,
    (max jGcHql12δl(X[Il,(j)]X[Il,T]ΣT,T1/nlql1)2s0ql12δl)jGc(Hql12δl(X[Il,(j)]X[Il,T](ΣT,T1ql1)/nl)2s0ql12δl)d(bs0)exp(Cs0cnl min{s0ql122δl2κ4ΣT,T1ql122,s0ql12δlκ2ΣT,T1ql12})+d(bs0)exp(Cs0cnl min{s0ql122δl2κ4ΣT,T1ql122,s0ql12δlκ2ΣT,T1ql12})2d(ebs0)s0exp(Cs0cnl min{s0δl2,s0δl})C exp( log(d)+Cs0 log(2esgb/s)+Cs0cnl min{s0δl2,s0δl})C exp(cnl min{s0δl2,s0δl}); (47)

    The third inequality comes from T,T11cmin.

  • Suppose ql1=(αl1)T|T| is fixed. If nl min{θl2,θl}C log(esgb),θl2 max iTcΣi,TΣT,T12, by Lemma 6 Part 2,
    (X[Il,(G)\T]X[Il,T]ΣT,T1/nlql1θlql12)i(G)\T(|X[Il,i]X[Il,T]/nl(ΣT,T1ql1)|θlql12)i(G)\T(|X[Il,i]X[Il,T]/nl(ΣT,T1ql1)Σi,TΣT,T1ql1|θlql12|Σi,TΣT,T1ql1|)i(G)\T(|X[Il,i]X[Il,T]/nl(ΣT,T1ql1)Σi,TΣT,T1ql1|θlql12Σi,TΣT,T12ql12)i(G)\T(|X[Il,i]X[Il,T]/nl(ΣT,T1ql1)Σi,TΣT,T1ql1|12θlql12)i(G)\T(|X[Il,i]X[Il,T]/nl(ΣT,T1ql1)Σi,TΣT,T1ql1|cmin2θlΣT,T1ql12)sgbC exp(cnl min{θl2,θl})=C exp( log(sgb)cnl min{θl2,θl})C exp(cnl min{θl2,θl}). (48)

Then we specify {nl, tl, δl, θl}l ≥ 1 as follows,

  • n1 = n2C(s log(esgb) + sg log(d/sg)), t1 = t2 = cn1/(s log(es)) ≥ C, δ1=δ2=1/(16s), θ1=θ2=1/(16s);

  • n3==nlmaxn1lmax2C(s log(esgb)+sg log(d/sg))/ log(es), t3==tlmax=cn3/sC, δ3==δlmax= log(es)/(16s)max {( log(es)/s)1/2/16, log(es)s0/(16s)}, θ3==θlmax=( log(es)/s)1/2/16, with lmax = ⌈C log(es)⌉ + 2.

We can see the following events happen

X[Il,T]X[Il,T]ΣT,T1/nlI|T|Cstl/nl1/ log(es),l=1,2;X[Il,T]X[Il,T]ΣT,T1/nlI|T|Cstl/nl1/2,l=3,,lmax; (49)
maxjGcHql12/(16s)(X[Il,(j)]X[Il,T]ΣT,T1/nlql1)2s0ql12/(16s),l=1,2;maxjGcHql12 log(es)/(16s)(X[Il,(j)]X[Il,T]ΣT,T1/nlql1)2s0ql12 log(es)/(16s),l=3,,lmax; (50)
X[Il,(G)\T]X[Il,T]ΣT,T1/nlql1ql12/(16s),l=1,2X[Il,(G)\T]X[Il,T]ΣT,T1/nlql1ql12( log(es)/s)1/2/16,l=3,,lmax. (51)

with probability at least 1C log(es)exp(cn log(es))C log(es)exp(cnsg)C log(es)exp(cns). By triangle inequality, u0 satisfies

u02s/sg(jGβT,(j)*βT,(j)*222)1/2+sgn(βT*)22s. (52)

When maxiTcXTXi/n212 and (49)–(52) hold, we have

q022s,q12=(I|T|XI1,TXI1,TΣT,T1/n1)q0I|T|XI1,TXI1,TΣT,T1/n1q022s/ log(es);similarly, q22q12/ log(es)2s/( log(es));ql2ql12/2q2/2l223ls/( log(es)),l3. (53)

For large constant C>0,qlmax223C log(es)s/ log(es)cmin/8. Notice that

uT=(l=1lmaxγl)T=(l=1lmax(αl1αl))T=(α0αlmax)T=(u˜0)T(qlmax)T,

we know that

uT(u˜0)T2max iTcXTXi/n2=qlmax2max iTcXTXi/n2cmin812<cmin8.

In addition,

u(G)\Tl=1lmaxX[Il,(G)\T]X[Il,T]ΣT,T1/nl(αl1)Tq02/(16s)+q12/(16s)+l=3lmaxql12( log(es)/s)1/2/161/8+1/8+l=324l/161/2.

Since

q02/(16s)+q12/(16s)+l=3lmaxql12 log(es)/(16s)18+18+l=3lmax24ls/( log(es)) log(es)/(16s)12,
H1/2(u(Gc)),2Hν(u(Gc)),2l=3lmaxHql12 log(es)/(16s)(X[Il,(Gc)]X[Il,Tc]ql1),2+l=12Hql12/(16s)(X[Il,(Gc)]X[Il,Tc]ql1),2l=12s0ql12/(16s)+l=3lmaxs0ql12 log(es)/(16s)s0/2,

where

ν=q02/(16s)+q12/(16s)+l=3lmaxql12 log(es)/(16s).

Thus, the construction of u satisfies all required condition in Lemma 1 with probability at least 1C exp(cns). This has finished the proof of this lemma. □

C. Proof of Lemma 3

Let g(S) be the group support of set S, that is, g(S) = {i1, …, ik} if Sj=1k(ij) and S ∩ (ij) are not empty for all 1 ≤ jk. Lemma 7 Part 1 and the union bound show that

(γp,γ02s,γ0,22sg1nXγ22[cmin2γ22,(Cmin+cmin2)γ22])=(x2sp,S1,,p,|S|=2sp,|g(S)|2sg1nXSx22[cmin2γ22,(Cmin+cmin2)γ22])S{1,,p},|S|=2sp,|g(S)|2sg(x2sp,1nXSx22[cmin2γ22(Cmin+cmin2)γ22])S{1,,p},|S|=2sp,|g(S)|2sg(1nXSXSΣS,Scmin2)S{1,,p},|S|=2sp,|g(S)|2sg(1nXSXSΣS,S1I|S|cmin2Cmax)[(d2sg)1](2sgb2s)2 exp(Cscn)(ed2sg)2sg(e2sgb2s)2s2 exp(Cscn)2 exp(2s log(esgb/s)+2sg log(ed/sg)+Cscn)2ecn.

D. Proof of Theorem 2

If d ≥ 3sg and b ≥ 3s/sg, by (80), we can find Ω(1), …, Ω(N) ⊂ {1, …, db} such that |Ω(i)| = sgs/sg⌋, |Ω(k)(i)|=s/sg1{Ω(k)(i)is not empty}  for all 1 ≤ iN, ≤ kd, and

|Ω(i)Ω(j)|8sgs/sg/9,1ijN, (54)
|{k|Ω(k)(i)Ω(k)(j)|2s/sg/3}|2sg/3,1ijN, (55)

where N=|(d22sg)sg/3(b22s/sg)s/9|. For any 1 ≤ jdb, 1 ≤ iN, define

βj(i)={1λsgs/sg+λgsgs/sg,jΩ(i)0jΩ(i),

then ∥β(i)0s, ∥β(i)0,2sg. We consider the quotient space

db/ker(X)={[x]x+ker(X),xdb}.

Then the dimension of db/ker(X) is rank(X) ≤ n. Define the norm ∥[x]∥ = infv∈ker(X){λxv1 + λgxv1,2}. For any vector xdb satisfying ∥x0 ≤ 2s, ∥x0,2 ≤ 2sg, note that xv with v ∈ ker(X) satisfies X(xv) = Xx, by our assumption, we have ∥[x]∥ = λx1 +λgx1,2. Thus ∥[β(1)]∥ = ⋯ = ∥[β(N)]∥ = 1. Moreover, by (54) and (55),

β(i)β(j)1=1λsgs/sg+λgsgs/sg(|Ω(i)|+|Ω(j)|2|Ω(i)Ω(j)|)2sgs/sg9(λsgs/sg+λgsgs/sg),

and

β(i)β(j)1,2=k=1dβ(k)(i)β(k)(j)2kSi,jβ(k)(i)β(k)(j)21λsgs/sg+λgsgs/sg2s/sg3|Si,j|1λsgs/sg+λgsgs/sg2s/sg3sg3,

where

Si,j={k|Ω(k)(i),Ω(k)(j) are not empty sets,|Ω(k)(i)Ω(k)(j)|<2s/sg/3|}

Since β(i)β(j) is (2s, 2sg)-sparse,

[β(i)][β(j)]=[β(i)β(j)]=λβ(i)β(j)1+λgβ(i)β(j)1,22/9.

By [46, Proposition C.3], we have N ≤ 10rank(X) ≤ 10n. Therefore we have

(d22sg)sg/3(b22s/sg)s/910n,

which means that nc(sg log(d/sg) + s log(esgb/s)).

If d < 3sg or b < 3s/sg, let sg=[sg/3]1sg/5, s=[s/15]sg, then d3sg and b3s/sg. Since all (2s, 2sg)-sparse vectors can be exactly recovered by the 1 + 1,2 minimization and s′ ≤ s, sgsg, we know that the 1 + 1,2 minimization exactly recover all (2s′, 2sg)-sparse vectors. Therefore, we have

nc(sg log(d/sg)+s log(esgb/s))c(sg5 log(dsg)+s15 log(eb(sg/5)s/15eb))c(sg log(d/sg)+s log(esgb/s)). (56)

E. Proof of Theorem 3

We would like prove Theorem 3 by contradiction. Let

c=min{18,c,c256},c0=min{c2e,c22C2,16c2},C0=max {C2c2,132c2},

where c′ is a uniform constant such that nc′(s log(esgb/s) + sg log(d/sg)) if the conditions in Theorem 2 are satisfied. Assume for contradiction that

n<c0(s log(esgb/s)+sg log(d/sg)). (57)

Let s0 = s/sg, define the norm =1+s01,2. Let B={xpx1},

dn(B,p)=infLn is a subspace of p with dim(p/Ln)n{supβBLnβ2}.

By [46, Theorem 10.4], we have

dn(B,p)supβBβΔ(Xβ)2CssupβB(β1+s0β1,2)=Cs. (58)

If

dn(B,p)c min{1s0,[(sgs log(cssgd log(esgb/s)n)+log(esgb/s))/n]1/2}, (59)

since

CscC0scsgs=cs0,

(58) and (59) together imply that

nc2C2(sg log(cssgd log(esgb/s)n)+s log(esgb/s)). (60)

By (57),

cssgd log(esgb/s)n>cssgd log(esgb/s)c0(s log(esgb/s)+sg log(d/sg))2essgd log(esgb/s)s log(esgb/s)+sg log(d/sg)min{essgd log(esgb/s)s log(esgb/s),essgd log(esgb/s)sg log(ed/sg)}min{edsg,edsg log(edsg)}(edsg)1/2. (61)

In the last inequality, we used x1/2 ≥ log(x)/2 for all x ≥ 1. Combine (60) and (61) together, we have

nc22C2(s log(esgb/s)+sg log(d/sg))c0(s log(esgb/s)+sg log(d/sg))>n,

contradiction!

Thus, we only need to prove (59) based on (57). We still use the proof of contradiction. If

dn(B,p)<c min{1s0,[(sgs log(cssgd log(esgb/s)n)+log(esgb/s))/n]1/2}=:μ,

then there exists a subspace Ln of p with dim(p/Ln)n such that for all vLn\{0},

v2<μ(v1+s0v1,2).

Let Bn×p satisfying ker(B) = Ln. Let s=132μ2, sg=s/s0, by (57) and (61),

18s01/2cs01/2μc min{C0s,(sg2s log(d/sg)+log(esgb/s)c0(sg log(d/sg)+s log(esgb/s)))1/2}142s1/2,

which means that

1ss,1sgsg.

Moreover, we have 164μ2<s132μ2. For any (2s′, 2sg)-sparse β with support set T and group support set G, and v ∈ ker(A), by Cauchy-Schwarz inequality,

vT1+s0v(G)1,22svT2+s02sgvT222svT2<22142μμ(v1+s0v1,2)=12(v1+s0v1,2),

i.e.,

vT1+s0v(G)1,2<vTc1+s0v(Gc)1,2.

Based on Cauchy-Schwarz inequality and the sub-differential of ∥β1 and ∥β1,2, we have

β+v1+s0β+v1,2β1+sgn(β)vT+vTc1+s0(β1,2+jGβ(j)v(j)β(j)2+jGcvj2)β1vT1+vTc1+s0(β1,2v(G)2+v(Gc)2)>β1+s0β1,2.

By Theorem 2,

nc(s log(esgb/s)+sg log(d/sg))cs{ log(esgb2s)+12s0 log(s0d/s)}cs log(esgb2s).

Thus

ncs( log(esgb2s)+sgs log(cssgd log(esgb/s)n))>c64μ2(14 log(esgb/s)+sgs log(cssgd log(esgb/s)n))n

provided that c=min{18,c,c256}, contradiction! This means that (59) holds if (57) is true.

Therefore, we have finished the proof of Theorem 3. □

F. Proof of Theorem 4

Let λ=Cσs log(esgb)+sg log(d/sg)sn, λg=s/sgλ. By (86) in Lemma 5 and (101), one has

(H110λ(Xε),2110λg)(1jd,H110λ(X(j)ε)2110λg,ε25nσ2)+(ε25nσ2)j=1d(H110λ(X(j)ε)2|110λg|ε25nσ2)+(ε25nσ2)d exp(Cs log(esgb)+sg log(d/sg)sg)+en=exp( log(sg)+log(d/sg)Cs log(esgb)+sg log(d/sg)sg)+enexp(Cs log(esgb)+sg log(d/sg)sg)+en. (62)

By the definition of β^ and KKT condition, we have

X(yXβ^)+λz1+λgz2=0,

where

{(z1)i=sgn(β^i),β^i0;|(z1)i|1,β^i=0;
{(z2)(j)=β^(j)β^(j)2,β^(j)0;(z2)(j)21,β^(j)=0.

Therefore,

Hλ(X(Xβ^y)),2λg.

(62), Lemma 8 Part 1 and the previous inequality together imply that

(H(1+110)λ(XXh),2(1+110)λg)1exp(Cs log(esgb)+sg log(d/sg)sg)en, (63)

where h=β^β*. By the definition of β^, we have

yXβ^22+λβ^1+λgβ^1,2yXβ*22+λβ*1+λgβ*1,2.

(32) and the previous inequality show that

Xh22+λhTc1+λgh(Gc)1,22Xh,ελsgn(βT*)hTλgjGβT,(j)*h(j)βT,(j)*2+2λβTc*1+2λgβTc*1,2. (64)

First, we consider 〈Xh, ε〉. Denote P=XT(XTXT)1XT, since Xh=XThT+XTchTc and (InP)XT = 0,

|Xh,ε||PXh,ε|+|(InP)Xh,ε|=|XTXh,(XTXT)1XTε|+|(InP)XTchTc,ε|=|XTXh,(XTXT)1XTε|+|XTchTc,(InP)ε|. (65)

Therefore, to give an upper bound of |〈Xh, ε〉|, we only need to bound |XTXh,(XTXT)1XTε| and |XTchTc,(InP)ε|, respectively. By Part 1 of Lemma 7 and also notice that cminσmin (Σ) ≤ σmax (Σ) ≤ Cmax,

((1nXTXT)12cmin)(1nXTXTΣT,Tcmin2)(1nXTXTΣT,T1Iscmin2Cmax)2 exp(Cscn)2 exp(cn). (66)

(66), Lemma 9 and Cauchy-Schwarz inequality together imply that with probability at least 1exp(Cs log(esgb)+sg log(d/sg)s)2 exp(cn),

(XTXT)1XTε12cminsnXTε22cminsnXTεCsnns log(esgb)+sg log(d/sg)sσ2Csnλ,
(XTXT)1XTε1,2sg(XTXT)1XTε22cminsgnXTε2Cssgnλ.

Combine Lemma 8 Part 2, (63) and the previous two inequalities together, with probability at least 12 exp(Cs log(esgb)+sg log(d/sg)s)3ecn,

|XTXh,(XTXT)1XTε|1110λ(XTXT)1XTε1+1110λg(XTXT)1XTε1,2Csnλ2. (67)

Similarly to the proof of (62), also notice that ∥(InP)ε2 ≤ ∥ε2 and X(Gc) is independent of InP, we have

(H110λ(X(Gc)(InP)ε),2110λg)(jGc,H110λ(X(j)(InP)ε)2110λg|(InP)ε25nσ2)+(ε25nσ2)exp(Cs log(esgb)+sg log(d/sg)sg)+en.

By Lemma 8 Part 2 and (62), with probability at least 1exp(Cs log(esgb)+sg log(d/sg)sg)eη,

|X(Gc)h(Gc),(InP)ε|=|h(Gc),X(Gc)(InP)ε|110λh(Gc)1+110λgh(Gc)1,2.

Notice that XTc\(Gc) and InP are independent and |Tc\(Gc)| ≤ |G| ≤ sgb, by Lemma 9, with probability at least 1exp(Cs log(esgb)+sg log(d/sg)s)en,

|XTc\(Gc)hTc\(Gc),(InP)ε|hTc\(Gc)1XTc\(Gc)(InP)εCns log(esgb)+sg log(d/sg)sσ2hTc\(Gc)1110λhTc\(Gc)1.

Combine the previous two inequalities together, we have

|XTchTc,(InP)ε||X(Gc)h(Gc),(InP)ε|+|XTc\(Gc)hTc\(Gc),(InP)ε|110λhTc1+110λgh(Gc)1,2 (68)

with probability 1C exp(Cs log(esgb)+sg log(d/sg)s)Cecn. Combine (65), (67) and (68) together, we know that with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s)Cecn,

|Xh,ε|Csnλ2+110λhTc1+110λgh(Gc)1,2. (69)

Moreover, by the proof of Theorem 1, with probability at least 1 − C exp(−cn/s), there exists an approximate dual certificate up in the row span of X satisfying (18), and vTsgn(βT*)218, where v is defined in (29). Similarly to (33), we have

sgn(βT*)hT+jGs0βT,(j)*h(j)βT,(j)*2vTsgn(βT*)2hT2hTc1/2s0h(Gc)1,2/2+h,ucmin8hT2hTc1/2s0h(Gc)1,2/2+h,u.

By Lemma 10, with probability at least 1 − Cecn/s, u = X w with w2Cs/n. Therefore, with probability at least 1 − Cecn/s,

|h,u|=|Xh,w|Xh2w2Cs/nXh2.

The two previous inequalities together imply that

sgn(βT*)hT+jGs0βT,(j)*h(j)βT,(j)*2cmin8hT2hTc1/2s0h(Gc)1,2/2Cs/nXh2 (70)

with probability at least 1 − Cecn/s.

Combine (64), (69) and (70) together, with probability at least 1CeCs log(esgb)+sg log(d/sg)sCecn/s,

Xh22+310λhTc1+310λgh(Gc)1,2Csnλ2+cmin8λhT2+Cs/nλXh2+2λβTc*1+2λgβTc*1,2. (71)

By (42), (63) and (66), with probability at least 1exp(Cs log(esgb)+sg log(d/sg)sg)Cecn,

hT2(XTXT)1XTXThT22cmin nXTXhXTXTchTc22cmin n(XTXh2+XTXTchTc2)2cmin n(H1110λ(XTXh)2+1110sλ+niTcXTXi/n2|hi|)2cmin n(sgH1110λ(XTXh),2+1110sλ+nmaxiTcXTXi/n2hTc1)2cmin n(sg1110λg+1110sλ+n2hTc1)5cminsnλ+1cminhTc1. (72)

The fourth inequality comes from x2Hα(x)2+sα for xs; the fifth inequality holds since XTXh0,2sg. (71) and (72) together imply that

Xh22+740λhTc1+310λgh(Gc)1,2Csnλ2+Cs/nλXh2+2λβTc*1+2λgβTc*1,2

with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s)Cecn/s. Also notice that

Cs/nλXh2Xh22+Csnλ2,

with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s)Cecn/s,

hTc1+s0h(Gc)1,2C(snλ+βTc*1+s0βTc*1,2). (73)

From the proof of Lemma 1, we know that (36) and (41) hold with probability at least 1 − 2ecn. By Lemma 8 Part 2 and (63), with probability at least 1exp(Cs log(esgb)+sg log(d/sg)sg)en,

|XT˜hT˜,Xh|=|hT˜,XT˜Xh|1110(λhT˜1+λghT˜1,2)1110(λ3shT˜2+λg2sghT˜2)4λshT˜2. (74)

The second inequality is due to hT˜03s, hT˜0,22sg and Cauchy-Schwarz inequality.

Combine (36), (41), (73) and (74) together, with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s)Cecn/s, we have

cmin2hT˜221n4λshT˜2+3CmaxhT˜2(2s1/2hTc1+sg1/2h(Gc)1,2)1n4λshT˜2+3CmaxhT˜2Cs(snλ+βTc*1+s0βTc*1,2)C(snλ+1sβTc*1+1sgβTc*1,2)hT˜2.

Therefore, with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s)Cecn/s,

hT˜2C(snλ+1sβTc*1+1sgβTc*1,2). (75)

By (39), (40), (73) and the previous inequality, also notice that ecn/seCs log(esgb)+sg log(d/sg)s, with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s),

h2hT˜2+i2hTi2+j2hRj2hT˜2+2s1/2hTc2+sg1/2h(Gc)1,2C(snλ+1sβTc*1+1sgβTc*1,2), (76)

i.e., with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s),

h2C(σ2(sg log(d/sg)+s log(esgb))n+1sβTc*1+1sgβTc*1,2).

Moreover, if β* is (s, sg)-sparse, then βTc*1=βTc*1,2=0. Therefore, with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s),

h22Cσ2(sg log(d/sg)+s log(esgb))n.

G. Proof of Theorem 5

First, we consider the case that d ≥ 3sg and b ≥ 3s/sg. Let ω(1), …, ω(N) be uniformly randomly vectors from

A={ω{0,1}dbj1{ω(j)0}=sg,ω(j)0=s/sg if ω(j)0}.

Denote Ω(i)={jωj(i)0}, Ω(k)(i)={jj(k),ωj(i)0} and β(i) = τω(i), for all 1 ≤ iN, 1 ≤ kd, where τ is a parameter that will be specified later. Obviously, ∥β(i)0 = sgs/sg⌋ ≤ s, therefore β(i)β(j)222sgs/sgτ22sτ2.

Moreover, if |Ω(i) ∩ Ω(j)| ≥ 8sgs/sg⌋/9, then we must have

|{k|ω(k)(i),ω(k)(j)0,|Ω(k)(i)Ω(k)(j)|2s/sg/3}|2sg/3,

otherwise |Ω(i)Ω(j)|2sg3s/sg+sg32s/sg/38sgs/sg/9, which is a contradiction.

Therefore,

(β(i)β(j)222sgs/sgτ2/9)=(|Ω(i)Ω(j)|8sgs/sg/9)(|{k|ω(k)(i),ω(k)(j)0,|Ω(k)(i)Ω(k)(j)|2s/sg/3}|2sg/3)l=sg/3sg(sgl)[t=2s/sg/3s/sg(s/sgt)(bs/sgs/sgt]l(bs/sg)sgl(dlsgl)/(dsg)(bs/sg)sg=l=2sg/3sg(sgl)(dlsgl)(dsg)[t=2s/sg/3s/sg(s/sgt)(bs/sgs/sgt)(bs/sg)]l. (77)

Note that

(dlsgl)(dsg)=(dl)(dsg+1)(sgl)!d(d1)(dsg+1)sg!=sg(sg1)(sgl+1)d(d1)(dl+1)(sgd)l,

The inequality holds since sgidisgd, for all 1 ≤ isg. Similarly, for 1 ≤ t ≤ ⌊s/sg⌋,

(bs/sgs/sgt)(bs/sg)(bts/sgt)(bs/sg)(s/sgb)t.

Combine (77) and the previous two inequalities together, we have

(β(i)β(j)222sgs/sgτ2/9)l=2sg/3sg(sgl)(sgd)l[t=2s/sg/3s/sg(s/sgt)(s/sgb)t]ll=2sg/3sg(sgl)(sgd)l[t=2s/sg/3s/sg(s/sgt)(s/sgb)2s/sg/3]ll=2sg/3sg(sgl)(sgd)l[2s/sg(s/sgb)2s/sg/3]ll=2sg/3sg(sgl)(sgd)2sg/3[(22s/sgb)2s/sg/3]2sg/3(22sgd)2sg/3(22s/sgb)2s/9. (78)

Set N=[(d22sg)sg/3(b22s/sg)s/9], then

(1ijN,β(i)β(j)22>2sgs/sgτ2/9)1N(N1)2(22sgd)2sg/3(22s/sgb)2s/9>0.

i.e., the probability that β(1), …, β(N); Ω(1), ⋯, Ω(N) satisfy

s9τ2<2sgs/sgτ2/9<min ijβ(i)β(j)222sτ2, (79)
|Ω(i)Ω(j)|<8sgs/sg/9,1i<jN (80)

is positive. For convenience, we fix β(1), …, β(N) to be the vectors satisfying (79).

Denote y(i) = (i) + ε for all 1 ≤ iN. We consider the Kullback-Leibler divergence between different distribution pairs:

DKL((y(i),X),(y(j),X))=E(y(j),X)[ log(p(y(i),X)p(y(j),X))],

where p(y(i), X) is the probability density of (y(i), X). Conditioning on X, we have

E(y(j),X)[ log(p(y(i),X)p(y(j),X))X]=X(β(i)β(j))222σ2.

Thus for 1 ≤ ijN,

DKL((y(i),X),(y(j),X))=EXX(β(i)β(j))222σ2=n(β(i)β(j))Σ(β(i)β(j))2σ23nβ(i)β(j)224σ23nsτ22σ2. (81)

In the first inequality, we used σmax(Σ)32. By generalized Fano’s Lemma,

infβ^supβFs,sgEβ^β2sτ2/92(13nsτ22σ2+log 2 log N).

Since  log Nsg log(dsg)+s log(esgbs), by setting τ=cσ2(sg log(dsg)+s log(esgbs))ns, we have

infβ^supβFs,sgEβ^β22(infβ^supβFs,sgEβ^β2)2cσ2(sg log(d/sg)+s log(esgb/s))n.

If d < 3sg or b < 3s/sg, let sg=[sg/3]1sg/5, s=[s/15]sg, then d3sg and b3s/sg. Similarly to (56), we have

infβ^supβFs,sgEβ^β22infβ^supβFs,sgEβ^β22cσ2(sg log(d/sg)+s log(esgb/s))ncσ2(sg log(d/sg)+s log(esgb/s))n.

H. Proof of Theorem 6

The proof of Theorem 6 relies on the following key lemma, which shows that Σ−1 is in the feasible set of the optimization problem (23) with high probability by choosing appropriate α and γ.

Lemma 4: By setting α=Cs log(esgb)+sg log(d/sg)sn, γ=ssgα in (23), we have

(max 1ipHα(ei1nXXΣ1ei),2γ)14 exp(Cs log(esgb)+sg log(d/sg)sg).

Note that Y = * + ε, we have

n(β^uβ*)=n(β^β*+1nM^X(YXβ^))=n(I1nM^XX)(β^β*)+1nM^Xε.

Since εi~i.i.d.N(0,σ2), we know that

1nM^XεX~N(0,M^Σ^M^).

Denote h=β^β*. Since β* is (s, sg)-sparse, by (73), (76) and Cauchy-Schwarz inequality, with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s),

h1hT1+hTc1shT2+hTc1sh2+hTc1Csnλ.
h1,2h(G)1,2+h(Gc)1,2sgh(G)2+h(Gc)1,2sgh2+h(Gc)1,2Cssgnλ.

In addition, Lemma 4 shows that Σ−1 is in the feasible set of (23) with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s). By the definition of M^,

max iHα(eiΣ^M^ei),2=max iHα(eiΣ^m^i),2γ. (82)

Combining these facts, by Lemma 8 Part 2, we must have

(I1nM^XX)(β^β*)=max i|eiΣ^M^ei,h|αh1+γh1,2Csnαλ+Cssgnγλ=C(s log(esgb)+sg log(d/sg))nσ

with probability at least 1C exp(Cs log(esgb)+sg log(d/sg)s). This has finished the proof of (24).

Next, we consider m^iΣ^m^i. By (82) and Lemma 8 Part 2, we have

1ei,Σ^m^i=ei,eiΣ^m^iαei1+γei1,2=α+γ.

Therefore, for any c ≥ 0,

m^iΣ^m^im^iΣ^m^i+c(1αγ)cei,Σ^m^iminm{mΣ^m+c(1αγ)cei,Σ^m}.

Since m = cei/2 achieves the minimum of the right hand side, we have

m^iΣ^m^ic(1αγ)c24Σ^i,i.

If Σ^ii>0 for all 1 ≤ ip, by setting c=2(1αγ)/Σ^i,i, we have

m^iΣ^m^i(1αγ)2Σ^i,i,1ip. (83)

Moreover, by Lemma 6 Part 2 with u = v = ei, we have

(|Σ^i,iΣi,i|cmin2)2 exp(cn).

By the union bound,

(1ip,|Σ^i,iΣi,i|cmin2)i=1p(|Σ^i,iΣi,i|cmin2)db2 exp(cn)2 exp(cn).

Therefore, with probability at least 1 – 2 exp (−cn),

cmin2Σ^i,iCmax+cmin2,1ip.

(83) and the previous inequality together imply that with probability at least 1 – 2 exp (−cn),

m^iΣ^m^i12Cmax,1ip.

(24) and the previous inequality together imply (25). □

Fig. 2.

Fig. 2.

Average estimation error in the noisy case

Acknowledgments

The research of Tony Cai was supported in part by NSF grants DMS-1712735 and DMS-2015259 and NIH grants R01-GM129781 and R01-GM123056. The research of Anru R. Zhang and Yuchen Zhou was supported in part by NSF Grants CAREER-2203741 and DMS-1811868 and NIH grant R01-GM131399-01.

Biographies

T. Tony Cai received the Ph.D. degree from Cornell University, Ithaca, NY, in 1996. His research interests include high-dimensional statistics, machine learning, large- scale inference, nonparametric function estimation, functional data analysis, and statistical decision theory. He is the Daniel H. Silberberg Professor of Statistics and data science at the Wharton School of the University of Pennsylvania, Philadelphia, PA, USA. Dr. Cai is the recipient of the 2008 COPSS Presidents Award and a fellow of the Institute of Mathematical Statistics. He is a past editor of the Annals of Statistics.

Anru R. Zhang is the Eugene Anson Stead, Jr. M.D. Associate Professor in the Department of Biostatistics & Bioinformatics and Associate Professor in the Departments of Computer Science, Mathematics, and Statistical Science at Duke University. He was an assistant professor of statistics at the University of Wisconsin-Madison in 2015–2021. He obtained his bachelor’s degree from Peking University in 2010 and his Ph.D. from the University of Pennsylvania in 2015. His work focuses on high-dimensional statistical inference, non-convex optimization, statistical tensor analysis, computational complexity, and applications in genomics, microbiome, electronic health records, and computational imaging. He received the IMS Tweedie Award (2022), ASA Gottfried E. Noether Junior Award (2021), Bernoulli Society New Researcher Award (2021), ICSA Outstanding Young Researcher Award (2021), and NSF CAREER Award (2020).

Yuchen Zhou is a postdoctoral researcher in the Department of Statistics and Data Science, The Wharton School, University of Pennsylvania. He received the B.E. degree from Peking University in 2016 and the Ph.D. degree in statistics from the University of Wisconsin-Madison in 2021. His research interests include high-dimensional statistical inference, tensor data analysis, reinforcement learning and statistical learning theory.

Appendix

We collect all additional technical lemmas and their proofs in this section.

Lemma 5 (Bernstein-type Inequality for Soft-thresholded Sub-Gaussian Vectors):

Suppose the rows of Xn×p are independent sub-Gaussian vectors satisfying Assumption 1. wn is a fixed vector, Ω is a subset of {1, …, p} with |Ω| = r. Then

(k=1nwkXk,Ω2Cmaxκw2(r+2t))exp(t). (84)

For any fixed vector wn and fixed index subset Ω ⊆ {1, …, p} with |Ω| = r,

(H(δw2)(wXΩ)2tw2)(r(t/δ)2r)exp((t/(κCmax)(t/δ)r)+2/2)+(r(t/δ)2)exp((t/(κCmax)(t/δ)2)+2/2). (85)

In particular, for any br, if λ¯=Cw2s log(esgb)+sg log(d/sg)s, λ¯g=s/sgλ¯, we have

(Hλ¯(wXΩ)2λ¯g)exp(Cs log(esgb)+sg log(d/sg)sg). (86)

Proof of Lemma 5. We only need to focus on the case where ∥w2 = 1. Let WΩ=XΩΣΩ,Ω1/2, immediately we know that W1,Ω, …, Wn are isotropic sub-Gaussian distributed. Then for any fixed w, wWΩ is also an isotropic sub-Gaussian vector such that for any αr,

E exp(wWΩα)=E exp(wXΩΣΩ,Ω1/2α)=E exp(wXΣ1/2(Σ1/2). ,ΩΣΩ,Ω1/2α)exp(κ2(Σ1/2). ,ΩΣΩ,Ω1/2α22/2)=exp(κ2α22/2).

The last equation holds since (Σ1/2)Ω,·(Σ1/2)·,Ω = (Σ1/2 Σ1/2)Ω,Ω = ΣΩ,Ω.

By the tail inequality of sub-Gaussian quadratic form ([60, Theorem 2.1]),

(wWΩ22κ2(r+2rt+2t))exp(t).

By taking square-root of the previous inequality, we have

(wWΩ2κw2(r+2t))exp(t).

Also note that

wXΩ2=wWΩΣΩ,Ω1/22ΣΩ,Ω1/2wWΩ2Σ1/2wWΩ2CmaxwWΩ2,

we obtain (84).

For the second part of proof, note that

(Hδ(wXΩ)t)(ΛΩ, such that all entries of |wXΛ|δ and wXΛ2t)(ΛΩ,|Λ|δt,wXΛ2t)+(ΛΩ,|Λ|δ>t, all entries of |wXΛ|δ)ΛΩ|Λ|=(t/δ)2r(wXΛ2t)+ΛΩ|Λ|=(t/δ)2(wXΛ2t).

By the first part of this lemma,

(wXΛ2Cmaxκw2t)exp((t|Λ|)+2/2).

Plug in this to the previous inequality, one has

(Hδ(wXΩ)t)(r(t/δ)2r)exp((t/(κCmax)(t/δ)r)+2/2)+(r[(t/δ)2])exp((t/(κCmax)(t/δ)2)+2/2).

Specifically, if δ=Cs log(esgb)+sg log(d/sg)s, t=s/sgδ,

t/(κCmax)(t/δ)2Cs log(esgb)+sg log(d/sg)sg2ssgCs log(esgb)+sg log(d/sg)sg.

Therefore, (85) shows that

(Hλ¯(wXΩ)2λ¯g)r(t/δ)2exp(Cs log(esgb)+sg log(d/sg)sq)+r(t/δ)2exp(Cs log(esgb)+sg log(d/sg)sg)2r2s/sgexp(Cs log(esgb)+sg log(d/sg)sg)exp( log2+2s log(eb)sgCs log(esgb)+sg log(d/sg)sg)exp(Cs log(esgb)+sg log(d/sg)sg).

Lemma 6 (sub-Gaussian quadratic form concentrations): Suppose Zp is a sub-Gaussian vector satisfying Assumption 1.

  1. For any fixed u,vp, u, v ≠ 0, uZZv is sub-exponential such that for every t > 0,
    (|uZZvEuZZv|tu2v2)C exp(ct/κ2). (87)
  2. In addition, suppose X=[X1,,Xn]n×p is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1,
    (|1nk=1nuXkXkvuΣv|tu2v2)2 exp(cn min{t2κ4,tκ2}). (88)
  3. More generally, for any fixed matrix Up×r, the following concentration inequality in spectral norm holds,
    (1nk=1nUXkXkvUΣv2tUv2)2 exp(Crcn min{t2κ4,tκ2}). (89)

Proof of Lemma 6. Since we can rescale u and v without essentially changing the problem, without loss of generality we assume ∥u2 = ∥v2 = 1. Let A = uv, then uZZv = ZuvZ = ZAZ. By Assumption 1, EZ=0 and Z,eiψ2Cκ. By Hanson-Wright inequality ([61, Theorem 1.1]),

(|uZZvEuZZv|t)=(|ZAZEZAZ|t)2 exp[c min(t2κ4AHS2,tκ2A)]2 exp[c min(t2κ4,tκ2)],

where

AHS=(i,j|ai,j|2)1/2=(i,j|uivj|2)1/2=u2v2=1,A=maxx21Ax2=maxx21uvx2=u2maxx21|vx|=u2v2=1.

Therefore, for every tκ2,

(|uZZvEuZZv|t)2 exp(ct/κ2).

Thus, there exists a constant c < log 2, for every t ≥ 0,

(|uZZvEuZZv|t)2 exp(ct/κ2).

Notice that EuXkXkv=uΣv, for all 1 ≤ kn, by Bernstein-type concentration inequality (c.f., [62, Proposition 5.16]),

(|1nk=1nuXkXkvuΣv|t)2 exp(cn min{t2κ4,tκ2}).

This has finished the proof of (88).

Finally, we consider (89), which can be done by an ε-net argument and the result in (88). For any wr, ∥w2 = 1, set u = Uw in (88), we have

(|1nk=1nwUXkXkvwUΣv|t2Uw2v2)2 exp(cn min{t2κ4,tκ2}).

By [62, Lemma 5.3], we can find a 12-net 𝒩12 of Sr1={xxr,x2=1} with |𝒩12|5r. By the union bound,

(w𝒩12,|1nk=1nwUXkXkvwUΣv|t2Uw2v2)5r2 exp(cn min{t2κ4,tκ2}). (90)

For any gr, g ≠ 0, set x=gg2 arg max wr,w2=1|wg|, we can find y𝒩12 such that xy212. By triangle inequality,

g2|yg|=|xg||yg||xgyg|xy2g212g2.

Therefore,

supwr,w2=1|1nk=1nwUXkXkvwUΣv|2supw𝒩12|1nk=1nwUXkXkvwUΣv|.

The (90) and the previous inequality together, also notice that U=supwr,w2=1Uw2, we have

(supwr,w2=1|1nk=1nwUXkXkvwUΣv|tUv2)(supw𝒩12|1nk=1nwUXkXkvwUΣv|t2Uv2)(w𝒩12,|1nk=1nwUXkXkvwUΣv|t2Uw2v2)5r2 exp(cn min{t2κ4,tκ2}). (91)

Finally, note that

1nk=1nUXkXkvUΣv2=supwr,w2=1|1nk=1nwUXkXkvwUΣv|,

we have proved (89). □

We collect the random matrix properties of X in the following lemma. These properties will be extensively used in the main content of the paper.

Lemma 7: Suppose X=[X1,,Xn]n×p is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1.

  1. Suppose T ⊆ {1, …, p} is with cardinality s. Then,
    (1nXTXTΣT,T1Ist)2 exp(Cscn min{t2κ4,tκ2}); (92)
  2. For any fixed vector αs, δ > 0, and fixed index subset Ω ⊆ Tc satisfying |Ω| = r, tδC(maxiTcΣi,TΣT,T12)α2,
    (Hδ(αXTXΩ/n)2t)(r(t/δ)2r)exp(C(t/δ)2rcn min{t2κ4α22,tκ2α2})+(r(t/δ)2)+exp(C(t/δ)2cn min{t2κ4α22,tκ2α2}). (93)

Here, Hλ(·) is the soft-thresholding estimator at level λ.

Proof of Lemma 7.

  1. The first statement is via ε-net. Denote WT=XTΣT,T1/2, then the rows of WT are independent isotropic sub-Gaussian distributed. For any fixed vector xSs1={x:xs,x2=1}, by [62, Lemma 5.5], Zi=(WT)i.,x are independent sub-Gaussian random variables with EZi2=1 and Ziψ2Cκ. Therefore, by Remark 5.18 and Lemma 5.14 in [62], Zi21ψ12Zi2ψ14Ziψ22Cκ2. Bernstein-type inequality shows that
    (|1nWTx221|t2)=(|1ni=1n(Zi21)|t2)2 exp(cn min{t2κ4,tκ2}).
    By [62, Lemma 5.2], we can find a 14-net 𝒩14 of Ss1={x:xs,x2=1} with |𝒩14|9s. The union bound tells us
    (maxx𝒩14|1nWTx221|t2)9s2 exp(cn min{t2κ4,tκ2}). (94)
    By [62, Lemma 5.4],
    1nWTWTIs2 max x𝒩14|(1nWTWTIs)x,x|=2 max x𝒩14|1nWTx221|. (95)
    Since cminσmin(Σ) ≤ σmax(Σ) ≤ Cmax, we have ΣT,T1/2Cmax and ΣT,T1/21/cmin. Therefore,
    1nXTXTΣT,T1Is=ΣT,T1/2(1nWTWTIs)ΣT,T1/2ΣT,T1/21nWTWTIsΣT,T1/2Cmaxcmin1nWTWTIs. (96)

    Combine (94), (95) and (96) together, we have arrived at the conclusion.

  2. Now we consider the proof for (93). Note that Hδ(αXTXΩ)2t implies that there exists Λ ⊂ Ω such that all entry of |αXTXΛ| are greater than δ, and |αXTXΛ|δ2t. Thus,
    (Hδ(αXTXΩ/n)2t)(ΛΩ, such that all entries of |αXTXΛ/n|δ, and αXTXΛ/n2t)(ΛΩ,|Λ|δt,αXTXΛ/n2t)+(ΛΩ,|Λ|δ>t, all entries of |αXTXΛ/n|δ)ΛΩ|Λ|=(t/δ)2( all entries of |αXTXΛ/n|δ)++ΛΩ|Λ|=(t/δ)2r(αXTXΛ/n2t)ΛΩ|Λ|=(t/δ)2(αXTXΛ/n2t)+ΛΩ|Λ|=(t/δ)2r(αXTXΛ/n2t). (97)
    Since tδC max iTcΣi,TΣT,T12α2, we know that no matter |Λ| = ⌊(t/δ)2⌋ ∧ r or ⌈(t/δ)2⌉,
    2Cmax(t/δ)2(max iTcΣi,TΣT,T12)α22Cmax2(t/δ)(max iTcΣi,TΣT,T12)α2t.
    By Part 3 of Lemma 6, for any Λ ⊆ Ω, t2Cmax|Λ| max iTcΣi,TΣT,T12α2, we have
    (αXTXΛ/n2t)(αXTXΛ/nEαXTXΛ/n2tEαXTXΛ/n2)(αXTXΛ/nEαXTXΛ/n2tΣΛ,Tα2)=(αXTXΛ/nEαXTXΛ/n2t(iΛ(Σi,Tα)2)1/2)(αXTXΛ/nEαXTXΛ/n2t|Λ| max iTc|Σi,Tα|)(αXTXΛ/nEαXTXΛ/n2t|Λ|max iTcΣi,TΣT,T12ΣT,Tα2)(αXTXΛ/nEαXTXΛ/n2t/2)2 exp(C|Λ|cn min{t2κ4α22,tκ2α2}).
    Combine (97) and the previous inequality, one obtains
    (Hδ(αXTXΩ/n)2t)(r(t/δ)2r)exp(C(t/δ)2rcn min{t2κ4α22,tκ2α2})+(r(t/δ)2)+exp(C(t/δ)2cn min{t2κ4α22,tκ2α2}).

Lemma 8 (Properties of Soft-thresholding):

  1. Suppose a, b > 0, x,y, H. (·) is the soft-thresholding operator satisfying Ha(x) = sgn(x) · (|x| − a)+. Then the following triangular inequality holds,
    |Ha+b(x+y)||Ha(x)|+|Hb(y)|. (98)
  2. Suppose a, b > 0, x,yp, if ∥Ha(x)∥∞,2b, then
    |x,y|ay1+by1,2. (99)

Proof of Lemma 8.

1)

|Ha+b(x+y)|=(|x+y|ab)+(|x|a+|y|b)+(|x|a)++(|y|b)+=|Ha(x)|+|Hb(y)|.

2)

|x,y||Ha(x),y|+|xHa(x),y|=|j=1d[Ha(x)](j),y(j)|+|xHa(x),y|j=1d[Ha(x)](j)2y(j)2+xHa(x)y1Ha(x),2y1,2+xay1by1,2+ay1.

Lemma 9: Suppose X=[X1,,Xn] is a random matrix with independent random sub-Gaussian rows satisfying Assumption 1, εi~i.i.d.N(0,σ2) Suppose T ⊆ {1, …, p} is with cardinality s, Pn×n is a projection matrix and independent of XT. Then, for any t ≥ log(es),

(XTPεCκntσ2)en+eCt.

Proof of Lemma 9. For fixed vector wn, since Assumption 2 is satisfied, for iT, X1i, …, Xni are independent sub-Gaussian distributed such that

E exp(tXji)=E exp(teiΣ1/2Σ1/2Xj)exp(κ2Σ1/2ei22t22)exp(κ2Σi,it22)exp(Cmaxκ2t22).

By Hoeffding-type inequality,

(|Xiw|tw2)2 exp(ct2κ2). (100)

Moreover, by [63, Lemma 1], for any x ≥ 0,

(i=1nεi2(n+2nx+2x)σ2)ex.

Set x = n in the last inequality, we have

(ε25nσ2)en. (101)

Combine (100) and (101) together and notice that ∥2 ≤ ∥ε2, we have

(XTPεCκntσ2)iT(|XiPε|Cκntσ2)(Pε25nσ2)+iT(|XiPε|Cκntσ2,Pε25nσ2)(ε25nσ2)+iT(|X.iPε|Cκntσ2Pε25nσ2)en+s2 exp(Ct)en+eCt.

Lemma 10: With probability at least 1 − Cecn/s, the approximate dual certificate defined in (45) can be written as u = Xw, where w2Cs/n.

Proof of Lemma 10. By (45), we have u = Xw, where w=(w1,,wlmax) and wl=1nlXIl,TΣT,T1ql1. Thus w22=l=1lmaxwl22. Also note that

1nlXIl,TΣT,T1ql122=1nlXIl,TXIl,TΣT,T1ql1,ΣT,T1ql1=(1nlXIl,TXIl,TΣT,T1I|T|)ql1,ΣT,T1ql1+ΣT,T1/2ql122=ql,ΣT,T1ql1+ΣT,T1/2ql122ql2ΣT,T1ql12+ΣT,T1/2ql1221cminql2ql12+1cminql1222cminql122

By (53), with probability at least 1 − C exp(−cn/s),

w22l=1lmaxCnlql122Cn(2s)2+Cn(2s/ log(es))2+C log(es)nl=3lmax(24ls/ log(es))2Csn.

Proof of Lemma 4. For any 1 ≤ ip, 1 ≤ jd, ,Λ ⊆ (j), |Λ| = k, by Lemma 6 with

v=Σ1ei,Up×k,U[Λ,:]=I,U[Λc,:]=0,

we have

((ei)Λ1nXΛXΣ1ei2t)2 exp(Ckcn min{t2κ4,tκ2}). (102)

By the same method in Lemma 5 Part 2,

(Hα((ei)(j)1nX(j)XΣ1ei)2γ)(Λ(j), all entries of |(ei)Λ1nXΛXΣ1ei|α and (ei)Λ1nXΛXΣ1ei2γ)(Λ(j),|Λ|αγ,(ei)Λ1nXΛXΣ1ei2γ)+(Λ(j),|Λ|α>γ, all entries of |(ei)Λ1nXΛXΣ1ei|α)Λ(j)|Λ|=s/sg(all entries of |(ei)Λ1nXΛXΣ1ei|α)+Λ(j)|Λ|=s/sg((ei)Λ1nXΛXΣ1ei2γ)Λ(j)|Λ|=s/sg((ei)Λ1nXΛXΣ1ei2γ)+Λ(j)|Λ|=s/sg((ei)Λ1nXΛXΣ1ei2γ).

Combine (102) and the previous inequality together, we have

(Hα((ei)(j)1nX(j)XΣ1ei)2γ)(bjs/sg)2 exp(Cs/sgcnCs log(esgb)+sg log(d/sg)sgn)+(bjs/sg)2 exp(Cs/sgcnCs log(esgb)+sg log(d/sg)sgn)4(2esgbs)2s/sgexp(Cs/sgCs log(esgb)+sg log(d/sg)sg)4 exp(2ssg log(2esgbs)+Cs/sgCs log(esgb)+sg log(d/sg)sg). (103)

By (103) and the union bound, we have

(max 1ipHα(ei1nXXΣ1ei),2γ)i=1pj=1d(Hα((ei)(j)1nX(j)XΣ1ei)2γ)d2b4 exp(2ssg log(2esgbs)+Cs/sgCs log(esgb)+sg log(d/sg)sg)4 exp(2 log(sg)+2 log(d/sg)+3ssg log(2eb)+Cs/sgCs log(esgb)+sg log(d/sg)sg)4 exp(Cs log(esgb)+sg log(d/sg)sg).

Footnotes

References

  • [1].Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, Tai E-S, Teo Y-Y, and Montana G, “Pathways-driven sparse regression identifies pathways and genes associated with high-density lipoprotein cholesterol in two asian cohorts,” PLoS genetics, vol. 9, no. 11, p. e1003939, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Vidyasagar M, “Machine learning methods in the computational biology of cancer,” Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 470, no. 2167, p. 20140081, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Allahyar A and De Ridder J, “Feral: network-based classifier with application to breast cancer outcome prediction,” Bioinformatics, vol. 31, no. 12, pp. i311–i319, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Rao N, Nowak R, Cox C, and Rogers T, “Classification with the sparse group lasso,” IEEE Transactions on Signal Processing, vol. 64, no. 2, pp. 448–463, 2015. [Google Scholar]
  • [5].Chatterjee S, Steinhaeuser K, Banerjee A, Chatterjee S, and Ganguly A, “Sparse group lasso: Consistency and climate applications,” in Proceedings of the 2012 SIAM International Conference on Data Mining. SIAM, 2012, pp. 47–58. [Google Scholar]
  • [6].Wang W, Liang Y, and Xing E, “Block regularized lasso for multivariate multi-response linear regression,” in Artificial Intelligence and Statistics, 2013, pp. 608–617. [Google Scholar]
  • [7].Lounici K, Pontil M, Tsybakov A, and Van De Geer S, “Taking advantage of sparsity in multi-task learning,” in COLT 2009-The 22nd Conference on Learning Theory, 2009. [Google Scholar]
  • [8].Lozano AC and Swirszcz G, “Multi-level lasso for sparse multi-task regression,” in Proceedings of the 29th International Coference on International Conference on Machine Learning. Omnipress, 2012, pp. 595–602. [Google Scholar]
  • [9].Zhou HH, Zhang Y, Ithapu VK, Johnson SC, and Singh V, “When can multi-site datasets be pooled for regression? hypothesis tests, 2-consistency and neuroscience applications,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 4170–4179. [PMC free article] [PubMed] [Google Scholar]
  • [10].Friedman J, Hastie T, and Tibshirani R, “A note on the group lasso and a sparse group lasso,” arXiv preprint arXiv:1001.0736, 2010. [Google Scholar]
  • [11].Simon N, Friedman J, Hastie T, and Tibshirani R, “A sparse-group lasso,” Journal of Computational and Graphical Statistics, vol. 22, no. 2, pp. 231–245, 2013. [Google Scholar]
  • [12].Li Y, Nan B, and Zhu J, “Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure,” Biometrics, vol. 71, no. 2, pp. 354–363, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Tibshirani R, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288, 1996. [Google Scholar]
  • [14].Bickel PJ, Ritov Y, and Tsybakov AB, “Simultaneous analysis of lasso and dantzig selector,” The Annals of Statistics, vol. 37, no. 4, pp. 1705–1732, 2009. [Google Scholar]
  • [15].Verzelen N, “Minimax risks for sparse regressions: Ultra-high dimensional phenomenons,” Electronic Journal of Statistics, vol. 6, pp. 38–90, 2012. [Google Scholar]
  • [16].Yuan M and Lin Y, “Model selection and estimation in regression with grouped variables,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 68, no. 1, pp. 49–67, 2006. [Google Scholar]
  • [17].Lounici K, Pontil M, Van De Geer S, and Tsybakov AB, “Oracle inequalities and optimal inference under group sparsity,” The Annals of Statistics, vol. 39, no. 4, pp. 2164–2204, 2011. [Google Scholar]
  • [18].Bunea F, Lederer J, and She Y, “The group square-root lasso: Theoretical properties and fast algorithms,” IEEE Transactions on Information Theory, vol. 60, no. 2, pp. 1313–1325, 2013. [Google Scholar]
  • [19].Johnstone IM and Lu AY, “On consistency and sparsity for principal components analysis in high dimensions,” Journal of the American Statistical Association, vol. 104, no. 486, pp. 682–693, 2009. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Ma Z, “Sparse principal component analysis and iterative thresholding,” The Annals of Statistics, vol. 41, no. 2, pp. 772–801, 2013. [Google Scholar]
  • [21].Zhang A and Xia D, “Tensor SVD: Statistical and computational limits,” IEEE Transactions on Information Theory, vol. 64, no. 11, pp. 7311–7338, 2018. [Google Scholar]
  • [22].Wang M and Li L, “Learning from binary multiway data: Probabilistic tensor decomposition and its statistical optimality,” arXiv preprint arXiv:1811.05076, 2018. [PMC free article] [PubMed] [Google Scholar]
  • [23].Oymak S, Jalali A, Fazel M, Eldar YC, and Hassibi B, “Simultaneously structured models with application to sparse and low-rank matrices,” IEEE Transactions on Information Theory, vol. 61, no. 5, pp. 2886–2908, 2015. [Google Scholar]
  • [24].Hao B, Zhang A, and Cheng G, “Sparse and low-rank tensor estimation via cubic sketchings,” arXiv preprint arXiv:1801.09326, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Zhang A and Han R, “Optimal sparse singular value decomposition for high-dimensional high-order data,” Journal of the American Statistical Association, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Jaganathan K, Oymak S, and Hassibi B, “Sparse phase retrieval: Convex algorithms and limitations,” in 2013 IEEE International Symposium on Information Theory. IEEE, 2013, pp. 1022–1026. [Google Scholar]
  • [27].Shechtman Y, Beck A, and Eldar YC, “Gespar: Efficient phase retrieval of sparse signals,” IEEE transactions on signal processing, vol. 62, no. 4, pp. 928–938, 2014. [Google Scholar]
  • [28].Cai TT, Li X, and Ma Z, “Optimal rates of convergence for noisy sparse phase retrieval via thresholded Wirtinger flow,” The Annals of Statistics, vol. 44, pp. 2221–2251, 2016. [Google Scholar]
  • [29].Oymak S, Jalali A, Fazel M, and Hassibi B, “Noisy estimation of simultaneously structured models: Limitations of convex relaxation,” in Decision and Control (CDC), 2013 IEEE 52nd Annual Conference on. IEEE, 2013, pp. 6019–6024. [Google Scholar]
  • [30].Zhang C-H and Zhang SS, “Confidence intervals for low dimensional parameters in high dimensional linear models,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 76, no. 1, pp. 217–242, 2014. [Google Scholar]
  • [31].Javanmard A and Montanari A, “Confidence intervals and hypothesis testing for high-dimensional regression,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 2869–2909, 2014. [Google Scholar]
  • [32].Mitra R and Zhang C-H, “The benefit of group sparsity in group inference with de-biased scaled group lasso,” Electronic Journal of Statistics, vol. 10, no. 2, pp. 1829–1873, 2016. [Google Scholar]
  • [33].Cai TT and Guo Z, “Confidence intervals for high-dimensional linear regression: Minimax rates and adaptivity,” The Annals of statistics, vol. 45, no. 2, pp. 615–646, 2017. [Google Scholar]
  • [34].Negahban SN, Ravikumar P, Wainwright MJ, and Yu B, “A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers,” Statistical Science, vol. 27, no. 4, pp. 538–557, 2012. [Google Scholar]
  • [35].Stojnic M, Xu W, and Hassibi B, “Compressed sensing-probabilistic analysis of a null-space characterization,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 3377–3380. [Google Scholar]
  • [36].Candes E, Romberg J, and Tao T, “Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information,” IEEE Transactions on Information Theory, vol. 52, no. 2, pp. 489–509, 2006. [Google Scholar]
  • [37].Candes E and Tao T, “The dantzig selector: Statistical estimation when p is much larger than n,” The Annals of Statistics, vol. 35, no. 6, pp. 2313–2351, 2007. [Google Scholar]
  • [38].Cai TT and Zhang A, “Compressed sensing and affine rank minimization under restricted isometry,” IEEE Transactions on Signal Processing, vol. 61, no. 13, pp. 3279–3290, 2013. [Google Scholar]
  • [39].——, “Sharp rip bound for sparse signal and low-rank matrix recovery,” Applied and Computational Harmonic Analysis, vol. 35, no. 1, pp. 74–93, 2013. [Google Scholar]
  • [40].——, “Sparse representation of a polytope and recovery of sparse signals and low-rank matrices,” IEEE transactions on information theory, vol. 60, no. 1, pp. 122–132, 2014. [Google Scholar]
  • [41].Poignard B, “Asymptotic theory of the adaptive sparse group lasso,” Annals of the Institute of Statistical Mathematics, pp. 1–32, 2018. [Google Scholar]
  • [42].Rao N, Cox C, Nowak R, and Rogers TT, “Sparse overlapping sets lasso for multitask learning and its application to fmri analysis,” in Advances in neural information processing systems, 2013, pp. 2202–2210. [Google Scholar]
  • [43].Ahsen ME and Vidyasagar M, “Error bounds for compressed sensing algorithms with group sparsity: A unified approach,” Applied and Computational Harmonic Analysis, vol. 43, no. 2, pp. 212–232, 2017. [Google Scholar]
  • [44].Candes EJ and Plan Y, “A probabilistic and ripless theory of compressed sensing,” IEEE transactions on information theory, vol. 57, no. 11, pp. 7235–7254, 2011. [Google Scholar]
  • [45].Javanmard A and Montanari A, “Debiasing the lasso: Optimal sample size for gaussian designs,” The Annals of Statistics, vol. 46, no. 6A, pp. 2593–2622, 2018. [Google Scholar]
  • [46].Foucart S and Rauhut H, A mathematical introduction to compressive sensing. Birkhäuser Basel, 2013, vol. 1, no. 3. [Google Scholar]
  • [47].Candes E and Tao T, “Decoding by linear programming,” IEEE Transactions on Information Theory, vol. 51, no. 12, pp. 4203–4215, 2005. [Google Scholar]
  • [48].Bertsekas D and Nedic A, “Convex analysis and optimization (conservative),” 2003.
  • [49].Candès EJ and Recht B, “Exact matrix completion via convex optimization,” Foundations of Computational mathematics, vol. 9, no. 6, p. 717, 2009. [Google Scholar]
  • [50].Gross D, “Recovering low-rank matrices from few coefficients in any basis,” IEEE Transactions on Information Theory, vol. 57, no. 3, pp. 1548–1566, 2011. [Google Scholar]
  • [51].Candès EJ, Li X, Ma Y, and Wright J, “Robust principal component analysis?” Journal of the ACM (JACM), vol. 58, no. 3, p. 11, 2011. [Google Scholar]
  • [52].Yuan M and Zhang C-H, “On tensor completion via nuclear norm minimization,” Foundations of Computational Mathematics, vol. 16, no. 4, pp. 1031–1068, 2016. [Google Scholar]
  • [53].Van de Geer S, Bühlmann P, Ritov Y, and Dezeure R, “On asymptotically optimal confidence regions and tests for high-dimensional models,” The Annals of Statistics, vol. 42, no. 3, pp. 1166–1202, 2014. [Google Scholar]
  • [54].Sun T and Zhang C-H, “Scaled sparse linear regression,” Biometrika, vol. 99, no. 4, pp. 879–898, 2012. [Google Scholar]
  • [55].Tibshirani R, Saunders M, Rosset S, Zhu J, and Knight K, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005. [Google Scholar]
  • [56].Rinaldo A, “Properties and refinements of the fused lasso,” The Annals of Statistics, vol. 37, no. 5B, pp. 2922–2952, 2009. [Google Scholar]
  • [57].Jalali A and Fazel M, “A convex method for learning d-valued models,” in 2013 IEEE Global Conference on Signal and Information Processing. IEEE, 2013, pp. 1123–1126. [Google Scholar]
  • [58].Jalali A, Javanmard A, and Fazel M, “New computational and statistical aspects of regularized regression with application to rare feature selection and aggregation,” arXiv preprint arXiv:1904.05338, 2019. [Google Scholar]
  • [59].Sprechmann P, Ramirez I, Sapiro G, and Eldar Y, “Collaborative hierarchical sparse modeling,” in 2010 44th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2010, pp. 1–6. [Google Scholar]
  • [60].Hsu D, Kakade S, and Zhang T, “A tail inequality for quadratic forms of subgaussian random vectors,” Electronic Communications in Probability, vol. 17, 2012. [Google Scholar]
  • [61].Rudelson M and Vershynin R, “Hanson-wright inequality and subgaussian concentration,” Electronic Communications in Probability, vol. 18, 2013. [Google Scholar]
  • [62].Vershynin R, “Introduction to the non-asymptotic analysis of random matrices,” arXiv preprint arXiv:1011.3027, 2010. [Google Scholar]
  • [63].Laurent B and Massart P, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000. [Google Scholar]

RESOURCES