Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Nov 1.
Published in final edited form as: J Multivar Anal. 2018 Aug 23;168:334–351. doi: 10.1016/j.jmva.2018.08.007

Broken adaptive ridge regression and its asymptotic properties

Linlin Dai a, Kani Chen b, Zhihua Sun c, Zhenqiu Liu d, Gang Li e,*
PMCID: PMC6430210  NIHMSID: NIHMS1504510  PMID: 30911202

Abstract

This paper studies the asymptotic properties of a sparse linear regression estimator, referred to as broken adaptive ridge (BAR) estimator, resulting from an L0-based iteratively reweighted L2 penalization algorithm using the ridge estimator as its initial value. We show that the BAR estimator is consistent for variable selection and has an oracle property for parameter estimation. Moreover, we show that the BAR estimator possesses a grouping effect: highly correlated covariates are naturally grouped together, which is a desirable property not known for other oracle variable selection methods. Lastly, we combine BAR with a sparsity-restricted least squares estimator and give conditions under which the resulting two-stage sparse regression method is selection and estimation consistent in addition to having the grouping property in high- or ultrahigh-dimensional settings. Numerical studies are conducted to investigate and illustrate the operating characteristics of the BAR method in comparison with other methods.

Keywords: Feature selection, Grouping effect, L0-penalized regression, Oracle estimator, Sparsity recovery

1. Introduction

Simultaneous variable selection and parameter estimation is an essential task in statistics and its applications. A natural approach to variable selection is L0-penalized regression, which directly penalizes the cardinality of a model through well-known information criteria such as Mallow’s Cp [35], Akaike’s information criterion (AIC) [1], the Bayesian information criterion (BIC) [9, 38], and risk inflation criteria (RIC) [17]. It has also been shown to have an optimality property for variable selection and parameter estimation [40]. However, the L0-penalization problem is nonconvex and finding its global optima requires exhaustive combinatorial best subset search, which is NP-hard and computationally infeasible even for data in moderate dimension. Moreover, it can be unstable for variable selection [6]. A popular alternative is L1-penalized regression, or Lasso [43], which is known to be consistent for variable selection [36, 48] although not consistent for parameter estimation [13, 51]. During the past two decades, much e orts have been devoted to improving Lasso using various variants of the L1 penalty, which are not only consistent for variable selection, but also consistent for parameter estimation [13, 15, 24, 27, 47, 51]. Yet another approach is to approximate the L0 penalty using a surrogate L0 penalty function, which could lead to a more sparse model with favorable numerical properties [11, 25, 40, 41].

This paper concerns a different L0-based approach to variable selection and parameter estimation which performs iteratively reweighted L2 penalized regressions in order to approximate an L0 penalized regression [19, 33]. Unlike a surrogate L0 penalization method such as the truncated Lasso penalization (TLP) method [41], the L0-based iteratively reweighted L2 penalization method can be viewed as performing a sequence of surrogate L0 penalizations, where each reweighted L2 penalty serves as an adaptive surrogate L0 penalty and the approximation of L0 penalization improves with each iteration. It is also computationally appealing since each iteration only involves a reweighted L2 penalization. The idea of iteratively reweighted penalization has its roots in the well-known Lawson algorithm [29] in classical approximation theory, which later found applications in Ld minimization for values of d ∈ (0, 1) [37], sparse signal reconstruction [21], compressive sensing [7, 8, 10, 20, 45], and variable selection [19, 33]. However, despite its increasing popularity in applications and favorable numerical performance, theoretical studies of the iteratively reweighted L2 penalization method have so far focused on its non-asymptotic numerical convergence properties [10, 19, 33]. Its statistical properties such as selection consistency and estimation consistency remain unexplored.

The primary purpose of this paper is to study rigorously the asymptotic properties of a specific version of the L0-based iteratively reweighted L2-penalization method for the linear model. We will refer to the resulting estimator as the broken adaptive ridge (BAR) estimator to reflect the fact that it is obtained by fitting adaptively reweighted ridge regression which eventually breaks part of the ridge with the corresponding weights diverging to infinity to produce sparsity. The BAR algorithm uses an L2-penalized (or ridge) estimator as its initial estimator and updates the estimator iteratively using reweighted L2 penalizations where the weight for each coe cient at each iteration step is the inverse of its squared estimate from the previous iteration as defined later in Eqs. (2)(3) of Section 2.

Our key theoretical contribution is to establish two large-sample properties for the BAR estimator. First, we will show that the BAR estimator possesses an oracle property: when the true model is sparse, with probability tending to 1, it estimates the zero coefficients as zeros and estimates the non-zero coefficients as well as the scenario when the true sub-model is known in advance. To this end, we stress that because the BAR estimator is the limit of an iterative penalization algorithm, commonly used techniques for deriving oracle properties of a single-step penalized regression method do not apply directly. As pointed out by a reviewer, the common strategies used in analyzing approximate solutions to nonconvex penalized method such as SCAD and MCP usually rely on an initial estimator that is close to the optimal, and then show that some kind of iterative refinement of the initial estimator would lead to a good local estimator [15, 34]. Our derivations of the asymptotic oracle properties for BAR are inspired by this idea, but the technical details are substantially different from those for other variable selection methods in the literature. A unique feature of the BAR algorithm is that at each reweighted L2 iteration, the resulting estimator is not sparse. We show that with a good initial ridge estimator, each BAR iteration shrinks the estimates of the zero coefficients towards zero and drives the estimates of the nonzero coefficients towards an oracle ridge estimator. Consequently, sparsity and selection consistency are achieved only in the limit of the BAR algorithm and the estimation consistency of the ridge estimator for the nonzero coefficients is preserved. Second, we will show that the BAR estimator has a grouping effect that highly correlated variables are naturally grouped together with similar coefficients, which is a desirable property inherited from the ridge estimator. Therefore, our results show that the BAR estimator enjoys the best of the L0 and L2 penalized regressions, while avoiding some of their limitations. Lastly, we consider high- or ultrahigh-dimensional settings when the dimension is larger or much larger than the sample size. Frommlet and Nuel [19] suggested to use the mBIC penalty for the BAR method in high-dimensional settings, but its asymptotic consistency has yet to be studied rigorously. We propose a different extension of the BAR method to ultrahigh-dimensional settings by combining it with a sparsity-restricted least squares estimation method and show rigorously that the resulting two-stage estimator is asymptotically consistent for variable selection and parameter estimation, and that it retains the grouping property.

In Section 2, we review the BAR estimator. In Section 3, we state our main results on its oracle property and grouping property for the general non-orthogonal case; we also develop rigorously a two-stage BAR method for high-or ultrahigh-dimensional settings. Section 4 illustrates the finite-sample operating characteristics of BAR in comparison with some popular methods for variable selection, parameter estimation, prediction, and grouping. Section 5 presents illustrations involving a prostate cancer data and a gut microbiota data. Concluding remarks are given in Section 6. Proofs of the theoretical results are deferred to the Appendix.

2. Broken adaptive ridge estimator

Consider the linear regression model

y=j=1pnxjβj+ε,

where yn is a response vector, x1,,xpnn are predictors, and β=(β1,,βpn) is the vector of regression coefficients. Assume that β0 is the true parameter. Without loss of generality, we suppose X = (x1,…,xp) = (w1,…,wn) and y are centralized. In this paper, || ‧ || represents the Euclidean norm for a vector and spectral norm for a matrix. Let

β^(0)=argminβ{yXβ2+ξnj=1pnβj2}=(XX+ξnI)1Xy (1)

be the ridge estimator, where ξn ≥ 0 is a tuning parameter. For any integer k ≥ 1, define

β^(k)=g{β^(k1)}, (2)

where

g(β˜)=argminβ{yXβ2+λnj=1pnβj2/β˜j2}={XX+λnD(β˜)}1Xy, (3)

and D(β˜)=diag(β˜12,,β˜pn2). The broken adaptive ridge (BAR) estimator is defined as

β^*=limkβ^(k).

Remark 1. [Weight selection and arithmetic overflow] The initial ridge estimator β^(0) and its subsequent updates β^(k) do not yield any zero coefficient. Thus, the weights β˜i2s in each iteration are well defined. However, as the iterated values converge to its limit, the weight matrix D{β^(k1)} will inevitably run into division by a very small nonzero value, which could cause an arithmetic overflow. A commonly used solution to this problem is to introduce a small perturbation, or more specifically, replace D(β˜)=diag(β˜12,,β˜pn2) by diag{1/(β˜12+δ2),,1/(β˜pn2+δ2)} for some δ > 0 to avoid numerical instability [10, 19]. Here we provide a different solution by rewriting formula (3) as

g(β˜)=Γ(β˜){X˜(β˜)X˜(β˜)+λnIpn}1X˜(β˜)y, (4)

where Γ(β˜)=diag(β˜1,,β˜pn), and X˜(β˜)=XΓ(β˜). Because the right-hand side of (4) involves only multiplication, instead of division, by the β˜js., numerical stability is achieved without having to introduce a perturbation δ. We also note that the BAR algorithm based on (4) is equivalent to the algorithm of Liu and Li [33].

Remark 2. [Orthogonal case] To appreciate how the BAR method works, it is helpful to examine the orthogonal case which admits a closed form for the BAR estimator and which has been previously studied in [19]. Assume that XX/n=Ipn. Then, for each j ∈ {1, …, pn}, the jth component of g(β˜) defined by (3) is equal to

gj(β˜j)=β˜j2β˜j2+λn/nβ˜j(OLS),

where β^(OLS)=n1Xy is the ordinary least squares (OLS) estimate. Hence the BAR algorithm can be implemented component by component. First, we show the existence and uniqueness of the BAR estimator β^. Without loss of generality, assume β^j(OLS)>0 and then 0β^j(k)β^j(OLS) for every integer k. Also note that the map xg(x)=x2β^j(OLS)/(x2+λn/n) is increasing in x on (0, ∞). Set ϕ(x) = g(x) − x. For fixed n and λn, the following facts can be verified easily.

  • (i)

    If β^j2(OLS)<4λn/n, then ϕ (x) < 0. Therefore, β^j(k+1)<β^j(k) for all k and β^j=0.

  • (ii)
    If β^j2(OLS)=4λn/n, then ϕ (x) ≤ 0. This leads to β^j(k+1)β^j(k) for all k. Then,
    β^j*={0if β^j(0)<β^j(OLS)/2,β^j(OLS)/2otherwise.
  • (iii)
    If β^j2(OLS)4λn/n, then (x) ≥ 0 when x ∈ [x1, x], where x1=β^j(OLS)/2{β^J2(OLS)/4λn/n}1/2 and x2=β^j(OLS)/2+{β^J2(OLS)/4λn/n}1/2 and ϕ(x) < 0 otherwise. As a result,
    β^j*={0if β^j(0)<x1,β^j(OLS)/2+β^j2(OLS)/4λn/n,otherwise.

In conclusion, the BAR estimator always converges to a unique solution depending on the settings of the initial value. Since we choose the ridge estimator as the initial value, which would be close to the OLS estimator under mild conditions, the BAR algorithm for the jth component converges to the following limit:

β^j*={0if β^j(OLS)(2λn/n,2λn/n),β^j(OLS)/2+β^j2(OLS)/4λn/nif β^j(OLS)(2λn/n,2λn/n). (5)

Hence, the BAR estimator β^ shrinks the small values of β^(OLS) to 0 and keeps its large values essentially unchanged provided that λn/n is small. Figure 1 depicts the BAR thresholding function (5), along with some well-known thresholding functions (L0, Lasso, Adaptive Lasso, and SCAD), suggesting that BAR gives a good approximation to an L0 penalized regression. The existence and uniqueness of the BAR estimator for X in a general form are nontrivial, although they have been verified empirically in our extensive numerical studies. A slightly weaker theoretical result is provided in Theorem 1(i) of Section 3. Therefore, BAR is expected to behave like an L0-penalized regression, but is more stable and easily scalable to high-dimensional settings.

Figure 1:

Figure 1:

Thresholding functions of BAR, L0, Lasso, Adaptive Lasso, SCAD, and L0.5.

3. Large-sample properties of BAR

3.1. Oracle property

We next develop the oracle properties of the BAR estimator for the general case without requiring the design matrix to be orthogonal.

Let β0=(β01,,β0pn) be the true parameter value with dimension pn diverging to infinity but pn < n. For simplicity, write β0=(β01,β02) where β01nq and β02pnqn. Assume β01 ≠ 0 and β02 = 0. Let β^=(β^01,β^02) denote the BAR estimator of β0. Set X1=(x1,,xqn), n1=X1X1/n and n=XX/n.

The following conditions are needed to derive the asymptotic properties of the BAR estimator.

  • (C1)

    The errors ε1, …, εn are independent and identically distributed random variables with mean zero and variance 0 < σ2 < ∞.

  • (C2)

    There exists a constant C > 1 such that 0 < 1/C < λmin (Σn) ≤ λmax (Σn) < C < ∞ for every integer n.

  • (C3)

    Let a0n=min1jqn|β0j| and a1n=min1jqn|β0j|. As n,pnqn/n0,ξna1n/n0,(pn/n)1/2/a0n0, and λna1n(qn/n)1/2/a0n20

Theorem 1 (Oracle property). Assume the conditions (C1)–(C3)| are satisfied. For any qn-vector bn satisfying ||bn|| ≤ 1, define sn2=σ2bnn11bn Then, the fixed point of f(α)={X1X1+λnD1(α)}1X1y exists and is unique, where D1(α)=diag(α12,,αpn2) Moreover, with probability tending to 1,

  • (i)

    the BAR estimator β^=(β^1,β^2) exists and is unique, where β^2=0 and β^1 is the unique fixed point of f (α);

  • (ii)

    nsn1b1(β^1β01)N(0,1).

3.2. Grouping effect

When the true model has a natural group structure, it is desirable to have all coefficients within a group clustered together. For example, by virtue of gene network relationships, some of the genes will strongly correlate and are referred to as grouped genes [39]. The next theorem shows that the BAR method has a grouping property.

Theorem 2. Assume that the columns of matrix X are standardized such that x1j + ⋯ + xnj = 0 and xj22=1 for all j ∈ {1, …, pn}. Let β^ be the BAR estimator. Then, with probability tending to 1, for any i < j,

|β^i*1β^j*1|1λny2(1ρij),

provided that β^iβ^j0, where ρij=xixj is the sample correlation of xi and xj.

The above grouping property of BAR is similar to that of Elastic Net. In particular, Zou and Hastie [52] showed that if β^i(EN)β^j(EN)>0, then

|β^i(EN)β^j(EN)|1λy2(1ρij),

where β^i(EN) is the Elastic Net estimate of βi and λ is a tuning parameter. As seen in the proof of Theorem 2, the reason for stating the grouping effect of the BAR estimator in terms of β^i1 is the weight β^i2 used in the reweighted L2 penalization.

3.3. High-dimensional setting

Like most oracle variable selection methods, the oracle property of the BAR estimator in Theorem 1 is derived for pn < n. In high- or ultrahigh-dimensional problems where the number of variables pn can be much larger than the sample size n, it is not clear if BAR would still enjoy the oracle property under the similar conditions. A commonly used strategy for ultrahigh-dimensional problems is to add a screening step to reduce the dimension before performing variable selection and parameter estimation [14]. During the past decades, many variable screening procedures have been developed with the sure screening property that with probability tending to 1, the screened model retains all important variables; see, e.g., [12, 14, 15, 22, 28, 3032, 44, 46, 49, 50] and the references therein. Our results in Theorems 3 and 4 guarantee that when combined with a sure screening procedure, the BAR estimator has the oracle and grouping property in ultrahigh-dimensional settings.

Inspired by Xu and Chen [46], who studied a sparse maximum likelihood estimation method for generalized linear models, we introduce below a sparsity-restricted least squares estimate for screening joint effects for the linear model with unspecified error distribution and establish its sure screening property. Specifically, for a given k, define the following sparsity-restricted least squares estimate (sLSE):

β˜sLSE=argminβ{l(β):β0k}, (6)

where l(β)=(yXβ)(yXβ) Solving for β^sLSE can be computationally demanding even for moderately large k since it is equivalent to the best subset search. Here we adopt the iterative hard-thresholding (IHT) algorithm of Blumensath and Davies [4] to approximately solve the optimization problem (6). Define a local quadratic approximation of (β) by

h(γ,β)=l(β)+(γβ)S(β)+u2γβ22,

where S(β)=l(β)=2X(yXβ) and u > 0 is a scaling parameter. The IHT algorithm solves (6) via the following iterative algorithm

β(t+1)=argminγ{h(γ,β(t)):γ0k} (7)

where β(0) = 0. Note that without the sparsity constraint, (7) has an explicit solution

γ=β(t)+u12X(yXβ(t)).

Adding the sparsity constraint, the solution of (7) is then given by

β(t+1)=(γ11{|γ1|r},,γpn1{|γpn|r}),

with r being the kth largest component of |γ|=(|γ1|,,|γpn|) Therefore the update at each iteration of the IHT algorithm is straightforward and The following theorem shows that with an appropriate scaling parameter u, the sequence {β(t)}obtained from the above IHT algorithm decreases the objective function at every iteration and converges to a local solution of (6).

Theorem 3. Let {β(t)} be the sequence defined by (7). Assume that u2λmax(XX) where λmax(XX) denotes the largest eigenvalue of XX Then

  • (a)

    ℓ{β(t)} ≥ {β(t+1)} for every integer t.

  • (b)

    Assume that the parameter space B Assumeis a compact subset of p Assumeand contains the true parameter value β0. If SwSwS0 for all submodels s such that ||s||0k, then {β(t)} converges to a local solution to (6).

Let s0={j:β0j0} denote the true model and s^sLSE={j:β^sLSE,j0} the screened model via the sparsity-restricted least squares estimate β^sLSE The following conditions are needed to establish the sure screening property of ŝsLSE.

  • (D1)

    There exist ω1, ω2 > 0 and some nonnegative constants τ1, τ2 such that τ1 + τ2 < 1/2, minjs0|β0j|ω1nτ1 and qn < kω2nτ2.

  • (D2)

    ln(pn) = O(nd) for some d ∈ [0, 1 − 2τ1 − 2τ2).

  • (D3)

    There exist constants c1 > 0, δ1 > 0 such that for sufficiently large n, λmin(n1XsXs)c1 for βs{βs:βsβs02δ1} and sS+2k={s:s0s,s02k} where λmin denotes the smallest eigenvalue of a matrix.

  • (D4)

    Let σi2=var(yi) There exist positive constants c2, c3, c4, σ such that |xij|c2,|wiβ0|c3,|σi|σ and for sufficiently large n,

max1jpmax1in{xij2i=1nxij2σi2}c4n1.

Theorem 4. (Sure Joint Screening Property) Under conditions (D1)–(D4), Pr(s0s^sLSE)1asn.

Finally, let β^sLSE-BAR be BAR estimator for the screened submodel

y=js^sLSExjβj+ε. (8)

It then follows immediately from Theorems 1, 2 and 4 that under conditions (D1)–(D4), β^sLSE-BAR has the oracle properties for variable selection and parameter estimation and the grouping property for highly correlated covariates, provided that Model (8) satisfies conditions (C1)–(C3).

4. Simulation study

4.1. Variable selection, estimation, and prediction

We first present a simulation to illustrate the performance of BAR for variable selection, parameter estimation, and prediction in comparison with some L1-based variable selection methods (Lasso, adaptive Lasso, SCAD and MCP) and a surrogate L0 method (TLP) [40] under some low-dimensional settings when pn. Here we consider two linear models, viz.

Model1: Trueparameterβ0=(2,3,0,0,4,0,,0),
Model2:True parameterβ0=  (2,3,0,0,4,0.2,0.3,0,0,0.4,0,,0),

where in both models, the p-dimensional covariate vector X is generated from a multivariate Gaussian distribution with mean 0 and variance-covariance matrix Σ = (r|ij|), and the error ε is Gaussian and independent of the covariates. Model 1 has sparse strong signals and Model 2 contains a mix of both strong and weak signals. Five-fold CV was used to select tuning parameters for each method. For BAR, we used ten grid points evenly spaced on the interval [10−4, 100] for ξn and ten equally spaced grid points on [1, 10] for λn. We observed from our limited experience that BAR is insensitive to the choice of ξn and its solution path is essentially flat over a broad range of ξn. So the selection of λn is more relevant. In this simulation, we used only ten grid points to reduce the computational burden. In practice, however, a more refined grid for λn would generally be preferred. We used the R package glmnet [18] for Lasso, adaptive Lasso and ncvreg [5] for SCAD and MCP in our simulations. The TLP method [40] requires to specify a resolution tuning parameter τ. To illustrate the impact of τ, we considered two τ values (0.15 and 0.5), although a data-driven τ would be used in practice as demonstrated in [40]. We considered a variety of scenarios by varying n, p, and r for each of the above two models, but only present partial results in Tablel 1 due to space limitation.

Table 1:

(Low-dimensional settings: n = 100, p ∈ {10, 50, 100} Comparison of BAR, Lasso, SCAD, MCP, and Adaptive Lasso (ALasso) and TLP based on 100 Monte-Carlo samples. (Misclassification = mean number of misclassified non-zeros and zeros; FP = mean number of false positives (non-zeros); FN = mean number of false negatives (zeros); TM = probability that the selected model is exactly the true model; MAB = mean absolute bias; MSPE = mean squared prediction error from five-fold CV.)

Model p Method Misclassification FP FN TM MAB MSPE
1 10 BAR 0.08 0.08 0 93% 0.26 1.02
Lasso 3.28 3.28 0 6% 0.56 1.08
SCAD 0.90 0.90 0 62% 0.34 1.03
MCP 0.72 0.72 0 68% 0.33 1.03
ALasso 0.26 0.26 0 89% 0.29 1.03
TLP(0.15) 0.54 0.54 0 63% 0.34 1.04
TLP(0.5) 0.64 0.64 0 77% 0.31 1.04
50 BAR 0.26 0.26 0 86% 0.32 1.04
Lasso 7.75 7.75 0 2% 0.95 1.18
SCAD 1.81 1.81 0 65% 0.37 1.05
MCP 0.97 0.97 0 73% 0.34 1.05
ALasso 3.99 3.99 0 40% 0.68 1.03
TLP(0.15) 0.98 0.98 0 60% 0.45 1.05
TLP(0.5) 1.51 1.51 0 77% 0.36 1.08
100 BAR 0.44 0.44 0 80% 0.35 1.02
Lasso 10.38 10.38 0 2% 1.08 1.2
SCAD 1.41 1.41 0 65% 0.31 1.03
MCP 1.04 1.04 0 66% 0.32 1.03
ALasso 1.51 1.51 0 32% 0.42 1.08
TLP(0.15) 1.31 1.31 0 55% 0.46 1.02
TLP(0.5) 2.21 2.21 0 70% 0.38 1.05
2 10 BAR 1.14 0.12 1.02 30% 0.67 1.11
Lasso 2.70 2.61 0.09 5% 0.76 1.11
SCAD 1.87 1.58 0.29 14% 0.74 1.1
MCP 1.69 1.27 0.42 19% 0.74 1.1
ALasso 1.55 1.11 0.44 18% 0.69 1.07
TLP(0.15) 1.02 0.40 0.62 33% 0.66 1.07
TLP(0.5) 2.27 1.87 0.4 9% 0.74 1.09
50 BAR 2.18 0.68 1.5 4% 0.94 1.17
Lasso 10.56 10.09 0.47 0% 1.51 1.29
SCAD 5.36 4.32 1.04 1% 1.06 1.19
MCP 3.33 2.1 1.23 4% 0.99 1.19
ALasso 8.22 7.38 0.84 0% 1.41 1.1
TLP(0.15) 2.25 1.36 1.19 7% 1.03 1.1
TLP(0.5) 6.07 5.13 0.94 3% 1.09 1.2
100 BAR 2.67 0.94 1.73 2% 1.05 1.17
Lasso 14.63 14.02 0.61 0% 1.79 1.34
SCAD 4.24 3.08 1.16 1% 1.01 1.21
MCP 3.34 1.99 1.35 3% 0.96 1.2
ALasso 4.42 2.27 2.15 1% 1.26 1.32
TLP(0.15) 3.14 1.88 1.26 2% 1.03 1.08
TLP(0.5) 7.67 6.71 0.96 1% 1.17 1.19

From Table 1, it appears that the BAR and TLP methods generally produce more sparse models than the L1-based methods, with BAR having the lowest number of false positives (nonzeros). When the signals are strong (Model 1), BAR demonstrates better performance in most categories with (i) better variable selection performance (the lowest number of misclassifications (non-zeros and zeros), lowest false positives, highest probability of selecting the true model (TM), and equally low false rate (close to 0)); (ii) better estimation performance (the smallest estimation bias, MAB); and (iii) for equally good prediction performance (similar mean squared prediction error (MSPE)).

Similar trends are observed for Model 2, in which there are both strong and weak signals. BAR still produces the sparsest model with the lowest positive false positive rate and lowest number of misclassifications, but it tends to miss the small effects with more false negatives. For estimation, BAR has comparable or smaller bias, but no method dominates others across all scenarios. For prediction, all methods had comparable performance with similar MSPE.

Because some signals are weak and the sample size n = 100 is relatively small, it is challenging for all methods to select the exact correct model (TM close to zero), but for different reasons: BAR tends to miss some small signals with more false negatives (zeros) whereas the other methods tend to include some zeros as signals with more false positives. Moreover, as a surrogate L0 method, the TLP method generally produces a more sparse model than an L1-based method. In principle, the TLP method with a smaller τ value gives a closer approximation to L0 penalization and thus leads to a more sparse model as shown in Table A.1. In practice, however, τ cannot be too small since the TLP estimate obtained from Algorithm 1 in [40] would be no more sparse than its initial Lasso estimator.

In summary, BAR shows an advantage of more accurately selecting and estimating strong signals. On the other hand, if weak signals are also of interest and the sample size is relatively small, then an L1-based method with smaller false negatives might be preferred.

4.2. Grouping effect

We now present a simulation to illustrate the grouping effect of BAR for highly correlated covariates. Assume y = Xβ0 + ε with pn = 9 and β0 = (2, 2, 2, 0.4, 0.4, 0.4, 0, 0, 0), where ε~N(0,0.22In) and

{xi=z1+ei,z1~N(0,302In)for i{1,2,3},xi=z2+ei,z2~N(0,302In)for i{4,5,6},xi~N(0,302In)for i{7,8,9},

where ei are independent N(0,0.12In). In this setting, x1, x2, and x3 form a group with latent factor z1, and x5, x6, and x7 form another group with latent factor z2. The with-in group correlations are almost 1 and the between-group correlations are almost 0. Figure 2 plots the solution paths of BAR, Elastic Net, Lasso, Adaptive Lasso, SCAD, and MCP for a random sample of size n = 200. The estimated coefficients and cross-validation errors together with the corresponding optimal tuning parameters are provided in Table 2.

Figure 2:

Figure 2:

Path plots of BAR, Elastic Net, Lasso, SCAD, Adaptive Lasso, and MCP for a linear model with two groups of highly correlated covariates with true coefficients βj = 2 for j =1,2, and 3; βj = 0.5 for j = 4, 5, and 6; and βj = 0 for j ≥ 7, as described in Section 3.2. For BAR, we fixed ξn = 100. For Elastic Net, the fixed tuning parameter is λ2 = 0.001 and SCAD with γ = 3.7, Adaptive Lasso with ξ = 18.68, MCP with γ = 3. The dashed vertical lines are the optimal tuning parameters obtained by 5-fold CV with lowest CV error and the corresponding estimated coefficients are provided in Table 2.

Table 2:

(Grouping effects) Estimated coefficients of BAR, Elastic Net, Lasso, SCAD, Adaptive Lasso, and MCP for a linear model with optimal tuning parameters selected by five-fold CV. The full solution paths are depicted in Figure 2.

Parameter BAR Elastic Net Lasso SCAD Adaptive Lasso MCP
(Intercept) −0.008 −0.004 −0.005 0.031 −0.009 0.029
β1 1.902 2.000 2.137 6.001 1.585 6.001
β2 2.016 1.998 1.441 1.908
β3 2.081 1.999 2.421 2.506
β4 0.499 0.406 0.894 1.200 1.153 1.200
β5 0.469 0.401 0.292 0.047
β6 0.233 0.392 0.013
β7 −0.001
β8
β9
Tuning ξn = 100 λ1 = 0.017 λ = 0.017 γ = 3.7, ξ = 18.68 γ = 3,
parameters λn = 0.043 λ2 = 0.001 λ = 0.044 λ = 0.001 λ = 10.861
CV Error 0.038 0.045 0.331 0.048 0.043 0.330

It is seen from Figure 2 and Table 2 that both BAR and Elastic Net selected all three variables in each of the two non-zero groups with almost evenly distributed coefficients within each group, whereas Lasso, adaptive Lasso, SCAD, and MCP failed to reveal any grouping information with disorganized paths and erratic coefficient estimates.

4.3. High-dimensional settings (p ≫ n)

We finally present a simulation to illustrate the performance of BAR in comparison with some other variable selection methods when combined with the sparsity-restricted least squares estimate-based screening method in a high-dimensional setting. Specifically, we consider Models 1 and 2 in Section 3.1, but increase p to 2000. The simulation results for the two-stage estimators with n = 100 are presented in Table 3.

Table 3:

(High-dimensional settings: n = 100, p = 2000) Comparison of BAR, Lasso, SCAD, MCP, Adaptive Lasso (ALasso), and TLP Following the Sparsity-Restricted Least Squares Estimate (sLSE). (Misclassification = Mean Number of Misclassified Non-zeros and Zeros; FP = Mean Number of False Positives (Non-zeros); FN = Mean Number of False Negatives (Zeros); TM = Probability that the Selected Model is Exactly the True Model; MAB = Mean Absolute Bias; MSPE = Mean Squared Prediction Error from Five-Fold CV.)

Model Method Misclassification FP FN TM MAB MSPE
1 sLSE-BAR 0.32 0.18 0.14 75.72% 0.55 1.04
sLSE-Lasso 5.10 4.96 0.14 3.8% 0.97 1.13
sLSE-SCAD 0.94 0.80 0.14 64.60% 0.57 1.05
sLSE-MCP 0.67 0.53 0.14 68.32% 0.57 1.05
sLSE-ALasso 0.92 0.78 0.14 63.88% 0.61 1.05
sLSE-TLP(0.15) 0.74 0.60 0.14 56.56% 0.64 1.06
sLSE-TLP(0.5) 0.89 0.75 0.14 67.72% 0.57 1.06
2 sLSE-BAR 0.30 0.18 0.12 0% 1.45 1.04
sLSE-Lasso 5.09 4.98 0.11 0% 1.87 1.13
sLSE-SCAD 0.95 0.83 0.12 0% 1.47 1.05
sLSE-MCP 0.66 0.54 0.12 0% 1.47 1.05
sLSE-ALasso 0.93 0.81 0.12 0% 1.51 1.05
sLSE-TLP(0.15) 0.72 0.60 0.12 0% 1.53 1.06
sLSE-TLP(0.5) 0.91 0.79 0.12 0% 1.47 1.07

It is seen that for both Models 1 and 2, sLSE-BAR outperformed other methods for variable selection and parameter estimation with a lower number of misclassifications, fewer false positives, and a smaller estimation bias (MAB). Consistent with the low-dimensional settings, BAR shows similar prediction performance to the other methods. Again it is challenging for all methods to select exactly the true model under Model 2 when there are both strong and weak signals, but our limited simulations (not presented here) showed that TM (the probability of selecting the correct model) will improve, especially for BAR, as n increases.

5. Data examples

5.1. Prostate cancer data

As an illustration, we analyze the prostate cancer data described in [42]. There are eight clinical predictors, namely ln(cancer volume) (lcavol), ln(prostate weight) (lweight), age, ln(amount of benign prostatic hyperplasia) (lpbh), seminal vesicle invasion (svi), ln(capsular penetration) (lcp), Gleason score (gleason) and percentage Gleason scores 4 or 5 (pgg45). We consider the problem of selecting important predictors for the logarithm of prostate-specific antigen (lpsa). We randomly split the data set into a training set of size 67 and a test set of size 30, and compare the results of six variable selection methods: BAR, lasso, adaptive lasso, SCAD, MCP and Elastic Net by selecting nonzero features based on the training data set and test error as the mean squared prediction error on the test data set. We performed an analysis with p = 36 features consisting of all the main effects and two-way interactions. The five-fold CV method was used to select the optimal ξn and λn values for BAR, with 100 evenly-spaced grid points on the interval [10−4, 100] for ξn, and 100 equally-spaced grid points on [1, 100] for λn. The estimated coefficients are given in Table 4, together with a summary of the number of selected features and test errors for each method.

Table 4:

Coefficients of selected features and test mean squared prediction error for a prostate data (with p = 36 main effects and two-way interactions)

Variable BAR Lasso SCAD MCP Adaptive Lasso Elastic Net
lcavol 0.573 0.560 0.637 0.627 0.594 0.264
lweight 0.020 0.056
Age −0.028 −0.073 −0.149 −0.077 −0.053
lbph 0.070
svi 0.125 0.422 0.440 0.141 0.090
lcp 0.001
gleason 0.013
(lcavol)×(lweight) 0.068
(lcavol)×(age) 0.036
(lcavol)×(svi) 0.341 0.216 0.233 0.150
(lcavol)×(gleason) 0.196
(lweight)×(gleason) 0.315 0.238 0.243 0.305 0.259 0.171
(age)×(lbph) 0.184 0.277 0.282 0.194 0.098
(age)×(svi) 0.045
(lbph)×(gleason) 0.021
(svi)×(gleason) 0.030 0.090
(lcp)×(pgg45) −0.019 −0.067
Tuning ξn = 44.45 λ = 0.08 γ = 3.7 γ = 3 ξ = 1.16 λ1 = 0.06
Parameters λn = 0.1 λ = 0.08 λ = 0.07 λ = 0.21 λ2 = 0.1
Number of Selected Features 3 7 6 6 7 16
Test Error 0.461 0.484 0.587 0.583 0.513 0.498

It is seen that BAR produces the sparsest model with three nonzero features as compared with at least six nonzero features by the other methods. Its prediction performance is comparable to other methods with a slightly lower test error. This is consistent with the simulation results in Section 4.1.

5.2. Gut microbiota data

In this example, we illustrate our method on a metagenomic dataset on gut microbiota that was downloaded from dbGaP under study ID phs000258 [53]. After aligning the 16S rRNA sequences of gut microbiota to reference sequences and taxonomy databases, we had a total of 240 taxa at the genus level. Here we are interested in identifying BMI associated taxa. Because taxa with small reads are not reliable and subject to noise and measurement error, we first dropped the taxa with average reads below 2, resulting in 62 features. For each taxon, we created a dummy variable 1(value of taxon > 25th quantile of the taxon). We used 304 patients after removing missing values. As in the previous example, we randomly split the data set into a training set of size 228 to select nonzero features and a test set of size 76 to compute the mean squared prediction error (MSPE). Table 5 gives the selected taxa, together with a summary of the number of selected features and test errors for each of six methods: BAR, lasso, adaptive lasso, SCAD, MCP and Elastic Net.

Table 5:

Coefficients of selected BMI-related features and test mean squared prediction error for a gut microbiota metagenomic data

Taxon BAR Lasso SCAD MCP Adaptive Lasso Elastic Net
Acetanaerobacterium −0.064
Acidaminococcus −0.014
Akkermansia 0.233
Anaerotruncus −0.105
Asteroleplasma −0.058
Bacteroides −0.110
Barnesiella 0.195
Butyricicoccus −0.046
Butyrivibrio 0.205
Catenibacterium 0.198
Clostridium −0.259
Collinsella 1.066 0.669 0.689 0.867 0.908 0.708
Coprobacillus −0.165 −0.133 −0.064 −0.216 −0.295
Coprococcus 0.014
Desulfovibrio −0.082 −0.049 −0.122 −0.308
Escherichia Shigella 0.081
Eubacterium 0.021
Faecalibacterium −0.610 −0.389 −0.352 −0.352 −0.584 −0.591
Hydrogenoanaerobacterium 0.099 0.044 0.223 0.409
Lachnobacterium 0.294
Lactobacillus 0.115
Megasphaera 0.761 0.458 0.410 0.470 0.716 0.639
Odoribacter −0.024
Parabacteroides −0.043
Prevotella 0.335
Rikenella −0.065
Robinsoniella −0.113 −0.088 −0.050 −0.124 −0.344
Sporobacter −0.102
Streptococcus 0.050
Subdoligranulum −0.018
Succiniclasticum 0.119 0.080 0.211 0.344
Succinivibrio 0.238
Sutterella −0.001 −0.224
Treponema −0.115
Turicibacter −0.996 −0.622 −0.615 −0.807 −0.846 −0.611
Xylanibacter 0.160
Tuning ξn = 77.78 λ = 0.54 γ = 3.7 γ=3 ξ = 14.19 λ1 = 0.19
Parameters λn = 38 λ = 0.57 λ = 0.66 λ = 3.67 λ2 = 0.26
Number of Selected Taxa 4 10 9 6 9 36
Test Error 28.630 28.189 28.311 28.230 27.926 27.062

Consistent with the simulation results in Section 3, BAR yields the sparsest model (four nonzero features as compared to at least six nonzero features by the other methods) with similar test errors to other methods. Among the four identified taxa by BAR, Faecalibacterium is well known to be associated with obesity. For example, some recent studies showed that Faecalibacterium increases significantly in lean people and has a negative association with BMI [2, 23]. The other three taxa (Collinsella, Megasphaera, and Turicibacter) have also been shown to be correlated with obesity and BMI in animal models and clinical trials [3, 16, 26].

6. Discussion

We have established an oracle property and a grouping property for the BAR estimator, a sparse regression estimator resulted from an L0-based iteratively reweighted L2-penalized regressions using a ridge estimator as the initial value. Hence, BAR enjoys the best of both L0 and L2 penalized regressions while avoiding their pitfalls. For example, BAR shares the consistency properties of L0-penalized regression for variable selection and parameter estimation, avoids its instability, and is computationally more scalable to high-dimensional settings. BAR can also be viewed as a sparse version of the ridge estimator obtained by shrinking the small ridge coefficients to zeros via an approximate L0-penalized regression and consequently inherits the stability and grouping effect of the ridge estimator. In addition to the low-dimensional setting (p < n), we have extended the BAR method to high- or ultrahigh-dimensional settings when p is larger and potentially much larger than n, by combining it with a sparsity-restricted least squares estimate and provided conditions under which the two-stage estimator possesses the oracle property and grouping property.

We have conducted extensive simulations to study the operating characteristics of the L0-based BAR method in comparison with some popular L1-based variable selection methods. It was observed that as an L0-based method, BAR tends to produce a more sparse model with lower false positive rate and comparable prediction performance, and generally can select and estimate strong signals more accurately with a lower misclassification rate and a smaller estimation bias. In contrast, it is more likely to shrink weak signals to null when the sample size is not very large. Therefore, in practice, one may choose between an L1-based method and an L0-based method such as BAR based on whether or not weak signals are of interest.

Our theoretical developments for the asymptotic properties of the L0-based BAR estimator can be easily modified to analyze large sample properties of an Ld-based BAR estimator obtained by replacing β˜j2 with |β˜j|d2 in Eq. (3) for any d ∈ [0, 1]. Our simulations (not reported in this version of the paper) suggests that there is a trade-off between the false positive rate and false negative rate as d varies. Specifically, as d increases from 0 to 1, the Ld-based BAR estimator becomes less sparse with an increased false positive rate but decreased false negative rate.

Finally, this paper focuses on the linear model. Extensions to more complex models and applications such as generalized linear models and survival data are currently under investigation by our team.

Acknowledgments

The author thanks the Editor-in-Chief, an Associate Editor and two anonymous referees for their constructive and insightful comments and suggestions that have led to significant improvements of the paper. The research of Gang Li was partly supported by National Institute of Health Grants P30 CA-16042, UL1TR000124–02, and P50 CA211015.

Appendix A. Proofs of the theorems

To prove Theorems 1–4, we need to introduce some notations and lemmas. Write β=(α,γ) where α is a qn × 1 vector and γ is a (pnqn) × 1 vector. Analogously, write β^(k)=(α^(k)T,γ^(k)T) and

g(β)={XX+λnD(β)}1Xy={α*(β),γ*(β)}. (A.1)

For simplicity, we write α*(β) and γ* (β) as α* and γ* hereafter. We further partition n1 as

Σn1=(A11A12A12A22)

where the A11 is a qn × qn matrix. Then, multiplying (XX)1{XX+λnD(β)} to Eq. (A.1) gives

(α*β01γ*)+λnn(A11D1(α)α*+A12D2(γ)γ*A12TD1(α)α*+A22D2(γ)γ*)=(XX)1Xε, (A.2)

where D1(α)=diag(α12,,αqn2) and D2(γ)=diag(γ12,,γpnqn2).

Lemma 1. Let δn be a sequence of positive real numbers satisfying δn → ∞ and pnδn2/λn0 Define Hn={βpn:ββ0δnpn/n} and Hn1{αqn:αβ01δnpn/n}. Assume conditions (C1)−(C3) hold. Then, with probability tending to 1, we have

  • (a)

    supβHnγ/γ<1/C0 for some constant C0 > 1;

  • b)

    g is a mapping from Hn to itself.

Proof. We first prove part (a). Because λn/n0 and pnδn2/λn0, we have δnpn/n0. By (C2), nE(β^(OLS)β02)=nE((XX)1Xε2)=σ2trace(n1)=O(pn) Hence, β^(OLS)β02=Op(pn/n). It then follows from (A.2) that

supβHnγ*+λnA12D1(α)α*/n+λnA22D2(γ)γ*/n=Op(pn/n). (A.3)

Note that αβ01δn(pn/n)1/2 and αg(β)β^(OLS)=Op(a1npn) By assumptions (C2) and (C3), we have

supβHnλnA12D1(α)α*/nλnnA12supβHnD1(α)α*2Cλnna1na0n2supβHnα*=op(pn/n), (A.4)

where the second inequality uses the fact A122C, which follows from the inequality A12A12A112A112+A12A21n2<C2. Combining (A.3) and (A.4) gives

supβHnγ*+λnA22D2(γ)γ*/n=Op(pn/n). (A.5)

Note that A22 is positive definite and by the singular value decomposition, A22=i=1pnqnτ2iu2iu2i, where τ2i and u2i are eigenvalues and eigenvectors of A22. Then, since 1/C < τ2i < C, we have

λnnA22D2(γ)γ*=λnni=1pnqnτ2iu2iu2iTD2(γ)γ*=λnn{i=1pnqnτ2i2u2iTD2(γ)γ*2}1/2λnn1C{i=1pnqnu2iTD2(γ)γ*2}1/2=1CλnD2(γ)γ*/n.

This, together with (A.5) and (C2), implies that with probability tending to 1,

1CλnD2(γ)γ*/nγ*δnpn/n. (A.6)

Let dγ*/γ=(γ1/γ1,,γpnqn/γpnqn). Because γδnpn/n we have

1CλnnD2(γ)γ*=1Cλnn{D2(γ)}1/2dγ*/γ1Cλnnnδnpndγ*/γ (A.7)

and

γ*=D2(γ)1/2dγ*/γδnpnndγ*/γ. (A.8)

Combining (A.6), (A.7) and (A.8), we have that with probability tending to 1,

dγ*/γ1λn/(pnδn2C)1<1/C0 (A.9)

for some constant C0 > 1 provided that λn/(pnδn2) It is worth noting that Pr(dγ/γ0)1, as n → ∞.

Furthermore, with probability tending to 1,

γ*dγ*/γmax1j(pnqn)|γj|dγ*/γ×γγ/C0.

This proves part (a).

Next we prove part (b). First, it is easy to see from (A.8) and (A.9) that, as n → ∞,

Pr(γ*δnpn/n)1. (A.10)

Then, by (A.2), we have

supβHnα*β01+λnA11D1(α)α*/n+λnA12D2(γ)γ*/n=Op(pn/n). (A.11)

Similar to (A.4), it is easily to verify that

supβHnλnA11D1(α)α*/n=op(pn/n). (A.12)

Moreover, with probability tending to 1,

supβHnλnA12D2(γ)γ*/nλnnsupβHnD2(γ)γ*×A1222C2δnpn/n, (A.13)

where the last step follows from (A.6), (A.10), and the fact that A122C It follows from (A.11), (A.12) and (A.13) that with probability tending to 1,

supβHnα*β01(22C2+1)δnn1/2pn. (A.14)

Because δnpn/n0 we have, as n → ∞,

Pr(α*Hn1)1. (A.15)

Combining (A.10) and (A.15) completes the proof of part (b).

Lemma 2. Assume that (C1)–(C3) hold. For any qn-vector bn satisfying ||bn|| ≤ 1, define sn2=σ2bnn11bn as in Theorem 1. Define

f(α)={X1X1+λnD1(α)}1X1y. (A.16)

Then, with probability tending to 1,

  • (a)

    f(α) is a contraction mapping from Bn{αqn:αβ01δnpn/n} to itself;

  • (b)

    nsn1bn(α^β01)N(0,1) where α^ is the unique fixed point of f (α) defined by

α^={X1X1+λnD1(α^)}1X1y.

Proof. We first prove part (a). Note that (A.16) can be rewritten as

f(α)β01+λnΣn11D1(α)f(α)/n=(X1X1)1X1ε.

Thus,

supαBnf(α)β01+λnΣn11D1(α)f(α)/n=Op(qn/n). (A.17)

Similar to (A.4), it can be shown that

supαBnλnΣn11D1(α)f(α)/n=op(qn/n). (A.18)

It follows from (A.17) and (A.18) that

supαBnf(α)β01δnqn/n,

where δn is a sequence of real numbers satisfying δn → ∞ and δnqn/n0. This implies that, as n → ∞,

Pr{f(α)Bn}1. (A.19)

In other words, f is a mapping from the region Bn to itself.

Rewrite (A.16) as {X1X1+λnD1(α)}f(α)=X1y and then differentiate it with respect to α, we have

{Σn1+λnD1(α)/n}f˙(α)+(λn/n)×diag{2fj(α)/αj3}=0,

where f˙(α)=f(α)/α and diag{2fj(α)/αj3}=diag{2f1(α)/α13,,2fqn(α)/αqn3} This, together with the assumption λn/n0, implies that

supαBn{Σn1+λnD1(α)/n}f˙(α)=supαBn2λnndiag{fj(α)/αj3}=op(1). (A.20)

Note that Σn1 is positive definite. Write n1=i=1qnτ1iu1iu1i, where τ1i and u1i are eigenvalues and eigenvectors of Σn1. Then, by (C2), τ1i ∈ (1/C,C) for all i and

Σn1f˙(α)=supx=1,xqnΣn1f˙(α)x=supx=1,xRqni=1qnλ1iu1iu1if˙(α)x=supx=1,xqn{i=1qnλ1i2u1if˙(α)x2}1/2supx=1,xqn1C{i=1qnu1if˙(α)x2}1/2=supx=1,xqn1Cf˙(α)x=1Cf˙(α). (A.21)

Therefore, it follows from αBn, (A.21) and (C2) that

{Σn1+λnD1(α)/n}f˙(α)Σn1f˙(α)λnD1(α)f˙(α)/n1Cf˙(α)λnna0n2f˙(α),

This, together with (A.20) and (C2), implies that

supαBnf˙(α)=op(1). (A.22)

Finally, the conclusion in part (a) follows from (A.19) and (A.22).

Next we prove part (b). Write

n1/2sn1bn(α^β01)=n1/2sn1bn[{Σn1+λnD1(α^)/n}1Σn1Iqnβ01]+n1/2sn1bn{Σn1+λnD1(α^)/n}1X1εI1+I2. (A.23)

By the first order resolvent expansion formula, viz. (H + Δ)−1 = H−1H−1 Δ (H + Δ)−1, the first term on the right-hand side of Eq. (A.23) can be written as

I1=sn1bnΣn11λnnD1(α^){Σn1+λnD1(α^)/n}1Σn1β01.

Hence, by the assumptions (C2) and (C3), we have

I1λnnsn1a0n2Σn11β01=Op(λna1na0n2qnn)0. (A.24)

Furthermore, by writing X1=(w˜1,,w˜n) and applying the first order resolvent expansion formula, it can be shown that

I2=sn1ni=1nbnΣn11w˜iεi+op(1), (A.25)

which converges in distribution to N(0, 1) by the Lindeberg–Feller Central Limit Theorem. Finally, combining (A.23), (A.24), and (A.25), we can conclude the proof of part (b).

Proof of Theorem 1. Note that for the initial ridge estimator β^(0) defined by (1), we have

β^(0)β0{(Σn+ξnIpn/n)1ΣnIpn}β0+(Σn+ξnIpn/n)1Xε/nT1+T2.

By the first order resolvent expansion formula and ξn/n0,

T1=Σn1ξn(Σn+ξnIpn/n)1Σnβ0/nC3ξna1npn/n=op(pn/n).

It is easy to see that T2=Op(pn/n).Thus β^(0)β0=Op{(pn/n)1/2}. This, combined with part (a) of Lemma 1, implies that

Pr(limkγ^(k)=0)1. (A.26)

Hence, to prove part (i) of Theorem 1, it is su cient to show that

Pr(limkα^(k)α^=0)1, (A.27)

where α^ is the fixed point of f(α) defined in part (b) of Lemma 2.

Define γ* = 0 if γ= 0. It is easy to see from (A.2) that for any αBn,

limγ0γ*(α,γ)=0. (A.28)

Combining (A.28) with the fact

(X1X1+λnD1(α)X1X2X2X1X2X2+λnD2(γ))(α*γ*)=(X1yX2y),

we find that, for any αBn,

limγ0α*(α,γ)={X1X1+λnD1(α)}1X1y=f(α). (A.29)

Therefore, g is continuous and thus uniformly continuous on the compact set βHn. This, together with (A.26) and (A.29), implies that, as k → ∞,

ηksupαBnf(α)α*(α,γ^(k))0 (A.30)

with probability tending to 1. Note that

α^(k+1)α^=α*(β^(k))α^α*(β^(k))f(α^(k))+f(α^(k))α^ηk+1Cα^(k)α^,

where the last step follows from f(α^(k))α^=f(α^(k))f(α^)α^(k)α^/C. Let ak=α^(k)α^, for every integer k ≥ 0. From (A.30), we can induce that with probability tending to 1, for any > 0, there exists a positive integer N such that for every integer k > N, |ηk| < and

ak+1ak1C2+ηk1C+ηka1Ck+η1Ck1++ηNCkN+(ηN+1CkN1++ηk1C+ηk)(a1+η1++ηN)1CkN+1(1/C)kN11/C

and the right-hand term tends to 0 as k → ∞. This proves (A.27).

Therefore, it follows immediately from (A.26) and (A.27) that with probability tending to 1, limkβ(k)=limk(α^(k)T,γ^(k)T)=(α^T,0), which completes the proof of part (i). This, in addition to part (b) of Lemma 2, proves part (ii) of Theorem 1

Proof of Theorem 2. Recall that β^=limkβ^(k+1) and β^(k+1)=argminβ{Q(β|β^(k))}, where

Q(β|β^(k))=yXβ2+λnl=1pnβl2/{β^l(k)}2.

If βl0 for ℓ ∈ {i, j}, then β^ must satisfy the following normal equations for ℓ ∈ {i, j}

2xl{yXβ^(k+1)}+2λnβ^l(k+1)/{β^l(k)}2=0.

Thus, for ℓ ∈ {i, j},

β^l(k+1)/{β^l(k)}2=xlε^(k+1)/λn, (A.31)

where ε^(k+1)=y-Xβ^(k+1). Moreover, because

ε^(k+1)2+λni=1pnβ^i2β˜i2=Q(β^(k+1)|β^(k))Q(0|β^(k))=y2,

we have

ε^(k+1)y (A.32)

Letting k → ∞ in (A.31) and (A.32), we have, for ℓ ∈ {i, j}, and ϵ^y,βl1=xlϵ^λn where ϵ^=y-Xβ^.

Therefore,

|β^i*1β^j*1|1λny×xixj1λny2(1ρij).

and the proof is complete.

Proof of Theorem 3. (a) It is easy to check that

l{β(t)}=h{β(t),β(t)}h{β(t+1),β(t)}=l{β(t)}+{β(t+1)β(t)}S{β(t)}+u2β(t+1)β(t)22=l{β(t+1)}+u2β(t+1)β(t)22+[l{β(t+1)}+l{β(t)}+{β(t+1)β(t)}S{β(t)}]=l{β(t+1)}+u2β(t+1)β(t)22X(β(t+1)β(t))22l{β(t+1)}+u2β(t+1)β(t)22λmax(XX)(β(t+1)β(t))22l{β(t+1)}, (A.33)

where the last step follows from the assumption that u2λmax(XX).

(b) The proof for the convergence of{β(t)} to a local solution of (6) essentially mimicks the proof of Theorem 1 of Xu and Chen (2014) and thus is omitted here.

Proof of Theorem 4. Define S+k={s:s0s;s0k} and Sk={s:s0s;s0k} to be the collections the over-fitted models and the under-fitted models. Let β^s be the estimate of β based on the model s Clearly the result holds if Pr(s^sLSES+k)1. Thus, it suffices to show that as n → ∞,

Pr{minsSkl(β^s)maxsS+kl(β^s)}0.

For any sSk,s=ss0S+2k. Consider βsβs02ω1nτ1, and by the Taylor expansion and conditions (D1) and (D3), we have

l(βs0)l(βs)=(βs0βs)S(βs0)12(βsβs0)2XsXs(βsβs0)(βs0βs)S(βs0)c1nβsβs022ω1nτ1S(βs0)2c1ω12n12τ1.

Thus, upon setting c=ω21/2c1ω1/2, we have

Pr{l(βs0)l(βs)0}Pr{S(βs0)2c1ω1n1τ1}jsPr{Sj2(βs0)12k(c1ω1)2n22τ1}=jsPr{|Sj(βs0)|1(2k)1/2c1ω1n1τ1}jsPr{|Sj(βs0)|cn1τ1τ2/2}. (A.34)

Let tni=xij(i=1nxij2σi2)1/2 Such that i=1ntni2σi2=1 Under condition (D4), we have maxi(tni2)c4/n and i=1nxij2σi2c22nσ2 for some positive constant σ. Then,

Pr{Sj(βs0)cn1τ1τ2/2/2}=Pr{2i=1nxij(yiwisβs0)cn1τ1τ2/2/2}Pr{2i=1ntni(yiwisβs0)c2c2σn1/2τ1τ2/2}exp(cn12τ1τ2),

where the last step uses Lemma A.1 in [46] and c is a positive constant. Similarly,

Pr{Sj(βs0)cn1τ1τ2/2}exp(cn12τ1τ2). (A.35)

Combining (A.34)(A.35), we have Pr{l(βs0)l(βs)0}4kexp(cn12τ1τ2). Hence,

Pr{minsSkl(βs)l(βs0)}sSkPr{l(βs)l(βs0)}4kpnkexp(cn12τ1τ2)=4exp(lnk+klnpncn12τ1τ2)4w2exp{τ2lnn+O(nτ2+d)cn12τ1τ2}=o(1),

where the last two steps follow from conditions (D1)–(D2). Because l(βs) is convex βs, the above result holds for any βs such that βsβs02ω1nτ1.

For any sSk, let βˇs be β^s augmented with zeros corresponding to the elements in s/s0 By condition (D1), we have βˇsβs02ω1nτ1. Thus

Pr{minsSkl(β^s)maxsS+kl(β^s)}Pr{minsSkl(βˇs)l(βs0)}=o(1).

This concludes the argument.

References

  • [1].Akaike H, A new look at the statistical model identification, IEEE Trans. Automat. Contr 19 (1974) 716–723. [Google Scholar]
  • [2].Andoh A, Nishida A, Takahashi K, Inatomi O, Imaeda H, Bamba S, Kito K, Sugimoto M, Kobayashi T, Comparison of the gut microbial community between obese and lean peoples using 16s gene sequencing in a Japanese population, J. Clin. Biochem. Nutr 59 (2016) 65–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Basso N, Soricelli E, Castagneto-Gissey L, Casella G, Albanese D, Fava F, Donati C, Tuohy K, Angelini G, La Neve F, Severino A,Kamvissi-Lorenz V, Birkenfeld AL, Bornstein S, Manco M, Mingrone G, Insulin resistance, microbiota, and fat distribution changes by a new model of vertical sleeve gastrectomy in obese rats, Diabetes 65 (2016) 2990–3001. [DOI] [PubMed] [Google Scholar]
  • [4].Blumensath T, Davies ME, Iterative hard thresholding for compressed sensing, Appl. Comput. Harmon. Anal 27 (2009) 265–274. [Google Scholar]
  • [5].Breheny P, Huang J, Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection, Ann. Appl. Statist 5 (2011) 232–253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [6].Breiman L, Heuristics of instability and stabilization in model selection, Ann. Statist 24 (1996) 2350–2383. [Google Scholar]
  • [7].Candès EJ, Wakin MB, Boyd SP. Enhancing sparsity by reweighted [lscript]1 minimization, J. Fourier Anal. Appl 14 (2008) 877–905. [Google Scholar]
  • [8].Chartrand R, Yin W, Iteratively reweighted algorithms for compressive sensing, In: Proceedings of Int. Conf. on Acoustics, Speech, Signal Processing (ICASSP) (2008) 3869–3872. [Google Scholar]
  • [9].Chen J, Chen Z, Extended bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]
  • [10].Daubechies I, DeVore R, Fornasier M, Güntürk CS, Iteratively reweighted least squares minimization for sparse recovery, Commun. Pure Appl. Math 63 (2010) 1–38. [Google Scholar]
  • [11].Dicker L, Huang B, Lin X, Variable selection and estimation with the seamless-ℓ0 penalty, Stat. Sinica 23 (2013) 929–962. [Google Scholar]
  • [12].Fan J, Feng Y, Song R, Nonparametric independence screening in sparse ultra-high-dimensional additive models, J. Amer. Statist. Assoc 106 (2011) 544–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties, J. Amer. Statist. Assoc 96 (2001) 1348–1360. [Google Scholar]
  • [14].Fan J, Lv J, Sure independence screening for ultrahigh dimensional feature space, J. R. Stat. Soc. Ser. B Stat. Methodol 70 (2008) 849–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Fan J, Xue L, Zou H, Strong oracle optimality of folded concave penalized estimation, Ann. Statist 42 (2014) 819–849. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [16].Federico A, Dallio M, Tolone S, Gravina AG, Patrone V, Romano M, Tuccillo C, Mozzillo AL, Amoroso V, Misso G, Morelli L,Docimo L, Loguercio C, Gastrointestinal hormones, intestinal microbiota and metabolic homeostasis in obese patients: E ect of bariatric surgery, in vivo 30 (2016) 321–330. [PubMed] [Google Scholar]
  • [17].Foster DP, George EI, The risk inflation criterion for multiple regression, Ann. Statist 22 (1994) 1947–1975. [Google Scholar]
  • [18].Friedman J, Hastie T, Tibshirani RJ, Regularization paths for generalized linear models via coordinate descent, J. Stat. Softw 33 (2010) 1. [PMC free article] [PubMed] [Google Scholar]
  • [19].Frommlet F, Nuel G, An adaptive ridge procedure for 0 regularization, PloS one 11 (2016) e0148620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Gasso G, Rakotomamonjy A, Canu S, Recovering sparse signals with a certain family of nonconvex penalties and DC programming, IEEE Trans. Signal Process 57 (2009) 4686–4698. [Google Scholar]
  • [21].Gorodnitsky IF, Rao BD, Sparse signal reconstruction from limited data using focuss: A re-weighted minimum norm algorithm, IEEE Trans. Signal Process 45 (1997) 600–616. [Google Scholar]
  • [22].He X, Wang L, Hong HG, Quantile-adaptive model-free variable screening for high-dimensional heterogeneous data, Ann. Statist 41 (2013) 342–369. [Google Scholar]
  • [23].Hippe B, Remely M, Aumueller E, Pointner A, Magnet U, Haslberger A, Faecalibacterium prausnitzii phylotypes in type two diabetic, obese, and lean control subjects, Benef. Microbes 7 (2016) 511–517. [DOI] [PubMed] [Google Scholar]
  • [24].Huang J, Horowitz JL, Ma S, Asymptotic properties of bridge estimators in sparse high-dimensional regression models, Ann. Statist 36 (2008) 587–613. [Google Scholar]
  • [25].Johnson BA, Long Q, Huang Y, Chansky K, Redman M, Log-penalized least squares, iteratively reweighted lasso, and variable selection for censored lifetime medical cost, Department of Biostatistics and Bioinformatics, Technical Report, Emory University, Atlanta, GA, 2012.
  • [26].Jung M-J, Lee J, Shin N-R, Kim M-S, Hyun D-W, Yun J-H, Kim PS, Whon TW, Bae J-W, Chronic repression of mTOR complex 2 induces changes in the gut microbiota of diet-induced obese mice, Sci. Rep 6 (2016) 30887. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Knight K, Fu W, Asymptotics for lasso-type estimators, Ann. Statist 28 (2000) 1356–1378. [Google Scholar]
  • [28].Lai P, Liu Y, Liu Z, Wan Y, Model free feature screening for ultrahigh dimensional data with responses missing at random, Comput. Statist. Data Anal 105 (2017) 201–216. [Google Scholar]
  • [29].Lawson CL, Contribution to the theory of linear least maximum approximation, Doctoral dissertation, Univ. Calif. at Los Angeles, Los Angeles, CA, 1961. [Google Scholar]
  • [30].Li G, Peng H, Zhang J, Zhu L, Robust rank correlation based screening, Ann. Statist 40 (2012) 1846–1877. [Google Scholar]
  • [31].Lin L, Sun J, Adaptive conditional feature screening, Comput. Statist. Data Anal 94 (2016) 287–301. [Google Scholar]
  • [32].Liu J, Li R, Wu R, Feature selection for varying coe cient models with ultrahigh-dimensional covariates, J. Amer. Statist. Assoc 109 (2014) 266–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Liu Z, Li G , E cient regularized regression with penalty for variable selection and network construction, Comput. Math. Methods Med 2016 (2016) ID 3456153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Loh P-L, Wainwright MJ, Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima, J. Mach. Learn. Res 16 (2015) 559–616. [Google Scholar]
  • [35].Mallows CL, Some comments on Cp, Technometrics 15 (1973) 661–675. [Google Scholar]
  • [36].Meinshausen N, Bühlmann P, High-dimensional graphs and variable selection with the lasso, Ann. Statist 34 (2006) 1436–1462. [Google Scholar]
  • [37].Osborne MR, Finite algorithms in optimization and data analysis, Wiley, 1985. [Google Scholar]
  • [38].Schwarz G, Estimating the dimension of a model, Ann. Statist 6 (1978) 461–464. [Google Scholar]
  • [39].Segal MR, Dahlquist KD, Conklin BR, Regression approaches for microarray data analysis, J. Comput. Biol 10 (2003) 961–980. [DOI] [PubMed] [Google Scholar]
  • [40].Shen X, Pan W, Zhu Y, Likelihood-based selection and sharp parameter estimation, J. Amer. Statist. Assoc 107 (2012) 223–232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Shen X, Pan W, Zhu Y, Zhou H, On constrained and regularized high-dimensional regression, Ann. Inst. Statist. Math 65 (2013) 807–832. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Stamey TA, Kabalin JN, McNeal JE, Johnstone IM, Freiha F, Redwine EA, Yang N, Prostate specific antigen in the diagnosis and treatment of adenocarcinoma of the prostate. II: Radical prostatectomy treated patients, J. Urol 141 (1989) 1076–1083. [DOI] [PubMed] [Google Scholar]
  • [43].Tibshirani RJ, Regression shrinkage and selection via the lasso, J. R. Stat. Soc. Series B Stat. Methodol 58 (1996) 267–288. [Google Scholar]
  • [44].Wang H, Forward regression for ultra-high dimensional variable screening, J. Amer. Statist. Assoc 104 (2009) 1512–1524. [Google Scholar]
  • [45].Wipf D, Nagarajan S, Iterative reweighted 1 and 2 methods for finding sparse solutions, IEEE J. Sel. Topics Signal Process 4 (2) (2010) 317–329. [Google Scholar]
  • [46].Xu C, Chen J, The sparse mle for ultrahigh-dimensional feature screening, J. Amer. Statist. Assoc 109 (2014) 1257–1269. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [47].Zhang C-H, Nearly unbiased variable selection under minimax concave penalty, Ann. Statist 38 (2010) 894–942. [Google Scholar]
  • [48].Zhao P, Yu B, On model selection consistency of lasso, J. Mach. Learn. Res 7 (2006) 2541–2563. [Google Scholar]
  • [49].Zhong W, Zhu L, Li R, Cui H, Regularized quantile regression and robust feature screening for single index models, Stat. Sinica 26 (2016) 69–95. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Zhu L-P, Li L, Li R, Zhu L-X, Model-free feature screening for ultrahigh-dimensional data, J. Amer. Statist. Assoc 106 (2011) 1464–1475. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [51].Zou H, The adaptive lasso and its oracle properties, J. Amer. Statist. Assoc 101 (2006) 1418–1429. [Google Scholar]
  • [52].Zou H, Hastie T, Regularization and variable selection via the Elastic Net, J. R. Stat. Soc. Series B Stat. Methodol 67 (2005) 301–320. [Google Scholar]
  • [53].Zupancic ML, Cantarel BL, Liu Z, Drabek EF, Ryan KA, Cirimotich S, Jones C, Knight R, Walters WA, Knights D, Mongodin EF, Horenstein RB, Mitchell BD, Steinle N, Snitker S, Shuldiner AR, Fraser CM, Analysis of the gut microbiota in the old order amish and its relation to the metabolic syndrome, PloS one 7 (2012) e43052. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES