Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Oct 30.
Published in final edited form as: Appl Stoch Models Bus Ind. 2018 May 25;35(2):283–298. doi: 10.1002/asmb.2340

Weak signals in high-dimension regression: detection, estimation and prediction

Yanming Li 1, Hyokyoung G Hong 2,*, S Ejaz Ahmed 3, Yi Li 1
PMCID: PMC6821396  NIHMSID: NIHMS1015641  PMID: 31666801

Summary.

Regularization methods, including Lasso, group Lasso and SCAD, typically focus on selecting variables with strong effects while ignoring weak signals. This may result in biased prediction, especially when weak signals outnumber strong signals. This paper aims to incorporate weak signals in variable selection, estimation and prediction. We propose a two-stage procedure, consisting of variable selection and post-selection estimation. The variable selection stage involves a covariance-insured screening for detecting weak signals, while the post-selection estimation stage involves a shrinkage estimator for jointly estimating strong and weak signals selected from the first stage. We term the proposed method as the covariance-insured screening based post-selection shrinkage estimator. We establish asymptotic properties for the proposed method and show, via simulations, that incorporating weak signals can improve estimation and prediction performance. We apply the proposed method to predict the annual gross domestic product (GDP) rates based on various socioeconomic indicators for 82 countries.

Keywords: high-dimensional data, Lasso, post-selection shrinkage estimation, variable selection, weak signal detection

1. Introduction

Given n independent samples, we consider a high-dimensional linear regression model

y=Xβ+ε, (1)

where y = (y1, … , yn)T is an n-vector of responses, X = (Xij)n×p is an n × p random design matrix, β = (β1, …, βp)T is a p-vector of regression coefficients and ε = (ε1, … , εn)T n is an n-vector of independently and identically distributed random errors with mean 0 and variance σ2. Let β*=(β1*,,βp*)T denote the true value of β. We write X = (x(1), …, x(n))T = (x1, …, xp), where x(i) = (Xij, … , Xip)T is the i-th row of X and xj is the j-th column of X, for i = 1, …, n and j = 1, …, p. Without the subject index i, we write y, Xj and ε as the random variables underlying yi, Xij and εi, respectively. We assume that each Xj is independent of ε. We write x as the random vector underlying x(i) and assume that x follows a p-dimensional multivariate sub-Gaussian distribution with mean zeros, variance proxy σx2, and covariance matrix Σ. Sub-Gaussian distributions contain a wide class of distributions such as Gaussian, binary and all bounded random variables. Therefore, our proposed framework can accommodate more data types, as opposed to the conventional Gaussian distributions.

We assume that model (1) is sparse. That is, the number of nonzero β* components is less than n. When p > n, the essential problem is to recover the set of predictors with nonzero coefficients. The past two decades have seen many regularization methods developed for variable selection and estimation in high-dimensional settings, including Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), group Lasso (Yuan and Lin, 2006), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010), among many others. Most regularization methods assume the restrictive β-min condition which requires that the strength of nonzero βj*′s is larger than a certain noise level (Zhang and Zhang, 2014). Hence, regularization methods may fail to detect weak signals with nonzero but small βj*′s, and this will result in biased estimates and inaccurate predictions, especially when weak signals outnumber strong signals.

Detection of weak signals is challenging. However, if weak signals are partially correlated with strong signals which satisfy the β-min condition, they may be more reliably detected. To elaborate on this idea, first notice that the regression coefficient βj* can be written as

βj*=1jpΩjj cov(Xj,y),j=1,,p, (2)

where Ωjj′ is the jj-th entry of of Ω = Σ−1, the precision matrix of x. Let ρjj′ be the partial correlation of Xj and Xj′, i.e. the correlation between the residuals of Xj and Xj after regressing them on all the other X variables. It can be shown that ρjj=Ωjj/ΩjjΩjj. Hence, that Xj and Xj are partially uncorrelated is equivalent to Ωjj′ = 0. Assume that Ω is a sparse matrix with only a few nonzero entries in Ω. When the right hand side of (2) can be accurately evaluated, weak signals can be distinguished from those of noises. In high-dimensional settings, it is impossible to accurately evaluate 1jpΩjjcov (Xj,y). However, under the faithfulness condition that will be introduced in Section 3, a variable, say, indexed by j′, satisfing the β-min condition will have a nonzero cov(Xj, y). Once we identify such strong signals, we set to discover variables that are partially correlated with them.

For brevity, we term weak signals which are partially correlated with strong signals as “weak but correlated” (WBC) signals. This paper aims to incorporate WBC signals in variable selection, estimation and prediction. We propose a two-stage procedure which consists of variable selection and post-selection estimation. The variable selection stage involves a covariance-insured screening for detecting weak signals, and the post-selection estimation stage involves a shrinkage estimator for jointly estimating strong and weak signals selected from the first stage. We call the proposed method as the covariance-insured screening based post-selection shrinkage estimator (CIS-PSE). Our simulation studies demonstrate that by incorporating WBC signals, CIS-PSE improves estimation and prediction accuracy. We also establish the asymptotic selection consistency of CIS-PSE.

The paper is organized as follows. We outline the proposed CIS-PSE method in Section 2 and investigate its asymptotic properties in Section 3. We evaluate the finite-sample performance of CIS-PSE via simulations in Section 4, and apply the proposed method to predict the annual gross domestic product (GDP) rates based on the socioeconomic status for 82 countries in Section 5. We conclude the paper with a brief discussion in Section 6. All technical proofs are provided in Appendix.

2. Methods

2.1. Notation

We use scripted upper-case letters, such as S, to denote the subsets of {1, … , p}. Denote by |S| the cardinality of S and by Sc the complement of S. For a vector v, we denote a subvector of v indexed by S by vS. Let XS=(xj,jS) be a submatrix of the design matrix X restricted to the columns indexed by S. For the symmetric covariance matrix Σ, denote by ΣSS its submatix with the row and column indices restricted to subsets S and S, respectively. When S=S, we write ΣS=ΣSS for short. The notation also applies to its sample version Σ^.

Denote by G(V,E;Ω) the graph induced by Ω, where the node set is V={1,,p} and the set of edges is denoted by E. An edge is a pair of nodes, say, k and k′, with Ωkk ≠ 0. For a subset VlV, denote by Ωl the principal submatrix of Ω with its row and column indices restricted to Vl and by El the corresponding edge set. The subgraph G(Vl,El;Ωl) is a connected component of G(V,E;Ω) if (i) any two nodes in Vl are connected by edges in El; (ii) for any node kVlc, there exists a node kVl such that k, k′ cannot be connected by any edges in E.

For a symmetric matrix A, denote by tr(A) the trace of A, and denote by λmin(A) and λmax(A) the minimum and maximum eigenvalues of A. We define the operator norm and the Frobenius norm as A=λmax1/2(ATA) and AF= tr (ATA)1/2, respectively. For a p-vector v, denote its Lq norm by vq=(j=1p|vj|q)1/q with q ≥ 1. For two real numbers a and b, denote ab = max(a, b).

Denote the sample covariance matrix and the marginal sample covariance between Xj and y, j = 1, …, p, by

Σ^=1ni=1nx(i)(x(i))T and  cov ^(Xj,y)=1ni=1nXijyi.

For a vector V = (V1, …,Vp)T, denote  cov ^(V,y)=( cov ^(V1,y),, cov ^(Vp,y))T.

2.2. Defining strong and weak signals

Consider a low-dimensional linear regression model where p < n. The ordinary least squares (OLS) estimator β^OLS=Σ^1 cov ^(x,y)=Ω^ cov ^(x,y) minimizes the prediction error, where Ω^=Σ^1 is the empirical precision matrix. It is also known that β^OLS is an unbiased estimator of β* and yields the best outcome prediction y^best =XΩ^ cov ^(x,y) with the minimal prediction error.

However, when p > n, Σ^ becomes non-invertible, and thus β Cannot be estimated using all X variables. Let S0={j:βj*0} be the true signal set and assume that |S0|<n. If S were known, the predicted outcome, y^best=XS0Σ^S01 cov ^(xS0,y), would have the smallest prediction error. In practice, S0 is unknown and some variable selection method must be applied first to identify S0. We define the set of strong signals as

S1={j:|βj*|>clogp/n for some c>0,1jp} (3)

and let S2=S0\S1 be the set of weak signals. Then, the OLS estimator and the best outcome prediction are given by

β^OLS=(β^S1OLSβ^S2OLS)=(Ω^11 cov ^(xS1,y)+Ω^12 cov ^(xS2,y)Ω^21 cov ^(xS1,y)+Ω^22 cov ^(xS2,y))

and

y^ best =XS1Ω^11 cov ^(xS1,y)+XS2Ω^21 cov ^(xS1,y)+XS1Ω^12 cov ^(xS2,y)+XS2Ω^22 cov ^(xS2,y),

where (Ω^11Ω^12Ω^21Ω^22)=(Σ^S1Σ^S1S2Σ^S2S1Σ^S2)1 is the partitioned empirical precision matrix. We observe that the partial Correlations between the variables in S1 and S2 contribute to the estimation of βS1 and βS2, and outcome prediction. Therefore, incorporating WBC signals helps reduce the estimation bias and prediction error.

We now further divide S2 into SWBC and S2*. Here, SWBC is the set of weak signals which have nonzero partial correlations with the signals in S1 and S2* is the set of weak signals which are not partially orrelated with signals in S1. Formally, with c given in (3),

SWBC={j:0<|βj*|<clogp/n and Ωjj0 for some jS1,1jp}

and

S2*={j:0<|βj*|<clogp/n and Ωjj=0 for any jS1,1jp}.

Thus, p predictors can be partitioned as S1SWBCS2*Snull={1,,p}, where Snull={j:βj*=0}. We assume that |S1|=p1,|SWBC|=pWBC and |S2*|=p2*.

2.3. Covariance-insured screening based post-selection shrinkage estimator (CIS-PSE)

Our proposed CIS-PSE method consists of the variable selection and post-shrinkage estimation steps.

Variable selection:

First, we detect strong signals by regularization methods such as Lasso or adaptive Lasso. Denote by S1^ the set of detected strong signals. To identify WBC signals, we evaluate (2) for each jS^1c. When there is no confusion, we use a j′ to denote a strong signal.

Though estimating cov(Xj′, y) for every 1 ≤ j′ ≤ p can be easily done, identifying and estimating nonzero entries in Ω is still challenging in high-dimensional settings. However, for identifying WBC signals, it is unnecessary to estimate the whole Ω matrix. Leveraging intra-feature correlations among predictors, we introduce a computationally efficient method for detecting nonzero Qjj′’s.

Variables that are partially correlated with signals in S1^ form the connected components of G(V,E;Ω) that contain at least one element of S1^. Therefore, for detecting WBC signals, it suffices to focus on such connected components. Under the sparsity assumptions of β* and Ω, the size of such connected components is relatively small. For example, as shown in Figure 1, the first two diagonal blocks of a moderate size are relevant for detection of WBC signals.

Figure 1:

Figure 1:

Figure 1:

Figure 1:

An illustrative example of marginally strong signals and their connected components in G(V,E;Ω). Left panel: structure of Ω; Middle panel: structure of Ω after properly reordering the row and column indices of Ω; Right panel: the corresponding graph structure and connected components of the strong signals. Signals in S1 are colored red. Signals in S2* are colored orange. WBC signals in SWBC are colored yellow.

Under the sparsity assumption of Ω, the connected components of Ω can be inferred from those of the thresholded sample covariance matrix (Mazumder and Hastie, 2012; Bickel and Levina, 2008; Fan et al., 2011; Shao et al., 2011), which is much easier to estimate and can be calculated in a parallel manner. Denote by Σ˜α the thresholded sample covariance matrix with a thresholding parameter α, where Σ˜kkα=Σ^kk1{|Σ^kk|α},1k,kp with 1(·) being the indicator function. Denote by G(V,E˜;Σ˜α) the graph corresponding to Σ˜α. For variable k, 1 ≤ kp, denote by C[k] the vertex set of the connected component in G(V,E;Ω) containing k. If variables k and k′ belong to the same connected component, 1 ≤ kk′ ≤ p, then C[k]=C[k]. For example, C[14]=C[16]=C[24]=C[26]=C[29] in the third panel of Figure 1. Clearly, when kC[k],Ωkk=0, evaluating (2) is equivalent to estimating

βj*=jC[j]Ωjj cov (Xj,y),j=1,,p. (4)

Correspondingly, for a variable k, 1 ≤ kp, denote by C^[k] the vertex set of the connected component in G(V,E˜;Σ˜α) containing k. For a multivariate Gaussian x, Mazumder and Hastie (2012) showed that C[k] ′s can be exactly recovered from C[k] ′s with a properly chosen α. For a multivariate sub-Gaussian x, we refer to the following lemma.

Lemma 2.1. Suppose that the maximum size of a connected component in Ω containing a variable in S0 is of order O(exp(nξ)), for some 0 < ξ < 1, then under Assumption (A7) specified in Section 3, with an α=O(nξ1) and for any variable k, 1 ≤ kp, we have

P(C[k]=C^[k])1C1nξexp(C2n1+ξ)1 (5)

for some positive constants C1 and C2.

We summarize the variable selection procedure for S1 and SWBC.

Step S1: Detection of S1. Obtain a candidate subset S1^ of strong signals using a penalized regression method. Here, we consider the penalized least squares (PLS) estimator from Gao et al. (2017):

β^PLS=argminβ{yXβ22+j=1pPenλ(βj)}, (6)

where Penλ(βj) is a penalty on each individual βj to shrink the weak effects toward zeros and select the strong signals, with the tuning parameter λ > 0 controlling the size of the candidate subset S1^. Commonly used penalties are Penλ(βj) = λ|βj| and Penλ(βj) = λωj |βj| for Lasso and adaptive Lasso, where ωj > 0 is a known weight.

Step 2: Detection of SWBC. First, for a given threshold α, construct a sparse estimate of the covariance matrix Σ˜α. Next, for each selected variable j′ in S1^ from Step S1, detect C^[j], the node set, corresponding to its connected component in G(V,E˜;Σ˜α). Let U(α,S^1) be the union of the vertex sets corresponding to those connected components detected. That is, U(α,S^1)=jS^1C^[j]. Then according to (4), it suffices to identify WBC signals within U(α,S^1). Specifically, for each jS^1cU(α,S^1), let Σ˜[j]α be the submatrix from restricting Σ˜α on C^[j]. We then evaluate (4) and select WBC variables by

S^WBC={jS^1cU(α,S^1):|jC^[j](Σ˜[j]α)jj1 cov ^(Xj,y)|νn} (7)

for some pre-specified νn. Here (Σ˜[j]α)jj1 denotes the entry of (Σ˜[j]α)1 corresponding to variables j and j′. In our numerical studies, we rank the magnitude of variable jS^1cU(α,S^1) according to the magnitude of |jC^[j](Σ˜[j]α)jj1 cov ^(Xj,y)| and select up to the first n|S^1| variables.

Step 3: Detection of S2*. To identify S^2*, we first solve a regression problem with a ridge penalty only on variables in S^1WBCc, where S^1WBC=S^1S^WBC. That is,

β^r=argminβ{yXβ22+λ˜nβS^1cWBC22}, (8)

where λ˜n>0 is a tuning parameter controlling the overall strength of the variables selected in S^1WBCc. Then a post-selection weighted ridge (WR) estimator β^WR has the form

β^jWR={β^jr,jS^1WBC,β^jr1(|β^jr|>an),jS^1WBCc, (9)

where an is a thresholding parameter. Then the candidate subset S^2* is obtained by

S^2*={jS^1WBCc:β^jWR0,1jp}. (10)

Post-selection shrinkage estimation:

We consider the following two cases when performing the post-selection shrinkage estimation.

Case 1: p^1+p^WBC+p^2*<n. We obtain the CIS-PSE on S^0 by

β^S^0CISPSE=Σ^S^01 cov ^(xS^0,y), (11)

where S^0=S^1S^WBCS^2*. Then β^S^1ClSPSE and β^S^WBCCISPSE can be obtained by restricting β^S^0CISPSE to S^1 and S^WBC, respectively.

Case 2: p^1+p^WBC+p^2*>n. Recall that β^S^1WBCWR=(β^jr,jS^1WBC)T and β^S2WR=(β^jr1(|β^jr|>an),jS^2*)T. We obtain the CIS-PSE of βS^1WBC by

β^S^1WBCCIS-PSE=β^S^1WBCWR(s^22T^n1)(β^S^1WBCWRβ^S^1WBCRE), (12)

Where s^2=|S^2*| and T^n is as defined by

T^n=(β^S^2*WR)T(XS^2*TMS^1WBCXS^2*)β^S^2*WR/σ^2, (13)

where MS^1WBC=InXS^1WBC(XS^1WBCTXS^1WBC)1XS^1WBCT and σ^2=i=1n(yiXS^2Tβ^S^2WR)2/(nS^2). If XS^1WBCTXS^1WBC is singular, we replace (XS^1WBCTXS^1WBC)1 with a generalized inverse. Then β^S^1CISPSE and β^S^WBCCISPSE can be obtained by restricting β^S^1CISPSE to S1^ and S^WBC, respectively.

Set S^null=(S^1S^WBC S^2*)c. The final CIS-PSE estimator β^CISPSE is defined as

β^CISPSE=((β^S^1CISPSE)T,(β^S^WBCCISPSE)T,(β^S^2*WR)T,0S^ null )T. (14)

2.4. Selection of tuning parameters

When selecting strong signals, the tuning parameter λ in Lasso or adaptive Lasso can be chosen by BIC (Zou, 2006). To choose νn for the selection of WBC signals according to (7), we rank variables jS^1cU(α,S^1) according to the magnitude of |jC^[j](Σ˜[j]α)jj1 cov ^(Xj,y)| and select the first rn|S^1| variables to be S^WBC. Here, r is chosen so that S^1WBC minimizes prediction errors on an independent validation dataset. For tuning parameter α, we set α = c3 log(n), for some positive constant c3, as suggested in Shao et al. (2011). Our empirical experiments show that α = c3 log(n) tends to give the larger true positives and the smaller false positives in identifying WBC variables. Figure 7 in Appendix reveals that in order to find the optimal α that minimizes the prediction error on a validation dataset, it suffices to conduct a grid search with only a few proposed values of α. In our numerical studies, instead of thresholding the sample covariance matrix, we threshold the sample correlation matrix. As correlations are ranged between −1 and 1, it is easier to set a target range for α. To detect signals in S2*, we follow Gao et al. (2017) to use cross-validation to choose λ˜n and an in (8) and (9), respectively. In particular, we set and λ˜n=c1an2(loglogn)3log(np) an = c2n−1/8 for some positive constants c1 and c2. In the training dataset we fix the tuning parameters and fit the model, and in the validation dataset we compute the prediction error of the model. We repeat this procedure for various c1 and c2, and choose a pair that gives the smallest prediction error on the validation dataset.

Figure 7:

Figure 7:

Sum of squared prediction error (SSPE) corresponding to different α’s.

3. Asymptotic properties

To investigate the asymptotic properties of CIS-PSE, we assume the following.

(A1) The random error ϵ has a finite kurtosis.

(A2) log(p) = O(nν) for some 0 < ν < 1.

(A3) There are positive constants κ1 and κ2 such that 0 < κ1 < λmiη(Σ) ≤ λmax(Σ) < κ2 < ∞.

(A4) Sparse Riesz condition (SRC): For the random design matrix X, any S{1,,p} with |S|=q,qp, and any vector vq, there exist 0 < c* < c* < ∞ such that c*XSTv22/v22c* holds with probability tending to 1.

(A5) Faithfulness Assumption: Suppose that

max|ΣS1cS1βS1*|+max|ΣS1cS2βS2*|+min|ΣS1S2βS2*|<min|ΣS1S1βS1*|,

where the absolute value function | · | is applied component-wise to its argument vector. The max and min operators are with respect to all individual components in the argument vectors.

(A6) Denote by Cmax=max1lB|Vl| the maximum size of the connected components in graph G(V,E;Ω) that contains at least one signal in S1, where B is the number of such connected components. Assume Cmax = O(ηξ) for some ξ ∈ (0,1).

(A7) Assume min(k,k)E|Σkk|Cnξ1 for some constant C > 0 and max(k,k)E|Σkk|=O(nξ1) for the ξ in (A6).

(A8) For any subset Vl{1,,p} with |Vl|=O(n),supjE|Xij1(jVl)|2ν<Cν< for some constant ν > 0.

(A9) Assume that βS2**2=o(nτ) for some 0 < τ < 1, where || · ||2 is the Euclidean norm.

(A1), a technical assumption for the asymptotic proofs, is satisfied by many parametric distributions such as Gaussian. The assumption is mild as we do not assume any parametric distributions for ε except that it has finite moments. (A2) and (A3) are commonly assumed in the high-dimensional literature. (A4) guarantees that S1 can be recovered with probability tending to 1 as n → ∞ (Zhang and Huang, 2008). (A5) ensures that for all jS1, minjS1| cov ^(Xj,y)|>maxjS1c| cov ^(Xj,y)| holds with probability tending to 1 (Lemma 4 in Genovese et al., 2012). (A6) implies that the size of each connected component of a strong signal, i.e., C[j],jS1, cannot exceed the order of exp(nξ) for some ξ ∈ (0,1). This assumption is required for estimating sparse covariance matrices. (A7) guarantees that with a properly chosen thresholding parameter α, Xk and Xk have non-zero thresholded sample covariances for (k,k)E, and have zero thresholded sample covariances for (k,k)E. As a result, the connected components of the thresholded sample covariance matrix and those of the precision matrix can be detected with adequate accuracy. (A8) ensures that the precision matrix can be accurately estimated by inverting the thresholded sample covariance matrix; see Shao et al. (2011) and Bickel and Levina (2008) for details. (A9), which bounds the total size of weak signals on S2*, is required for selection consistency on S2* (Gao et al., 2017).

We show that given a consistently selected S1, we have selection consistency for SWBC.

Theorem 3.1. With (A1)-(A3) and (A6)-(A8),

limnP(S^WBC=SWBC|S^1=S1)=1.

The following corollary shows that Theorem 3.1, together with Theorem 2 in Zhang and Huang (2008) and Corollary 2 in Gao et al. (2017), further implies selection consistency for S1SWBCS2*.

Corollary 3.2. Under Assumptions (A1)-(A9), we have

limnP({S^1=S1}{S^WBC=SWBC}{S^2*=S2*})=1.

Corollary 3.2 implies that CIS-PSE can recover the true set asymptotically. Thus, when |S0|<n, CIS-PSE gives an OLS estimator with probability going to 1 and has the minimum prediction error asymptotically, among all the unbiased estimators.

4. Simulation studies

We conduct simulations to compare the performance of the proposed CIS-PSE and the post-shrinkage estimator (PSE) by Gao et al. (2017). The key difference between CIS-PSE and PSE lies in that PSE focuses only on S1 whereas CIS-PSE considers S1SWBC.

Data are generated according to (1) with

β*=(20,20,20S1,0.5,,05SWBC30,0.5,,05S2*30,0,,0Snullp63)T. (15)

The random errors ϵi are independently generated from N(0,1). We consider the following examples.

Example 1: The first three variables, which belong to S1, are independently generated from N(0,1). The first ten, next ten and the last ten signals in SWBC belong to the connected component of X1, X2 and X3, respectively. These three connected components are independent of each other. S2* is independent of S1 and SWBC. Each connected component within S1SWBC and S2* are generated from a multivariate normal distribution with mean zeros, variance 1, and a compound symmetric (CS) correlation matrix with correlation coefficient of 0.7. Variables in Snull are independently generated from N(0, 1).

Example 2: This example is the same as Example 1 except that the three connected components within S1SWBC and S2* follow the first order autocorrelation (AR(1)) structure with correlation coefficient of 0.7.

Example 3: This example is the same as Example 1 except that there are 30 variables in Snull (i.e., variables X64-X93) are set to be correlated with signals in S1. That is, X64-X73 are correlated with X1, X74-X83 are correlated with X2, and X84-X93 are correlated with X3. These three connected components within S1Snull have a CS correlation structure with correlation coefficient of 0.7.

For each example, we conduct 500 independent experiments with p=200, 300, 400 and 500. We generate a training dataset of size n = 200, a test dataset of size n = 100 to assess the prediction performance, and an independent validation dataset of size n = 100 for tuning parameter selection.

First, we compare CIS-PSE and PSE in selecting S0 under Examples 1–2. We use Lasso and adaptive Lasso to select S^1. Since both Lasso and adaptive Lasso give similar results, we report only the Lasso results in this section and present the results of adaptive Lasso in Appendix. We report the number of correctly identified variables (TP) in S0 and the number of incorrectly selected variables (FP) in S0c. Table 1 shows that CIS-PSE outperforms PSE in identifying signals in S0. We observe that the performance of PSE deteriorates as p increases, whereas CIS-PSE selects S0 signals consistently even when p increases.

Table 1:

The performance of variable selection on S0

p = 200 p = 300 p = 400 p = 500
Example 1 TP CIS-PSE 59.6 (1.9) 58.7 (2.1) 57.9 (2.3) 57.7 (2.4)
PSE 41.2 (4.9) 34.4 (5.1) 25.7 (6.0) 22.6 (5.8)
FP CIS-PSE 3.7 (2.4) 5.1 (2.7) 6.9 (3.1) 8.8 (3.3)
PSE 13.3 (4.6) 18.9 (5.2) 21.7 (5.9) 26.1 (6.0)
Example 2 TP CIS-PSE 63.0 (0) 62.9 (0.1) 62.9 (0.1) 62.9 (0.1)
PSE 43.9 (3.9) 37.0 (4.2) 32.8 (5.0) 31.5 (4.3)
FP CIS-PSE 3.5 (2.4) 5.0 (2.7) 6.3 (3.3) 8.1 (3.1)
PSE 12.7 (4.2) 19.5 (5.1) 22.1 (6.3) 27.4 (6.4)

TP=true positive; FP= false positive.

Next, we evaluate estimation accuracy on the targeted sub-model S1SWBC using the mean squared error (MSE) as the criterion under Examples 1–2. Figure 2 indicates that the proposed CIS-PSE detects WBC signals and provides more accurate and precise estimates. Figure 3 shows that CIS-PSE also improves the estimation of βS1 Compared to PSE.

Figure 2:

Figure 2:

Figure 2:

The mean squared error (MSE) of β^SWBC for different p’s under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, and dotted lines indicate Lasso.

Figure 3:

Figure 3:

Figure 3:

The mean squared error (MSE) of β^S1 for different p’s under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, dotted lines indicate Lasso RE defined as β^S^1RE=Σ^S^11 cov ^(xS^1,y), dot-dashed lines represent Lasso, and long-dashed lines are for WR in (9).

We explore the prediction performanCe under Examples 1–2 using the mean squared prediction error (MSPE), defined as y^ytest22/ntest=Xβ^ytest22/ntest, where β^ is obtained from the training data, ytest is the response variable for the test dataset, ntest is the size of test dataset, and ◊ represents either the proposed CIS-PSE or PSE. Table 2, whiCh summarizes the results, shows that CIS-PSE outperforms PSE, suggesting incorporating WBC signals helps to improve the prediction accuracy.

Table 2:

Mean squared prediction error (MSPE) of the predicted outcomes

p 200 300 400 500
Example 1 CIS-PSE 3.17 (0.80) 3.19 (0.78) 3.25 (0.77) 3.32 (0.77)
PSE 4.19 (0.83) 4.93 (1.02) 5.28 (1.07) 5.50 (1.16)
Lasso 10.28 (5.68) 10.02 (5.58) 9.77 (4.96) 9.78 (4.54)
Example 2 CIS-PSE 0.65 (0.14) 0.92 (0.19) 1.30 (0.16) 2.43 (0.64)
PSE 2.89 (0.61) 3.55 (0.75) 4.09 (0.68) 4.34 (0.97)
Lasso 4.20 (0.79) 4.50 (0.88) 4.68 (0.90) 4.73 (0.97)

Lastly, we consider the setting where a subset of Snull is correlated with a subset of S1; see Example 3. Compared to Example 1, the results that are summarized in Table 4 show that the number of false positives only slightly increases, when some variables in Snull are correlated with variables in S1.

Table 4:

Estimation results of S1 and SWBC from the growth rate data

When S1^ is selected by Lasso
S1^ β^S1Lasso β^S1PSE β^S1CISPSE S^WBC β^SWBCCISPSE corr^
TOT 0.06 2.85 3.73 - - -
LFERT 1.66 1.75 2.55 LLIFE 1.88 −.85
NOM60 0.12 0.84
NOF60 −0.10 0.83
LGDPNOM60 −0.02 0.83
PRIF60 −0.001 −0.002 −0.12 LGDPPRIF60 0.02 0.99
LGDPPRIM60 −0.02 0.93
LGDPNOF60 0.02 −0.90
When S1^selected by adaptive Lasso
S1^ β^S1 Ada-Lasso  β^S1PSE β^S1CISPSE S^WBC β^SWBCCISPSE corr^
LFERT 1.98 2.04 2.54 LLIFE 1.77 −0.85
NOM60 0.08 0.84
NOF60 −0.07 0.83
LGDPNOM60 −0.01 0.83
*

corr^ stands for the sample correlations between variables in S1^ and S^WBC.

*

TOT=The term of trade shock; LFERT=log of fertility rate (children per woman) averaged over 1960­1985; LLIFE=log of life expectancy at age 0 averaged over 1960–1985; NOM60=Percentage of no schooling in the male population in 1960; NOF60=Percentage of no schooling in the female population in 1960; LGDP60=log GDP per capita in 1960 (1985 price); PRIF60=Percentage of primary schooling attained in female population in 1960; PRIM60=Percentage of primary schooling attained in male population in 1960; LGDPNORM60=LGDP60×NOM60; LGDPPRIF60=LGDP60×PRIF60; LGDPPRIM60=LGDP60×PRIM60; LGDPNOF60=LGDP60×NOF60.

5. A real data example

We apply the proposed CIS-PSE method to analyze the gross domestic product (GDP) growth data studied in Gao et al. (2017) and Barro and Lee (1994). Our goal is to identify factors that are associated with the long-run GDP growth rate. The dataset includes the GDP growth rates and 45 socioeconomic variables for 82 countries from 1960 to 1985. We consider the following model:

GRi=β0+β1log(GDP60i)+ziTβ2+1(GDP60i<2898)(δ0+δ1log(GDP60i)+ziTδ2)+εi, (16)

where i is the country indicator, i = 1, …, 82, GRi is the annualized GDP growth rate of country i from 1960 to 1985, GDP60i is the GDP per capita in 1960, and zi are 45 socioeconomic covariates, the details of which can be found in Gao et al. (2017). The β1 and β2 represent the coefficients of log(GDP60) and socioeconomic predictors, respectively. The δ0 represents the coefficient of whether the GDP per capita in 1960 is below a threshold (=2898) or not. The δ1 represents the coefficient of log(GDP60) when GDP per capita in 1960 is below 2898. The δ2 represent the coefficients of the interactions between the GDP60i < 2898 and the socioeconomic predictors when GDP per capita in 1960 is below 2898.

We apply the proposed CIS-PSE and PSE by Gao et al. (2017) to detect S1. Additionally, CISPSE is used to further identify SWBC. Effects of covariates in S^1 are estimated by Lasso, adaptive Lasso, PSE and CIS-PSE. Effects of covariates in S^WBC are estimated by CIS-PSE. The sample correlations between variables in S^1 and S^WBC are also provided. Table 4 reports the selected variables and their estimated coefficients.

Next, we evaluate the accuracy of predicted GR using a leave-one-out cross-validation. For each country, we treat it itself as the test set while using all other countries as the training set. We apply Lasso, adaptive Lasso, PSE and CIS-PSE. All tuning parameters are selected as described in Section 4. The prediction results in Figure 4 show that CIS-PSE has the smallest prediction errors compared to PSE, Lasso and adaptive Lasso, with S^1 detected by either Lasso or adaptive Lasso.

Figure 4:

Figure 4:

Figure 4:

Prediction errors from post-selection shrinkage estimators: CIS-PSE, PSE and two penalized estimators (Lasso and adaptive Lasso). S1^ is detected by Lasso in the left panel and by adaptive Lasso in the right panel.

6. Discussion

To improve the estimation and prediction accuracy in high-dimensional linear regressions, we introduce the concept of weak but correlated (WBC) signals, which are commonly missed by the Lasso-type variable selection methods. We show that these variables can be easily detected with the help of their partial correlations with strong signals. We propose a CIS-PSE procedure for high-dimensional variable selection and estimation, particularly for WBC signal detection and estimation. We show that, by incorporating WBC signals, it significantly improves the estimation and prediction accuracy.

An alternative approach to weak signal detection would be to group them according to a known group structure and then select by their grouped effects (Bodmer and Bonilla, 2008; Li and Leal, 2008; Wu et al., 2011; Yuan and Lin, 2006). However, grouping strategies require prior knowledge on the group structure, and, in some situations, may not amplify the grouped effects of weak signals. For example, as pointed out in Buhlmann et al. (2013) and Shah and Samworth (2013), when a pair of highly negatively correlated variables are grouped together, they cancel out each other’s effect. On the other hand, our CIS-PSE method is based on detecting partial correlations and can accommodate the “canceling out” scenarios. Hence, when the grouping structure is known, it is worth combining the grouping strategy and CIS-PSE for weak signal detection. We will pursue this in the future.

Table 3:

Comparison of false positives (standard deviations in parentheses) between Examples 1 and 3

p = 200 p = 300 p = 400 p = 500
Example 1 3.7 (2.4) 5.1 (2.7) 6.9 (3.1) 8.8 (3.3)
Example 3 4.6 (3.5) 6.1 (3.4) 7.8 (3.8) 10.9 (4.2)

7. Appendix

We provide technical proofs for Theorem 3.1, Corollary 3.2 and lemmas in this section. We first list some definitions and auxiliary lemmas.

Definition 7.1. A random vector Z = (Z1, …, Zp) is a sub-Gaussian with mean vector μ and variance proxy σz2, if for any ap,E[exp{aT(Zμ)}]exp(σz2a22/2).

Let Z be a sub-Gaussian random variable with variance proxy σz2. The sub-Gaussian tail inequality is given as, for any t > 0,

P(Z>t)et22σz2 and P(Z<t)et22σz2.

The following Lemma 7.2 ensures that the set of signals with non-vanishing marginal sample correlations with y coincides with S1 with probability tending to 1. Therefore, evaluating condition (4) for a covariate j is equivalent to estimating nonzero Ωjj’s for every jS1. Let rj be the rank of variable j′ according to the magnitude of | cov ^(Xj,y)|,j=1,,p. Denote by S˜1(k)={j:rjk} the first k covariates with the largest absolute marginal correlations with y, for k = 1, …, p. Recall that s1=|S1|.

Lemma 7.2. Under Assumption (A5), we have

limnP(S˜1(s1)=S1)=1.

Proof of Lemma 7.2. By the definition of S˜1(s1), it is suffice to show that with probability tending to 1, as n → ∞,

max|1nXS1cTy|<min|1nXS1Ty|.

Since y=Xβ*+ε=XS0βS0*+ε=XS1βS1*+XS2βS2*+ε, we have

1nXS1cTy=1nXS1cTXS1βS1*+1nXS1cTXS2βS2*+1nXS1cTε.

Notice that for each jS1c,1nxjTε cov (Xj,ε)=0 in probability, then max|1nXS1cTε|=oP(1) as n → ∞.

It follows that when n → ∞,

max|1nXS1cTy|max|ΣS1cS1βS1*|+max|ΣS1cS2βS2*|+oP(1).

Similarly, when n → ∞,

min|1nXS1Ty|min|ΣS1S1βS1*|min|ΣS1S2βS2*|+oP(1).

Lemma 7.2 is concluded by combining the above two inequalities with the faithfulness condition. ◻

Bickel and Levina (2008) showed that for α=O(log|S[j]|/n),Σ˜[j]αΣ[j]=OP(ρn), where ρn=O(n1Cmax1/4) with Cmax1/4 given in (A6). Furthermore, Bickel and Levina (2008) and Fan et al. (2011) showed that the estimation error for each connected component of the precision matrix is bounded by

Ω^[j]Ω[j]=(Σ˜[j]α)1Σ[j]1=OP(ρn). (17)

Here we adopt the recursive labeling Algorithm in Shapiro and Stockman (2002) to detect the connected components of the thresholded sample covariance matrix.

Without loss of generality, suppose that the strong signals in S^1 belong to distinct connected components of Σ˜α. We rearrange the indices in S^1 as {1,,|S^1|} and write the submatrix of Σ˜α corresponding to j,1j|S^1|, as Σ˜jα. For notational convenience, we rewrite

Ω^=diag((Σ˜1α)1,,(Σ˜|S^1(α)|α)1,0)p×p.

The following Lemma 7.3 is useful for controlling the size of C^[j].

Lemma 7.3. Under (A6)-(A7), when x is from a multivariate sub-Gaussian distribution, we have P(|C^[j]|O(nξ))1nξC1exp(C2n1+ξ) for some positive constants C1 and C2.

Lemma 7.3 is a direct conclusion of Lemma 2.1 and Assumption (A6). Next we prove Theorem 3.1.

Proof of Theorem 3.1. Notice that βj*=jC[j]Ωjjcov (Xj,y) and β^j=jC^[j]Ω^jj cov ^ (Xj,y). Consider a sequence of thresholding parameters νn = O(n3ξ/2) with a decreasing series of positive numbers un = 1 + nξ/4 such that limnun = 1,

P({jS1:|βj*|>νnun}{jS1:|β^j|>νn})1P(jS1,|βj*|>νnun{|β^j|νn}). (18)

Moreover, since {|β^j|νn} and {|βj*|>νnun}, we have |β^jβj*|νn(un1). As a result,

1P(jS1,|βj*|>νnun{|β^j|νn})1jS1,|βj*|>νnunP(|β^jβj*|νn(un1)). (19)

Notice that from Lemma 2.1, P(C[j]C^[j])1C1nξexp(C2nξ) for some positive constants C1 and C2 and 0 < ξ < 1. Therefore, we further have

P(|β^jβj*|νn(un1))=P(|jC[j]Ω^jj cov ^(Xj,y)jEjΩjj cov (Xj,y)|νn(un1))=P(|jC[j]Ω^jj cov ^(Xj,y)jC^[j]Ωjj cov (Xj,y)|νn(un1))+C1nξexp(C2nξ)P(jC^[j]|(Ω^jjΩjj) cov ^(Xj,y)+Ωjj( cov ^(Xj,y) cov (Xj,y))|νn(un1))+C1nξexp(C2nξ)P(jC^[j]|(Ω^jjΩjj) cov ^(Xj,y)|νn(un1)/2)+P(jC^[j]|Ωjj( cov ^(Xj,y) cov (Xj,y))|νn(un1)/2)+C1nξexp(C2nξ). (20)

By (17) and Assumption (A6), Ω^ΩMn(1ξ/4), for some 0 < M < ∞. As n → ∞, the first term in (20) can be shown as:

P(jC^[j]|(Ω^jjΩjj) cov ^(Xj,y)|>νn(un1)/2)P(1n2yTxC^[j](Ω^Ω)jC^[j]T(Ω^Ω)jC^[j]xC^[j]Ty>νn2(un1)24)P(λmax(1nxC^[j](Ω^Ω)jC^[j]T(Ω^Ω)jC^[j]xC^[j]T)1nyTy>νn2(un1)24).P(λmax(1nxC^[j]xC^[j]T)λmax((Ω^Ω)jC^[j]T(Ω^Ω)jC^[j])1nyTy>νn2(un1)24). (21)

Notice that E[yTy/n]=E(y2)=σ2+(β*)TΣβ*σ2+λmax(Σ)β*2<C˜1 for some positive constant C˜1. For sufficiently large n, λmax(xC^[j]xC^[j]T/n)λmax(Σ^)κ2+λmax(Σ^Σ)=κ2+Σ^Σ1/2=κ2+O(ρn1/2)<C˜2 for some positive constant C˜2. And λmax((Ω^Ω)jc^[j]T(Ω^Ω)jC^[j])Ω^Ω2M2n(2ξ/2) Therefore, from (21), for suffitiently large n,

P(jC^[j]|(Ω^jjΩjj) cov ^(Xj,y)|>νn(un1)/2)P(C˜2M21nyTy>n(2ξ/2)νn2(un1)24)P(1nyTy>1C˜2M2n24)4C˜2M2E[yTy/n]n24C˜1C˜2M2n20, (22)

where the second last step is from applying Markov inequality to the positively valued random variable yTy/n.

For the second term in (20), let z˜=(Z˜1,,Z˜p)T=(x1Ty/n cov (X1,y),,xpTy/ncov (Xp,y))T, then we have

P(jC^[j]|Ωjj(xjTy/n cov (Xj,y))|νn(un1)/2)P(λmax(ΩTΩ)z˜C^[j]22νn2(un1)24)P(|Z˜j|>νnκ1(un1)2|C^[j]|)

for some jC^[j]. Notice that E[Z˜j]=E[Xjy cov (Xj,y)]=0. Also E[Z˜j2]= Var [Xjy]E[Xj2y2](E(Xj4)+E(y4))/2<C˜3 for some positive Constant C˜3 as E(Xj4)λmax2(Σ) and E(y4)(λmax(Σ)β2)2+E(ε4)<. Therefore, from Jensen’s inequality, E[|Z˜j|]=E[(Z˜j2)1/2](E[Z˜j2])1/2C˜31/2. Then using Lemma 7.3 to Control the size of C^[j] and applying Markov inequality on |Z˜j|, we have

P(|Z˜j|>νnκ1(un1)2|C^[j]|)P(|Z˜j|>κ1nξ/42)2E[|Z˜j|]κ1nξ/42C˜31/2κ1nξ/40. (23)

Plugging (22) and (23) into (20) and then plugging (20) into (19) gives

limnP({jS1:|βj*|>νnun}{jS1:|β^j|>νn})=1. (24)

By a similar argument, we also have

limnP({jS1:|βj*|>νn/un}{jS1:|β^j|>νn})=1. (25)

Combining (24) and (25), we have

limnP(S^WBC=SWBC|S^1=S1).

Proof of Corollary 3.2. Notice that

P({S^WBC=SWBC}{S^2*=S2*}{S^1=S1})=P(S^2*=S2*|{S^WBC=SWBC}{S^1=S1})P({S^WBC=SWBC}{S^1=S1})=P(S^2*=S2*|S^1WBC=S1WBC)P(S^WBC=SWBC|S^1=S1)P(S^1=S1). (26)

Under the SRC in (A4), by Lemma 1 in Gao et al. (2017) or Theorem 2 in Zhang and Huang (2008),

limnP(S^1=S1)=1. (27)

From Theorem 3.1,

limnP(S^WBC=SWBC|S^1=S1)=1. (28)

Equations (27) and (28) together give limnP(S^WBC=SWBC)=1. This further gives that P(S1S^1WBC{S1WBCS2})1. Then by Corollary 2 in Gao et al. (2017), we also have

limnP({S^2*=S2*}|{S^1WBC=S1WBC})=1. (29)

Combining (27), (28), (29) and (26) completes the proof. ◻

The following Tables 56 and Figures 56 give the selection, estimation and prediction results under Examples 1 and 2 when S1^ is selected by adaptive Lasso.

Table 5:

The performance of variable selection on S0 when S1^ is selected by adaptive Lasso

p = 200 p = 300 p = 400 p = 500
Example 1 TP CIS-PSE 61.4 (2.4) 61.1 (2.5) 61.0 (2.6) 61.1 (2.6)
PSE 41.2 (5.0) 34.4 (5.1) 25.7 (6.1) 22.6 (5.8)
FP CIS-PSE 2.5 (2.1) 4.6 (2.5) 6.7 (3.2) 8.6 (3.0)
PSE 15.1 (4.9) 19.7 (5.0) 22.3 (5.4) 28.0 (6.2)
Example 2 TP CIS-PSE 62.5 (0.8) 58.0 (2.2) 54.6 (2.9) 52.0 (3.5)
PSE 43.9 (3.8) 37 (4.2) 32.8 (4.3) 31.5 (4.2)
FP CIS-PSE 3.4 (2.6) 5.2 (3.0) 6.0 (3.5) 7.7 (4.1)
PSE 13.1 (3.7) 18.6 (4.7) 21.9 (5.5) 28.0 (6.1)

Table 6:

Mean squared prediction error (MSPE) of the predicted outcomes when S1^ is selected by adaptive Lasso

p 200 300 400 500
Example 1 CIS-PSE 2.16 (0.57) 3.06 (0.63) 3.30 (0.72) 3.60 (0.81)
PSE 3.91 (0.71) 4.85 (0.86) 5.71 (1.02) 6.27 (1.17)
adaptive Lasso 15.02 (4.37) 14.67 (3.90) 14.49 (4.00) 14.74 (4.29)
Example 2 CIS-PSE 2.69 (0.58) 2.96 (0.72) 3.23 (0.81) 3.31 (0.83)
PSE 3.77 (0.77) 4.76 (1.08) 5.50 (1.17) 5.82 (0.18)
adaptive Lasso 4.48 (0.79) 4.81 (0.91) 4.94 (0.97) 4.97 (1.01)

Figure 5:

Figure 5:

Figure 5:

The mean squared error (MSE) of β^SWBC for different p’s when S1^ is selected by adaptive Lasso under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, and dotted lines indicate adaptive Lasso.

Figure 6:

Figure 6:

Figure 6:

The mean squared error (MSE) of β^S1 for different p’s when S1^ is selected by adaptive Lasso under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, dotted lines indicate adaptive Lasso RE defined as β^S^1RE=Σ^S^11 cov ^(xS^1,y) dot-dashed lines represent adaptive Lasso, and long-dashed lines are for WR in (9).

Figure 7 shows the averaged sum of squared prediction error (SSPE) on the validation datasets across 500 independent experiments for different α’s.

References

  1. Barro R and Lee J (1994). Data set for a panel of 139 countries. http://admin.nber.org/pub/barro.lee/.
  2. Bickel P and Levina E (2008). Covariance regularization by thresholding. Annals of Statistics, 36(6):2577–2604. [Google Scholar]
  3. Bodmer W and Bonilla C (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet, 40(6):695–701. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Buhlmann P, Rütimann P, van de Geer S, and Zhang CH (2013). Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference, 143:1835–1858. [Google Scholar]
  5. Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc, 96:1348–1360. [Google Scholar]
  6. Fan J, Liao Y, and Min M (2011). High-dimensional covariance matrix estimation in approxiamte factor models. Annals of Statistics, 39(6):3320–3356. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Gao X, Ahmed SE, and Feng Y (2017). Post selection shrinkage estimation for high­dimensional data analysis. Applied Stochastic Models in Business and Industry, 33(2):97–120. [Google Scholar]
  8. Li B and Leal S (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet, 83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Mazumder R and Hastie T (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. Journal of Machine Learning Research, 12:723–736. [PMC free article] [PubMed] [Google Scholar]
  10. Shah RD and Samworth RJ (2013). Discussion of ‘correlated variables in regression: clustering and sparse estimation’. Journal of Statistical Planning and Inference, 143:1866–1868. [Google Scholar]
  11. Shao J, Wang Y, Deng X, and Wang S (2011). Sparse linear discriminant analysis by thresholding for high-dimensional data. Annals of Statistics, 39(2):1241–1265. [Google Scholar]
  12. Shapiro L and Stockman G (2002). Computer Vision. Prentice Hall. [Google Scholar]
  13. Tibshirani R (1996). Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B, 58:267–288. [Google Scholar]
  14. Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B, 68:49–67. [Google Scholar]
  16. Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):1567–1594. [Google Scholar]
  17. Zhang CH and Huang J (2008). The sparsity and bias of the LASSO selection in high dimensional linear regression. Annals of Statistics, 36:1567–1594. [Google Scholar]
  18. Zhang CH and Zhang S (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B, 76(1):217–242. [Google Scholar]
  19. Zou H (2006). The adaptive Lasso and its oracle properties. J. Am. Statist. Assoc, 101:1418–1429. [Google Scholar]

RESOURCES