Weak signals in high-dimension regression: detection, estimation and prediction

Yanming Li; Hyokyoung G Hong; S Ejaz Ahmed; Yi Li

doi:10.1002/asmb.2340

. Author manuscript; available in PMC: 2019 Oct 30.

Published in final edited form as: Appl Stoch Models Bus Ind. 2018 May 25;35(2):283–298. doi: 10.1002/asmb.2340

Weak signals in high-dimension regression: detection, estimation and prediction

Yanming Li ¹, Hyokyoung G Hong ^2,^*, S Ejaz Ahmed ³, Yi Li ¹

PMCID: PMC6821396 NIHMSID: NIHMS1015641 PMID: 31666801

Summary.

Regularization methods, including Lasso, group Lasso and SCAD, typically focus on selecting variables with strong effects while ignoring weak signals. This may result in biased prediction, especially when weak signals outnumber strong signals. This paper aims to incorporate weak signals in variable selection, estimation and prediction. We propose a two-stage procedure, consisting of variable selection and post-selection estimation. The variable selection stage involves a covariance-insured screening for detecting weak signals, while the post-selection estimation stage involves a shrinkage estimator for jointly estimating strong and weak signals selected from the first stage. We term the proposed method as the covariance-insured screening based post-selection shrinkage estimator. We establish asymptotic properties for the proposed method and show, via simulations, that incorporating weak signals can improve estimation and prediction performance. We apply the proposed method to predict the annual gross domestic product (GDP) rates based on various socioeconomic indicators for 82 countries.

Keywords: high-dimensional data, Lasso, post-selection shrinkage estimation, variable selection, weak signal detection

1. Introduction

Given n independent samples, we consider a high-dimensional linear regression model

y = X β + ε,

(1)

where y = (y₁, … , y_n)^T is an n-vector of responses, X = (X_ij)_n×p is an n × p random design matrix, β = (β₁, …, β_p)^T is a p-vector of regression coefficients and ε = (ε₁, … , ε_n)^T n is an n-vector of independently and identically distributed random errors with mean 0 and variance σ². Let $β^{*} = {(β_{1}^{*}, \dots, β_{p}^{*})}^{T}$ denote the true value of β. We write X = (x⁽¹⁾, …, x⁽ⁿ⁾)^T = (x₁, …, x_p), where x⁽ⁱ⁾ = (X_ij, … , X_ip)^T is the i-th row of X and x_j is the j-th column of X, for i = 1, …, n and j = 1, …, p. Without the subject index i, we write y, X_j and ε as the random variables underlying y_i, X_ij and ε_i, respectively. We assume that each X_j is independent of ε. We write x as the random vector underlying x⁽ⁱ⁾ and assume that x follows a p-dimensional multivariate sub-Gaussian distribution with mean zeros, variance proxy $σ_{x}^{2}$ , and covariance matrix Σ. Sub-Gaussian distributions contain a wide class of distributions such as Gaussian, binary and all bounded random variables. Therefore, our proposed framework can accommodate more data types, as opposed to the conventional Gaussian distributions.

We assume that model (1) is sparse. That is, the number of nonzero β* components is less than n. When p > n, the essential problem is to recover the set of predictors with nonzero coefficients. The past two decades have seen many regularization methods developed for variable selection and estimation in high-dimensional settings, including Lasso (Tibshirani, 1996), adaptive Lasso (Zou, 2006), group Lasso (Yuan and Lin, 2006), SCAD (Fan and Li, 2001) and MCP (Zhang, 2010), among many others. Most regularization methods assume the restrictive β-min condition which requires that the strength of nonzero $β_{j}^{*}$ ′s is larger than a certain noise level (Zhang and Zhang, 2014). Hence, regularization methods may fail to detect weak signals with nonzero but small $β_{j}^{*}$ ′s, and this will result in biased estimates and inaccurate predictions, especially when weak signals outnumber strong signals.

Detection of weak signals is challenging. However, if weak signals are partially correlated with strong signals which satisfy the β-min condition, they may be more reliably detected. To elaborate on this idea, first notice that the regression coefficient $β_{j}^{*}$ can be written as

β_{j}^{*} = \sum_{1 \leq j^{'} \leq p} Ω_{j j^{'}} cov (X_{j^{'}}, y), j = 1, \dots, p,

(2)

where Ω_jj′ is the jj^′-th entry of of Ω = Σ⁻¹, the precision matrix of x. Let ρ_jj′ be the partial correlation of X_j and X_j′, i.e. the correlation between the residuals of X_j and X_j′ after regressing them on all the other X variables. It can be shown that $ρ_{j j^{'}} = - Ω_{j j^{'}} / \sqrt{Ω_{j j} Ω_{j^{'} j^{'}}}$ . Hence, that X_j and X_j′ are partially uncorrelated is equivalent to Ω_jj′ = 0. Assume that Ω is a sparse matrix with only a few nonzero entries in Ω. When the right hand side of (2) can be accurately evaluated, weak signals can be distinguished from those of noises. In high-dimensional settings, it is impossible to accurately evaluate $\sum_{1 \leq j^{'} \leq p} Ω_{j j^{'}} cov (X_{j^{'}}, y)$ . However, under the faithfulness condition that will be introduced in Section 3, a variable, say, indexed by j′, satisfing the β-min condition will have a nonzero cov(X_j′, y). Once we identify such strong signals, we set to discover variables that are partially correlated with them.

For brevity, we term weak signals which are partially correlated with strong signals as “weak but correlated” (WBC) signals. This paper aims to incorporate WBC signals in variable selection, estimation and prediction. We propose a two-stage procedure which consists of variable selection and post-selection estimation. The variable selection stage involves a covariance-insured screening for detecting weak signals, and the post-selection estimation stage involves a shrinkage estimator for jointly estimating strong and weak signals selected from the first stage. We call the proposed method as the covariance-insured screening based post-selection shrinkage estimator (CIS-PSE). Our simulation studies demonstrate that by incorporating WBC signals, CIS-PSE improves estimation and prediction accuracy. We also establish the asymptotic selection consistency of CIS-PSE.

The paper is organized as follows. We outline the proposed CIS-PSE method in Section 2 and investigate its asymptotic properties in Section 3. We evaluate the finite-sample performance of CIS-PSE via simulations in Section 4, and apply the proposed method to predict the annual gross domestic product (GDP) rates based on the socioeconomic status for 82 countries in Section 5. We conclude the paper with a brief discussion in Section 6. All technical proofs are provided in Appendix.

2. Methods

2.1. Notation

We use scripted upper-case letters, such as S, to denote the subsets of {1, … , p}. Denote by $| S |$ the cardinality of $S$ and by $S^{c}$ the complement of $S$ . For a vector v, we denote a subvector of v indexed by $S$ by $v_{S}$ . Let $X_{S} = (x_{j}, j \in S)$ be a submatrix of the design matrix X restricted to the columns indexed by $S$ . For the symmetric covariance matrix Σ, denote by $Σ_{S S^{'}}$ its submatix with the row and column indices restricted to subsets $S$ and $S^{'}$ , respectively. When $S = S^{'}$ , we write $Σ_{S} = Σ_{S S^{'}}$ for short. The notation also applies to its sample version $\hat{Σ}$ .

Denote by $G (V, E; Ω)$ the graph induced by Ω, where the node set is $V = {1, \dots, p}$ and the set of edges is denoted by $E$ . An edge is a pair of nodes, say, k and k′, with Ω_kk′ ≠ 0. For a subset $V_{l} \subset V$ , denote by Ω_l the principal submatrix of Ω with its row and column indices restricted to $V_{l}$ and by $E_{l}$ the corresponding edge set. The subgraph $G (V_{l}, E_{l}; Ω_{l})$ is a connected component of $G (V, E; Ω)$ if (i) any two nodes in $V_{l}$ are connected by edges in $E_{l}$ ; (ii) for any node $k \in V_{l}^{c}$ , there exists a node $k^{'} \in V_{l}$ such that k, k′ cannot be connected by any edges in $E$ .

For a symmetric matrix A, denote by tr(A) the trace of A, and denote by λ_min(A) and λ_max(A) the minimum and maximum eigenvalues of A. We define the operator norm and the Frobenius norm as $‖ A ‖ = λ_{max}^{1 / 2} (A^{T} A)$ and $‖ A ‖_{F} = t r {(A^{T} A)}^{1 / 2}$ , respectively. For a p-vector v, denote its L_q norm by $‖ v ‖_{q} = {(\sum_{j = 1}^{p} {| v_{j} |}^{q})}^{1 / q}$ with q ≥ 1. For two real numbers a and b, denote a∨b = max(a, b).

Denote the sample covariance matrix and the marginal sample covariance between X_j and y, j = 1, …, p, by

\hat{Σ} = \frac{1}{n} \sum_{i = 1}^{n} x^{(i)} {(x^{(i)})}^{T} and \hat{cov} (X_{j}, y) = \frac{1}{n} \sum_{i = 1}^{n} X_{i j} y_{i} .

For a vector V = (V₁, …,V_p)^T, denote $\hat{cov} (V, y) = {(\hat{cov} (V_{1}, y), \dots, \hat{cov} (V_{p}, y))}^{T}$ .

2.2. Defining strong and weak signals

Consider a low-dimensional linear regression model where p < n. The ordinary least squares (OLS) estimator ${\hat{β}}^{OLS} = {\hat{Σ}}^{- 1} \hat{cov} (x, y) = \hat{Ω} \hat{cov} (x, y)$ minimizes the prediction error, where $\hat{Ω} = {\hat{Σ}}^{- 1}$ is the empirical precision matrix. It is also known that ${\hat{β}}^{OLS}$ is an unbiased estimator of β* and yields the best outcome prediction ${\hat{y}}^{best} = X \hat{Ω} \hat{cov} (x, y)$ with the minimal prediction error.

However, when p > n, $\hat{Σ}$ becomes non-invertible, and thus β Cannot be estimated using all X variables. Let $S_{0} = {j : β_{j}^{*} \neq 0}$ be the true signal set and assume that $| S_{0} | < n$ . If $S$ were known, the predicted outcome, ${\hat{y}}^{best} = X_{S_{0}} {\hat{Σ}}_{S_{0}}^{- 1} \hat{cov} (x_{S_{0}}, y)$ , would have the smallest prediction error. In practice, $S_{0}$ is unknown and some variable selection method must be applied first to identify $S_{0}$ . We define the set of strong signals as

S_{1} = {j : | β_{j}^{*} | > c \sqrt{log p / n} for some c > 0, 1 \leq j \leq p}

(3)

and let $S_{2} = S_{0} \ S_{1}$ be the set of weak signals. Then, the OLS estimator and the best outcome prediction are given by

{\hat{β}}^{OLS} = (\begin{matrix} {\hat{β}}_{S_{1}}^{OLS} \\ {\hat{β}}_{S_{2}}^{OLS} \end{matrix}) = (\begin{matrix} {\hat{Ω}}_{11} \hat{cov} (x_{S_{1}}, y) + {\hat{Ω}}_{12} \hat{cov} (x_{S_{2}}, y) \\ {\hat{Ω}}_{21} \hat{cov} (x_{S_{1}}, y) + {\hat{Ω}}_{22} \hat{cov} (x_{S_{2}}, y) \end{matrix})

and

{\hat{y}}^{best} = X_{S_{1}} {\hat{Ω}}_{11} \hat{cov} (x_{S_{1}}, y) + X_{S_{2}} {\hat{Ω}}_{21} \hat{cov} (x_{S_{1}}, y) + X_{S_{1}} {\hat{Ω}}_{12} \hat{cov} (x_{S}_{2}, y) + X_{S_{2}} {\hat{Ω}}_{22} \hat{cov} (x_{S_{2}}, y),

where $(\begin{matrix} {\hat{Ω}}_{11} & {\hat{Ω}}_{12} \\ {\hat{Ω}}_{21} & {\hat{Ω}}_{22} \end{matrix}) = {(\begin{matrix} {\hat{Σ}}_{S_{1}} & {\hat{Σ}}_{S_{1} S_{2}} \\ {\hat{Σ}}_{S_{2} S_{1}} & {\hat{Σ}}_{S_{2}} \end{matrix})}^{- 1}$ is the partitioned empirical precision matrix. We observe that the partial Correlations between the variables in $S_{1}$ and $S_{2}$ contribute to the estimation of $β_{S_{1}}$ and $β_{S_{2}}$ , and outcome prediction. Therefore, incorporating WBC signals helps reduce the estimation bias and prediction error.

We now further divide $S_{2}$ into $S_{WBC}$ and $S_{2 *}$ . Here, $S_{WBC}$ is the set of weak signals which have nonzero partial correlations with the signals in $S_{1}$ and $S_{2 *}$ is the set of weak signals which are not partially orrelated with signals in $S_{1}$ . Formally, with c given in (3),

S_{WBC} = {j : 0 < | β_{j}^{*} | < c \sqrt{log p / n} and Ω_{j j^{'}} \neq 0 for some j^{'} \in S_{1}, 1 \leq j \leq p}

and

S_{2 *} = {j : 0 < | β_{j}^{*} | < c \sqrt{log p / n} and Ω_{j j^{'}} = 0 for any j^{'} \in S_{1}, 1 \leq j \leq p} .

Thus, p predictors can be partitioned as $S_{1} \cup S_{WBC} \cup S_{2 *} \cup S_{null} = {1, \dots, p}$ , where $S_{null} = {j : β_{j}^{*} = 0}$ . We assume that $| S_{1} | = p_{1}, | S_{WBC} | = p_{WBC}$ and $| S_{2 *} | = p_{2 *}$ .

2.3. Covariance-insured screening based post-selection shrinkage estimator (CIS-PSE)

Our proposed CIS-PSE method consists of the variable selection and post-shrinkage estimation steps.

Variable selection:

First, we detect strong signals by regularization methods such as Lasso or adaptive Lasso. Denote by $\hat{S_{1}}$ the set of detected strong signals. To identify WBC signals, we evaluate (2) for each $j \in {\hat{S}}_{1}^{c}$ . When there is no confusion, we use a j′ to denote a strong signal.

Though estimating cov(X_j′, y) for every 1 ≤ j′ ≤ p can be easily done, identifying and estimating nonzero entries in Ω is still challenging in high-dimensional settings. However, for identifying WBC signals, it is unnecessary to estimate the whole Ω matrix. Leveraging intra-feature correlations among predictors, we introduce a computationally efficient method for detecting nonzero Q_jj′’s.

Variables that are partially correlated with signals in $\hat{S_{1}}$ form the connected components of $G (V, E; Ω)$ that contain at least one element of $\hat{S_{1}}$ . Therefore, for detecting WBC signals, it suffices to focus on such connected components. Under the sparsity assumptions of β* and Ω, the size of such connected components is relatively small. For example, as shown in Figure 1, the first two diagonal blocks of a moderate size are relevant for detection of WBC signals.

Figure 1: — An illustrative example of marginally strong signals and their connected components in $G (V, E; Ω)$ . Left panel: structure of Ω; Middle panel: structure of Ω after properly reordering the row and column indices of Ω; Right panel: the corresponding graph structure and connected components of the strong signals. Signals in $S_{1}$ are colored red. Signals in $S_{2 *}$ are colored orange. WBC signals in $S_{WBC}$ are colored yellow.

Under the sparsity assumption of Ω, the connected components of Ω can be inferred from those of the thresholded sample covariance matrix (Mazumder and Hastie, 2012; Bickel and Levina, 2008; Fan et al., 2011; Shao et al., 2011), which is much easier to estimate and can be calculated in a parallel manner. Denote by ${\tilde{Σ}}^{α}$ the thresholded sample covariance matrix with a thresholding parameter α, where ${\tilde{Σ}}_{k k^{'}}^{α} = {\hat{Σ}}_{k k^{'}} 1 {| {\hat{Σ}}_{k k^{'}} | \geq α}, 1 \leq k, k^{'} \leq p$ with 1(·) being the indicator function. Denote by $G (V, \tilde{E}; {\tilde{Σ}}^{α})$ the graph corresponding to ${\tilde{Σ}}^{α}$ . For variable k, 1 ≤ k ≤ p, denote by $C_{[k]}$ the vertex set of the connected component in $G (V, E; Ω)$ containing k. If variables k and k′ belong to the same connected component, 1 ≤ k ≠ k′ ≤ p, then $C_{[k]} = C_{[k^{'}]}$ . For example, $C_{[14]} = C_{[16]} = C_{[24]} = C_{[26]} = C_{[29]}$ in the third panel of Figure 1. Clearly, when $k^{'} \notin C_{[k]}, Ω_{k k^{'}} = 0$ , evaluating (2) is equivalent to estimating

β_{j}^{*} = \sum_{j^{'} \in C_{[j]}} Ω_{j j^{'}} cov (X_{j^{'}}, y), j = 1, \dots, p .

(4)

Correspondingly, for a variable k, 1 ≤ k ≤ p, denote by ${\hat{C}}_{[k]}$ the vertex set of the connected component in $G (V, \tilde{E}; {\tilde{Σ}}^{α})$ containing k. For a multivariate Gaussian x, Mazumder and Hastie (2012) showed that $C_{[k]}$ ′s can be exactly recovered from $C_{[k]}$ ′s with a properly chosen α. For a multivariate sub-Gaussian x, we refer to the following lemma.

Lemma 2.1. Suppose that the maximum size of a connected component in Ω containing a variable in $S_{0}$ is of order O(exp(n^ξ)), for some 0 < ξ < 1, then under Assumption (A7) specified in Section 3, with an $α = O (\sqrt{n^{ξ - 1}})$ and for any variable k, 1 ≤ k ≤ p, we have

P (C_{[k]} = {\hat{C}}_{[k]}) \geq 1 - C_{1} n^{ξ} exp (- C_{2} n^{1 + ξ}) \to 1

(5)

for some positive constants C₁ and C₂.

We summarize the variable selection procedure for $S_{1}$ and $S_{WBC}$ .

Step S1: Detection of $S_{1}$ . Obtain a candidate subset $\hat{S_{1}}$ of strong signals using a penalized regression method. Here, we consider the penalized least squares (PLS) estimator from Gao et al. (2017):

{\hat{β}}^{PLS} = \underset{β}{argmin} {‖ y - X β ‖_{2}^{2} + \sum_{j = 1}^{p} {Pen}_{λ} (β_{j})},

(6)

where Pen_λ(β_j) is a penalty on each individual β_j to shrink the weak effects toward zeros and select the strong signals, with the tuning parameter λ > 0 controlling the size of the candidate subset $\hat{S_{1}}$ . Commonly used penalties are Pen_λ(β_j) = λ|β_j| and Pen_λ(β_j) = λω_j |β_j| for Lasso and adaptive Lasso, where ω_j > 0 is a known weight.

Step 2: Detection of $S_{W B C}$ . First, for a given threshold α, construct a sparse estimate of the covariance matrix ${\tilde{Σ}}^{α}$ . Next, for each selected variable j′ in $\hat{S_{1}}$ from Step S1, detect ${\hat{C}}_{[j^{'}]}$ , the node set, corresponding to its connected component in $G (V, \tilde{E}; {\tilde{Σ}}^{α})$ . Let $U (α, {\hat{S}}_{1})$ be the union of the vertex sets corresponding to those connected components detected. That is, $U (α, {\hat{S}}_{1}) = \cap_{j^{'} \in {\hat{S}}_{1}} {\hat{C}}_{[j^{'}]}$ . Then according to (4), it suffices to identify WBC signals within $U (α, {\hat{S}}_{1})$ . Specifically, for each $j \in {\hat{S}}_{1}^{c} \cap U (α, {\hat{S}}_{1})$ , let ${\tilde{Σ}}_{[j]}^{α}$ be the submatrix from restricting ${\tilde{Σ}}^{α}$ on ${\hat{C}}_{[j]}$ . We then evaluate (4) and select WBC variables by

{\hat{S}}_{WBC} = {j \in {\hat{S}}_{1}^{c} \cap U (α, {\hat{S}}_{1}) : | \sum_{j^{'} \in {\hat{C}}_{[j]}} {({\tilde{Σ}}_{[j]}^{α})}_{j j^{'}}^{- 1} \hat{cov} (X_{j^{'}}, y) | \geq ν_{n}}

(7)

for some pre-specified ν_n. Here ${({\tilde{Σ}}_{[j]}^{α})}_{j j^{'}}^{- 1}$ denotes the entry of ${({\tilde{Σ}}_{[j]}^{α})}^{- 1}$ corresponding to variables j and j′. In our numerical studies, we rank the magnitude of variable $j \in {\hat{S}}_{1}^{c} \cap U (α, {\hat{S}}_{1})$ according to the magnitude of $| \sum_{j^{'} \in {\hat{C}}_{[j]}} {({\tilde{Σ}}_{[j]}^{α})}_{j j^{'}}^{- 1} \hat{cov} (X_{j^{'}}, y) |$ and select up to the first $n - | {\hat{S}}_{1} |$ variables.

Step 3: Detection of $S_{2 *}$ . To identify ${\hat{S}}_{2 *}$ , we first solve a regression problem with a ridge penalty only on variables in ${\hat{S}}_{1 WBC}^{c}$ , where ${\hat{S}}_{1 WBC} = {\hat{S}}_{1} \cup {\hat{S}}_{WBC}$ . That is,

{\hat{β}}^{r} = \underset{β}{argmin} {‖ y - X β ‖_{2}^{2} + {\tilde{λ}}_{n} ‖ β_{{\hat{S}}_{1}^{c}_{WBC}} ‖_{2}^{2}},

(8)

where ${\tilde{λ}}_{n} > 0$ is a tuning parameter controlling the overall strength of the variables selected in ${\hat{S}}_{1 WBC}^{c}$ . Then a post-selection weighted ridge (WR) estimator ${\hat{β}}^{WR}$ has the form

{\hat{β}}_{j}^{WR} = {\begin{cases} {\hat{β}}_{j}^{r}, & j \in {\hat{S}}_{1 WBC}, \\ {\hat{β}}_{j}^{r} 1 (| {\hat{β}}_{j}^{r} | > a_{n}), & j \in {\hat{S}}_{1 WBC}^{c}, \end{cases}

(9)

where a_n is a thresholding parameter. Then the candidate subset ${\hat{S}}_{2 *}$ is obtained by

{\hat{S}}_{2 *} = {j \in {\hat{S}}_{1 WBC}^{c} : {\hat{β}}_{j}^{WR} \neq 0, 1 \leq j \leq p} .

(10)

Post-selection shrinkage estimation:

We consider the following two cases when performing the post-selection shrinkage estimation.

Case 1: ${\hat{p}}_{1} + {\hat{p}}_{W B C} + {\hat{p}}_{2 *} < n$ . We obtain the CIS-PSE on ${\hat{S}}_{0}$ by

{\hat{β}}_{{\hat{S}}_{0}}^{CIS - PSE} = {\hat{Σ}}_{{\hat{S}}_{0}}^{- 1} \hat{cov} (x_{{\hat{S}}_{0}}, y),

(11)

where ${\hat{S}}_{0} = {\hat{S}}_{1} \cup {\hat{S}}_{WBC} \cup {\hat{S}}_{2 *}$ . Then ${\hat{β}}_{{\hat{S}}_{1}}^{ClS - PSE}$ and ${\hat{β}}_{{\hat{S}}_{WBC}}^{CIS - PSE}$ can be obtained by restricting ${\hat{β}}_{{\hat{S}}_{0}}^{CIS - PSE}$ to ${\hat{S}}_{1}$ and ${\hat{S}}_{WBC}$ , respectively.

Case 2: ${\hat{p}}_{1} + {\hat{p}}_{W B C} + {\hat{p}}_{2 *} > n$ . Recall that ${\hat{β}}_{{\hat{S}}_{1 WBC}}^{WR} = {({\hat{β}}_{j}^{r}, j \in {\hat{S}}_{1 WBC})}^{T}$ and ${\hat{β}}_{S_{2}}^{WR} = {({\hat{β}}_{j}^{r} 1 (| {\hat{β}}_{j}^{r} | > a_{n}), j \in {\hat{S}}_{2 *})}^{T}$ . We obtain the CIS-PSE of $β_{\hat{S}}_{_{1 WBC}}$ by

{\hat{β}}_{{\hat{S}}_{1 WBC}}^{CIS-PSE} = {\hat{β}}_{{\hat{S}}_{1 WBC}}^{WR} - (\frac{{\hat{s}}_{2} - 2}{{\hat{T}}_{n}} \land 1) ({\hat{β}}_{{\hat{S}}_{1 WBC}}^{WR} - {\hat{β}}_{{\hat{S}}_{1 WBC}}^{RE}),

(12)

Where ${\hat{s}}_{2} = | {\hat{S}}_{2 *} |$ and ${\hat{T}}_{n}$ is as defined by

{\hat{T}}_{n} = {({\hat{β}}_{{\hat{S}}_{2 *}}^{WR})}^{T} (X_{{\hat{S}}_{2 *}}^{T} M_{{\hat{S}}_{1 WBC}} X_{{\hat{S}}_{2 *}}) {\hat{β}}_{{\hat{S}}_{2 *}}^{WR} / {\hat{σ}}^{2},

(13)

where $M_{{\hat{S}}_{1 WBC}} = I_{n} - X_{{\hat{S}}_{1 WBC}} {(X_{{\hat{S}}_{1 WBC}}^{T} X_{{\hat{S}}_{1 WBC}})}^{- 1} X_{{\hat{S}}_{1 WBC}}^{T}$ and ${\hat{σ}}^{2} = \sum_{i = 1}^{n} {(y_{i} - X_{{\hat{S}}_{2}}^{T} {\hat{β}}_{{\hat{S}}_{2}}^{WR})}^{2} / (n - {\hat{S}}_{2})$ . If $X_{{\hat{S}}_{1 WBC}}^{T} X_{{\hat{S}}_{1 WBC}}$ is singular, we replace ${(X_{{\hat{S}}_{1 WBC}}^{T} X_{{\hat{S}}_{1 WBC}})}^{- 1}$ with a generalized inverse. Then ${\hat{β}}_{{\hat{S}}_{1}}^{CIS - PSE}$ and ${\hat{β}}_{{\hat{S}}_{WBC}}^{CIS - PSE}$ can be obtained by restricting ${\hat{β}}_{{\hat{S}}_{1}}^{CIS - PSE}$ to $\hat{S_{1}}$ and ${\hat{S}}_{WBC}$ , respectively.

Set ${\hat{S}}_{null} = {({\hat{S}}_{1} \cup {\hat{S}}_{WBC} \cup {\hat{S}}_{2 *})}^{c}$ . The final CIS-PSE estimator ${\hat{β}}^{CIS - PSE}$ is defined as

{\hat{β}}^{CIS - PSE} = {({({\hat{β}}_{{\hat{S}}_{1}}^{CIS - PSE})}^{T}, {({\hat{β}}_{{\hat{S}}_{WBC}}^{CIS - PSE})}^{T}, {({\hat{β}}_{{\hat{S}}_{2 *}}^{WR})}^{T}, 0_{{\hat{S}}_{null}})}^{T} .

(14)

2.4. Selection of tuning parameters

When selecting strong signals, the tuning parameter λ in Lasso or adaptive Lasso can be chosen by BIC (Zou, 2006). To choose ν_n for the selection of WBC signals according to (7), we rank variables $j \in {\hat{S}}_{1}^{c} \cap U (α, {\hat{S}}_{1})$ according to the magnitude of $| \sum_{j^{'} \in {\hat{C}}_{[j]}} {({\tilde{Σ}}_{[j]}^{α})}_{j j^{'}}^{- 1} \hat{cov} (X_{j^{'}}, y) |$ and select the first $r \leq n - | {\hat{S}}_{1} |$ variables to be ${\hat{S}}_{WBC}$ . Here, r is chosen so that ${\hat{S}}_{1 WBC}$ minimizes prediction errors on an independent validation dataset. For tuning parameter α, we set α = c₃ log(n), for some positive constant c₃, as suggested in Shao et al. (2011). Our empirical experiments show that α = c₃ log(n) tends to give the larger true positives and the smaller false positives in identifying WBC variables. Figure 7 in Appendix reveals that in order to find the optimal α that minimizes the prediction error on a validation dataset, it suffices to conduct a grid search with only a few proposed values of α. In our numerical studies, instead of thresholding the sample covariance matrix, we threshold the sample correlation matrix. As correlations are ranged between −1 and 1, it is easier to set a target range for α. To detect signals in $S_{2 *}$ , we follow Gao et al. (2017) to use cross-validation to choose ${\tilde{λ}}_{n}$ and a_n in (8) and (9), respectively. In particular, we set and ${\tilde{λ}}_{n} = c_{1} a_{n}^{- 2} {(loglog n)}^{3} log (n \lor p)$ a_n = c₂n^−1/8 for some positive constants c₁ and c₂. In the training dataset we fix the tuning parameters and fit the model, and in the validation dataset we compute the prediction error of the model. We repeat this procedure for various c₁ and c₂, and choose a pair that gives the smallest prediction error on the validation dataset.

Figure 7: — Sum of squared prediction error (SSPE) corresponding to different α’s.

3. Asymptotic properties

To investigate the asymptotic properties of CIS-PSE, we assume the following.

(A1) The random error ϵ has a finite kurtosis.

(A2) log(p) = O(n^ν) for some 0 < ν < 1.

(A3) There are positive constants κ₁ and κ₂ such that 0 < κ₁ < λ_miη(Σ) ≤ λ_max(Σ) < κ₂ < ∞.

(A4) Sparse Riesz condition (SRC): For the random design matrix X, any $S \subset {1, \dots, p}$ with $| S | = q, q \leq p$ , and any vector $v \in ℝ^{q}$ , there exist 0 < c_* < c* < ∞ such that $c_{*} \leq ‖ X_{S}^{T} v ‖_{2}^{2} / ‖ v ‖_{2}^{2} \leq c *$ holds with probability tending to 1.

(A5) Faithfulness Assumption: Suppose that

max | Σ_{S_{1}^{c} S_{1}} β_{S_{1}}^{*} | + max | Σ_{S_{1}^{c} S_{2}} β_{S_{2}}^{*} | + min | Σ_{S_{1} S_{2}} β_{S_{2}}^{*} | < min | Σ_{S_{1} S_{1}} β_{S_{1}}^{*} |,

where the absolute value function | · | is applied component-wise to its argument vector. The max and min operators are with respect to all individual components in the argument vectors.

(A6) Denote by $C_{max} = {max}_{1 \leq l \leq B} | V_{l} |$ the maximum size of the connected components in graph $G (V, E; Ω)$ that contains at least one signal in $S_{1}$ , where B is the number of such connected components. Assume C_max = O(η^ξ) for some ξ ∈ (0,1).

(A7) Assume ${min}_{(k, k^{'}) \in E} | Σ_{k k^{'}} | \geq C \sqrt{n^{ξ - 1}}$ for some constant C > 0 and ${max}_{(k, k^{'}) \notin E} | Σ_{k k^{'}} | = O (\sqrt{n^{ξ - 1}})$ for the ξ in (A6).

(A8) For any subset $V_{l} \subset {1, \dots, p}$ with $| V_{l} | = O (n), {sup}_{j} E {| X_{i j} 1 (j \in V_{l}) |}^{2 ν} < C_{ν} < \infty$ for some constant ν > 0.

(A9) Assume that $‖ β_{S_{2 *}}^{*} ‖_{2} = o (n^{τ})$ for some 0 < τ < 1, where || · ||₂ is the Euclidean norm.

(A1), a technical assumption for the asymptotic proofs, is satisfied by many parametric distributions such as Gaussian. The assumption is mild as we do not assume any parametric distributions for ε except that it has finite moments. (A2) and (A3) are commonly assumed in the high-dimensional literature. (A4) guarantees that $S_{1}$ can be recovered with probability tending to 1 as n → ∞ (Zhang and Huang, 2008). (A5) ensures that for all $j \in S_{1}$ , ${min}_{j \in S_{1}} | \hat{cov} (X_{j}, y) | > {max}_{j \in S_{1}^{c}} | \hat{cov} (X_{j}, y) |$ holds with probability tending to 1 (Lemma 4 in Genovese et al., 2012). (A6) implies that the size of each connected component of a strong signal, i.e., $C_{[j^{'}]}, j^{'} \in S_{1}$ , cannot exceed the order of exp(n^ξ) for some ξ ∈ (0,1). This assumption is required for estimating sparse covariance matrices. (A7) guarantees that with a properly chosen thresholding parameter α, X_k and X_k′ have non-zero thresholded sample covariances for $(k, k^{'}) \in E$ , and have zero thresholded sample covariances for $(k, k^{'}) \notin E$ . As a result, the connected components of the thresholded sample covariance matrix and those of the precision matrix can be detected with adequate accuracy. (A8) ensures that the precision matrix can be accurately estimated by inverting the thresholded sample covariance matrix; see Shao et al. (2011) and Bickel and Levina (2008) for details. (A9), which bounds the total size of weak signals on $S_{2 *}$ , is required for selection consistency on $S_{2 *}$ (Gao et al., 2017).

We show that given a consistently selected $S_{1}$ , we have selection consistency for $S_{WBC}$ .

Theorem 3.1. With (A1)-(A3) and (A6)-(A8),

lim_{n \to \infty} P ({\hat{S}}_{W B C} = S_{W B C} | {\hat{S}}_{1} = S_{1}) = 1.

The following corollary shows that Theorem 3.1, together with Theorem 2 in Zhang and Huang (2008) and Corollary 2 in Gao et al. (2017), further implies selection consistency for $S_{1} \cup S_{WBC} \cup S_{2 *}$ .

Corollary 3.2. Under Assumptions (A1)-(A9), we have

lim_{n \to \infty} P ({{\hat{S}}_{1} = S_{1}} \cap {{\hat{S}}_{W B C} = S_{W B C}} \cap {{\hat{S}}_{2 *} = S_{2 *}}) = 1.

Corollary 3.2 implies that CIS-PSE can recover the true set asymptotically. Thus, when $| S_{0} | < n$ , CIS-PSE gives an OLS estimator with probability going to 1 and has the minimum prediction error asymptotically, among all the unbiased estimators.

4. Simulation studies

We conduct simulations to compare the performance of the proposed CIS-PSE and the post-shrinkage estimator (PSE) by Gao et al. (2017). The key difference between CIS-PSE and PSE lies in that PSE focuses only on $S_{1}$ whereas CIS-PSE considers $S_{1} \cup S_{WBC}$ .

Data are generated according to (1) with

β^{*} = {(\overset{S_{1}}{\overset{︷}{20, 20, 20}}, \underset{30}{\underset{︸}{\overset{S_{WBC}}{\overset{︷}{0.5, \dots, 05}}}}, \underset{30}{\underset{︸}{\overset{S_{2*}}{\overset{︷}{0.5, \dots, 05}}}}, \underset{p - 63}{\underset{︸}{\overset{S_{null}}{\overset{︷}{0, \dots, 0}}}})}^{T} .

(15)

The random errors ϵ_i are independently generated from N(0,1). We consider the following examples.

Example 1: The first three variables, which belong to $S_{1}$ , are independently generated from N(0,1). The first ten, next ten and the last ten signals in $S_{WBC}$ belong to the connected component of X₁, X₂ and X₃, respectively. These three connected components are independent of each other. $S_{2 *}$ is independent of $S_{1}$ and $S_{WBC}$ . Each connected component within $S_{1} \cup S_{WBC}$ and $S_{2 *}$ are generated from a multivariate normal distribution with mean zeros, variance 1, and a compound symmetric (CS) correlation matrix with correlation coefficient of 0.7. Variables in $S_{null}$ are independently generated from N(0, 1).

Example 2: This example is the same as Example 1 except that the three connected components within $S_{1} \cup S_{WBC}$ and $S_{2 *}$ follow the first order autocorrelation (AR(1)) structure with correlation coefficient of 0.7.

Example 3: This example is the same as Example 1 except that there are 30 variables in $S_{null}$ (i.e., variables X₆₄-X₉₃) are set to be correlated with signals in $S_{1}$ . That is, X₆₄-X₇₃ are correlated with X₁, X₇₄-X₈₃ are correlated with X₂, and X₈₄-X₉₃ are correlated with X₃. These three connected components within $S_{1} \cup S_{null}$ have a CS correlation structure with correlation coefficient of 0.7.

For each example, we conduct 500 independent experiments with p=200, 300, 400 and 500. We generate a training dataset of size n = 200, a test dataset of size n = 100 to assess the prediction performance, and an independent validation dataset of size n = 100 for tuning parameter selection.

First, we compare CIS-PSE and PSE in selecting $S_{0}$ under Examples 1–2. We use Lasso and adaptive Lasso to select ${\hat{S}}_{1}$ . Since both Lasso and adaptive Lasso give similar results, we report only the Lasso results in this section and present the results of adaptive Lasso in Appendix. We report the number of correctly identified variables (TP) in $S_{0}$ and the number of incorrectly selected variables (FP) in $S_{0}^{c}$ . Table 1 shows that CIS-PSE outperforms PSE in identifying signals in $S_{0}$ . We observe that the performance of PSE deteriorates as p increases, whereas CIS-PSE selects $S_{0}$ signals consistently even when p increases.

Table 1:

The performance of variable selection on $S_{0}$

			p = 200	p = 300	p = 400	p = 500
Example 1	TP	CIS-PSE	59.6 (1.9)	58.7 (2.1)	57.9 (2.3)	57.7 (2.4)
	TP	PSE	41.2 (4.9)	34.4 (5.1)	25.7 (6.0)	22.6 (5.8)
	FP	CIS-PSE	3.7 (2.4)	5.1 (2.7)	6.9 (3.1)	8.8 (3.3)
	FP	PSE	13.3 (4.6)	18.9 (5.2)	21.7 (5.9)	26.1 (6.0)
Example 2	TP	CIS-PSE	63.0 (0)	62.9 (0.1)	62.9 (0.1)	62.9 (0.1)
	TP	PSE	43.9 (3.9)	37.0 (4.2)	32.8 (5.0)	31.5 (4.3)
	FP	CIS-PSE	3.5 (2.4)	5.0 (2.7)	6.3 (3.3)	8.1 (3.1)
	FP	PSE	12.7 (4.2)	19.5 (5.1)	22.1 (6.3)	27.4 (6.4)

Open in a new tab

TP=true positive; FP= false positive.

Next, we evaluate estimation accuracy on the targeted sub-model $S_{1} \cup S_{WBC}$ using the mean squared error (MSE) as the criterion under Examples 1–2. Figure 2 indicates that the proposed CIS-PSE detects WBC signals and provides more accurate and precise estimates. Figure 3 shows that CIS-PSE also improves the estimation of $β_{S_{1}}$ Compared to PSE.

Figure 2: — The mean squared error (MSE) of ${\hat{β}}_{S_{WBC}}$ for different p’s under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, and dotted lines indicate Lasso.

Figure 3: — The mean squared error (MSE) of ${\hat{β}}_{S_{1}}$ for different p’s under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, dotted lines indicate Lasso RE defined as ${\hat{β}}_{{\hat{S}}_{1}}^{RE} = {\hat{Σ}}_{{\hat{S}}_{1}}^{- 1} \hat{cov} (x_{{\hat{S}}_{1}}, y)$ , dot-dashed lines represent Lasso, and long-dashed lines are for WR in (9).

We explore the prediction performanCe under Examples 1–2 using the mean squared prediction error (MSPE), defined as $‖ \hat{y} - y^{test} ‖_{2}^{2} / n_{test} = ‖ X {\hat{β}}^{⋄} - y^{test} ‖_{2}^{2} / n_{test}$ , where ${\hat{β}}^{⋄}$ is obtained from the training data, y^test is the response variable for the test dataset, n_test is the size of test dataset, and ◊ represents either the proposed CIS-PSE or PSE. Table 2, whiCh summarizes the results, shows that CIS-PSE outperforms PSE, suggesting incorporating WBC signals helps to improve the prediction accuracy.

Table 2:

Mean squared prediction error (MSPE) of the predicted outcomes

	p	200	300	400	500
Example 1	CIS-PSE	3.17 (0.80)	3.19 (0.78)	3.25 (0.77)	3.32 (0.77)
	PSE	4.19 (0.83)	4.93 (1.02)	5.28 (1.07)	5.50 (1.16)
	Lasso	10.28 (5.68)	10.02 (5.58)	9.77 (4.96)	9.78 (4.54)
Example 2	CIS-PSE	0.65 (0.14)	0.92 (0.19)	1.30 (0.16)	2.43 (0.64)
	PSE	2.89 (0.61)	3.55 (0.75)	4.09 (0.68)	4.34 (0.97)
	Lasso	4.20 (0.79)	4.50 (0.88)	4.68 (0.90)	4.73 (0.97)

Open in a new tab

Lastly, we consider the setting where a subset of $S_{null}$ is correlated with a subset of $S_{1}$ ; see Example 3. Compared to Example 1, the results that are summarized in Table 4 show that the number of false positives only slightly increases, when some variables in $S_{null}$ are correlated with variables in $S_{1}$ .

Table 4:

Estimation results of $S_{1}$ and $S_{WBC}$ from the growth rate data

When $\hat{S_{1}}$ is selected by Lasso
$\hat{S_{1}}$	${\hat{β}}_{S_{1}}^{Lasso}$	${\hat{β}}_{S_{1}}^{PSE}$	${\hat{β}}_{S_{1}}^{CIS - PSE}$	${\hat{S}}_{WBC}$	${\hat{β}}_{S_{WBC}}^{CIS - PSE}$	$\hat{corr}$
TOT	0.06	2.85	3.73	-	-	-
LFERT	1.66	1.75	2.55	LLIFE	1.88	−.85
				NOM60	0.12	0.84
				NOF60	−0.10	0.83
				LGDPNOM60	−0.02	0.83
PRIF60	−0.001	−0.002	−0.12	LGDPPRIF60	0.02	0.99
				LGDPPRIM60	−0.02	0.93
				LGDPNOF60	0.02	−0.90
When $\hat{S_{1}}$ selected by adaptive Lasso
$\hat{S_{1}}$	${\hat{β}}_{S_{1}}^{Ada-Lasso}$	${\hat{β}}_{S_{1}}^{PSE}$	${\hat{β}}_{S_{1}}^{CIS - PSE}$	${\hat{S}}_{WBC}$	${\hat{β}}_{S_{WBC}}^{CIS - PSE}$	$\hat{corr}$
LFERT	1.98	2.04	2.54	LLIFE	1.77	−0.85
				NOM60	0.08	0.84
				NOF60	−0.07	0.83
				LGDPNOM60	−0.01	0.83

Open in a new tab

$\hat{corr}$ stands for the sample correlations between variables in $\hat{S_{1}}$ and ${\hat{S}}_{WBC}$ .

TOT=The term of trade shock; LFERT=log of fertility rate (children per woman) averaged over 19601985; LLIFE=log of life expectancy at age 0 averaged over 1960–1985; NOM60=Percentage of no schooling in the male population in 1960; NOF60=Percentage of no schooling in the female population in 1960; LGDP60=log GDP per capita in 1960 (1985 price); PRIF60=Percentage of primary schooling attained in female population in 1960; PRIM60=Percentage of primary schooling attained in male population in 1960; LGDPNORM60=LGDP60×NOM60; LGDPPRIF60=LGDP60×PRIF60; LGDPPRIM60=LGDP60×PRIM60; LGDPNOF60=LGDP60×NOF60.

5. A real data example

We apply the proposed CIS-PSE method to analyze the gross domestic product (GDP) growth data studied in Gao et al. (2017) and Barro and Lee (1994). Our goal is to identify factors that are associated with the long-run GDP growth rate. The dataset includes the GDP growth rates and 45 socioeconomic variables for 82 countries from 1960 to 1985. We consider the following model:

{GR}_{i} = β_{0} + β_{1} log (GDP 60_{i}) + z_{i}^{T} β_{2} + 1 (GDP 60_{i} < 2898) (δ_{0} + δ_{1} log (GDP 60_{i}) + z_{i}^{T} δ_{2}) + ε_{i},

(16)

where i is the country indicator, i = 1, …, 82, GR_i is the annualized GDP growth rate of country i from 1960 to 1985, GDP60_i is the GDP per capita in 1960, and z_i are 45 socioeconomic covariates, the details of which can be found in Gao et al. (2017). The β₁ and β₂ represent the coefficients of log(GDP60) and socioeconomic predictors, respectively. The δ₀ represents the coefficient of whether the GDP per capita in 1960 is below a threshold (=2898) or not. The δ₁ represents the coefficient of log(GDP60) when GDP per capita in 1960 is below 2898. The δ₂ represent the coefficients of the interactions between the GDP60_i < 2898 and the socioeconomic predictors when GDP per capita in 1960 is below 2898.

We apply the proposed CIS-PSE and PSE by Gao et al. (2017) to detect $S_{1}$ . Additionally, CISPSE is used to further identify $S_{WBC}$ . Effects of covariates in ${\hat{S}}_{1}$ are estimated by Lasso, adaptive Lasso, PSE and CIS-PSE. Effects of covariates in ${\hat{S}}_{WBC}$ are estimated by CIS-PSE. The sample correlations between variables in ${\hat{S}}_{1}$ and ${\hat{S}}_{WBC}$ are also provided. Table 4 reports the selected variables and their estimated coefficients.

Next, we evaluate the accuracy of predicted GR using a leave-one-out cross-validation. For each country, we treat it itself as the test set while using all other countries as the training set. We apply Lasso, adaptive Lasso, PSE and CIS-PSE. All tuning parameters are selected as described in Section 4. The prediction results in Figure 4 show that CIS-PSE has the smallest prediction errors compared to PSE, Lasso and adaptive Lasso, with ${\hat{S}}_{1}$ detected by either Lasso or adaptive Lasso.

Figure 4: — Prediction errors from post-selection shrinkage estimators: CIS-PSE, PSE and two penalized estimators (Lasso and adaptive Lasso). $\hat{S_{1}}$ is detected by Lasso in the left panel and by adaptive Lasso in the right panel.

6. Discussion

To improve the estimation and prediction accuracy in high-dimensional linear regressions, we introduce the concept of weak but correlated (WBC) signals, which are commonly missed by the Lasso-type variable selection methods. We show that these variables can be easily detected with the help of their partial correlations with strong signals. We propose a CIS-PSE procedure for high-dimensional variable selection and estimation, particularly for WBC signal detection and estimation. We show that, by incorporating WBC signals, it significantly improves the estimation and prediction accuracy.

An alternative approach to weak signal detection would be to group them according to a known group structure and then select by their grouped effects (Bodmer and Bonilla, 2008; Li and Leal, 2008; Wu et al., 2011; Yuan and Lin, 2006). However, grouping strategies require prior knowledge on the group structure, and, in some situations, may not amplify the grouped effects of weak signals. For example, as pointed out in Buhlmann et al. (2013) and Shah and Samworth (2013), when a pair of highly negatively correlated variables are grouped together, they cancel out each other’s effect. On the other hand, our CIS-PSE method is based on detecting partial correlations and can accommodate the “canceling out” scenarios. Hence, when the grouping structure is known, it is worth combining the grouping strategy and CIS-PSE for weak signal detection. We will pursue this in the future.

Table 3:

Comparison of false positives (standard deviations in parentheses) between Examples 1 and 3

	p = 200	p = 300	p = 400	p = 500
Example 1	3.7 (2.4)	5.1 (2.7)	6.9 (3.1)	8.8 (3.3)
Example 3	4.6 (3.5)	6.1 (3.4)	7.8 (3.8)	10.9 (4.2)

Open in a new tab

7. Appendix

We provide technical proofs for Theorem 3.1, Corollary 3.2 and lemmas in this section. We first list some definitions and auxiliary lemmas.

Definition 7.1. A random vector Z = (Z₁, …, Z_p) is a sub-Gaussian with mean vector μ and variance proxy $σ_{z}^{2}$ , if for any $a \in ℝ^{p}, E [exp {a^{T} (Z - μ)}] \leq exp (σ_{z}^{2} ‖ a ‖_{2}^{2} / 2)$ .

Let Z be a sub-Gaussian random variable with variance proxy $σ_{z}^{2}$ . The sub-Gaussian tail inequality is given as, for any t > 0,

P (Z > t) \leq e^{- \frac{t^{2}}{2 σ_{z}^{2}}} and P (Z < - t) \leq e^{- \frac{t^{2}}{2 σ_{z}^{2}}} .

The following Lemma 7.2 ensures that the set of signals with non-vanishing marginal sample correlations with y coincides with $S_{1}$ with probability tending to 1. Therefore, evaluating condition (4) for a covariate j is equivalent to estimating nonzero Ω_jj’s for every $j^{'} \in S_{1}$ . Let r_j′ be the rank of variable j′ according to the magnitude of $| \hat{cov} (X_{j^{'}}, y) |, j^{'} = 1, \dots, p$ . Denote by ${\tilde{S}}_{1} (k) = {j^{'} : r_{j^{'}} \leq k}$ the first k covariates with the largest absolute marginal correlations with y, for k = 1, …, p. Recall that $s_{1} = | S_{1} |$ .

Lemma 7.2. Under Assumption (A5), we have

lim_{n \to \infty} P ({\tilde{S}}_{1} (s_{1}) = S_{1}) = 1.

Proof of Lemma 7.2. By the definition of ${\tilde{S}}_{1} (s_{1})$ , it is suffice to show that with probability tending to 1, as n → ∞,

max | \frac{1}{n} X_{S_{1}^{c}}^{T} y | < min | \frac{1}{n} X_{S_{1}}^{T} y | .

Since $y = X β^{*} + ε = X_{S_{0}} β_{S_{0}}^{*} + ε = X_{S_{1}} β_{S_{1}}^{*} + X_{S_{2}} β_{S_{2}}^{*} + ε$ , we have

\frac{1}{n} X_{S_{1}^{c}}^{T} y = \frac{1}{n} X_{S_{1}^{c}}^{T} X_{S_{1}} β_{S_{1}}^{*} + \frac{1}{n} X_{S_{1}^{c}}^{T} X_{S_{2}} β_{S_{2}}^{*} + \frac{1}{n} X_{S_{1}^{c}}^{T} ε .

Notice that for each $j^{'} \in S_{1}^{c}, \frac{1}{n} x_{j^{'}}^{T} ε \to cov (X_{j^{'}}, ε) = 0$ in probability, then $max | \frac{1}{n} X_{S_{1}^{c}}^{T} ε | = o_{P} (1)$ as n → ∞.

It follows that when n → ∞,

max | \frac{1}{n} X_{S_{1}^{c}}^{T} y | \leq max | Σ_{S_{1}^{c} S_{1}} β_{S_{1}}^{*} | + max | Σ_{S_{1}^{c} S_{2}} β_{S_{2}}^{*} | + o_{P} (1) .

Similarly, when n → ∞,

min | \frac{1}{n} X_{S_{1}}^{T} y | \geq min | Σ_{S_{1} S_{1}} β_{S_{1}}^{*} | - min | Σ_{S_{1} S_{2}} β_{S_{2}}^{*} | + o_{P} (1) .

Lemma 7.2 is concluded by combining the above two inequalities with the faithfulness condition. ◻

Bickel and Levina (2008) showed that for $α = O (\sqrt{log | S_{[j]} | / n)}, ‖ {\tilde{Σ}}_{[j]}^{α} - Σ_{[j]} ‖ = O_{P} (ρ_{n})$ , where $ρ_{n} = O (n^{- 1} C_{max}^{1 / 4})$ with $C_{max}^{1 / 4}$ given in (A6). Furthermore, Bickel and Levina (2008) and Fan et al. (2011) showed that the estimation error for each connected component of the precision matrix is bounded by

‖ {\hat{Ω}}_{[j]} - Ω_{[j]} ‖ = ‖ {({\tilde{Σ}}_{[j]}^{α})}^{- 1} - Σ_{[j]}^{- 1} ‖ = O_{P} (ρ_{n}) .

(17)

Here we adopt the recursive labeling Algorithm in Shapiro and Stockman (2002) to detect the connected components of the thresholded sample covariance matrix.

Without loss of generality, suppose that the strong signals in ${\hat{S}}_{1}$ belong to distinct connected components of ${\tilde{Σ}}^{α}$ . We rearrange the indices in ${\hat{S}}_{1}$ as ${1, \dots, | {\hat{S}}_{1} |}$ and write the submatrix of ${\tilde{Σ}}^{α}$ corresponding to $j^{'}, 1 \leq j^{'} \leq | {\hat{S}}_{1} |$ , as ${\tilde{Σ}}_{j^{'}}^{α}$ . For notational convenience, we rewrite

\hat{Ω} = diag {({({\tilde{Σ}}_{1}^{α})}^{- 1}, \dots, {({\tilde{Σ}}_{| {\hat{S}}_{1}^{(α)} |}^{α})}^{- 1}, 0)}_{p \times p} .

The following Lemma 7.3 is useful for controlling the size of ${\hat{C}}_{[j]}$ .

Lemma 7.3. Under (A6)-(A7), when x is from a multivariate sub-Gaussian distribution, we have $P (| {\hat{C}}_{[j]} | \leq O (n^{ξ})) \geq 1 - n^{ξ} C_{1} exp (- C_{2} n^{1 + ξ})$ for some positive constants C₁ and C₂.

Lemma 7.3 is a direct conclusion of Lemma 2.1 and Assumption (A6). Next we prove Theorem 3.1.

Proof of Theorem 3.1. Notice that $β_{j}^{*} = \sum_{j^{'} \in C_{[j]}} Ω_{j j^{'}} cov (X_{j^{'}}, y)$ and ${\hat{β}}_{j} = \sum_{j^{'} \in {\hat{C}}_{[j]}} {\hat{Ω}}_{j j^{'}} \hat{cov} (X_{j^{'}}, y)$ . Consider a sequence of thresholding parameters ν_n = O(n^3ξ/2) with a decreasing series of positive numbers u_n = 1 + n^−ξ/4 such that lim_n→∞ u_n = 1,

P ({j \notin S_{1} : | β_{j}^{*} | > ν_{n} u_{n}} \subseteq {j \notin S_{1} : | {\hat{β}}_{j} | > ν_{n}}) \geq 1 - P (\underset{j \notin S_{1}, | β_{j}^{*} | > ν_{n} u_{n}}{\cup} {| {\hat{β}}_{j} | \leq ν_{n}}) .

(18)

Moreover, since ${| {\hat{β}}_{j} | \leq ν_{n}}$ and ${| β_{j}^{*} | > ν_{n} u_{n}}$ , we have $| {\hat{β}}_{j} - β_{j}^{*} | \geq ν_{n} (u_{n} - 1)$ . As a result,

1 - P (\underset{j \notin S_{1}, | β_{j}^{*} | > ν_{n} u_{n}}{\cup} {| {\hat{β}}_{j} | \leq ν_{n}}) \geq 1 - \sum_{j \notin S_{1}, | β_{j}^{*} | > ν_{n} u_{n}} P (| {\hat{β}}_{j} - β_{j}^{*} | \geq ν_{n} (u_{n} - 1)) .

(19)

Notice that from Lemma 2.1, $P (C_{[j]} \subseteq {\hat{C}}_{[j]}) \geq 1 - C_{1} n^{ξ} exp (- C_{2} n^{ξ})$ for some positive constants C₁ and C₂ and 0 < ξ < 1. Therefore, we further have

P (| {\hat{β}}_{j} - β_{j}^{*} | \geq ν_{n} (u_{n} - 1)) = P (| \sum_{j^{'} \in C_{[j]}} {\hat{Ω}}_{j j^{'}} \hat{cov} (X_{j^{'}}, y) - \sum_{j^{'} \in E_{j}} Ω_{j j^{'}} cov (X_{j^{'}}, y) | \geq ν_{n} (u_{n} - 1)) = P (| \sum_{j^{'} \in C_{[j]}} {\hat{Ω}}_{j j^{'}} \hat{cov} (X_{j^{'}}, y) - \sum_{j^{'} \in {\hat{C}}_{[j]}} Ω_{j j^{'}} cov (X_{j^{'}}, y) | \geq ν_{n} (u_{n} - 1)) + C_{1} n^{ξ} exp (- C_{2} n^{ξ}) \leq P (\sum_{j^{'} \in {\hat{C}}_{[j]}} | ({\hat{Ω}}_{j j^{'}} - Ω_{j j^{'}}) \hat{cov} (X_{j^{'}}, y) + Ω_{j j^{'}} (\hat{cov} (X_{j^{'}}, y) - cov (X_{j^{'}}, y)) | \geq ν_{n} (u_{n} - 1)) + C_{1} n^{ξ} exp (- C_{2} n^{ξ}) \leq P (\sum_{j^{'} \in {\hat{C}}_{[j]}} | ({\hat{Ω}}_{j j^{'}} - Ω_{j j^{'}}) \hat{cov} (X_{j^{'}}, y) | \geq ν_{n} (u_{n} - 1) / 2) + P (\sum_{j^{'} \in {\hat{C}}_{[j]}} | Ω_{j j^{'}} (\hat{cov} (X_{j^{'}}, y) - cov (X_{j^{'}}, y)) | \geq ν_{n} (u_{n} - 1) / 2) + C_{1} n^{ξ} exp (- C_{2} n^{ξ}) .

(20)

By (17) and Assumption (A6), $‖ \hat{Ω} - Ω ‖ \leq M n^{- (1 - ξ / 4)}$ , for some 0 < M < ∞. As n → ∞, the first term in (20) can be shown as:

P (\sum_{j^{'} \in {\hat{C}}_{[j]}} | ({\hat{Ω}}_{j j^{'}} - Ω_{j j^{'}}) \hat{cov} (X_{j^{'}}, y) | > ν_{n} (u_{n} - 1) / 2) \leq P (\frac{1}{n^{2}} y^{T} x_{\hat{C} [j]} {(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}}^{T} {(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}} x_{{\hat{C}}_{[j]}}^{T} y > \frac{ν_{n}^{2} {(u_{n} - 1)}^{2}}{4}) \leq P (λ_{max} (\frac{1}{n} x_{{\hat{C}}_{[j]}} {(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}}^{T} {(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}} x_{{\hat{C}}_{[j]}}^{T}) \frac{1}{n} y^{T} y > \frac{ν_{n}^{2} {(u_{n} - 1)}^{2}}{4}) . \leq P (λ_{max} (\frac{1}{n} x_{{\hat{C}}_{[j]}} x_{{\hat{C}}_{[j]}}^{T}) λ_{max} ({(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}}^{T} {(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}}) \frac{1}{n} y^{T} y > \frac{ν_{n}^{2} {(u_{n} - 1)}^{2}}{4}) .

(21)

Notice that $E [y^{T} y / n] = E (y^{2}) = σ^{2} + {(β^{*})}^{T} Σ β^{*} \leq σ^{2} + λ_{max} (Σ) ‖ β^{*} ‖^{2} < {\tilde{C}}_{1}$ for some positive constant ${\tilde{C}}_{1}$ . For sufficiently large n, $λ_{max} (x_{{\hat{C}}_{[j]}} x_{{\hat{C}}_{[j]}}^{T} / n) \leq λ_{max} (\hat{Σ}) \leq κ_{2} + λ_{max} (\hat{Σ} - Σ) = κ_{2} + ‖ \hat{Σ} - Σ ‖^{1 / 2} = κ_{2} + O (ρ_{n}^{1 / 2}) < {\tilde{C}}_{2}$ for some positive constant ${\tilde{C}}_{2}$ . And $λ_{max} ({(\hat{Ω} - Ω)}_{j {\hat{c}}_{[j]}}^{T} {(\hat{Ω} - Ω)}_{j {\hat{C}}_{[j]}}) \leq ‖ \hat{Ω} - Ω ‖^{2} \leq M^{2} n^{- (2 - ξ / 2)}$ Therefore, from (21), for suffitiently large n,

P (\sum_{j^{'} \in {\hat{C}}_{[j]}} | ({\hat{Ω}}_{j j^{'}} - Ω_{j j^{'}}) \hat{cov} (X_{j^{'}}, y) | > ν_{n} (u_{n} - 1) / 2) \leq P ({\tilde{C}}_{2} M^{2} \frac{1}{n} y^{T} y > n^{(2 - ξ / 2)} \frac{ν_{n}^{2} {(u_{n} - 1)}^{2}}{4}) \leq P (\frac{1}{n} y^{T} y > \frac{1}{{\tilde{C}}_{2} M^{2}} \frac{n^{2}}{4}) \leq \frac{4 {\tilde{C}}_{2} M^{2} E [y^{T} y / n]}{n^{2}} \leq \frac{4 {\tilde{C}}_{1} {\tilde{C}}_{2} M^{2}}{n^{2}} \to 0,

(22)

where the second last step is from applying Markov inequality to the positively valued random variable y^Ty/n.

For the second term in (20), let $\tilde{z} = {({\tilde{Z}}_{1}, \dots, {\tilde{Z}}_{p})}^{T} = {(x_{1}^{T} y / n - cov (X_{1}, y), \dots, x_{p}^{T} y / n - cov (X_{p}, y))}^{T}$ , then we have

P (\sum_{j^{'} \in {\hat{C}}_{[j]}} | Ω_{j j^{'}} (x_{j^{'}}^{T} y / n - cov (X_{j^{'}}, y)) | \geq ν_{n} (u_{n} - 1) / 2) \leq P (λ_{max} (Ω^{T} Ω) ‖ {\tilde{z}}_{{\hat{C}}_{[j]}} ‖_{2}^{2} \geq \frac{ν_{n}^{2} {(u_{n} - 1)}^{2}}{4}) \leq P (| {\tilde{Z}}_{j^{'}} | > \frac{ν_{n} κ_{1} (u_{n} - 1)}{2 | {\hat{C}}_{[j]} |})

for some $j^{'} \in {\hat{C}}_{[j]}$ . Notice that $E [{\tilde{Z}}_{j^{'}}] = E [X_{j^{'}} y - cov (X_{j^{'}}, y)] = 0$ . Also $E [{\tilde{Z}}_{j^{'}}^{2}] = V a r [X_{j^{'}} y] \leq E [X_{j^{'}}^{2} y^{2}] \leq (E (X_{j^{'}}^{4}) + E (y^{4})) / 2 < {\tilde{C}}_{3}$ for some positive Constant ${\tilde{C}}_{3}$ as $E (X_{j^{'}}^{4}) \leq λ_{max}^{2} (Σ)$ and $E (y^{4}) \leq {(λ_{max} (Σ) ‖ β ‖^{2})}^{2} + E (ε^{4}) < \infty$ . Therefore, from Jensen’s inequality, $E [| {\tilde{Z}}_{j^{'}} |] = E [{({\tilde{Z}}_{j^{'}}^{2})}^{1 / 2}] \leq {(E [{\tilde{Z}}_{j^{'}}^{2}])}^{1 / 2} \leq {\tilde{C}}_{3}^{1 / 2}$ . Then using Lemma 7.3 to Control the size of ${\hat{C}}_{[j]}$ and applying Markov inequality on $| {\tilde{Z}}_{j^{'}} |$ , we have

P (| {\tilde{Z}}_{j^{'}} | > \frac{ν_{n} κ_{1} (u_{n} - 1)}{2 | {\hat{C}}_{[j]} |}) \leq P (| {\tilde{Z}}_{j^{'}} | > \frac{κ_{1} n^{ξ / 4}}{2}) \leq \frac{2 E [| {\tilde{Z}}_{j^{'}} |]}{κ_{1} n^{ξ / 4}} \leq \frac{2 {\tilde{C}}_{3}^{1 / 2}}{κ_{1} n^{ξ / 4}} \to 0.

(23)

Plugging (22) and (23) into (20) and then plugging (20) into (19) gives

lim_{n \to \infty} P ({j \notin S_{1} : | β_{j}^{*} | > ν_{n} u_{n}} \subseteq {j \notin S_{1} : | {\hat{β}}_{j} | > ν_{n}}) = 1.

(24)

By a similar argument, we also have

lim_{n \to \infty} P ({j \notin S_{1} : | β_{j}^{*} | > ν_{n} / u_{n}} \subseteq {j \notin S_{1} : | {\hat{β}}_{j} | > ν_{n}}) = 1.

(25)

Combining (24) and (25), we have

lim_{n \to \infty} P ({\hat{S}}_{WBC} = S_{WBC} | {\hat{S}}_{1} = S_{1}) .

◻

Proof of Corollary 3.2. Notice that

P ({{\hat{S}}_{WBC} = S_{WBC}} \cap {{\hat{S}}_{2 *} = S_{2 *}} \cap {{\hat{S}}_{1} = S_{1}}) = P ({\hat{S}}_{2 *} = S_{2 *} | {{\hat{S}}_{WBC} = S_{WBC}} \cap {{\hat{S}}_{1} = S_{1}}) P ({{\hat{S}}_{WBC} = S_{WBC}} \cap {{\hat{S}}_{1} = S_{1}}) = P ({\hat{S}}_{2 *} = S_{2 *} | {\hat{S}}_{1 WBC} = S_{1 WBC}) P ({\hat{S}}_{WBC} = S_{WBC} | {\hat{S}}_{1} = S_{1}) P ({\hat{S}}_{1} = S_{1}) .

(26)

Under the SRC in (A4), by Lemma 1 in Gao et al. (2017) or Theorem 2 in Zhang and Huang (2008),

lim_{n \to \infty} P ({\hat{S}}_{1} = S_{1}) = 1.

(27)

From Theorem 3.1,

lim_{n \to \infty} P ({\hat{S}}_{WBC} = S_{WBC} | {\hat{S}}_{1} = S_{1}) = 1.

(28)

Equations (27) and (28) together give ${lim}_{n \to \infty} P ({\hat{S}}_{WBC} = S_{WBC}) = 1$ . This further gives that $P (S_{1} \subset {\hat{S}}_{1 WBC} \subset {S_{1 WBC} \cup S_{2}}) \to 1$ . Then by Corollary 2 in Gao et al. (2017), we also have

lim_{n \to \infty} P ({{\hat{S}}_{2 *} = S_{2 *}} | {{\hat{S}}_{1 WBC} = S_{1 WBC}}) = 1.

(29)

Combining (27), (28), (29) and (26) completes the proof. ◻

The following Tables 5–6 and Figures 5–6 give the selection, estimation and prediction results under Examples 1 and 2 when $\hat{S_{1}}$ is selected by adaptive Lasso.

Table 5:

The performance of variable selection on $S_{0}$ when $\hat{S_{1}}$ is selected by adaptive Lasso

			p = 200	p = 300	p = 400	p = 500
Example 1	TP	CIS-PSE	61.4 (2.4)	61.1 (2.5)	61.0 (2.6)	61.1 (2.6)
	TP	PSE	41.2 (5.0)	34.4 (5.1)	25.7 (6.1)	22.6 (5.8)
	FP	CIS-PSE	2.5 (2.1)	4.6 (2.5)	6.7 (3.2)	8.6 (3.0)
	FP	PSE	15.1 (4.9)	19.7 (5.0)	22.3 (5.4)	28.0 (6.2)
Example 2	TP	CIS-PSE	62.5 (0.8)	58.0 (2.2)	54.6 (2.9)	52.0 (3.5)
	TP	PSE	43.9 (3.8)	37 (4.2)	32.8 (4.3)	31.5 (4.2)
	FP	CIS-PSE	3.4 (2.6)	5.2 (3.0)	6.0 (3.5)	7.7 (4.1)
	FP	PSE	13.1 (3.7)	18.6 (4.7)	21.9 (5.5)	28.0 (6.1)

Open in a new tab

Table 6:

Mean squared prediction error (MSPE) of the predicted outcomes when $\hat{S_{1}}$ is selected by adaptive Lasso

	p	200	300	400	500
Example 1	CIS-PSE	2.16 (0.57)	3.06 (0.63)	3.30 (0.72)	3.60 (0.81)
	PSE	3.91 (0.71)	4.85 (0.86)	5.71 (1.02)	6.27 (1.17)
	adaptive Lasso	15.02 (4.37)	14.67 (3.90)	14.49 (4.00)	14.74 (4.29)
Example 2	CIS-PSE	2.69 (0.58)	2.96 (0.72)	3.23 (0.81)	3.31 (0.83)
	PSE	3.77 (0.77)	4.76 (1.08)	5.50 (1.17)	5.82 (0.18)
	adaptive Lasso	4.48 (0.79)	4.81 (0.91)	4.94 (0.97)	4.97 (1.01)

Open in a new tab

Figure 5: — The mean squared error (MSE) of ${\hat{β}}_{S_{WBC}}$ for different p’s when $\hat{S_{1}}$ is selected by adaptive Lasso under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, and dotted lines indicate adaptive Lasso.

Figure 6: — The mean squared error (MSE) of ${\hat{β}}_{S_{1}}$ for different p’s when $\hat{S_{1}}$ is selected by adaptive Lasso under Example 1 (Left panel) and Example 2 (Right panel). Solid lines represent CIS-PSE, dashed lines are for PSE, dotted lines indicate adaptive Lasso RE defined as ${\hat{β}}_{{\hat{S}}_{1}}^{RE} = {\hat{Σ}}_{{\hat{S}}_{1}}^{- 1} \hat{cov} (x_{{\hat{S}}_{1}}, y)$ dot-dashed lines represent adaptive Lasso, and long-dashed lines are for WR in (9).

Figure 7 shows the averaged sum of squared prediction error (SSPE) on the validation datasets across 500 independent experiments for different α’s.

References

Barro R and Lee J (1994). Data set for a panel of 139 countries. http://admin.nber.org/pub/barro.lee/.
Bickel P and Levina E (2008). Covariance regularization by thresholding. Annals of Statistics, 36(6):2577–2604. [Google Scholar]
Bodmer W and Bonilla C (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet, 40(6):695–701. [DOI] [PMC free article] [PubMed] [Google Scholar]
Buhlmann P, Rütimann P, van de Geer S, and Zhang CH (2013). Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference, 143:1835–1858. [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc, 96:1348–1360. [Google Scholar]
Fan J, Liao Y, and Min M (2011). High-dimensional covariance matrix estimation in approxiamte factor models. Annals of Statistics, 39(6):3320–3356. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao X, Ahmed SE, and Feng Y (2017). Post selection shrinkage estimation for highdimensional data analysis. Applied Stochastic Models in Business and Industry, 33(2):97–120. [Google Scholar]
Li B and Leal S (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet, 83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mazumder R and Hastie T (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. Journal of Machine Learning Research, 12:723–736. [PMC free article] [PubMed] [Google Scholar]
Shah RD and Samworth RJ (2013). Discussion of ‘correlated variables in regression: clustering and sparse estimation’. Journal of Statistical Planning and Inference, 143:1866–1868. [Google Scholar]
Shao J, Wang Y, Deng X, and Wang S (2011). Sparse linear discriminant analysis by thresholding for high-dimensional data. Annals of Statistics, 39(2):1241–1265. [Google Scholar]
Shapiro L and Stockman G (2002). Computer Vision. Prentice Hall. [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B, 58:267–288. [Google Scholar]
Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B, 68:49–67. [Google Scholar]
Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):1567–1594. [Google Scholar]
Zhang CH and Huang J (2008). The sparsity and bias of the LASSO selection in high dimensional linear regression. Annals of Statistics, 36:1567–1594. [Google Scholar]
Zhang CH and Zhang S (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B, 76(1):217–242. [Google Scholar]
Zou H (2006). The adaptive Lasso and its oracle properties. J. Am. Statist. Assoc, 101:1418–1429. [Google Scholar]

[R1] Barro R and Lee J (1994). Data set for a panel of 139 countries. http://admin.nber.org/pub/barro.lee/.

[R2] Bickel P and Levina E (2008). Covariance regularization by thresholding. Annals of Statistics, 36(6):2577–2604. [Google Scholar]

[R3] Bodmer W and Bonilla C (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet, 40(6):695–701. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Buhlmann P, Rütimann P, van de Geer S, and Zhang CH (2013). Correlated variables in regression: clustering and sparse estimation. Journal of Statistical Planning and Inference, 143:1835–1858. [Google Scholar]

[R5] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Statist. Assoc, 96:1348–1360. [Google Scholar]

[R6] Fan J, Liao Y, and Min M (2011). High-dimensional covariance matrix estimation in approxiamte factor models. Annals of Statistics, 39(6):3320–3356. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Gao X, Ahmed SE, and Feng Y (2017). Post selection shrinkage estimation for highdimensional data analysis. Applied Stochastic Models in Business and Industry, 33(2):97–120. [Google Scholar]

[R8] Li B and Leal S (2008). Methods for detecting associations with rare variants for common diseases: application to analysis of sequence data. Am. J. Hum. Genet, 83(3):311–321. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Mazumder R and Hastie T (2012). Exact covariance thresholding into connected components for large-scale graphical lasso. Journal of Machine Learning Research, 12:723–736. [PMC free article] [PubMed] [Google Scholar]

[R10] Shah RD and Samworth RJ (2013). Discussion of ‘correlated variables in regression: clustering and sparse estimation’. Journal of Statistical Planning and Inference, 143:1866–1868. [Google Scholar]

[R11] Shao J, Wang Y, Deng X, and Wang S (2011). Sparse linear discriminant analysis by thresholding for high-dimensional data. Annals of Statistics, 39(2):1241–1265. [Google Scholar]

[R12] Shapiro L and Stockman G (2002). Computer Vision. Prentice Hall. [Google Scholar]

[R13] Tibshirani R (1996). Regression shrinkage and selection via the Lasso. J. R. Statist. Soc. B, 58:267–288. [Google Scholar]

[R14] Wu MC, Lee S, Cai T, Li Y, Boehnke M, and Lin X (2011). Rare-variant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet, 89(1):82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Yuan M and Lin Y (2006). Model selection and estimation in regression with grouped variables. J. R. Statist. Soc. B, 68:49–67. [Google Scholar]

[R16] Zhang CH (2010). Nearly unbiased variable selection under minimax concave penalty. Annals of Statistics, 38(2):1567–1594. [Google Scholar]

[R17] Zhang CH and Huang J (2008). The sparsity and bias of the LASSO selection in high dimensional linear regression. Annals of Statistics, 36:1567–1594. [Google Scholar]

[R18] Zhang CH and Zhang S (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Statist. Soc. B, 76(1):217–242. [Google Scholar]

[R19] Zou H (2006). The adaptive Lasso and its oracle properties. J. Am. Statist. Assoc, 101:1418–1429. [Google Scholar]

PERMALINK

Weak signals in high-dimension regression: detection, estimation and prediction

Yanming Li

Hyokyoung G Hong

S Ejaz Ahmed

Yi Li

Summary.

1. Introduction

2. Methods

2.1. Notation

2.2. Defining strong and weak signals

2.3. Covariance-insured screening based post-selection shrinkage estimator (CIS-PSE)

Variable selection:

Figure 1:

Post-selection shrinkage estimation:

2.4. Selection of tuning parameters

Figure 7:

3. Asymptotic properties

4. Simulation studies

Table 1:

Figure 2:

Figure 3:

Table 2:

Table 4:

5. A real data example

Figure 4:

6. Discussion

Table 3:

7. Appendix

Table 5:

Table 6:

Figure 5:

Figure 6:

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Weak signals in high-dimension regression: detection, estimation and prediction

Yanming Li

Hyokyoung G Hong

S Ejaz Ahmed

Yi Li

Summary.

1. Introduction

2. Methods

2.1. Notation

2.2. Defining strong and weak signals

2.3. Covariance-insured screening based post-selection shrinkage estimator (CIS-PSE)

Variable selection:

Figure 1:

Post-selection shrinkage estimation:

2.4. Selection of tuning parameters

Figure 7:

3. Asymptotic properties

4. Simulation studies

Table 1:

Figure 2:

Figure 3:

Table 2:

Table 4:

5. A real data example

Figure 4:

6. Discussion

Table 3:

7. Appendix

Table 5:

Table 6:

Figure 5:

Figure 6:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases