Variable Selection via Partial Correlation

Runze Li; Jingyuan Liu; Lejia Lou

doi:10.5705/ss.202015.0473

. Author manuscript; available in PMC: 2018 Jul 1.

Published in final edited form as: Stat Sin. 2017 Jul;27(3):983–996. doi: 10.5705/ss.202015.0473

Variable Selection via Partial Correlation

Runze Li ¹, Jingyuan Liu ¹, Lejia Lou ¹

PMCID: PMC5484095 NIHMSID: NIHMS866292 PMID: 28663685

Abstract

Partial correlation based variable selection method was proposed for normal linear regression models by Bühlmann, Kalisch and Maathuis (2010) as a comparable alternative method to regularization methods for variable selection. This paper addresses two important issues related to partial correlation based variable selection method: (a) whether this method is sensitive to normality assumption, and (b) whether this method is valid when the dimension of predictor increases in an exponential rate of the sample size. To address issue (a), we systematically study this method for elliptical linear regression models. Our finding indicates that the original proposal may lead to inferior performance when the marginal kurtosis of predictor is not close to that of normal distribution. Our simulation results further confirm this finding. To ensure the superior performance of partial correlation based variable selection procedure, we propose a thresholded partial correlation (TPC) approach to select significant variables in linear regression models. We establish the selection consistency of the TPC in the presence of ultrahigh dimensional predictors. Since the TPC procedure includes the original proposal as a special case, our theoretical results address the issue (b) directly. As a by-product, the sure screening property of the first step of TPC was obtained. The numerical examples also illustrate that the TPC is competitively comparable to the commonly-used regularization methods for variable selection.

keywords and phrases: Elliptical distribution, model selection consistency, partial correlation, partial faithfulness, sure screening property, ultrahigh dimensional linear model, variable selection

1. Introduction

Variable selection via penalized least squares has been extensively studied during the last two decades. Popular penalized least squares variable selection procedures include LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), adaptive LASSO (Zou, 2006) and among others. See Fan and Lv (2010) for a selective overview on this topic and references therein for more works on variable selection via penalized least squares.

As a powerful alternative method to penalized least squares for variable selection, Bühlmann, Kalisch and Maathuis (2010) proposed a variable selection procedure by ranking the partial correlations (PC) between the predictors and the response, named PC-simple algorithm. The definition of partial correlations is given in section 2. The authors provided a stepwise algorithm for the linear regression models with partial faithfulness - where for each predictor, if its partial correlation with the response given a certain subset of other predictors is 0, then the partial correlation given all the other predictors is also 0. The PC-simple algorithm possesses model selection consistency for such linear models, thus is competitively comparable to the penalized least squares variable selection approaches. Therefore, scientists may have two distinct schemes for variable selection in high-dimensional linear models, which raise their confidence in the selected predictors if they are chosen by both techniques.

This work aims to study two important issues related to the PC-simple algorithm. The first issue is that the procedure proposed in Bühlmann, Kalisch and Maathuis (2010) relies on normality assumption on the joint distribution of the response and the predictors, although partial faithfulness does not require normality assumption. Thus, it is of great interest to study the impact of normality assumption on the variable selection procedure developed in Bühlmann, Kalisch and Maathuis (2010). The second issue is that the theoretical results established in Bühlmann, Kalisch and Maathuis (2010) requires that the dimension of the predictor vector increases in a polynomial rate of the sample size. It is also of great interest to study whether the theoretical results are valid with dimensionality increasing at exponential rate of the sample size.

To study the issue related to normality assumption, we consider the elliptical linear model (i.e, the response and the predictors in a linear regression model jointly follow an elliptical distribution), which has been systematically accounted in Fang, Kotz and Ng (1990). The elliptical distribution family contains a much broader class of distributions than the normal distribution family, such as mixture of normal distributions, multivariate t-distribution, multi-uniform distribution on unit sphere, Pearson Type II distribution, among others. It has been used as a tool to study the robustness of normality in the literature of multivariate nonparametric tests (Mottonen, Oja and Tienari, 1997; Oja and Randles, 2004; Chen, Wiesel and Hero, 2011; Soloveychik and Wiesel, 2015; Wang, Peng and Li, 2015). The elliptical linear regressions have been proposed in Osiewalski (1991); Osiewalski and Steel (1993); Arellano-Valle, del Pino and Iglesias (2006); Fan and Lv (2008); Liang and Li (2009); Vidal and Arellano-Valle (2010), and have received more and more attentions in the recent literature (Arellano-Valle, del Pino and Iglesias, 2006; Fan and Lv, 2008; Liang and Li, 2009; Vidal and Arellano-Valle, 2010). Furthermore, the elliptical distribution family has a variety of applications. For instance, it becomes crucial for modeling finance data (Mcneil Frey and Embrechts, 2005) due to its potential to accommodate tail dependence (the phenomenon of simultaneous extremes), which is highly useful in quantitative finance but is not allowed by the multivariate normal distribution (Schmidt, 2002).

By exploring the limiting distribution of the sample partial correlation of elliptical distribution, which is of its own significance, we find that the PC-simple algorithm tends to over-fit(under-fit) the models under those elliptical distributions whose marginal kurtosis is larger (smaller) than the marginal kurtosis of normal distribution. To ensure the superior performance of partial correlation based variable selection procedure for the elliptical distribution family, we propose a thresholded partial correlation (TPC) approach to select significant variables in linear regression models. In the same spirit of the PC-simple algorithm, the TPC is a stepwise method for variable selection and is constructed by comparing each sample correlation and sample partial correlation with a given threshold corresponding to a given significant level. The TPC approach relies on the limiting distribution of the sample partial correlation, and coincides to the PC-simple algorithm for the normal linear models. This enables us to study the asymptotic property of the PC-sample algorithm under a broader framework in order to address the issue related to dimensionality increasing at exponential rate.

We systematically study the sampling properties of the TPC. We first derive the concentration inequality of the partial correlations without model assumption when the dimensionality of the covariates increases with the same size at an exponential rate. This enables us to conduct the TPC for the ultrahigh dimensional linear models. We further establish the theoretical properties of the TPC. This allows us to broaden the usage of this variable selection scheme. We also develop the sure screening property of the first-step TPC in the terminology of Fan and Lv (2008). Note that the first step of the TPC has the same spirit as the marginal screening based on the Pearson correlation (Fan and Lv, 2008). Thus, as a by-product, we obtain the sure screening property of the marginal screening procedure based on the Pearson correlation under different assumptions from Fan and Lv (2008).

This paper is organized as follows. In section 2, we propose the TPC for the elliptical linear models, and further establish its asymptotic properties. Numerical studies are conducted in section 3. A brief conclusion is given in section 4, and all the technical proofs are allocated in the Appendix.

2. Thresholded partial correlation (TPC) approach

In this section, we introduce the linear model with elliptical response and predictors, and bring in the motivation of the TPC approach by studying the limiting distribution of the partial correlation estimation for elliptical distributions. Then the TPC approach is proposed based on the limiting distribution, and the corresponding theoretical properties are discussed.

2.1 Elliptical linear model and its partial correlation estimation

Consider a linear model

y = x^{T} β + ε,

(2.1)

where y is the response variable, x = (x₁, · · ·, x_p)^T is the covariate vector, β = (β₁, …, β_p)^T is the coefficient vector, and ε is the random error with E(ε) = 0, var(ε) = σ². Throughout this paper, it is assumed without loss of generality that E(x) = 0 and E(y) = 0 so that there is no intercept in model (2.1). In practice, it is common that x and y are marginally standardized before performing variable selection. Furthermore, (x^T, y) is assumed to follow an elliptical distribution, and suppose that $(x_{1}^{T}, y_{1}), \dots, (x_{n}^{T}, y_{n})$ are independent and identically distributed (iid) random samples from an elliptical distribution EC_p₊₁(μ,Σ, ϕ), which has the characteristic function exp(it^Tμ)ϕ(t^TΣt) for some characteristic generator ϕ(·) (Fang, Kotz and Ng, 1990).

Bühlmann, Kalisch and Maathuis (2010) proposed a variable selection method, PC-simple algrithm, based on the parital correlation learning for the linear model (2.1) with normal response and predictors. To extend this method to the elliptical distribution, we first study the limiting distributions for the correlation and partial correlation under the elliptical assumption. Denote by ρ(y, x_j) and ρ̂(y, x_j) the population and the sample correlation between y and x_j, respectively. Then as shown in Theorem 5.1.6 of Muirhead (1982), the asymptotic distribution of ρ̂ (y, x_j) is

\sqrt{n} {\hat{ρ} (y, x_{j}) - ρ (y, x_{j})} \to N (0, (1 + κ) {1 - ρ^{2} (y, x_{j})}^{2}),

(2.2)

where $κ = \frac{ϕ^{″} (0)}{{(ϕ^{'} (0))}^{2}} - 1$ with ϕ′(0) and ϕ″ (0) being the first and second derivatives of ϕ at 0. κ is the marginal kurtosis of the elliptical distribution of EC_p₊₁(μ,Σ, ϕ) and equals 0 for a normal distribution N_p₊₁(μ, Σ).

For an index set 𝒮 ⊆ {1, 2, · · ·, p}, we define 𝒮^c to be 𝒮^c = {1 ≤ j ≤ p : j ∉ 𝒮}, |𝒮| to be its cardinality, and x_𝒮 = {x_j : j ∈ 𝒮} to be a subset of covariates with index set 𝒮. Denote the truly active index set 𝒜 = {1 ≤ j ≤ p : β_j ≠ 0} and the corresponding cardinality d₀ = |𝒜|. Based on x_𝒮, the definition of the partial correlation is given below.

Definition 1

(Partial Correlation) The partial correlation between x_j and y given a set of controlling variables x_𝒮, denoted by ρ(y, x_j |x_𝒮), is defined as the correlation between the residuals r_{x_j,x_𝒮} and r_{y,x_𝒮} from the linear regression of x_j on x_𝒮 and that of y on x_𝒮, respectively. And the corresponding sample partial correlation between y and x_j given x_𝒮 is denoted by ρ̂(y, x_j |x_𝒮).

In the next theorem, we study the asymptotic distribution of the sample partial correlation when the sample was drawn from an elliptical distribution, which provides the foundation of TPC variable selection procedure.

Theorem 1

Suppose that $(x_{1}^{T}, y_{1}), \dots, (x_{n}^{T}, y_{n})$ are iid random samples from an elliptical distribution EC_p₊₁(μ,Σ, ϕ) with all finite fourth moments. For any j = 1, · · ·, p, and 𝒮 ⊆ {j}^c with cardinality $∣ S ∣ = o (\sqrt{n})$ , if there exists a positive constant δ₀, the smallest eigenvalue of the covariance matrix of x_𝒮 is greater than δ₀, then

\sqrt{n} {\hat{ρ} (y, x_{j} ∣ x_{S})} - ρ (y, x_{j} ∣ x_{S})} \to N (0, (1 + κ) {1 - ρ^{2} (y, x_{j} ∣ x_{S})}^{2}) .

(2.3)

Theorem 1 seems to be a natural extension on partial correlation from normal distribution to elliptical distribution, but to our best knowledge, this results is new, and its proof is given in Appendix A. Let ∅ be the empty set, and ρ̂(y, x_j |x_∅) and ρ(y, x_j |x_∅) stand for ρ̂ (y, x_j) and ρ(y, x_j ), respectively. Then (2.3) is also valid for 𝒮 = ∅ by (2.2). The limiting distributions of sample correlation and partial correlation given in (2.2) and (2.3) provides insights into the impact of normality assumption on the PC-simple algorithm through the marginal kurtosis under ellipticity assumption. This enables us to modify the PC-simple algorithm by taking into account the marginal kurtosis to ensure its superior performance.

In addition, since the limiting distribution of ρ̂ (y, x_j |x_𝒮) in (2.3) involves ρ(y, x_j |x_𝒮) in the asymptotic variance, we consider the Fisher Z-transformation of ρ̂ (y, x_j |x_𝒮), whose limiting distribution no longer depends on ρ(y, x_j |x_𝒮). Specifically, let Ẑ (y, x_j |x_𝒮) and Z(y, x_j |x_𝒮) be the Fisher Z-transformation of ρ̂ (y, x_j |x_𝒮)} and ρ(y, x_j |x_𝒮), respectively. That is,

\hat{Z} (y, x_{j} ∣ x_{S}) = \frac{1}{2} log {\frac{1 + \hat{ρ} (y, x_{j} ∣ x_{S})}{1 - \hat{ρ} (y, x_{j} ∣ x_{S})}}, Z (y, x_{j} ∣ x_{S}) = \frac{1}{2} log {\frac{1 + ρ (y, x_{j} ∣ x_{S})}{1 - ρ (y, x_{j} ∣ x_{S})}} .

(2.4)

Then, it follows by the delta method and Theorem 1 that

\sqrt{n} {\hat{Z} (y, x_{j} ∣ x_{S}) - Z (y, x_{j} ∣ x_{S})} \to N (0, 1 + κ) .

(2.5)

The asymptotic distribution of Ẑ (y, x_j |x_𝒮) no longer depends on ρ(y, x_j |x_𝒮), thus it is easier to derive the selection threshold for Ẑ (y, x_j |x_𝒮) rather than for ρ̂(y, x_j |x_𝒮) directly.

2.2 A variable selection algorithm

Based on the partial faithfulness condition, the following holds for all j ∈ {1, …, p} (Bühlmann, Kalisch and Maathuis, 2010):

ρ (y, x_{j} ∣ x_{S}) \neq 0 for all S \subseteq {j}^{c} if and only if β_{j} \neq 0.

That is, x_j is important (or β_j ≠ 0) if and only if the partial correlations between y and x_j given all subsets 𝒮 contained in {j}^c are not zero. Extending the PC-simple algorithm proposed by Bühlmann, Kalisch and Maathuis (2010), we propose to identify active predictors by iteratively testing the series of hypotheses

H_{0} : ρ (y, x_{j} ∣ x_{S}) = 0 for ∣ S ∣ = 0, 1, \dots, {\hat{m}}_{reach},

where m̂_reach = min{m: |𝒜̂^[^m^]| ≤ m}, and 𝒜̂^[^m^] is the chosen model index set in the mth step with cardinality | 𝒜̂^[^m^]|. Based on the limiting distribution (2.5), the rejection region at level α is $∣ \hat{Z} (y, x_{j} ∣ x_{S}) ∣ > \sqrt{1 + \hat{κ}} Φ^{- 1} (1 - α / 2) / \sqrt{n}$ with κ̂ being a consistent estimate of κ, where Φ⁻¹(·) is the inverse of the cumulative distribution of the standard normal distribution. In practice, the factor $\sqrt{n}$ in the rejection region is replaced by $\sqrt{n - 1 - ∣ S ∣}$ due to the loss of degrees of freedom used in calculation of residuals. Therefore, an equivalent form of the rejection region with small sample correction is

∣ \hat{ρ} (y, x_{j} ∣ x_{S}) ∣ > T (α, n, \hat{κ}, ∣ S ∣),

(2.6)

where

T (α, n, \hat{κ}, ∣ S ∣) = \frac{exp {2 \sqrt{1 + \hat{κ}} Φ^{- 1} (1 - α / 2) / \sqrt{n - 1 - ∣ S ∣}} - 1}{exp {2 \sqrt{1 + \hat{κ}} Φ^{- 1} (1 - α / 2) / \sqrt{n - 1 - ∣ S ∣}} + 1}

(2.7)

with κ being estimated by its sample counterpart:

\hat{κ} = \frac{1}{p} \sum_{j = 1}^{p} {\frac{\frac{1}{n} \sum_{i = 1}^{n} {(x_{i j} - {\bar{x}}_{j})}^{4}}{3 {\frac{1}{n} \sum_{i = 1}^{n} {(x_{i j} - {\bar{x}}_{j})}^{2}}^{2}} - 1},

(2.8)

where x̄_j is the sample mean of the j-the element of x and x_ij is the j-th element of x_i. In practice, the sample partial correlations can be computed recursively:

For any k ∈ 𝒮,

\hat{ρ} (y, x_{j} ∣ x_{S}) = \frac{\hat{ρ} (y, x_{j} ∣ x_{S \ {k}}) - \hat{ρ} (y, x_{k} ∣ x_{S \ {k}}) \hat{ρ} (x_{j}, x_{k} ∣ x_{S \ {k}})}{{[{1 - \hat{ρ} {(y, x_{k} ∣ x_{S \ {k}})}^{2}} {1 - \hat{ρ} {(x_{j}, x_{k} ∣ x_{S \ {k}})}^{2}}]}^{1 / 2}} .

(2.9)

We summarize the TPC variable selection by the following algorithm.

Algorithm 1.

Algorithm for TPC variable selection.

Step 1: Set m = 1 and 𝒮 = ∅, obtain the marginally estimated active set by

{\hat{A}}^{[1]} = {j = 1, \dots, p : ∣ \hat{ρ} (y, x_{j}) ∣ > T (α, n, \hat{κ}, 0)} .

Step 2: Based on 𝒜̂^[^m^−1], construct the mth step estimated active set by

{\hat{A}}^{[m]} = {j \in {\hat{A}}^{[m - 1]} : ∣ \hat{ρ} (y, x_{j} ∣ x_{S}) ∣ > T (α, n, \hat{κ}, ∣ S ∣), \forall S \subseteq {\hat{A}}^{[m - 1]} \ {j}, ∣ S ∣ = m - 1} .

Step 3: Repeat Step 2 until m = m̂_reach.

Open in a new tab

Algorithm 1 results in a sequence of estimated active sets

{\hat{A}}^{[1]} \supseteq {\hat{A}}^{[2]} \supseteq \dots {\hat{A}}^{[m]} \supseteq \dots \supseteq {\hat{A}}^{[{\hat{m}}_{reach}]} .

Since κ = 0 for normal distributions, the TPC is indeed the PC-simple algorithm under normality assumption. Thus, Theorem 1 clearly shows that the PC-simple algorithm tends to over-fit (under-fit) the models under those distributions where the kurtosis is larger (smaller) than the normal kurtosis 0. Following Bühlmann, Kalisch and Maathuis (2010), we further apply the ordinary least squares approach to estimate the coefficients of predictors in 𝒜̂^[m̂_reach] after running Algorithm 1.

2.3 Theoretical properties

We impose the following regularity conditions to establish the asymptotic theory of the TPC. These regularity conditions may not be the weakest ones.

(D1)
The joint distribution of (x^T, y) satisfies partial faithfulness (Bühlmann, Kalisch and Maathuis, 2010).
(D2)
(x^T, y) follows EC_p₊₁(μ,Σ, ϕ) with Σ > 0. Furthermore, there exists s₀ > 0, such that for all 0 < s < s₀,
$\begin{array}{l} E {exp (s y^{2})} < \infty, max_{1 \leq j \leq p} E {exp (s x_{j} y)} < \infty, \\ and max_{1 \leq j, k \leq p} E {exp (s x_{j} x_{k})} < \infty . \end{array}$
(D3)
There exists δ > −1, such that the kurtosis satisfies κ > δ > −1.
(D4)
For some c_n = O(n⁻^d), 0 < d < 1/2, the partial correlations ρ(y, x_j |x_𝒮) satisfy
$inf {∣ ρ (y, x_{j} ∣ x_{S}) ∣ : j = 1, \dots, p, S \subseteq {j}^{c}, ∣ S ∣ \leq d_{0}, ρ (y, x_{j} ∣ x_{S}) \neq 0} \geq c_{n} .$
(D5)
The partial correlations ρ(y, x_j |x_𝒮) and ρ(x_j, x_k|x_𝒮) satisfy:
1. sup {|ρ(y, x_j |x_𝒮)|: 1 ≤ j ≤ p, 𝒮 ⊆ {j}^c, |𝒮| ≤ d₀} ≤ τ < 1,
2. sup {|ρ(x_j, x_k|x_𝒮)| : 1 ≤ j ≠ k ≤ p, 𝒮 ⊆ {j, k}^c, |𝒮| ≤ d₀} ≤ τ < 1.

Condition (D1) guarantees the validity of the TPC method as a variable selection criterion. The assumption on elliptical distribution in (D2) is crucial when deriving the asymptotic distribution of the sample partial correlation, and the sub-exponential tail probability ensures the difference between the population and sample partial correlations to degenerate with an exponential rate. Many elliptical distributions satisfy the sub-exponential tail probability, such as multivariate normal distribution and Pearson Type II distribution (Fang, Kotz and Ng, 1990). In addition, although (D2) is widely used in literature as a sufficient condition to facilitate the theoretical proof, it may not be the weakest condition to guarantee the validity of the TPC. (D3) puts a mild condition on the kurtosis, and is used to control Type I and II errors. The lower bound of partial correlations in (D4) is used to control Type II errors for the tests. This condition has the same spirit as that of the penalty-based methods which requires the non-zero coefficients to be bounded away from 0. The upper bound of partial correlations in the condition i) of (D5) is used to control Type I error, and the condition ii) of (D5) imposes a fixed upper bound on the population partial correlations between the covariates, which excludes the perfect collinearity between the covariates.

Based on the above regularity conditions, we obtain the following consistency property. First we consider the model selection consistency of the final estimated active set by TPC. Since the TPC depends on the significance level α = α_n, we rewrite the final chosen model to be 𝒜̂_n(α_n).

Theorem 2

Consider linear model (2.1). Under Conditions (D1)–(D5), there exists a sequence α_n → 0 and a positive constant C, such that if d₀ is fixed, then for p = o(exp(n^ξ)), 0 < ξ < 1/5, the estimated active set can be identified with the following rate

P {{\hat{A}}_{n} (α_{n}) = A} \geq 1 - O {exp (- n^{ν} / C)},

(2.10)

where ξ < ν < 1/5; and if d₀ = O(n^b), 0 < b < 1/5, then for p = o(exp(n^ξ)), 0 < ξ < 1/5 − b, (2.10) still holds, with ξ + b < ν < 1/5.

The proof of this theorem is given in Appendix B. Theorem 2 implies that the TPC method including the original PC-simple algorithm enjoys the model selection consistency property when dimensionality increases at exponential rate of the sample size. Following Bühlmann, Kalisch and Maathuis (2010), one possible choice of the theoretical significance level α_n is $α_{n} = 2 {1 - Φ (c_{n} \sqrt{n / (1 + κ)} / 2)}$ .

Note that Bühlmann, Kalisch and Maathuis (2010) utilized the tail probability of normal distribution to control the upper bound of probabilities of Types I and II errors. Thus, they have to assume that the model dimension grows at the polynomial rate of the sample size. We take a different approach from Bühlmann, Kalisch and Maathuis (2010) to establishing the model selection consistency in Theorem 2. We first derive the concentration inequality of the partial correlations as in Step 1 of the proof of Theorem 2. In this step, we do not require assumption of ellipticity. With the concentration inequality, we allow the dimensionality of the covariates increases with the same size at an exponential rate. This enables us to conduct the TPC for ultrahigh dimensional linear models.

In addition, notice that the estimated active set from the first step of the TPC, denoted by ${\hat{A}}_{n}^{[1]} (α_{n})$ , can be viewed as a feature screening procedure, and is essentially equivalent to the sure independence screening procedure proposed by Fan and Lv (2008). Thus, we next establish the sure screening property (Fan and Lv, 2008) of this first step of TPC under a different set of assumptions. We impose the following conditions on the population marginal correlations:

(E4)
inf {|ρ_n(y, x_j)| : j = 1, · · ·, p, ρ_n(y, x_j) ≠ 0} ≥ c_n, where c_n = O(n⁻^d), and 0 < d < 1/2.
(E5)
sup {|ρ_n(y, x_j)| : j = 1, · · ·, p_n,} ≤ τ < 1.

Theorem 3

Consider linear model (2.1) and assume (D1)–(D3), (E4) and (E5). For p = O(exp(n^ξ)), where 0 < ξ < 1, there exists a sequence α_n → 0 such that $P {{\hat{A}}_{n}^{[1]} \supseteq A} \geq 1 - O {exp (- n^{ν} / C^{*})}$ , where C^* is a positive constant and ξ < ν < 1/5.

The proof of this theorem is given in Appendix B. This theorem confirms the sure screening property of the marginal screening procedure based on the Pearson correlation under a different set of regularity conditions from Fan and Lv (2008).

3. Numerical Studies

In this section, we assess the finite sample performance of the TPC methods and compare it with LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and PC-simple algorithm through Monte Carlo simulation studies. We also illustrate the application of the TPC by a rat eye expression dataset example.

3.1 Simulation studies

In our simulation study, data were generated from linear model (2.1) with β₁ = 3, β₂ = 1.5, β₅ = 2 and β_j = 0 if j ≠ 1, 2, 5. In our simulation, we consider p = 200, 500 and 2000. The sample size is taken to be n = 200. Moreover, the joint distribution of x and ε are taken to be 0.9N(0, Σ) + 0.1N(0, 9Σ), which is an elliptical distribution, and the normal distribution N(0, Σ), where Σ is the (p + 1) × (p + 1) matrix with (i, j)th entry ρ^|i⁻^j|. We consider ρ = 0, 0.3, and 0.8 which corresponds to “uncorrelated”, “moderately correlated”, and “strongly correlated” among x, respectively. Note that the estimated kurtosis of the mixture normal is around 1.5, deviating from 0 for normal distribution to a large extent. Thus, the mixture of normal distributions in this study is used to illustrate a heavy-tailed situation. For each case, we conduct 1000 simulations.

In our simulation, we compare the finite sample performance of LASSO (Tibshirani, 1996) and SCAD (Fan and Li, 2001), the PC-simple algorithm (Bühlmann, Kalisch and Maathuis, 2010) and the TPC. Furthermore, the following criteria are used for evaluating the performance of variable selection procedures.

Model error: E_x[{x(β̂ − β)}²] = (β̂ − β)^T cov(x)(β̂ − β).
True positive number (TPN) which is defined to be the average of the number of the predictors with nonzero coefficients being successfully detected in 1000 simulation.
False positive number (FPN) which is defined to be the average of the number of predictors with zero coefficients being erroneously selected into the model.
Underfit percentage (UF), which is defined to be the percentage of underfit models which fails to identify at least one important predictor in the 1000 simulations.
Correct-fit percentage (CF), which is defined to be the percentage of correctly-fitted models that exactly select the truly important predictors in the 1000 simulations.
Overfit percentage (OF), which is defined to be the percentage of overfit models that identify all the important predictors, but include at least one unimportant predictor in the 1000 simulations.

Table 3.1 depicts the simulation results for the elliptical distribution and clearly shows that the TPC performs significantly better than LASSO, SCAD and the PC-simple algorithm in most situations, regardless of the low or high model dimensionality. Specifically, LASSO constantly over-fits the model under every scenario, as indicated by literature. The models selected by SCAD are also more conservative in this case compared with those by the PC-based methods, thus the correct-fit rate is much lower while the over-fit rate is high. Furthermore, since the PC-simple relies on normality, it fails to capture the correct model with a high percentage when this elliptical distribution is considered, especially when x-variables are independent, the correct-fit rate of the PC-simple are only 25% and 7% for p = 500 and 2000, respectively. The PC-simple tends to overfit this model because the marginal kurtosis is around 1.5, larger than that of normal distribution, 0. For instance, when p = 500, the over-fit rate (OF) of PC-simple are 0.58, 0.65 and 0.22 for ρ = 0, 0.3, 0.8, respectively. While the OF of TPC are 0.04, 0.07 and 0.00. The TPC increases the probability of recovering the true model to a large degree, by correcting the threshold under the ellipticity assumption. Furthermore, the time for 1000 simulations with p = 2000 is reported in Table S1 in the supplemental material to save space. In terms of the computational cost, TPC converges much faster than the PC-simple algorithm and the SCAD, and is comparable to the LARS algorithm for LASSO.

Table 3.1.

Simulation Results for Example 1: Elliptical Distribution

p	ρ	Method	MedME(Devi)	TPN	FPN	UF	CF	OF
		SCAD	0.050 (0.024)	3.00	4.52	0.00	0.51	0.49
		LASSO	8.984 (0.219)	3.00	33.63	0.00	0.00	1.00
200	0	PC-simple	0.082 (0.050)	2.92	0.82	0.08	0.41	0.51
		TPC	0.045 (0.032)	2.84	0.13	0.16	0.81	0.03

		SCAD	0.046 (0.023)	3.00	3.90	0.00	0.50	0.50
		LASSO	11.195 (0.216)	3.00	30.26	0.00	0.00	1.00
200	0.3	PC-simple	0.063 (0.036)	3.00	0.46	0.00	0.58	0.42
		TPC	0.036 (0.024)	2.99	0.04	0.01	0.96	0.03

		SCAD	0.044 (0.026)	3.00	2.51	0.00	0.50	0.50
		LASSO	20.925 (0.158)	3.00	16.44	0.00	0.02	0.98
200	0.8	PC-simple	0.039 (0.026)	2.94	0.17	0.06	0.83	0.11
		TPC	0.057 (0.040)	2.79	0.20	0.19	0.80	0.01

		SCAD	0.041 (0.022)	3.00	5.57	0.00	0.41	0.59
		LASSO	8.960 (0.212)	3.00	45.25	0.00	0.00	1.00
500	0	PC-simple	0.096 (0.051)	2.83	1.22	0.17	0.25	0.58
		TPC	0.043 (0.031)	2.74	0.21	0.26	0.70	0.04

		SCAD	0.043 (0.024)	3.00	7.05	0.00	0.40	0.60
		LASSO	11.172 (0.230)	3.00	38.94	0.00	0.00	1.00
500	0.3	PC-simple	0.077 (0.043)	3.00	0.83	0.00	0.35	0.65
		TPC	0.030 (0.018)	2.98	0.08	0.02	0.91	0.07

		SCAD	0.042 (0.026)	3.00	4.07	0.00	0.40	0.60
		LASSO	20.879 (0.187)	3.00	20.86	0.00	0.00	1.00
500	0.8	PC-simple	0.049 (0.031)	2.91	0.37	0.09	0.69	0.22
		TPC	0.044 (0.032)	2.73	0.26	0.25	0.75	0.00

		SCAD	0.051 (0.032)	3.00	10.13	0.00	0.40	0.60
		LASSO	9.140 (0.179)	3.00	66.84	0.00	0.00	1.00
2000	0	PC-simple	0.112 (0.056)	2.90	1.73	0.10	0.07	0.83
		TPC	0.050 (0.037)	2.83	0.35	0.17	0.67	0.16

		SCAD	0.045 (0.028)	3.00	8.58	0.00	0.33	0.67
		LASSO	11.345 (0.189)	3.00	61.97	0.00	0.00	1.00
2000	0.3	PC-simple	0.105 (0.044)	2.99	1.36	0.01	0.17	0.82
		TPC	0.039 (0.026)	2.97	0.18	0.03	0.83	0.14

		SCAD	0.049 (0.030)	3.00	7.33	0.00	0.28	0.72
		LASSO	20.960 (0.136)	3.00	37.81	0.00	0.00	1.00
2000	0.8	PC-simple	0.077 (0.046)	2.96	0.59	0.04	0.48	0.48
		TPC	0.045 (0.034)	2.82	0.24	0.17	0.81	0.02

Open in a new tab

The numbers in the parentheses are median absolute deviations over 1000 simulations.

The results for normal distribution are presented in Table S2 in the supplement material. Recall that in theory, the TPC should be equivalent to the PC-simple algorithm in this case, thus their performances are quite similar. The median model errors are comparable for all the methods except for LASSO, which yields much larger models than necessary. Overall, both LASSO and SCAD tend to provide more conservative models, and over-fit the model, compared with the partial-correlation-based methods for variable selection.

The elliptical distribution assumption is used only for deriving the asymptotic distribution of the partial correlations in Theorem 1, thus it inspired us to propose the test statistic for the series of hypotheses H₀ : ρ(y, x_j |x_𝒮) = 0. However, the model selection consistency of the TPC does not require the elliptical distribution for the response and the predictors. Therefore, the performance of TPC is expected to be relatively robust against the elliptical assumption. To illustrate this, we may consider a simulation example in which the discrete predictors are involved. Specifically, the x’s with even-subscript are generated in the same fashion as before, while the x’s with odd-subscript take discrete values 0, 1, 2 with probabilities 0.25, 0.5 and 0.25, respectively. The result is reported in Table S3 in the supplemental material to save space. We can see from Table S3 that the TPC outperforms other methods, especially in terms of the correct-fit rate.

3.2 An application

In this section, we demonstrate the proposed methodology by an empirical analysis of microarray data set, which was studied by Scheetz et al. (2006) and Huang et al. (2008). This dataset contains 120 12-week-old male rats, and for each rat, 3000 sufficiently expressed gene probes with enough variation are studied. The purpose of the analysis is to identify the probes that are most relevant to the response – the expression level of probe TRIM32, which are recently proved to cause Bardet-Biedl syndrome (Chiang, 2006).

We apply the SCAD, LASSO, PC-simple algorithm and TPC to this data set with one outlier deleted. Table 3.2 provides the information of the chosen gene probes by different methods. As LASSO yields a much larger model leading to the difficulty of interpretation, to save space, we only report the six probes selected by SCAD, the PC-simple algorithm and TPC, and indicate whether they are included in the 20 chosen probes by LASSO. We calculate the adjusted R² for each model and prediction error (PE) by leave-one-out cross-validation (LOOCV) method for each model. From Table 3.2, we can see that the models selected by SCAD, LASSO and TPC have very similar performance in terms of adjusted R² and predictor error. The TPC method improves the PC-simple algorithm by including two probes x₅ and x₆. These two probes lead to about 9% predictor error reduction from the model selected by PC-simple to the model selected by TPC. Note that the probe 1389584 at (x₁) and 1383996 at (x₂) are selected by all the four approaches, and also identified by Huang et al. (2008). Therefore, they are worth more comprehensively biological research. The rest of the results from TPC is more consistent with Huang et al. (2008) than the other methods.

Table 3.2.

Results for Real Data Example

Selected Probes	SCAD	LASSO	PC-simple	TPC	M6 Est(& SE)	M4 Est(& SE)
Intercept	Yes	Yes	Yes	Yes	.0147 (.0465)	.0164 (.0467)
1389584 at(x₁)	Yes	Yes	Yes	Yes	.3669 (.0823)***	.4098 (.0710) ***
1383996 at(x₂)	Yes	Yes	Yes	Yes	.1400 (.0595) *	.1583 (.0590) **
1382452 at(x₃)	Yes	Yes	/	/	.2450 (.0606)***	.2279 (.0547) ***
1370429 at(x₄)	/	/	Yes	Yes	.0464 (.0815)
1383110 at(x₅)	/	Yes	/	Yes	.1543 (.0840)
1374106 at(x₆)	/	Yes	/	Yes	.2203 (.0727)**	.2580 (.0688) ***
15 more probe	/	Yes	/	/

Size	4	21	4	6	7	5

Adjusted-R²(%)	69.37	69.55	66.64	69.10	74.60	74.38

PE	0.297	0.298	0.326	0.301	0.275	0.270

Open in a new tab

The 15 probes selected only by LASSO are omitted. “Yes” means the probe is selected by this method.

M6 stands for the linear model with six probes x₁-x₆; M4 for the model with four probes x₁, x₂, x₃ and x₆.

‘*’ stands for significant at level 0.05, ‘**’ for level 0.01, and ‘***’ for level 0.001.

We further conduct some exploratory analysis. We compare the model with 20 probes selected by LASSO with the model with the six probes listed in Table 3.2 (denoted by M6 in the table) by the likelihood ratio test (LRT). The p-value of the corresponding LRT is 0.058. This implies that the model with the six probes fits the data well enough. The corresponding estimates and standard errors of regression coefficients are listed in the second last column in Table 3.2. The adjusted R² and the predictor error calculated by the LOOCV method has much improvement over the model selected by the SCAD, LASSO, PC-simple and TPC methods. For example, the predictor error has about 10% reduction. The coefficients of x₄ and x₅ seem not to be significant at level 0.05. We refit the data to model with only four probes x₁, x₂, x₃ and x₆, and their estimates and standard error are reported in the last column of Table 3.2. The adjusted R² and predictor error of this model is very close to the one with six probes. The empirical analysis of this example implies that two comparable schemes for variable selection (i.e., regularization methods such as the SCAD and the LASSO and partial correlation based method such as PC-simple and TPC) can be used to improve each others. For example, the regularization method would miss probe x₆, while the TPC would miss probe x₃. Scientists may raises their confidence in the selected probes x₁ and x₂ since they are chosen by both techniques.

4. Conclusion

In this paper, we proposed the variable selection procedure via the thresholded partial correlation (TPC) and established its model selection consistency and sure screening property in the presence of ultrahigh-dimensional predictors. Our simulation and empirical analysis of a real-life data example illustrate that the TPC may serve as a potential alternative to the commonlyused regularization methods for high or ultrahigh dimensional regression models.

Supplementary Material

Suppl

NIHMS866292-supplement-Suppl.pdf^{(45.7KB, pdf)}

Acknowledgments

Li’s research was supported by National Institute on Drug Abuse (NIDA) grants P50-DA10075 and P50 DA036107. Liu’s research was supported by National Natural Science Foundation of China (NNSFC) grant 11401497 and the Fundamental Research Funds for the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. All authors equally contribute to this paper, and the authors are listed in the alphabetic order.

Appendix

Appendix A: Proof of Theorem 1

Let $u_{i} = (y_{i}, x_{i}^{T})$ , i = 1, · · ·, n and denote q = p+1. Thus, u₁, · · ·, u_n be an independent and identically distributed random sample from EC_q(μ,Σ, ϕ). To study the asymptotic behaviors of partial correlation of elliptical distribution, we consider the following general partitions of u_i, μ and Σ:

u_{i} = (\begin{matrix} u_{1 i} \\ u_{2 i} \end{matrix}), μ = (\begin{matrix} μ_{1} \\ μ_{2} \end{matrix}), \sum = (\begin{matrix} \sum_{11} & \sum_{12} \\ \sum_{21} & \sum_{22} \end{matrix}),

where u₁_i and μ₁ are q₁-dimensional, while u₂_i and μ₂ are q₂-dimensional, Σ₁₁ is a q₁ × q₁ matrix, and Σ₂₂ is a q₂ × q₂ matrix. Here q = q₁ + q₂. Let U = (u₁, · · ·, u_n)^T and denote

A = \frac{1}{n} \sum_{i = 1}^{n} (u_{i} - \bar{u}) {(u_{i} - \bar{u})}^{T} = \frac{1}{n} U^{T} (I - \frac{1}{n} 1_{n} 1_{n}^{T}) U

Partition A in the same way as Σ. Let a_kl.₂ is the (k, l)-element of $A_{11.2} ≙ A_{11} - A_{12} A_{22}^{- 1} A_{21}$ . Then the sample partial correlation of u_ik and u_il given u₂_i, ρ̂(u_ik, u_il|u₂_i), indeed equals $a_{k l .2} / \sqrt{a_{k k .2} a_{l l .2}}$ .

To derive the asymptotic distribution of A_11.2, let

C = (\begin{matrix} I & - \sum_{12} \\ 0 & I \end{matrix}),

where I stands for the identity matrix, and v_i = C(u_i−μ). Using Theorem 2.16 of Fang, Kotz and Ng (1990), it follows that

v_{i} ~ E C_{q} (0, (\begin{matrix} \sum_{11.2} & 0 \\ 0 & \sum_{22} \end{matrix}), ϕ)

(A.1)

where $\sum_{11.2} = \sum_{11} - \sum_{12} \sum_{22}^{- 1} \sum_{21}$ .

Let V = (v₁, · · ·, v_n)^T . By definition of v_i, V = (U − 1_nμ^T )C^T, where 1_n is an n × 1 vector with all elements being 1. Define

B = \frac{1}{n} \sum_{i = 1}^{n} (v_{i} - \bar{v}) {(v_{i} - \bar{v})}^{T} = \frac{1}{n} V^{T} (I - \frac{1}{n} 1_{n} 1_{n}^{T}) V = C A C^{T} .

Partition B in the same way as A, then B₁₁ = A₁₁ − Σ₁₂A₂₁ − A₁₂Σ₂₁ + Σ₁₂A₂₂Σ₂₁, B₁₂ = A₁₂ − Σ₁₂A₂₂, B₂₁ = A₂₁ − A₂₂Σ₂₁ and B₂₂ = A₂₂. By direct calculation, it follows that $B_{11.2} ≙ B_{11} - B_{12} B_{22}^{- 1} B_{21} = A_{11.2}$ . This enables us to derive the asymptotic distribution of A_11.2 through B_11.2.

Define $W_{11} = \sqrt{n} (B_{11} - \sum_{11.2}), W_{12} = \sqrt{n} B_{12}, W_{21} = \sqrt{n} B_{21}$ , and $W_{22} = \sqrt{n} (B_{22} - \sum_{22})$ . The assumption that all fourth moments of u_i are finite implies that all fourth moments of v_i are finite. Thus, it follows by the central limit theorem that W_kl for k = 1,2 and l = 1, 2, has an asymptotic normal distribution with mean zero and a finite covariance matrix. Then

B_{11.2} = \frac{1}{\sqrt{n}} W_{11} + \sum_{11.2} - \frac{1}{n} W_{12} {(\sum_{22} + \frac{1}{\sqrt{n}} W_{22})}^{- 1} W_{21} .

By the assumptions of Theorem 1, the largest eigenvalue of ${(\sum_{22} + \frac{1}{\sqrt{n}} W_{22})}^{- 1}$ is positive and finite. Therefore it follows that if $q_{2} = o (\sqrt{n})$ ,

\begin{array}{l} \sqrt{n} (A_{11.2} - \sum_{11.2}) = \sqrt{n} (B_{11.2} - \sum_{11.2}) = W_{11} + o_{P} (1) \\ = \sqrt{n} (B_{11} - \sum_{11.2}) + o_{P} (1) . \end{array}

This implies that A_11.2 and B₁₁ have the same asymptotic normal distribution, and hence $a_{k l .2} / \sqrt{a_{k k .2} a_{l l .2}}$ and $b_{k l} / \sqrt{b_{k k} b_{l l}}$ have the same asymptotic distribution, where b_kl is the (k, l)-element of B₁₁. Further notice that v_1i ~ EC_q₁ (0,Σ_11.2), ϕ) by (A.1), where v₁_i consists of the first q₁ elements of v_i. Therefore, the asymptotic normal distribution of the sample correlation coefficient ρ̂ (v_ik, v_il), which indeed equals to $b_{k l} / \sqrt{b_{k k} b_{l l}}$ , can be derived from (2.2) with replacing Σ by Σ_11.2. Thus, Theorem 1 holds by setting $u_{l i} = {(y_{i}, x_{i S^{c}}^{T})}^{T}$ and u₂_i = x_i𝒮.

Appendix B: Proof of Theorems 2 and 3

In this section, we introduce the following lemmas which are used repeatedly in the proofs of the theorems.

Lemma 1

(Hoeffding’s Inequality) Assume the independent random sample {X_i : i = 1, · · ·, n} satisfies P(X_i ∈ [a_i, b_i]) = 1 for some a_i and b_i, ∀ i = 1, · · ·, n. Then, for any ε > 0, the sample mean X̄ satisfies

P (∣ \bar{X} - E (\bar{X}) ∣ > ε) \leq 2 exp (- \frac{2 ε^{2} n^{2}}{\sum_{i = 1}^{n} {(b_{i} - a_{i})}^{2}}) .

(B.1)

Lemma 2

Suppose X is a random variable with E(e^a^|^X^|) < ∞ for some a > 0. Then for any M >0, there exist positive constants b and c such that

P (∣ X ∣ \geq M) \leq {b e}^{- c M} .

(B.2)

Lemma 3

Suppose γ̂₁ and γ̂₂ are estimates of the finite parameters γ₁ and γ₂ based on a size-n sample, respectively. Assume there exist positive constants b₁, b₂ and ν such that for any 0 < ε < 1,

P {∣ {\hat{γ}}_{j} - γ_{j} ∣ > ε} \leq b_{j} exp (- n^{ν} / b_{j}), j = 1, 2.

(B.3)

Then

\begin{array}{r} P {∣ ({\hat{γ}}_{1} - {\hat{γ}}_{2}) - (γ_{1} - γ_{2}) ∣ > ε} & \leq & b_{3} exp (- n^{ν} / b_{3}), \\ P {∣ {\hat{γ}}_{1} {\hat{γ}}_{2} - γ_{1} γ_{2} ∣ > ε} & \leq & b_{4} exp (- n^{ν} / b_{4}), \end{array}

where b₃ = b₁ + b₂, and b₄ = 2b₁ + b₂. If γ₂ ≠ 0,

P {| \frac{{\hat{γ}}_{1}}{{\hat{γ}}_{2}} - \frac{γ_{1}}{γ_{2}} | > ε} \leq b_{5} exp (- n^{ν} / b_{5}),

where b₅ = b₁ + 3b₂. If we further assume γ₂ > 0, then

\begin{array}{r} P {∣ \sqrt{{\hat{γ}}_{2}} - \sqrt{γ_{2}} ∣ > ε} & \leq & b_{6} exp (- n^{ν} / b_{6}), \\ P {∣ log {\hat{γ}}_{2} - log γ_{2} ∣ > ε} & \leq & b_{2} exp (- n^{ν} / b_{2}), \end{array}

where b₆ = 2b₂.

The proof of Lemma 3 can be found at the supplemental material of this work.

For ease of notation, denote ${\bar{x}}_{j} = \frac{1}{n} \sum_{i = 1}^{n} x_{i j}, \bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}, \bar{x_{j} y} = \frac{1}{n} \sum_{i = 1}^{n} x_{i j} y_{i}, \bar{x_{j}^{2}} = \frac{1}{n} \sum_{i = 1}^{n} x_{i j}^{2}$ , and $\bar{y^{2}} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}^{2}$ . Then

\hat{ρ} (y, x_{j}) = \frac{\bar{x_{j} y} - {\bar{x}}_{j} \bar{y}}{\sqrt{(\bar{x_{j}^{2}} - {\bar{x}}_{j}^{2}) (\bar{y^{2}} - {\bar{y}}^{2})}}

(B.4)

The proof of Theorem 2

We divide the proof into three parts.

Step 1

Study the consistency of ${\hat{Z}}_{n} (y, x_{j} ∣ x_{S}) / \sqrt{1 + \hat{κ}}$ . First consider x̄_j. For any 0 < ε < 1 and any M >0,

\begin{array}{l} P (∣ {\bar{x}}_{j} - E x_{j} ∣ > ε) \leq P (∣ {\bar{x}}_{j} - E x_{j} ∣ > ε, max_{1 \leq i \leq n} ∣ x_{i j} ∣ \leq M) + P (max_{1 \leq i \leq n} ∣ x_{i j} ∣ > M) \\ \leq P (∣ {\bar{x}}_{j} - E x_{j} ∣ > ε, max_{1 \leq i \leq n} ∣ x_{i j} ∣ \leq M) + n P (∣ x_{i j} ∣ > M) \\ \leq 2 exp (- \frac{n ε^{2}}{2 M^{4}}) + n C_{2} exp (- C_{1} M) \end{array}

(B.5)

for some positive constants C₁ and C₂. The first term above is obtained by Hoeffding’s inequality in Lemma 1, and the second term is by condition (D2) and Lemma 2. Take M = O(n^1/5), then for large n, (B.5) is simplified as P(|x̄_j − Ex_j | > ε) ≤ C₃ exp(−n^ν/C₃), where 0 < ν < 1/5 and C₃ > 0. In the same fashion, there exist some positive constants C₄, C₅, C₆ and C₇, such that for large n,

\begin{array}{l} P (∣ \bar{y} - E y ∣ > ε) \leq C_{4} exp (- n^{ν} / C_{4}), \\ P (∣ \bar{x_{j}^{2}} - E (x_{j}^{2}) ∣ > ε) \leq C_{5} exp (- n^{ν} / C_{5}), \\ P (∣ \bar{y^{2}} - E (y^{2}) ∣ > ε) \leq C_{6} exp (- n^{ν} / C_{6}), \\ P (∣ \bar{x_{j} y} - E (x_{j} y) ∣ > ε) \leq C_{7} exp (- n^{ν} / C_{7}) . \end{array}

Therefore by (B.4) and Lemma 3,

P {∣ \hat{ρ} (y, x_{j}) - ρ (y, x_{j}) ∣ > ε} \leq C_{8} exp (- n^{ν} / C_{8}),

where the positive constant C₈ is determined by C₃, …, C₇.

Note that

ρ (y, x_{j} ∣ x_{S}) = \frac{ρ (y, x_{j} ∣ x_{S \ {k}}) - ρ (y, x_{k} ∣ x_{S \ {k}}) ρ (x_{j}, x_{k} ∣ x_{S \ {k}})}{{[{1 - ρ^{2} (y, x_{k} ∣ x_{S \ {k}})} {1 - ρ^{2} (x_{j}, x_{k} ∣ x_{S \ {k}})}]}^{1 / 2}},

(B.6)

for any k ∈ 𝒮.

Under the bounded condition (D5), applying Lemma 3 to the sample version of (B.6) and the Z-transformation (2.4) recursively, we conclude that for some C₉ > 0 and C₁₀ > 0,

\begin{array}{l} P {∣ \hat{ρ} (y, x_{j} ∣ x_{S}) - ρ (y, x_{j} ∣ x_{S}) ∣ > ε} \leq C_{9} exp (- n^{ν} / C_{9}), and \\ P {∣ \hat{Z} (y, x_{j} ∣ x_{S}) - Z (y, x_{j} ∣ x_{S}) ∣ > ε} \leq C_{10} exp (- n^{ν} / C_{10}) . \end{array}

Furthermore, by the same argument, the sample kurtosis is consistent to the population version with the same rate, that is, there exists C₁₁ > 0 such that P{|κ̂ − κ| > ε} ≤ C₁₁ exp(−n^ν/C₁₁), and hence for some C₁₂ > 0,

P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | > ε} \leq C_{12} exp (- n^{ν} / C_{12}) .

Step 2

Compute P(E_j_|_𝒮) = P{an error occurs when testing ρ(y, x_j |x_𝒮) = 0}. Denote $E_{j ∣ S} = E_{j ∣ S}^{I} \cup E_{j ∣ S}^{I I}$ , where $E_{j ∣ S}^{I}$ is the event that the type I error occurs and $E_{j ∣ S}^{I I}$ is the event that the type II error occurs. Then by choosing $α_{n} = 2 {1 - Φ (\frac{c_{n}}{2} \sqrt{\frac{n}{1 + κ}})}$ ,

\begin{array}{l} P (E_{j ∣ S}^{I}) = P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} | > \frac{Φ^{- 1} (1 - α_{n} / 2)}{{(n - ∣ S ∣ - 1)}^{1 / 2}} when Z (y, x_{j} ∣ x_{S}) = 0} \\ \leq P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | > \frac{Φ^{- 1} (1 - α_{n} / 2)}{{(n - ∣ S ∣ - 1)}^{1 / 2}}} \\ = P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | > \frac{c_{n} \sqrt{n}}{2} {(n - ∣ S ∣ - 1) (1 + κ)}^{- 1 / 2}} \\ \leq P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | > \frac{c_{n}}{2 \sqrt{1 + κ}}} \\ \leq C_{12} exp (- n^{ν} / C_{12}), \end{array}

and

\begin{array}{l} P (E_{j ∣ S}^{I I}) = P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} | \leq \frac{Φ^{- 1} (1 - α_{n} / 2)}{{(n - ∣ S ∣ - 1)}^{1 / 2}} when Z (y, x_{j} ∣ x_{S}) \neq 0} \\ \leq P {| \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | - | \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | \leq \frac{Φ^{- 1} (1 - α_{n} / 2)}{{(n - ∣ S ∣ - 1)}^{1 / 2}}} \\ \leq P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | \geq | \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | - \frac{c_{n}}{2 \sqrt{1 + κ}}} \end{array}

Note that $∣ g (u) ∣ = ∣ \frac{1}{2} log {(1 + u) / (1 - u)} ∣ \geq ∣ u ∣$ for all u ∈ (−1, 1), then |Z(y, x_j |x_𝒮)| ≥ |ρ_n(y, x_j |x_𝒮)| ≥ c_n under condition (D4). Thus,

\begin{array}{l} P (E_{j ∣ S}^{I I}) \leq P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | \geq \frac{c_{n}}{\sqrt{1 + κ}} - \frac{c_{n}}{2 \sqrt{1 + κ}}} \\ = P {| \frac{\hat{Z} (y, x_{j} ∣ x_{S})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j} ∣ x_{S})}{\sqrt{1 + κ}} | \geq \frac{c_{n}}{2 \sqrt{1 + κ}}} \\ \leq C_{12} exp (- n^{ν} / C_{12}) . \end{array}

Therefore, $P (E_{j ∣ S}) = P (E_{j ∣ S}^{I}) + P (E_{j ∣ S}^{I I}) \leq 2 C_{12} exp (- n^{ν} / C_{12})$ .

Step 3

Study P{𝒜̂_n(α_n) = 𝒜}. Now consider all j = 1, · · ·, p and all 𝒮 ⊆ {j}^c subject to |𝒮| ≤ m_n, where m_n ≤ m̂_reach. Define $K_{j}^{m_{n}} = {S \subseteq {j}^{c}, ∣ S ∣ \leq m_{n}}$ , j = 1, · · ·, p.

\begin{array}{l} P {{\hat{A}}_{n} (α_{n}) \neq A} = P {an error occurs for some j and some S} \\ = P {\underset{j = 1, \dots, p_{n}; S \in K_{j}^{m_{n}}}{\cup} E_{j ∣ S}} \leq \sum_{j = 1, \dots, p_{n}; S \in K_{j}^{m_{n}}} P (E_{j ∣ S}) \\ \leq p^{m_{n} + 1} \cdot sup_{j = 1, \dots, p_{n}; S \in K_{j}^{m_{n}}} P (E_{j ∣ S}) \leq 2 p^{m_{n} + 1} C_{12} exp (- n^{ν} / C_{12}) \\ \leq 2 p^{d_{0} + 1} C_{12} exp (- n^{ν} / C_{12}) . \end{array}

(B.7)

The second inequality holds since the number of possible choices of j is p and there are p^m_n possible choices for 𝒮. The last inequality in (B.7) is obtained because P(m̂_reach = m_reach) → 1 and m_reach ≤ d₀ by the same technique as Lemma 3 in Bühlmann, Kalisch and Maathuis (2010). Thus for large n, m_n ≤ m̂_reach ≤ d₀.

Moreover, recall that ν can be chosen arbitrarily in (0, 1/5). Therefore, if d₀ is fixed, for p = o(exp(n^ξ)), 0 < ξ < 1/5, (B.7) is simplified as P{𝒜̂_n(α_n) ≠ 𝒜} ≤ O{exp(−n^ν/C₁₂)}, provided ξ < ν < 1/5. If d₀ = O(n^b), 0 < b < 1/5, for p = o(exp(n^ξ)), 0 < ξ < 1/5 − b, (B.7) becomes P{𝒜̂_n(α_n) ≠ 𝒜} ≤ O{exp(−n^ν/C₁₂)}, provided that ξ + b < ν < 1/5. This completes the proof of Theorem 2 with C = C₁₂.

Proof of Theorem 3

We only need to consider the first step of the thresholded partial correlation approach, where we have

P (| \frac{\hat{Z} (y, x_{j})}{\sqrt{1 + \hat{κ}}} - \frac{Z (y, x_{j})}{\sqrt{1 + κ}} | > ε) \leq C_{13} exp (- n^{ν} / C_{13})

for some C₁₃ > 0. Define $E_{j}^{I I} = {fail to include x_{j} when x_{j} is a true predictor}$ , then using the same technique as the proof of Theorem 2,

\begin{array}{l} P (E_{j}^{I I}) = {| \frac{{\hat{Z}}_{n} (y, x_{j})}{\sqrt{1 + \hat{κ}}} | \leq \frac{Φ^{- 1} (1 - α_{n} / 2)}{{(n - 1)}^{1 / 2}} when β_{j} \neq 0} \\ \leq P {| \frac{{\hat{Z}}_{n} (y, x_{j})}{\sqrt{1 + \hat{κ}}} - \frac{Z_{n} (y, x_{j})}{\sqrt{1 + κ}} | \geq \frac{c_{n}}{2 \sqrt{1 + κ}}} \\ \leq C_{13} exp (- n^{ν} / C_{13}) . \end{array}

Then

P {{\hat{A}}_{n}^{[1]} ⊉ A} = P {\cup_{j = 1}^{p} E_{j}^{I I}} \leq \sum_{j = 1}^{p} P (E_{j}^{I I}) \leq p C_{13} exp (- n^{ν} / C_{13}),

for any 0 < ν < 1/5. Therefore, for p = o(exp(n^ξ)) and $P {{\hat{A}}_{n}^{[1]} ⊉ A} \leq O {exp (- n^{ν} / C_{13})}$ , provided ξ < ν < 1/5. This completes the proof of Theorem 3 with C^* = C₁₃.

References

Arellano-Valle RB, del Pino F, Iglesias P. Bayesian inference in spherical linear models: robustness and conjugate analysis. Journal of Multivariate Analysis. 2006;97:179–197. [Google Scholar]
Bühlmann P, Kalisch M, Maathuis M. Variable selection in high-dimensional linear models: partially faithful distributions and the PC24 simple algorithm. Biometrika. 2010;97:261–278. [Google Scholar]
Chen Y, Wiesel A, Hero AO. Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Transactions on Signal Processing. 2011;59:4097–4107. [Google Scholar]
Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim K-Y, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Shefield VC. Homozygosity Mapping with SNP Arrays Identifies a Novel Gene for Bardet-Biedl Syndrome (BBS10) Proceeding of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]
Fang KT, Kotz S, Ng KW. Symmetric Multivariate and Related Distributions. Chapman and Hall; New York, NY: 1990. [Google Scholar]
Huang J, Ma SG, Zhang CH. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
Liang H, Li R. Variable selection for partially linear models with measurement errors. Journal of American Statistical Association. 2009;104:234–248. doi: 10.1198/jasa.2009.0127. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mcneil AJ, Frey R, Embrechts P. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press; Princeton, NJ: 2005. [Google Scholar]
Mottonen J, Oja H, Tienari J. On the efficiency of multivariate spatial sign and rank tests. Annals of Statistics. 1997;25:542–552. [Google Scholar]
Muirhead RJ. Aspects of Multivariate Statistical Theory. Wiley; New York: 1982. [Google Scholar]
Oja H, Randles RH. Multivariate nonparametric tests. Statistical Sciences. 2004;19:598–605. [Google Scholar]
Osiewalski J. A note on Bayesian inference in a regression model with elliptical errors. Journal of Econometrics. 1991;48:183–193. [Google Scholar]
Osiewalski J, Steel MFJ. Robust bayesian inference in elliptical regression models. Journal of Econometrics. 1993;57:345–363. [Google Scholar]
Scheetz TE, Kim K-YA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Shefield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceeding of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmidt R. Tail Dependence for Elliptically Contoured Distributions. Mathematical Methods of Operations Research. 2002;55:301–327. [Google Scholar]
Soloveychik I, Wiesel A. Performance analysis of Tyler’s covariance estimator. IEEE Transactions on Signal Processing. 2015;63:418–426. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
Vidal I, Arellano-Valle RB. Bayesian inference for dependent elliptical measurement error models. Journal of Multivariate Analysis. 2010;101:2587–2597. [Google Scholar]
Wang L, Peng B, Li R. A high-dimensional nonparametric multivariate test for mean vector. Journal of American Statistical Association. 2015 doi: 10.1080/01621459.2014.988215. Accepted. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl

NIHMS866292-supplement-Suppl.pdf^{(45.7KB, pdf)}

[R1] Arellano-Valle RB, del Pino F, Iglesias P. Bayesian inference in spherical linear models: robustness and conjugate analysis. Journal of Multivariate Analysis. 2006;97:179–197. [Google Scholar]

[R2] Bühlmann P, Kalisch M, Maathuis M. Variable selection in high-dimensional linear models: partially faithful distributions and the PC24 simple algorithm. Biometrika. 2010;97:261–278. [Google Scholar]

[R3] Chen Y, Wiesel A, Hero AO. Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Transactions on Signal Processing. 2011;59:4097–4107. [Google Scholar]

[R4] Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim K-Y, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Shefield VC. Homozygosity Mapping with SNP Arrays Identifies a Novel Gene for Bardet-Biedl Syndrome (BBS10) Proceeding of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R7] Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]

[R8] Fang KT, Kotz S, Ng KW. Symmetric Multivariate and Related Distributions. Chapman and Hall; New York, NY: 1990. [Google Scholar]

[R9] Huang J, Ma SG, Zhang CH. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R10] Liang H, Li R. Variable selection for partially linear models with measurement errors. Journal of American Statistical Association. 2009;104:234–248. doi: 10.1198/jasa.2009.0127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Mcneil AJ, Frey R, Embrechts P. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press; Princeton, NJ: 2005. [Google Scholar]

[R12] Mottonen J, Oja H, Tienari J. On the efficiency of multivariate spatial sign and rank tests. Annals of Statistics. 1997;25:542–552. [Google Scholar]

[R13] Muirhead RJ. Aspects of Multivariate Statistical Theory. Wiley; New York: 1982. [Google Scholar]

[R14] Oja H, Randles RH. Multivariate nonparametric tests. Statistical Sciences. 2004;19:598–605. [Google Scholar]

[R15] Osiewalski J. A note on Bayesian inference in a regression model with elliptical errors. Journal of Econometrics. 1991;48:183–193. [Google Scholar]

[R16] Osiewalski J, Steel MFJ. Robust bayesian inference in elliptical regression models. Journal of Econometrics. 1993;57:345–363. [Google Scholar]

[R17] Scheetz TE, Kim K-YA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Shefield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceeding of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] Schmidt R. Tail Dependence for Elliptically Contoured Distributions. Mathematical Methods of Operations Research. 2002;55:301–327. [Google Scholar]

[R19] Soloveychik I, Wiesel A. Performance analysis of Tyler’s covariance estimator. IEEE Transactions on Signal Processing. 2015;63:418–426. [Google Scholar]

[R20] Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]

[R21] Vidal I, Arellano-Valle RB. Bayesian inference for dependent elliptical measurement error models. Journal of Multivariate Analysis. 2010;101:2587–2597. [Google Scholar]

[R22] Wang L, Peng B, Li R. A high-dimensional nonparametric multivariate test for mean vector. Journal of American Statistical Association. 2015 doi: 10.1080/01621459.2014.988215. Accepted. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Variable Selection via Partial Correlation

Runze Li

Jingyuan Liu

Lejia Lou

Abstract

1. Introduction

2. Thresholded partial correlation (TPC) approach

2.1 Elliptical linear model and its partial correlation estimation

Definition 1

Theorem 1

2.2 A variable selection algorithm

Algorithm 1.

2.3 Theoretical properties

Theorem 2

Theorem 3

3. Numerical Studies

3.1 Simulation studies

Table 3.1.

3.2 An application

Table 3.2.

4. Conclusion

Supplementary Material

Acknowledgments

Appendix

Appendix A: Proof of Theorem 1

Appendix B: Proof of Theorems 2 and 3

Lemma 1

Lemma 2

Lemma 3

The proof of Theorem 2

Step 1

Step 2

Step 3

Proof of Theorem 3

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases