Abstract
Partial correlation based variable selection method was proposed for normal linear regression models by Bühlmann, Kalisch and Maathuis (2010) as a comparable alternative method to regularization methods for variable selection. This paper addresses two important issues related to partial correlation based variable selection method: (a) whether this method is sensitive to normality assumption, and (b) whether this method is valid when the dimension of predictor increases in an exponential rate of the sample size. To address issue (a), we systematically study this method for elliptical linear regression models. Our finding indicates that the original proposal may lead to inferior performance when the marginal kurtosis of predictor is not close to that of normal distribution. Our simulation results further confirm this finding. To ensure the superior performance of partial correlation based variable selection procedure, we propose a thresholded partial correlation (TPC) approach to select significant variables in linear regression models. We establish the selection consistency of the TPC in the presence of ultrahigh dimensional predictors. Since the TPC procedure includes the original proposal as a special case, our theoretical results address the issue (b) directly. As a by-product, the sure screening property of the first step of TPC was obtained. The numerical examples also illustrate that the TPC is competitively comparable to the commonly-used regularization methods for variable selection.
keywords and phrases: Elliptical distribution, model selection consistency, partial correlation, partial faithfulness, sure screening property, ultrahigh dimensional linear model, variable selection
1. Introduction
Variable selection via penalized least squares has been extensively studied during the last two decades. Popular penalized least squares variable selection procedures include LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001), adaptive LASSO (Zou, 2006) and among others. See Fan and Lv (2010) for a selective overview on this topic and references therein for more works on variable selection via penalized least squares.
As a powerful alternative method to penalized least squares for variable selection, Bühlmann, Kalisch and Maathuis (2010) proposed a variable selection procedure by ranking the partial correlations (PC) between the predictors and the response, named PC-simple algorithm. The definition of partial correlations is given in section 2. The authors provided a stepwise algorithm for the linear regression models with partial faithfulness - where for each predictor, if its partial correlation with the response given a certain subset of other predictors is 0, then the partial correlation given all the other predictors is also 0. The PC-simple algorithm possesses model selection consistency for such linear models, thus is competitively comparable to the penalized least squares variable selection approaches. Therefore, scientists may have two distinct schemes for variable selection in high-dimensional linear models, which raise their confidence in the selected predictors if they are chosen by both techniques.
This work aims to study two important issues related to the PC-simple algorithm. The first issue is that the procedure proposed in Bühlmann, Kalisch and Maathuis (2010) relies on normality assumption on the joint distribution of the response and the predictors, although partial faithfulness does not require normality assumption. Thus, it is of great interest to study the impact of normality assumption on the variable selection procedure developed in Bühlmann, Kalisch and Maathuis (2010). The second issue is that the theoretical results established in Bühlmann, Kalisch and Maathuis (2010) requires that the dimension of the predictor vector increases in a polynomial rate of the sample size. It is also of great interest to study whether the theoretical results are valid with dimensionality increasing at exponential rate of the sample size.
To study the issue related to normality assumption, we consider the elliptical linear model (i.e, the response and the predictors in a linear regression model jointly follow an elliptical distribution), which has been systematically accounted in Fang, Kotz and Ng (1990). The elliptical distribution family contains a much broader class of distributions than the normal distribution family, such as mixture of normal distributions, multivariate t-distribution, multi-uniform distribution on unit sphere, Pearson Type II distribution, among others. It has been used as a tool to study the robustness of normality in the literature of multivariate nonparametric tests (Mottonen, Oja and Tienari, 1997; Oja and Randles, 2004; Chen, Wiesel and Hero, 2011; Soloveychik and Wiesel, 2015; Wang, Peng and Li, 2015). The elliptical linear regressions have been proposed in Osiewalski (1991); Osiewalski and Steel (1993); Arellano-Valle, del Pino and Iglesias (2006); Fan and Lv (2008); Liang and Li (2009); Vidal and Arellano-Valle (2010), and have received more and more attentions in the recent literature (Arellano-Valle, del Pino and Iglesias, 2006; Fan and Lv, 2008; Liang and Li, 2009; Vidal and Arellano-Valle, 2010). Furthermore, the elliptical distribution family has a variety of applications. For instance, it becomes crucial for modeling finance data (Mcneil Frey and Embrechts, 2005) due to its potential to accommodate tail dependence (the phenomenon of simultaneous extremes), which is highly useful in quantitative finance but is not allowed by the multivariate normal distribution (Schmidt, 2002).
By exploring the limiting distribution of the sample partial correlation of elliptical distribution, which is of its own significance, we find that the PC-simple algorithm tends to over-fit(under-fit) the models under those elliptical distributions whose marginal kurtosis is larger (smaller) than the marginal kurtosis of normal distribution. To ensure the superior performance of partial correlation based variable selection procedure for the elliptical distribution family, we propose a thresholded partial correlation (TPC) approach to select significant variables in linear regression models. In the same spirit of the PC-simple algorithm, the TPC is a stepwise method for variable selection and is constructed by comparing each sample correlation and sample partial correlation with a given threshold corresponding to a given significant level. The TPC approach relies on the limiting distribution of the sample partial correlation, and coincides to the PC-simple algorithm for the normal linear models. This enables us to study the asymptotic property of the PC-sample algorithm under a broader framework in order to address the issue related to dimensionality increasing at exponential rate.
We systematically study the sampling properties of the TPC. We first derive the concentration inequality of the partial correlations without model assumption when the dimensionality of the covariates increases with the same size at an exponential rate. This enables us to conduct the TPC for the ultrahigh dimensional linear models. We further establish the theoretical properties of the TPC. This allows us to broaden the usage of this variable selection scheme. We also develop the sure screening property of the first-step TPC in the terminology of Fan and Lv (2008). Note that the first step of the TPC has the same spirit as the marginal screening based on the Pearson correlation (Fan and Lv, 2008). Thus, as a by-product, we obtain the sure screening property of the marginal screening procedure based on the Pearson correlation under different assumptions from Fan and Lv (2008).
This paper is organized as follows. In section 2, we propose the TPC for the elliptical linear models, and further establish its asymptotic properties. Numerical studies are conducted in section 3. A brief conclusion is given in section 4, and all the technical proofs are allocated in the Appendix.
2. Thresholded partial correlation (TPC) approach
In this section, we introduce the linear model with elliptical response and predictors, and bring in the motivation of the TPC approach by studying the limiting distribution of the partial correlation estimation for elliptical distributions. Then the TPC approach is proposed based on the limiting distribution, and the corresponding theoretical properties are discussed.
2.1 Elliptical linear model and its partial correlation estimation
Consider a linear model
| (2.1) |
where y is the response variable, x = (x1, · · ·, xp)T is the covariate vector, β = (β1, …, βp)T is the coefficient vector, and ε is the random error with E(ε) = 0, var(ε) = σ2. Throughout this paper, it is assumed without loss of generality that E(x) = 0 and E(y) = 0 so that there is no intercept in model (2.1). In practice, it is common that x and y are marginally standardized before performing variable selection. Furthermore, (xT, y) is assumed to follow an elliptical distribution, and suppose that are independent and identically distributed (iid) random samples from an elliptical distribution ECp+1(μ,Σ, ϕ), which has the characteristic function exp(itTμ)ϕ(tTΣt) for some characteristic generator ϕ(·) (Fang, Kotz and Ng, 1990).
Bühlmann, Kalisch and Maathuis (2010) proposed a variable selection method, PC-simple algrithm, based on the parital correlation learning for the linear model (2.1) with normal response and predictors. To extend this method to the elliptical distribution, we first study the limiting distributions for the correlation and partial correlation under the elliptical assumption. Denote by ρ(y, xj) and ρ̂(y, xj) the population and the sample correlation between y and xj, respectively. Then as shown in Theorem 5.1.6 of Muirhead (1982), the asymptotic distribution of ρ̂ (y, xj) is
| (2.2) |
where with ϕ′(0) and ϕ″ (0) being the first and second derivatives of ϕ at 0. κ is the marginal kurtosis of the elliptical distribution of ECp+1(μ,Σ, ϕ) and equals 0 for a normal distribution Np+1(μ, Σ).
For an index set 𝒮 ⊆ {1, 2, · · ·, p}, we define 𝒮c to be 𝒮c = {1 ≤ j ≤ p : j ∉ 𝒮}, |𝒮| to be its cardinality, and x𝒮 = {xj : j ∈ 𝒮} to be a subset of covariates with index set 𝒮. Denote the truly active index set 𝒜 = {1 ≤ j ≤ p : βj ≠ 0} and the corresponding cardinality d0 = |𝒜|. Based on x𝒮, the definition of the partial correlation is given below.
Definition 1
(Partial Correlation) The partial correlation between xj and y given a set of controlling variables x𝒮, denoted by ρ(y, xj |x𝒮), is defined as the correlation between the residuals rxj,x𝒮 and ry,x𝒮 from the linear regression of xj on x𝒮 and that of y on x𝒮, respectively. And the corresponding sample partial correlation between y and xj given x𝒮 is denoted by ρ̂(y, xj |x𝒮).
In the next theorem, we study the asymptotic distribution of the sample partial correlation when the sample was drawn from an elliptical distribution, which provides the foundation of TPC variable selection procedure.
Theorem 1
Suppose that are iid random samples from an elliptical distribution ECp+1(μ,Σ, ϕ) with all finite fourth moments. For any j = 1, · · ·, p, and 𝒮 ⊆ {j}c with cardinality , if there exists a positive constant δ0, the smallest eigenvalue of the covariance matrix of x𝒮 is greater than δ0, then
| (2.3) |
Theorem 1 seems to be a natural extension on partial correlation from normal distribution to elliptical distribution, but to our best knowledge, this results is new, and its proof is given in Appendix A. Let ∅ be the empty set, and ρ̂(y, xj |x∅) and ρ(y, xj |x∅) stand for ρ̂ (y, xj) and ρ(y, xj ), respectively. Then (2.3) is also valid for 𝒮 = ∅ by (2.2). The limiting distributions of sample correlation and partial correlation given in (2.2) and (2.3) provides insights into the impact of normality assumption on the PC-simple algorithm through the marginal kurtosis under ellipticity assumption. This enables us to modify the PC-simple algorithm by taking into account the marginal kurtosis to ensure its superior performance.
In addition, since the limiting distribution of ρ̂ (y, xj |x𝒮) in (2.3) involves ρ(y, xj |x𝒮) in the asymptotic variance, we consider the Fisher Z-transformation of ρ̂ (y, xj |x𝒮), whose limiting distribution no longer depends on ρ(y, xj |x𝒮). Specifically, let Ẑ (y, xj |x𝒮) and Z(y, xj |x𝒮) be the Fisher Z-transformation of ρ̂ (y, xj |x𝒮)} and ρ(y, xj |x𝒮), respectively. That is,
| (2.4) |
Then, it follows by the delta method and Theorem 1 that
| (2.5) |
The asymptotic distribution of Ẑ (y, xj |x𝒮) no longer depends on ρ(y, xj |x𝒮), thus it is easier to derive the selection threshold for Ẑ (y, xj |x𝒮) rather than for ρ̂(y, xj |x𝒮) directly.
2.2 A variable selection algorithm
Based on the partial faithfulness condition, the following holds for all j ∈ {1, …, p} (Bühlmann, Kalisch and Maathuis, 2010):
That is, xj is important (or βj ≠ 0) if and only if the partial correlations between y and xj given all subsets 𝒮 contained in {j}c are not zero. Extending the PC-simple algorithm proposed by Bühlmann, Kalisch and Maathuis (2010), we propose to identify active predictors by iteratively testing the series of hypotheses
where m̂reach = min{m: |𝒜̂[m]| ≤ m}, and 𝒜̂[m] is the chosen model index set in the mth step with cardinality | 𝒜̂[m]|. Based on the limiting distribution (2.5), the rejection region at level α is with κ̂ being a consistent estimate of κ, where Φ−1(·) is the inverse of the cumulative distribution of the standard normal distribution. In practice, the factor in the rejection region is replaced by due to the loss of degrees of freedom used in calculation of residuals. Therefore, an equivalent form of the rejection region with small sample correction is
| (2.6) |
where
| (2.7) |
with κ being estimated by its sample counterpart:
| (2.8) |
where x̄j is the sample mean of the j-the element of x and xij is the j-th element of xi. In practice, the sample partial correlations can be computed recursively:
For any k ∈ 𝒮,
| (2.9) |
We summarize the TPC variable selection by the following algorithm.
Algorithm 1.
Algorithm for TPC variable selection.
| Step 1: Set m = 1 and 𝒮 = ∅, obtain the marginally estimated active set by
| |
| Step 2: Based on 𝒜̂[m−1], construct the mth step estimated active set by
| |
| Step 3: Repeat Step 2 until m = m̂reach. |
Algorithm 1 results in a sequence of estimated active sets
Since κ = 0 for normal distributions, the TPC is indeed the PC-simple algorithm under normality assumption. Thus, Theorem 1 clearly shows that the PC-simple algorithm tends to over-fit (under-fit) the models under those distributions where the kurtosis is larger (smaller) than the normal kurtosis 0. Following Bühlmann, Kalisch and Maathuis (2010), we further apply the ordinary least squares approach to estimate the coefficients of predictors in 𝒜̂[m̂reach] after running Algorithm 1.
2.3 Theoretical properties
We impose the following regularity conditions to establish the asymptotic theory of the TPC. These regularity conditions may not be the weakest ones.
-
(D1)
The joint distribution of (xT, y) satisfies partial faithfulness (Bühlmann, Kalisch and Maathuis, 2010).
-
(D2)(xT, y) follows ECp+1(μ,Σ, ϕ) with Σ > 0. Furthermore, there exists s0 > 0, such that for all 0 < s < s0,
-
(D3)
There exists δ > −1, such that the kurtosis satisfies κ > δ > −1.
-
(D4)For some cn = O(n−d), 0 < d < 1/2, the partial correlations ρ(y, xj |x𝒮) satisfy
-
(D5)
The partial correlations ρ(y, xj |x𝒮) and ρ(xj, xk|x𝒮) satisfy:
sup {|ρ(y, xj |x𝒮)|: 1 ≤ j ≤ p, 𝒮 ⊆ {j}c, |𝒮| ≤ d0} ≤ τ < 1,
sup {|ρ(xj, xk|x𝒮)| : 1 ≤ j ≠ k ≤ p, 𝒮 ⊆ {j, k}c, |𝒮| ≤ d0} ≤ τ < 1.
Condition (D1) guarantees the validity of the TPC method as a variable selection criterion. The assumption on elliptical distribution in (D2) is crucial when deriving the asymptotic distribution of the sample partial correlation, and the sub-exponential tail probability ensures the difference between the population and sample partial correlations to degenerate with an exponential rate. Many elliptical distributions satisfy the sub-exponential tail probability, such as multivariate normal distribution and Pearson Type II distribution (Fang, Kotz and Ng, 1990). In addition, although (D2) is widely used in literature as a sufficient condition to facilitate the theoretical proof, it may not be the weakest condition to guarantee the validity of the TPC. (D3) puts a mild condition on the kurtosis, and is used to control Type I and II errors. The lower bound of partial correlations in (D4) is used to control Type II errors for the tests. This condition has the same spirit as that of the penalty-based methods which requires the non-zero coefficients to be bounded away from 0. The upper bound of partial correlations in the condition i) of (D5) is used to control Type I error, and the condition ii) of (D5) imposes a fixed upper bound on the population partial correlations between the covariates, which excludes the perfect collinearity between the covariates.
Based on the above regularity conditions, we obtain the following consistency property. First we consider the model selection consistency of the final estimated active set by TPC. Since the TPC depends on the significance level α = αn, we rewrite the final chosen model to be 𝒜̂n(αn).
Theorem 2
Consider linear model (2.1). Under Conditions (D1)–(D5), there exists a sequence αn → 0 and a positive constant C, such that if d0 is fixed, then for p = o(exp(nξ)), 0 < ξ < 1/5, the estimated active set can be identified with the following rate
| (2.10) |
where ξ < ν < 1/5; and if d0 = O(nb), 0 < b < 1/5, then for p = o(exp(nξ)), 0 < ξ < 1/5 − b, (2.10) still holds, with ξ + b < ν < 1/5.
The proof of this theorem is given in Appendix B. Theorem 2 implies that the TPC method including the original PC-simple algorithm enjoys the model selection consistency property when dimensionality increases at exponential rate of the sample size. Following Bühlmann, Kalisch and Maathuis (2010), one possible choice of the theoretical significance level αn is .
Note that Bühlmann, Kalisch and Maathuis (2010) utilized the tail probability of normal distribution to control the upper bound of probabilities of Types I and II errors. Thus, they have to assume that the model dimension grows at the polynomial rate of the sample size. We take a different approach from Bühlmann, Kalisch and Maathuis (2010) to establishing the model selection consistency in Theorem 2. We first derive the concentration inequality of the partial correlations as in Step 1 of the proof of Theorem 2. In this step, we do not require assumption of ellipticity. With the concentration inequality, we allow the dimensionality of the covariates increases with the same size at an exponential rate. This enables us to conduct the TPC for ultrahigh dimensional linear models.
In addition, notice that the estimated active set from the first step of the TPC, denoted by , can be viewed as a feature screening procedure, and is essentially equivalent to the sure independence screening procedure proposed by Fan and Lv (2008). Thus, we next establish the sure screening property (Fan and Lv, 2008) of this first step of TPC under a different set of assumptions. We impose the following conditions on the population marginal correlations:
-
(E4)
inf {|ρn(y, xj)| : j = 1, · · ·, p, ρn(y, xj) ≠ 0} ≥ cn, where cn = O(n−d), and 0 < d < 1/2.
-
(E5)
sup {|ρn(y, xj)| : j = 1, · · ·, pn,} ≤ τ < 1.
Theorem 3
Consider linear model (2.1) and assume (D1)–(D3), (E4) and (E5). For p = O(exp(nξ)), where 0 < ξ < 1, there exists a sequence αn → 0 such that , where C* is a positive constant and ξ < ν < 1/5.
The proof of this theorem is given in Appendix B. This theorem confirms the sure screening property of the marginal screening procedure based on the Pearson correlation under a different set of regularity conditions from Fan and Lv (2008).
3. Numerical Studies
In this section, we assess the finite sample performance of the TPC methods and compare it with LASSO (Tibshirani, 1996), SCAD (Fan and Li, 2001) and PC-simple algorithm through Monte Carlo simulation studies. We also illustrate the application of the TPC by a rat eye expression dataset example.
3.1 Simulation studies
In our simulation study, data were generated from linear model (2.1) with β1 = 3, β2 = 1.5, β5 = 2 and βj = 0 if j ≠ 1, 2, 5. In our simulation, we consider p = 200, 500 and 2000. The sample size is taken to be n = 200. Moreover, the joint distribution of x and ε are taken to be 0.9N(0, Σ) + 0.1N(0, 9Σ), which is an elliptical distribution, and the normal distribution N(0, Σ), where Σ is the (p + 1) × (p + 1) matrix with (i, j)th entry ρ|i−j|. We consider ρ = 0, 0.3, and 0.8 which corresponds to “uncorrelated”, “moderately correlated”, and “strongly correlated” among x, respectively. Note that the estimated kurtosis of the mixture normal is around 1.5, deviating from 0 for normal distribution to a large extent. Thus, the mixture of normal distributions in this study is used to illustrate a heavy-tailed situation. For each case, we conduct 1000 simulations.
In our simulation, we compare the finite sample performance of LASSO (Tibshirani, 1996) and SCAD (Fan and Li, 2001), the PC-simple algorithm (Bühlmann, Kalisch and Maathuis, 2010) and the TPC. Furthermore, the following criteria are used for evaluating the performance of variable selection procedures.
Model error: Ex[{x(β̂ − β)}2] = (β̂ − β)T cov(x)(β̂ − β).
True positive number (TPN) which is defined to be the average of the number of the predictors with nonzero coefficients being successfully detected in 1000 simulation.
False positive number (FPN) which is defined to be the average of the number of predictors with zero coefficients being erroneously selected into the model.
Underfit percentage (UF), which is defined to be the percentage of underfit models which fails to identify at least one important predictor in the 1000 simulations.
Correct-fit percentage (CF), which is defined to be the percentage of correctly-fitted models that exactly select the truly important predictors in the 1000 simulations.
Overfit percentage (OF), which is defined to be the percentage of overfit models that identify all the important predictors, but include at least one unimportant predictor in the 1000 simulations.
Table 3.1 depicts the simulation results for the elliptical distribution and clearly shows that the TPC performs significantly better than LASSO, SCAD and the PC-simple algorithm in most situations, regardless of the low or high model dimensionality. Specifically, LASSO constantly over-fits the model under every scenario, as indicated by literature. The models selected by SCAD are also more conservative in this case compared with those by the PC-based methods, thus the correct-fit rate is much lower while the over-fit rate is high. Furthermore, since the PC-simple relies on normality, it fails to capture the correct model with a high percentage when this elliptical distribution is considered, especially when x-variables are independent, the correct-fit rate of the PC-simple are only 25% and 7% for p = 500 and 2000, respectively. The PC-simple tends to overfit this model because the marginal kurtosis is around 1.5, larger than that of normal distribution, 0. For instance, when p = 500, the over-fit rate (OF) of PC-simple are 0.58, 0.65 and 0.22 for ρ = 0, 0.3, 0.8, respectively. While the OF of TPC are 0.04, 0.07 and 0.00. The TPC increases the probability of recovering the true model to a large degree, by correcting the threshold under the ellipticity assumption. Furthermore, the time for 1000 simulations with p = 2000 is reported in Table S1 in the supplemental material to save space. In terms of the computational cost, TPC converges much faster than the PC-simple algorithm and the SCAD, and is comparable to the LARS algorithm for LASSO.
Table 3.1.
Simulation Results for Example 1: Elliptical Distribution
| p | ρ | Method | MedME(Devi) | TPN | FPN | UF | CF | OF |
|---|---|---|---|---|---|---|---|---|
| SCAD | 0.050 (0.024) | 3.00 | 4.52 | 0.00 | 0.51 | 0.49 | ||
| LASSO | 8.984 (0.219) | 3.00 | 33.63 | 0.00 | 0.00 | 1.00 | ||
| 200 | 0 | PC-simple | 0.082 (0.050) | 2.92 | 0.82 | 0.08 | 0.41 | 0.51 |
| TPC | 0.045 (0.032) | 2.84 | 0.13 | 0.16 | 0.81 | 0.03 | ||
|
| ||||||||
| SCAD | 0.046 (0.023) | 3.00 | 3.90 | 0.00 | 0.50 | 0.50 | ||
| LASSO | 11.195 (0.216) | 3.00 | 30.26 | 0.00 | 0.00 | 1.00 | ||
| 200 | 0.3 | PC-simple | 0.063 (0.036) | 3.00 | 0.46 | 0.00 | 0.58 | 0.42 |
| TPC | 0.036 (0.024) | 2.99 | 0.04 | 0.01 | 0.96 | 0.03 | ||
|
| ||||||||
| SCAD | 0.044 (0.026) | 3.00 | 2.51 | 0.00 | 0.50 | 0.50 | ||
| LASSO | 20.925 (0.158) | 3.00 | 16.44 | 0.00 | 0.02 | 0.98 | ||
| 200 | 0.8 | PC-simple | 0.039 (0.026) | 2.94 | 0.17 | 0.06 | 0.83 | 0.11 |
| TPC | 0.057 (0.040) | 2.79 | 0.20 | 0.19 | 0.80 | 0.01 | ||
|
| ||||||||
| SCAD | 0.041 (0.022) | 3.00 | 5.57 | 0.00 | 0.41 | 0.59 | ||
| LASSO | 8.960 (0.212) | 3.00 | 45.25 | 0.00 | 0.00 | 1.00 | ||
| 500 | 0 | PC-simple | 0.096 (0.051) | 2.83 | 1.22 | 0.17 | 0.25 | 0.58 |
| TPC | 0.043 (0.031) | 2.74 | 0.21 | 0.26 | 0.70 | 0.04 | ||
|
| ||||||||
| SCAD | 0.043 (0.024) | 3.00 | 7.05 | 0.00 | 0.40 | 0.60 | ||
| LASSO | 11.172 (0.230) | 3.00 | 38.94 | 0.00 | 0.00 | 1.00 | ||
| 500 | 0.3 | PC-simple | 0.077 (0.043) | 3.00 | 0.83 | 0.00 | 0.35 | 0.65 |
| TPC | 0.030 (0.018) | 2.98 | 0.08 | 0.02 | 0.91 | 0.07 | ||
|
| ||||||||
| SCAD | 0.042 (0.026) | 3.00 | 4.07 | 0.00 | 0.40 | 0.60 | ||
| LASSO | 20.879 (0.187) | 3.00 | 20.86 | 0.00 | 0.00 | 1.00 | ||
| 500 | 0.8 | PC-simple | 0.049 (0.031) | 2.91 | 0.37 | 0.09 | 0.69 | 0.22 |
| TPC | 0.044 (0.032) | 2.73 | 0.26 | 0.25 | 0.75 | 0.00 | ||
|
| ||||||||
| SCAD | 0.051 (0.032) | 3.00 | 10.13 | 0.00 | 0.40 | 0.60 | ||
| LASSO | 9.140 (0.179) | 3.00 | 66.84 | 0.00 | 0.00 | 1.00 | ||
| 2000 | 0 | PC-simple | 0.112 (0.056) | 2.90 | 1.73 | 0.10 | 0.07 | 0.83 |
| TPC | 0.050 (0.037) | 2.83 | 0.35 | 0.17 | 0.67 | 0.16 | ||
|
| ||||||||
| SCAD | 0.045 (0.028) | 3.00 | 8.58 | 0.00 | 0.33 | 0.67 | ||
| LASSO | 11.345 (0.189) | 3.00 | 61.97 | 0.00 | 0.00 | 1.00 | ||
| 2000 | 0.3 | PC-simple | 0.105 (0.044) | 2.99 | 1.36 | 0.01 | 0.17 | 0.82 |
| TPC | 0.039 (0.026) | 2.97 | 0.18 | 0.03 | 0.83 | 0.14 | ||
|
| ||||||||
| SCAD | 0.049 (0.030) | 3.00 | 7.33 | 0.00 | 0.28 | 0.72 | ||
| LASSO | 20.960 (0.136) | 3.00 | 37.81 | 0.00 | 0.00 | 1.00 | ||
| 2000 | 0.8 | PC-simple | 0.077 (0.046) | 2.96 | 0.59 | 0.04 | 0.48 | 0.48 |
| TPC | 0.045 (0.034) | 2.82 | 0.24 | 0.17 | 0.81 | 0.02 | ||
The numbers in the parentheses are median absolute deviations over 1000 simulations.
The results for normal distribution are presented in Table S2 in the supplement material. Recall that in theory, the TPC should be equivalent to the PC-simple algorithm in this case, thus their performances are quite similar. The median model errors are comparable for all the methods except for LASSO, which yields much larger models than necessary. Overall, both LASSO and SCAD tend to provide more conservative models, and over-fit the model, compared with the partial-correlation-based methods for variable selection.
The elliptical distribution assumption is used only for deriving the asymptotic distribution of the partial correlations in Theorem 1, thus it inspired us to propose the test statistic for the series of hypotheses H0 : ρ(y, xj |x𝒮) = 0. However, the model selection consistency of the TPC does not require the elliptical distribution for the response and the predictors. Therefore, the performance of TPC is expected to be relatively robust against the elliptical assumption. To illustrate this, we may consider a simulation example in which the discrete predictors are involved. Specifically, the x’s with even-subscript are generated in the same fashion as before, while the x’s with odd-subscript take discrete values 0, 1, 2 with probabilities 0.25, 0.5 and 0.25, respectively. The result is reported in Table S3 in the supplemental material to save space. We can see from Table S3 that the TPC outperforms other methods, especially in terms of the correct-fit rate.
3.2 An application
In this section, we demonstrate the proposed methodology by an empirical analysis of microarray data set, which was studied by Scheetz et al. (2006) and Huang et al. (2008). This dataset contains 120 12-week-old male rats, and for each rat, 3000 sufficiently expressed gene probes with enough variation are studied. The purpose of the analysis is to identify the probes that are most relevant to the response – the expression level of probe TRIM32, which are recently proved to cause Bardet-Biedl syndrome (Chiang, 2006).
We apply the SCAD, LASSO, PC-simple algorithm and TPC to this data set with one outlier deleted. Table 3.2 provides the information of the chosen gene probes by different methods. As LASSO yields a much larger model leading to the difficulty of interpretation, to save space, we only report the six probes selected by SCAD, the PC-simple algorithm and TPC, and indicate whether they are included in the 20 chosen probes by LASSO. We calculate the adjusted R2 for each model and prediction error (PE) by leave-one-out cross-validation (LOOCV) method for each model. From Table 3.2, we can see that the models selected by SCAD, LASSO and TPC have very similar performance in terms of adjusted R2 and predictor error. The TPC method improves the PC-simple algorithm by including two probes x5 and x6. These two probes lead to about 9% predictor error reduction from the model selected by PC-simple to the model selected by TPC. Note that the probe 1389584 at (x1) and 1383996 at (x2) are selected by all the four approaches, and also identified by Huang et al. (2008). Therefore, they are worth more comprehensively biological research. The rest of the results from TPC is more consistent with Huang et al. (2008) than the other methods.
Table 3.2.
Results for Real Data Example
| Selected Probes | SCAD | LASSO | PC-simple | TPC | M6 Est(& SE) | M4 Est(& SE) |
|---|---|---|---|---|---|---|
| Intercept | Yes | Yes | Yes | Yes | .0147 (.0465) | .0164 (.0467) |
| 1389584 at(x1) | Yes | Yes | Yes | Yes | .3669 (.0823)*** | .4098 (.0710) *** |
| 1383996 at(x2) | Yes | Yes | Yes | Yes | .1400 (.0595) * | .1583 (.0590) ** |
| 1382452 at(x3) | Yes | Yes | / | / | .2450 (.0606)*** | .2279 (.0547) *** |
| 1370429 at(x4) | / | / | Yes | Yes | .0464 (.0815) | |
| 1383110 at(x5) | / | Yes | / | Yes | .1543 (.0840) | |
| 1374106 at(x6) | / | Yes | / | Yes | .2203 (.0727)** | .2580 (.0688) *** |
| 15 more probe | / | Yes | / | / | ||
|
| ||||||
| Size | 4 | 21 | 4 | 6 | 7 | 5 |
|
| ||||||
| Adjusted-R2(%) | 69.37 | 69.55 | 66.64 | 69.10 | 74.60 | 74.38 |
|
| ||||||
| PE | 0.297 | 0.298 | 0.326 | 0.301 | 0.275 | 0.270 |
The 15 probes selected only by LASSO are omitted. “Yes” means the probe is selected by this method.
M6 stands for the linear model with six probes x1-x6; M4 for the model with four probes x1, x2, x3 and x6.
‘*’ stands for significant at level 0.05, ‘**’ for level 0.01, and ‘***’ for level 0.001.
We further conduct some exploratory analysis. We compare the model with 20 probes selected by LASSO with the model with the six probes listed in Table 3.2 (denoted by M6 in the table) by the likelihood ratio test (LRT). The p-value of the corresponding LRT is 0.058. This implies that the model with the six probes fits the data well enough. The corresponding estimates and standard errors of regression coefficients are listed in the second last column in Table 3.2. The adjusted R2 and the predictor error calculated by the LOOCV method has much improvement over the model selected by the SCAD, LASSO, PC-simple and TPC methods. For example, the predictor error has about 10% reduction. The coefficients of x4 and x5 seem not to be significant at level 0.05. We refit the data to model with only four probes x1, x2, x3 and x6, and their estimates and standard error are reported in the last column of Table 3.2. The adjusted R2 and predictor error of this model is very close to the one with six probes. The empirical analysis of this example implies that two comparable schemes for variable selection (i.e., regularization methods such as the SCAD and the LASSO and partial correlation based method such as PC-simple and TPC) can be used to improve each others. For example, the regularization method would miss probe x6, while the TPC would miss probe x3. Scientists may raises their confidence in the selected probes x1 and x2 since they are chosen by both techniques.
4. Conclusion
In this paper, we proposed the variable selection procedure via the thresholded partial correlation (TPC) and established its model selection consistency and sure screening property in the presence of ultrahigh-dimensional predictors. Our simulation and empirical analysis of a real-life data example illustrate that the TPC may serve as a potential alternative to the commonlyused regularization methods for high or ultrahigh dimensional regression models.
Supplementary Material
Acknowledgments
Li’s research was supported by National Institute on Drug Abuse (NIDA) grants P50-DA10075 and P50 DA036107. Liu’s research was supported by National Natural Science Foundation of China (NNSFC) grant 11401497 and the Fundamental Research Funds for the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry. All authors equally contribute to this paper, and the authors are listed in the alphabetic order.
Appendix
Appendix A: Proof of Theorem 1
Let , i = 1, · · ·, n and denote q = p+1. Thus, u1, · · ·, un be an independent and identically distributed random sample from ECq(μ,Σ, ϕ). To study the asymptotic behaviors of partial correlation of elliptical distribution, we consider the following general partitions of ui, μ and Σ:
where u1i and μ1 are q1-dimensional, while u2i and μ2 are q2-dimensional, Σ11 is a q1 × q1 matrix, and Σ22 is a q2 × q2 matrix. Here q = q1 + q2. Let U = (u1, · · ·, un)T and denote
Partition A in the same way as Σ. Let akl.2 is the (k, l)-element of . Then the sample partial correlation of uik and uil given u2i, ρ̂(uik, uil|u2i), indeed equals .
To derive the asymptotic distribution of A11.2, let
where I stands for the identity matrix, and vi = C(ui−μ). Using Theorem 2.16 of Fang, Kotz and Ng (1990), it follows that
| (A.1) |
where .
Let V = (v1, · · ·, vn)T . By definition of vi, V = (U − 1nμT )CT, where 1n is an n × 1 vector with all elements being 1. Define
Partition B in the same way as A, then B11 = A11 − Σ12A21 − A12Σ21 + Σ12A22Σ21, B12 = A12 − Σ12A22, B21 = A21 − A22Σ21 and B22 = A22. By direct calculation, it follows that . This enables us to derive the asymptotic distribution of A11.2 through B11.2.
Define , and . The assumption that all fourth moments of ui are finite implies that all fourth moments of vi are finite. Thus, it follows by the central limit theorem that Wkl for k = 1,2 and l = 1, 2, has an asymptotic normal distribution with mean zero and a finite covariance matrix. Then
By the assumptions of Theorem 1, the largest eigenvalue of is positive and finite. Therefore it follows that if ,
This implies that A11.2 and B11 have the same asymptotic normal distribution, and hence and have the same asymptotic distribution, where bkl is the (k, l)-element of B11. Further notice that v1i ~ ECq1 (0,Σ11.2), ϕ) by (A.1), where v1i consists of the first q1 elements of vi. Therefore, the asymptotic normal distribution of the sample correlation coefficient ρ̂ (vik, vil), which indeed equals to , can be derived from (2.2) with replacing Σ by Σ11.2. Thus, Theorem 1 holds by setting and u2i = xi𝒮.
Appendix B: Proof of Theorems 2 and 3
In this section, we introduce the following lemmas which are used repeatedly in the proofs of the theorems.
Lemma 1
(Hoeffding’s Inequality) Assume the independent random sample {Xi : i = 1, · · ·, n} satisfies P(Xi ∈ [ai, bi]) = 1 for some ai and bi, ∀ i = 1, · · ·, n. Then, for any ε > 0, the sample mean X̄ satisfies
| (B.1) |
Lemma 2
Suppose X is a random variable with E(ea|X|) < ∞ for some a > 0. Then for any M >0, there exist positive constants b and c such that
| (B.2) |
Lemma 3
Suppose γ̂1 and γ̂2 are estimates of the finite parameters γ1 and γ2 based on a size-n sample, respectively. Assume there exist positive constants b1, b2 and ν such that for any 0 < ε < 1,
| (B.3) |
Then
where b3 = b1 + b2, and b4 = 2b1 + b2. If γ2 ≠ 0,
where b5 = b1 + 3b2. If we further assume γ2 > 0, then
where b6 = 2b2.
The proof of Lemma 3 can be found at the supplemental material of this work.
For ease of notation, denote , and . Then
| (B.4) |
The proof of Theorem 2
We divide the proof into three parts.
Step 1
Study the consistency of . First consider x̄j. For any 0 < ε < 1 and any M >0,
| (B.5) |
for some positive constants C1 and C2. The first term above is obtained by Hoeffding’s inequality in Lemma 1, and the second term is by condition (D2) and Lemma 2. Take M = O(n1/5), then for large n, (B.5) is simplified as P(|x̄j − Exj | > ε) ≤ C3 exp(−nν/C3), where 0 < ν < 1/5 and C3 > 0. In the same fashion, there exist some positive constants C4, C5, C6 and C7, such that for large n,
Therefore by (B.4) and Lemma 3,
where the positive constant C8 is determined by C3, …, C7.
Note that
| (B.6) |
for any k ∈ 𝒮.
Under the bounded condition (D5), applying Lemma 3 to the sample version of (B.6) and the Z-transformation (2.4) recursively, we conclude that for some C9 > 0 and C10 > 0,
Furthermore, by the same argument, the sample kurtosis is consistent to the population version with the same rate, that is, there exists C11 > 0 such that P{|κ̂ − κ| > ε} ≤ C11 exp(−nν/C11), and hence for some C12 > 0,
Step 2
Compute P(Ej|𝒮) = P{an error occurs when testing ρ(y, xj |x𝒮) = 0}. Denote , where is the event that the type I error occurs and is the event that the type II error occurs. Then by choosing ,
and
Note that for all u ∈ (−1, 1), then |Z(y, xj |x𝒮)| ≥ |ρn(y, xj |x𝒮)| ≥ cn under condition (D4). Thus,
Therefore, .
Step 3
Study P{𝒜̂n(αn) = 𝒜}. Now consider all j = 1, · · ·, p and all 𝒮 ⊆ {j}c subject to |𝒮| ≤ mn, where mn ≤ m̂reach. Define , j = 1, · · ·, p.
| (B.7) |
The second inequality holds since the number of possible choices of j is p and there are pmn possible choices for 𝒮. The last inequality in (B.7) is obtained because P(m̂reach = mreach) → 1 and mreach ≤ d0 by the same technique as Lemma 3 in Bühlmann, Kalisch and Maathuis (2010). Thus for large n, mn ≤ m̂reach ≤ d0.
Moreover, recall that ν can be chosen arbitrarily in (0, 1/5). Therefore, if d0 is fixed, for p = o(exp(nξ)), 0 < ξ < 1/5, (B.7) is simplified as P{𝒜̂n(αn) ≠ 𝒜} ≤ O{exp(−nν/C12)}, provided ξ < ν < 1/5. If d0 = O(nb), 0 < b < 1/5, for p = o(exp(nξ)), 0 < ξ < 1/5 − b, (B.7) becomes P{𝒜̂n(αn) ≠ 𝒜} ≤ O{exp(−nν/C12)}, provided that ξ + b < ν < 1/5. This completes the proof of Theorem 2 with C = C12.
Proof of Theorem 3
We only need to consider the first step of the thresholded partial correlation approach, where we have
for some C13 > 0. Define , then using the same technique as the proof of Theorem 2,
Then
for any 0 < ν < 1/5. Therefore, for p = o(exp(nξ)) and , provided ξ < ν < 1/5. This completes the proof of Theorem 3 with C* = C13.
References
- Arellano-Valle RB, del Pino F, Iglesias P. Bayesian inference in spherical linear models: robustness and conjugate analysis. Journal of Multivariate Analysis. 2006;97:179–197. [Google Scholar]
- Bühlmann P, Kalisch M, Maathuis M. Variable selection in high-dimensional linear models: partially faithful distributions and the PC24 simple algorithm. Biometrika. 2010;97:261–278. [Google Scholar]
- Chen Y, Wiesel A, Hero AO. Robust shrinkage estimation of high-dimensional covariance matrices. IEEE Transactions on Signal Processing. 2011;59:4097–4107. [Google Scholar]
- Chiang AP, Beck JS, Yen H-J, Tayeh MK, Scheetz TE, Swiderski R, Nishimura D, Braun TA, Kim K-Y, Huang J, Elbedour K, Carmi R, Slusarski DC, Casavant TL, Stone EM, Shefield VC. Homozygosity Mapping with SNP Arrays Identifies a Novel Gene for Bardet-Biedl Syndrome (BBS10) Proceeding of the National Academy of Sciences. 2006;103:6287–6292. doi: 10.1073/pnas.0600158103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fan J, Li R. Variable selection via nonconcave penalized likelihood and it oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
- Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]
- Fang KT, Kotz S, Ng KW. Symmetric Multivariate and Related Distributions. Chapman and Hall; New York, NY: 1990. [Google Scholar]
- Huang J, Ma SG, Zhang CH. Adaptive Lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
- Liang H, Li R. Variable selection for partially linear models with measurement errors. Journal of American Statistical Association. 2009;104:234–248. doi: 10.1198/jasa.2009.0127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mcneil AJ, Frey R, Embrechts P. Quantitative Risk Management: Concepts, Techniques and Tools. Princeton University Press; Princeton, NJ: 2005. [Google Scholar]
- Mottonen J, Oja H, Tienari J. On the efficiency of multivariate spatial sign and rank tests. Annals of Statistics. 1997;25:542–552. [Google Scholar]
- Muirhead RJ. Aspects of Multivariate Statistical Theory. Wiley; New York: 1982. [Google Scholar]
- Oja H, Randles RH. Multivariate nonparametric tests. Statistical Sciences. 2004;19:598–605. [Google Scholar]
- Osiewalski J. A note on Bayesian inference in a regression model with elliptical errors. Journal of Econometrics. 1991;48:183–193. [Google Scholar]
- Osiewalski J, Steel MFJ. Robust bayesian inference in elliptical regression models. Journal of Econometrics. 1993;57:345–363. [Google Scholar]
- Scheetz TE, Kim K-YA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Shefield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceeding of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmidt R. Tail Dependence for Elliptically Contoured Distributions. Mathematical Methods of Operations Research. 2002;55:301–327. [Google Scholar]
- Soloveychik I, Wiesel A. Performance analysis of Tyler’s covariance estimator. IEEE Transactions on Signal Processing. 2015;63:418–426. [Google Scholar]
- Tibshirani R. Regression shrinkage and selection via LASSO. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
- Vidal I, Arellano-Valle RB. Bayesian inference for dependent elliptical measurement error models. Journal of Multivariate Analysis. 2010;101:2587–2597. [Google Scholar]
- Wang L, Peng B, Li R. A high-dimensional nonparametric multivariate test for mean vector. Journal of American Statistical Association. 2015 doi: 10.1080/01621459.2014.988215. Accepted. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
