CORRELATION PURSUIT: FORWARD STEPWISE VARIABLE SELECTION FOR INDEX MODELS

Wenxuan Zhong; Tingting Zhang; Yu Zhu; Jun S Liu

doi:10.1111/j.1467-9868.2011.01026.x

. Author manuscript; available in PMC: 2013 Nov 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2012 Apr 12;74(5):849–870. doi: 10.1111/j.1467-9868.2011.01026.x

CORRELATION PURSUIT: FORWARD STEPWISE VARIABLE SELECTION FOR INDEX MODELS

Wenxuan Zhong ^1,^*, Tingting Zhang ², Yu Zhu ³, Jun S Liu ^4,^*

PMCID: PMC3519449 NIHMSID: NIHMS346711 PMID: 23243388

Abstract

In this article, a stepwise procedure, correlation pursuit (COP), is developed for variable selection under the sufficient dimension reduction framework, in which the response variable Y is influenced by the predictors X₁, X₂, …, X_p through an unknown function of a few linear combinations of them. Unlike linear stepwise regression, COP does not impose a special form of relationship (such as linear) between the response variable and the predictor variables. The COP procedure selects variables that attain the maximum correlation between the transformed response and the linear combination of the variables. Various asymptotic properties of the COP procedure are established, and in particular, its variable selection performance under diverging number of predictors and sample size has been investigated. The excellent empirical performance of the COP procedure in comparison with existing methods are demonstrated by both extensive simulation studies and a real example in functional genomics.

Keywords: variable selection, projection pursuit regression, sliced inverse regression, stepwise regression, dimension reduction

1. Introduction

Advances in science and technology in the past few decades have led to an explosive growth of high dimensional data across a variety of areas such as genetics, molecular biology, cognitive sciences, environmental sciences, astrophysics, finance, internet commerce, etc. Compared to their dimensionalities, a large amount of data sets generated from these areas have relatively small sample sizes. Variable (or feature) selection and dimension reduction are more than often key steps in analyzing these data. Much progress has been made in the past few decades on variable selection for linear models (see Shao 1998 and Fan and Lv 2010 for a review). In recent years, shrinkage based procedures for simultaneously estimating regression coefficients and selecting predictors have been particularly attractive to researchers, and many promising algorithms such as LASSO (Tibshirani 1996, Zou 2006, Friedman 2007), least angle regression (LARS) (Efron et al. 2004), SCAD (Fan and Li 2001), etc., have been invented.

Let Y ∈ ℝ be a univariate response variable and X = (X₁, X₂, ⋯, X_p)′ ∈ ℝ^p a vector of p continuous predictor variables. Throughout this article, we consider the following sufficient dimension reduction (SDR) model framework as pioneered by Li (1991) and Cook (1994). Let β₁, β₂, ⋯, β_K be p-dimensional vectors with β_i = (β_1i, β_2i, ⋯, β_pi)′ for 1 ≤ i ≤ K. The SDR model assumes that Y and X are mutually independent conditional on $β_{1}^{'} X, β_{2}^{'} X, \dots, β_{K}^{'} X$ , i.e.,

Y ⊥ X | B' X,

(1)

where “⊥″ means “independent of” and B = (β₁, β₂, ⋯, β_K). Expression (1) implies that all the information X contains about Y is contained in the K projections $β_{1}^{'} X, \dots, β_{K}^{'} X$ . A predictor variable X_j (1 ≤ j ≤ p) is said to be relevant if there exists at least one i (1 ≤ i ≤ K) such that β_ji ≠ 0. Let L be the number of relevant predictor variables. When there are a large number of predictors (i.e. p is large), it is usually safe to impose the sparsity assumption, which states that only a small subset of the predictors influence Y and the others are irrelevant. In the SDR model, this assumption means that both K and L are small relative to p.

In his seminal paper on dimension reduction, Li (1991) proposed a seemingly different model of the form:

Y = f (β_{1}^{'} X, β_{2}^{'} X, \dots, β_{K}^{'} X, ε),

(2)

where f is an unknown (K + 1)-variate link function and ε is a stochastic error independent of X. It has been shown that the two models (1) and (2) are in fact equivalent (Zeng and Zhu 2010). We henceforth always refer to β₁, β₂, ⋯, β_K as the SDR directions and the space spanned by these directions as a SDR subspace. In general, SDR subspaces are not unique. To resolve this ambiguity, Cook (1994) introduced the concept of central subspace, which is the intersection of all possible SDR subspaces and is a SDR subspace itself, and showed that the central space is well-defined and unique under some general conditions. We denote the central subspace by 𝒮(B) and assume its existence throughout this article.

Various methods have been developed for estimating β₁, ⋯, β_K in the literature on SDR. One particular family of methods utilizes inverse regression, which is to regress X against Y. The sliced inversion regression (SIR) method proposed by Li (1991) is the forerunner of this family of methods. Recognizing that estimation of the SDR directions does not automatically lead to variable selection, Cook (2004) derived various χ² tests for assessing the contribution of predictor variables to the SDR directions. Based on these tests, Li et al. (2005) proposed a backward subset selection method for selecting significant predictors. Following the recent trend of using the L₁ or L₂ penalty for variable selection, Zhong et al. (2005) proposed to regularize the sample covariance matrix of the predictor variables in SIR and developed a procedure called RSIR for variable selection. Li (2007) proposed Sparse SIR (SSIR) to obtain shrinkage estimates of the SDR directions. Bondell and Li (2009) further adopted the nonnegative garrote method for estimating the SDR directions and showed that the resulting method is consistent in variable selection.

The majority of the aforementioned methods take a two-step approach to variable selection under the SDR model. The first step is to perform dimension reduction, that is, to estimate the SDR directions; and the second step is to select the relevant variables using statistical testing or shrinkage methods. Because these methods need to estimate the covariance and conditional covariance matrices of X, both of which are of dimensions p × p, the effectiveness and robustness of the two-step approach are questionable when p is large relative to n. Zhu et al. (2006) have shown that the estimation accuracy of SDR directions deteriorates as p increases. In other words, the more irrelevant variables there are, the more likely a method fails to estimate the SDR directions accurately, and the less likely the method identifies the true relevant predictor variables.

In this article, we propose correlation pursuit (COP), a stepwise procedure for simultaneous dimension reduction and variable selection under the SDR model. Similar to projection pursuit (Friedman and Tukey 1974, Huber 1985), COP defines a projection function to measure the correlation between the transformed response and the projections of X and pursues a subset of explanatory variables that maximize the projection function. It starts with an randomly selected subset and iterates between finding an explanatory variable (predictor) that significantly improves the current projection function to add to the subset and finding an insignificant predictor to remove from the subset. During each iteration step, COP needs only to consider the predictors currently in the subset and one more predictor outside the subset. Therefore, COP can avoid the estimation and inversion of p × p covariance and conditional covariance matrices of X and mitigate the curse of dimensionality. Furthermore, COP performs dimension reduction and variable selection simultaneously. Therefore, dimension reduction and variable selection can be mutually enhanced. Our theoretical investigations as well as simulation studies show that COP is a promising tool for dimension reduction and variable selection in high dimensional data analysis.

The rest of the article is organized as follows. In Section 2, we give a brief introduction to SIR, following a correlation interpretation of SIR provided by Chen and Li (1998). This interpretation was also used in Fung et al. (2002) and Zhou and He (2008) for dimension reduction via canonical correlation. In the same section, we describe the COP procedure and derive various test statistics used by the procedure. The asymptotic behavior of the COP procedure is discussed in Section 3. Several implementation issues of the procedure are discussed in Section 4. Simulation and real data examples are reported in Sections 5 and 6, respectively. Additional remarks in Section 7 conclude the article. An abbreviated version of the proofs of the theorems is provided in the Appendix..

2. Correlation Pursuit for Variable Selection

2.1. Profile Correlation and SIR

Let η be an arbitrary direction in ℝ^p. We define the profile correlation between Y and η′X, denoted by P(η), as:

P (η) = max_{T} corr (T (Y), η' X),

(3)

where the maximization is taken over all possible transformations of Y including non-monotone ones. The profile correlation P(η) reflects the largest possible correlation between a transformed response T(Y) and the projection η′X. Let η₁ be the direction that maximizes P(η) subject to η′Ση = 1, that is, η₁ = argmax_η′Ση=1 P(η). We refer to η₁ as the first principal direction for the profile correlation between Y and X and call P(η₁) the first profile correlation. Direction η₁, or its projection $η_{1}^{'} X$ , may not entirely characterize the dependency between Y and X. Using P(η) as the projection function again, we can seek for a second direction, denoted by η₂, which is uncorrelated with $η_{1}^{'} X, i . e . η_{2}^{'} Σ η_{1} = 0$ , and maximizes P(η), that is, η₂ = argmax_{η:η′Ση₁=0} P(η). We refer to η₂ as the second principal direction and P(η₂) as the second profile correlation. This procedure can be continued until no more directions can be found that are orthogonal to the obtained directions and have a nonzero profile correlation with Y. Suppose K̃ principal directions exist between Y and X, which are η₁, η₂, ⋯, and η_K̃, with the corresponding profile correlations P(η₁) ≥ P(η₂) ≥ ⋯ P(η_K̃) > 0. We need to impose the following condition to establish the connection between the principal directions and the SDR directions under the SDR model.

Condition 1 (Linearity Condition): For any η in ℝ^p, E(η′X|B′X) is linear in B′X, where B is as defined in equation (1).

Proposition 2.1. Under the SDR model and the Linearity Condition, the principal directions η₁, η₂, …, η_K̃ are in the central space 𝒮(B).

To make this article self-sufficient, we have included the proof of Proposition 2.1 in the Appendix. Based on the proposition, the principal directions are indeed SDR directions. In general, K̃ < K. When the link function f is symmetric along a direction, using correlation alone may fail to recover this direction. For example, if $Y = X_{1}^{2} + ε$ , the profile correlation between Y and X₁ will always be zero. To exclude this possibility, we follow the convention in the SDR literature to impose the following condition.

Condition 2 (Coverage Condition): The number of principal directions of profile correlation is equal to the dimensionality of the central subspace, that is, K̃ = K.

Under both the linearity and coverage conditions, the principal directions η₁, η₂, ⋯, η_K form a special basis of the central subspace 𝒮(B), that is, 𝒮(B) = span(η₁, η₂, …., η_K). This basis is uniquely defined and is the estimation target of SIR. In the rest of the article, for ease of discussion, we use β₁, β₂, …., β_k and η₁, η₂, …, η_K, interchangeably.

Chen and Li (1998) showed that, at the population level, there exits an explicit solution for the principal directions. In the proof of their Theorem 3.1, Chen and Li (1998) has derived that

P^{2} (η) = \frac{η' var [E (X | Y)] η}{η' Σ η} \equiv \frac{η' M η}{η' Σ η},

(4)

where M ≜ var[E(X|Y)] is the covariance matrix of the expectation of X given Y. Furthermore, the principal directions of profile correlation are the solutions of the following eigenvalue decomposition problem:

M υ_{i} = λ_{i} Σ υ_{i}, υ_{i}^{'} Σ υ_{i} = 1, for i = 1, 2, \dots, K;

(5)

λ_{1} \geq λ_{2} \geq \dots \geq λ_{K} > 0 .

(6)

The principal directions η₁, η₂, …, and η_K are the first K eigenvectors of Σ⁻¹ M, and their corresponding eigenvalues are exactly the squared profile correlations, that is, P²(η_i) = λ_i for i = 1, 2, …, K.

Given independent observations {(x_i, y_i)}_{i=1, ⋯, n} of (X, Y), where x_i = (x_i1, ⋯, x_ip)′, Σ can be estimated by the sample covariance matrix,

Σ̂ = \frac{\sum_{i = 1}^{n} (x_{i} - x̄) (x_{i} - x̄)'}{n - 1},

(7)

where x̄ is the sample mean of {x_i}. Li (1991) proposed the following SIR procedure to estimate M. First, the range of ${y_{i}}_{i = 1}^{n}$ is divided into H disjoint intervals, denoted as S₁, …, S_H. For h = 1, …, H, the mean vector ${x̄}_{h} = n_{h}^{- 1} \sum_{y_{i} \in S_{h}} x_{i}$ is calculated, where n_h is the number of y_i's in S_h. Then, M is estimated by

M̂ = \frac{\sum_{h = 1}^{H} n_{h} ({x̄}_{h} - x̄) ({x̄}_{h} - x̄)'}{n},

(8)

and the matrix Σ⁻¹ M is estimated by Σ̂⁻¹ M̂. The first K eigenvectors of Σ̂⁻¹ M̂, denoted by η̂₁, η̂₂, …, η̂_K, are used to estimate the first K eigenvectors of Σ⁻¹ M or, equivalently, the principal directions η₁, η₂, …, η_K, respectively. The first K eigenvalues of Σ̂⁻¹ M̂, denoted by λ̂₁, λ̂₂, …, λ̂_K, are used to estimate the eigenvalues of Σ⁻¹ M or, equivalently, the squared profile correlations λ₁, λ₂, …, λ_K, respectively.

2.2. Correlation Pursuit

The SIR method needs to estimate the two p × p covariance matrices Σ and M, and to obtain the eigenvalue decomposition of Σ̂⁻¹ M̂. When a large number of irrelevant variables are present and the sample size n is relatively small, Σ̂ and M̂ become unstable, which leads to very inaccurate estimates of principal directions η̂₁, η̂₂, …, η̂_K (Zhu et al. 2006). As a consequence, those shrinkage-based variable selection methods that rely on η̂₁, η̂₂, …, η̂_K often perform poorly for the SDR model when p is large. We here propose a stepwise SIR-based procedure for simultaneous dimension reduction (i.e., estimating the principal directions) and variable selection (i.e., identifying true predictors). Our procedure starts with a collection of randomly selected predictors and iterates between an addition step, which selects and adds a predictor to the collection, and a deletion step, which selects and deletes a predictor from the collection. The procedure terminates when no new addition or deletion occurs.

The Addition Step. Let 𝒜 denote the collection of the indices of the selected predictors and X_𝒜 the collection of the selected variables. Applying SIR to the data involving only the predictors in X_𝒜, we obtain the estimated squared profile correlations ${λ̂}_{1}^{𝒜}, {λ̂}_{2}^{𝒜}, \dots, {λ̂}_{K}^{𝒜}$ . Superscript 𝒜 indicates that the estimated squared profile correlations depend on the current subset of selected predictors. Let X_t be an arbitrary predictor outside 𝒜 and 𝒜 + t = 𝒜⋃{t}. Applying SIR to the data involving the predictors in 𝒜 + t, we obtain the estimated squared profile correlations ${λ̂}_{1}^{𝒜 + t}, {λ̂}_{2}^{𝒜 + t}, \dots, {λ̂}_{K}^{𝒜 + t}$ . Because 𝒜 ⊂ 𝒜 + t, it is easy to see that ${λ̂}_{1}^{𝒜} \leq {λ̂}_{1}^{𝒜 + t}$ . The difference ${λ̂}_{1}^{𝒜 + t} - {λ̂}_{1}^{𝒜}$ reflects the amount of improvement in the first profile correlation due to the incorporation of X_t. We standardize this difference and use the the resulting test statistic

{COP}_{1}^{𝒜 + t} = \frac{n ({λ̂}_{1}^{𝒜 + t} - {λ̂}_{1}^{𝒜})}{1 - {λ̂}_{1}^{𝒜 + t}},

(9)

to assess the significance of adding X_t to 𝒜 in improving the first profile correlation. Similarly, the contributions of adding X_t to the other profile correlations can be assessed by

{COP}_{i}^{𝒜 + t} = \frac{n ({λ̂}_{i}^{𝒜 + t} - {λ̂}_{i}^{𝒜})}{1 - {λ̂}_{i}^{𝒜 + t}},

(10)

for 2 ≤ i ≤ K. The overall contribution of adding X_t to the improvement in all the K profile correlations can be assessed by combining the statistics ${COP}_{i}^{𝒜 + t}$ into one single test statistic

{COP}_{1 : K}^{𝒜 + t} = \sum_{i = 1}^{K} {COP}_{i}^{𝒜 + t} .

(11)

We further define that

{\bar{COP}}_{1 : K}^{𝒜} = {max}_{t \in 𝒜^{c}} {COP}_{1 : K}^{𝒜 + t} .

(12)

Let X_t̄ be a predictor that attains ${\bar{COP}}_{1 : K}^{𝒜}$ , that is, ${\bar{COP}}_{1 : K}^{𝒜} = {COP}_{1 : K}^{𝒜 + t̄}$ , and let c_e be a pre-specified threshold (details about its choice are deferred to the next two sections). Then, if ${\bar{COP}}_{1 : K}^{𝒜} > c_{e}$ , we add t̄ to 𝒜; otherwise, we do not add any variable.

The deletion step. Let X_t be an arbitrary predictor in 𝒜 and define 𝒜 − t = 𝒜 − {t}. Let ${λ̂}_{1}^{𝒜 - t}, {λ̂}_{2}^{𝒜 - t}, \dots, {λ̂}_{K}^{𝒜 - t}$ be the estimated squared profile correlations based on the data involving the predictors in 𝒜 − t only. The impact of deleting X_t from 𝒜 on the ith squared profile correlation can be measured by

{COP}_{i}^{𝒜 - t} = \frac{n ({λ̂}_{i}^{𝒜} - {λ̂}_{i}^{𝒜 - t})}{1 - {λ̂}_{i}^{𝒜}},

(13)

for 1 ≤ i ≤ K. The overall impact of deleting X_t is measured by

{COP}_{1 : K}^{𝒜 - t} = \sum_{i = 1}^{K} {COP}_{i}^{𝒜 - t},

(14)

and the least impact from deleting one predictor from 𝒜 is then defined to be

{\underline{COP}}_{1 : K}^{𝒜} = {min}_{t \in 𝒜} {COP}_{1 : K}^{𝒜 - t} .

(15)

Let X_ṯ be a predictor that achieves ${\underline{COP}}_{1 : K}^{𝒜}$ , and let c_d be a pre-specified threshold for deletion. If ${\underline{COP}}_{1 : K}^{𝒜} < c_{d}$ , we delete X_ṯ from 𝒜; otherwise, no deletion happens.

The asymptotic distributions of the proposed statistics and the selection of the thresholds will be discussed in the next two sections. Because the described procedure aims to find predictors that can most significantly improve the profile correlations between Y and X, we call it the Correlation Pursuit procedure (COP). Below we summarize COP into a pseudo-code.

The COP Algorithm

Set the number of principal directions K and the threshold values c_e and c_d.
Randomly select K + 1 variables as the initial collection of selected variables 𝒜.
Iterate until no more addition or deletion of predictors can be performed:
- The addition step:
  - Find t̄ such that ${COP}_{1 : K}^{𝒜 + t̄} = {\bar{COP}}_{1 : K}^{𝒜}$ ;
  - If ${\bar{COP}}_{1 : K}^{𝒜} > c_{e}$ , add t̄ to 𝒜, that is, let 𝒜 = 𝒜 + t̄;
- The deletion step:
  - Find ṯ such that ${COP}_{1 : K}^{𝒜 - ṯ} = {\underline{COP}}_{1 : K}^{𝒜}$ ;
  - If ${\underline{COP}}_{1 : K}^{𝒜} < c_{d}$ , delete ṯ from 𝒜, that is, let 𝒜 = 𝒜 − ṯ;
Output 𝒜.

3. Theoretical Properties

3.1. Asymptotic distributions of test statistics in COP

Let us first consider an addition step. We assume that SIR uses a fixed slicing scheme relative to the number of observations n, that is, the slices S₁, S₂, ⋯, S_H are fixed (defined by the range of the response variable) but the number of observations in each slice goes to infinity. Let X_t be an arbitrary predictor in 𝒜^c. Under the null hypothesis H₀ that all the predictors in 𝒜^c are irrelevant, we have η_t1 = η_t2 = ⋯ = η_tK = 0. Recall that the statistics we propose to measure the contributions of X_t to the K profile correlations are $({COP}_{1}^{𝒜 + t}, {COP}_{2}^{𝒜 + t}, \dots, {COP}_{K}^{𝒜 + t})'$ , and to measure the overall contribution of X_t by ${COP}_{1 : K}^{𝒜 + t}$ . To establish the asymptotic distributions of these statistics, we need to impose a condition on the conditional expectation of X_t given X_𝒜.

Condition 3 (Regression Condition): E(X_t | X_𝒜) is linear in X_𝒜.

Theorem 3.1. Assume that Conditions 1 and 2 hold, Condition 3 holds for (X_𝒜, X_t) for any X_t ∈ X_𝒜^C, and the squared profile correlations λ₁, λ₂, …, λ_K are positive and different from each other. Then, for any given fixed slicing scheme, under the null hypothesis H₀ that all the predictors in 𝒜^c are irrelevant, we have that

({COP}_{1}^{𝒜 + t}, {COP}_{2}^{𝒜 + t}, \dots, {COP}_{K}^{𝒜 + t}) \to (Z_{1 t}^{2}, Z_{2 t}^{2}, \dots, Z_{Kt}^{2})

(16)

in distribution and

{COP}_{1 : K}^{𝒜 + t} \to \sum_{l = 1}^{K} Z_{lt}^{2}

(17)

in distribution as n → ∞. Here, (Z_1t, Z_2t, ⋯, Z_Kt) follows the multivariate normal distribution with mean zero and covariance matrix W_Kt. The explicit expression of W_Kt is given in the Appendix.

The asymptotic distributions in Theorem 3.1 can be much simplified if we impose the following condition on the variance of the conditional expectation of X_t given X_𝒜.

Condition 4 (Constant Variance Condition): E[(X_t − E(X_t|X_𝒜))²|X_𝒜] is a constant.

Corollary 3.1. Assume that Conditions 1 and 2 hold, Conditions 3 and 4 hold for (X_𝒜, X_t) for X_t ∈ X_𝒜^c, and the squared profile correlations λ₁, λ₂, …, λ_K are positive and different from each other. Then, for any given fixed slicing scheme, under the null hypothesis H₀ that all the predictors in 𝒜^c are irrelevant, we have that ${COP}_{1}^{𝒜 + t}, {COP}_{2}^{𝒜 + t}, \dots, {COP}_{K}^{𝒜 + t}$ are asymptotically independent and identically distributed as χ²(1), and ${COP}_{1 : K}^{𝒜 + t}$ is asymptotically χ²(K).

Theorems 3.1 and Corollary 3.1 characterize the asymptotic behaviors of the test statistics for an arbitrary X_t in 𝒜^c. In the COP procedure, however, the predictor that attains the maximum value of ${COP}_{1 : K}^{𝒜 + t}$ among t ∈ 𝒜^c, which is ${\bar{COP}}_{1 : K}^{𝒜}$ , is considered a candidate predictor to enter 𝒜. Our next theorem characterizes the joint asymptotic behavior of ${{COP}_{1 : K}^{𝒜 + t}}_{t \in 𝒜^{c}}$ as well as that of ${\bar{COP}}_{1 : K}^{𝒜}$ .

Note that the linearity, regression, and constant variance conditions together are more general than the normality assumption on X because they only need to hold for the basis of the central subspace (e.g., B or η₁, ⋯, η_K) and a given subset of predictors (e.g., 𝒜). If we require that the conditions hold for any projection and any given subset of the predictors, however, then it is equivalent to requiring that X follows a multivariate normal distribution. In order to understand the joint behavior of all the COP statistics, in what follows we impose the normality assumption on X.

Let $𝒜 = {t_{j}}_{j = 1}^{d} and 𝒜^{c} = {t_{j}}_{j = d + 1}^{p}$ denote the collection of currently selected predictors and its complement respectively. Let Σ_𝒜 = Cov(X_𝒜), Σ_𝒜^c = Cov(X_𝒜^c), Σ_𝒜𝒜^c = Cov(X_𝒜, X_𝒜^c), and ${Σ̃}_{𝒜^{c}} = Σ_{𝒜^{c}} - Σ_{𝒜^{c} 𝒜} Σ_{𝒜}^{- 1} Σ_{𝒜 𝒜^{c}}$ . Note that $Σ_{𝒜^{c} 𝒜} = Σ_{𝒜 𝒜^{c}}^{'}$ . Let ã = (ã₁, ã₂, …, ã_p−d)′ be the vector of the diagonal elements of Σ̃_𝒜^c. Define D_𝒜^c = diag(ã₁, ã₂, …, ã_p−d), and define $U_{𝒜^{c}} = D_{𝒜^{c}}^{- 1 / 2} {Σ̃}_{𝒜^{c}} D_{𝒜^{c}}^{- 1 / 2}$ .

Theorem 3.2. Assume that (a) X follows a multivariate normal distribution; (b) the coverage condition holds; and (c) the squared profile correlations λ₁, λ₂, ⋯, λ_K are nonzero and different from each other. Then, for any fixed slicing scheme, under the null hypothesis H₀ that all the predictors in 𝒜^c are irrelevant, we have

({COP}_{1 : K}^{𝒜 + t_{d + 1}}, {COP}_{1 : K}^{𝒜 + t_{d + 2}}, \dots, {COP}_{1 : K}^{𝒜 + t_{p}}) \overset{D}{\to} (\sum_{k = 1}^{K} z_{k, d + 1}^{2}, \dots, \sum_{k = 1}^{K} z_{k, p}^{2}),

(18)

and

{\bar{COP}}_{1 : K}^{𝒜} \overset{D}{\to} max_{t \in 𝒜^{c}} \sum_{k = 1}^{K} z_{k, t}^{2}

(19)

as n goes to ∞. Here z_k = (z_k,d+1, ⋯, z_k,p) for k = 1, ⋯, K are mutually independent and each z_k follows a multivariate normal with mean zero and covariance matrix U_𝒜^c.

We now consider deletion steps of the COP procedure. We let 𝒜 denote the current collection of selected predictors before a deletion step, and let X_t be an arbitrary predictor in 𝒜. Note that ${COP}_{k}^{𝒜 - t} = {COP}_{k}^{𝒜̄ + t}$ , where 𝒜̃ = 𝒜 − t for 1 ≤ k ≤ K. Therefore, results similar to those stated in Theorem 3.1 and Corollary 3.1 can be obtained for $({COP}_{1}^{𝒜 - t}, {COP}_{2}^{𝒜 - t}, \dots, {COP}_{K}^{𝒜 - t}) and {COP}_{1 : K}^{𝒜 - t}$ after some modifications described below. First, our current “null hypothesis”, denoted as H_0t, is that X_t and the predictors in 𝒜^c are irrelevant. Second, the regression and constant variance conditions need to be imposed on the conditional expectation of X_t given X_𝒜̃ instead. The asymptotic distribution of ${\underline{COP}}_{1 : K}^{𝒜}$ , however, turns out to be fairly complicated if not entirely elusive, due to the fact that there does not exist a common null hypothesis for all X_t ∈ 𝒜. In what follows, we will establish two strong results that have implications for properly selecting the thresholds c_e and c_d, as well as for the consistency of the COP procedure in selecting true predictors.

3.2. Selection consistency of COP

Let 𝒯 be the collection of the true predictors under the SDR model. The principal profile correlation directions are η₁, η₂, ⋯, η_K, which form a basis of the central subspace. Assume that S₁, ⋯, S_H is a fixed slicing scheme used by SIR.

Let p_h = P(y ∈ S_h), $v_{K} = (η_{1}^{'} X - E (η_{1}^{'} X), \dots, η_{K}^{'} X - E (η_{K}^{'} X))'$ , and

M_{H, K} = \sum_{h = 1}^{H} p_{h} L_{h, K} L_{h, K}^{'},

(20)

where L_h,K = E(v_K|Y ∈ S_h). A few more conditions are needed for the results we state in the next two theorems.

Condition 5. X follows a multivariate normal distribution with covariance matrix Σ such that τ_min ≤ λ_min(Σ_p) ≤ λ_max(Σ_p) ≤ τ_max, where τ_min and τ_max are two positive constants, and λ_min(·) and λ_max(·) are the minimum and maximum eigenvalues of a matrix, respectively.

Condition 6. There exists a constant ω_H > 0 such that λ_min(M_H,K) > ω_H.

Condition 7. There exist constants $σ_{0}^{2}$ and υ > 0 such that for any slice S_h and any two predictors X_i and X_j, $Var (X_{j} | Y \in S_{h}) \leq σ_{0}^{2} and Var (X_{i} X_{j} | y \in S_{h}) \leq σ_{0}^{2}$ for all i, j = 1, ⋯, p, and h = 1, ⋯, H. In addition,

E (| X_{j} |^{l} | Y \in S_{h}) \leq \frac{l!}{2} Var (X_{j} | Y \in S_{h}) υ^{l - 2} and

E (| X_{i} X_{j} |^{l} | Y \in S_{h}) \leq \frac{l!}{2} Var (X_{i} X_{j} | Y \in S_{h}) υ^{l - 2}, for l \geq 2 .

Condition 8. Let η^j = (η_j1, η_j2, ⋯, η_jK)′ in which η_jk is the coefficient of X_j in the kth principal correlation direction η_k. There exist a positive constant ϖ > 0 and a nonnegative constant ξ₀; such that ‖ η^j ‖²> ϖ · n^−ξ₀ for j ∈ 𝒯, where ‖ · ‖ denotes the standard L₂ − norm.

Condition 9. lim_n→∞ p = ∞ and p = o(n^ϱ₀) with ϱ₀ ≥ 0 and 2ϱ₀ + 2ξ₀ < 1.

Condition 5 ensures that the variances of the predictors are on a comparable scale and that they are not strongly correlated. Condition 6 assumes a lower bound for the eigenvalues of M_H,K, which is slightly stronger than the coverage condition that ensures SIR to recover all the SDR directions. Condition 7 imposes conditions on the moments of the conditional expectations of X given Y ∈ S_h so that the Bernstein inequalities hold for the conditional sample means. Condition 8 assumes that the coefficients of any true predictors do not decrease to zero too fast as both n and p increase; otherwise, such predictors will not be identifiable asymptotically. Condition 9 allows p to increase as n increases, but their rates are constrained. Similar conditions have been used by others for establishing variable selection results for stepwise procedures in linear regression (Wang 2009, Fan and Lv 2008).

Theorem 3.3. Let 𝒜 be the set of currently selected predictors and let 𝒯 be the set of true predictors. Let $ϑ = ϖ \cdot ω_{H} \cdot \frac{τ_{min}^{2}}{2 τ_{max}}$ . Assume that Conditions 5–9 hold. Then, we have

P (min_{𝒜 : 𝒜^{c} \cap 𝒯 \neq \emptyset} max_{t \in 𝒜^{c} \cap 𝒯} {COP}_{1 : K}^{𝒜 + t} \geq ϑ n^{1 - ξ_{0}}) \to 1,

(21)

for any fixed slicing scheme as n goes to ∞.

The probability statement in (21) is not just about one given collection of predictors. It considers all the possible collections that do not include all the true predictors yet, that is, {𝒜 : 𝒜^c ⋂ 𝒯 ≠ ∅}. In other words, it considers all the possible scenarios where the null hypothesis H₀ is not true. Further note that ${max}_{t \in 𝒜^{c} \cap 𝒯} {COP}_{1 : K}^{𝒜 + t} \neq {\bar{COP}}_{1 : K}^{𝒜}$ . Because ${max}_{t \in 𝒜^{c}} {COP}_{1 : K}^{𝒜 + t} \geq {max}_{t \in 𝒜^{c} \cap 𝒯} {COP}_{1 : K}^{𝒜 + t}$ , from equation (21), we have

P (min_{𝒜 : 𝒜^{c} \cap 𝒯 \neq \emptyset} {\bar{COP}}_{1 : K}^{𝒜} \geq ϑ n^{1 - ξ_{0}}) \to 1 .

(22)

This result implies that by setting c_e to ϑn^1−ξ₀ or smaller, if the COP procedure has not collected all the true predictors yet, then with probability going to 1 (as n goes to ∞), it will continue to select a predictor to the current collection. Thus, the addition step of COP will not stop until all the true predictors are selected. Another way to interpret (22) is that the selection power of the COP procedure converges to 1 asymptotically.

Theorem 3.4. Assume that Conditions 5–9 hold. Then we have

P (max_{𝒜 : 𝒜^{c} \cap 𝒯 \neq \emptyset} max_{t \in 𝒜^{c}} {COP}_{1 : K}^{𝒜 + t} < C n^{ϱ}) \to 1,

(23)

for ϱ > 1/2 + ϱ₀, and any positive constant C, under any fixed slicing scheme with n going to ∞.

Theorem 3.4 has two implications. The first one regards the addition step of COP. Once all the true predictors are selected, that is, 𝒜^c ⋂ 𝒯 = ∅, the probability that it will select a false predictor from 𝒜^c converges to zero. The second implication concerns the deletion step. Consider one collection of selected predictors 𝒜̃ and assume that 𝒜̃ contains all the true predictors and also some irrelevant ones, that is, 𝒜̃ ⊃ 𝒯. Clearly,

{\underline{COP}}_{1 : K}^{𝒜̃} \leq min_{t \in 𝒜̃ - 𝒯} {COP}_{1 : K}^{𝒜̃ - t} \leq max_{𝒜 : 𝒜^{c} \cap 𝒯 = \emptyset} max_{t \in 𝒜^{c}} \sum_{k = 1}^{K} {COP}_{k}^{𝒜 + t} .

(24)

Therefore,

P ({\underline{COP}}_{1 : K}^{𝒜̃} < C_{n^{ϱ}}) \to 1 .

(25)

In other words, with probability going to 1, the COP procedure will delete an irrelevant predictor from the current collection.

One possible choice of the thresholds is $χ_{e}^{2} = ϑ n^{1 - ξ_{0}} and χ_{d}^{2} = ϑ \cdot n^{1 - ξ_{0}} / 2$ . From Theorem 3.3, asymptotically, the COP algorithm will not stop selecting variables until all the true predictors are included. Moreover, once all the true predictors are included, according to Theorem 3.4, all the redundant variables will be removed from the selected variables.

4. Implementation Issues

When implementing the COP algorithm, one needs to specify the number of profile correlation directions K, the thresholds for the addition and deletion steps c_e and c_d, and the slicing scheme particularly the number of slices H. A proper specification of these tuning parameters is critical for the success of the COP algorithm.

4.1. Slicing Schemes and the Choice of H

Li (1991) suggested that in terms of estimation, the performance of SIR is robust to the number of slices in general. The COP algorithm uses SIR to derive test statistics for selecting variables. It is of interest to understand the impact of a slicing scheme on the involved testing procedures. Again, we consider an addition step in the COP procedure. Let 𝒜 be the current collection of selected predictors. Let X_t be an arbitrary predictor in 𝒜^c.

Theorem 4.1. Assume that X follows a multivariate normal distribution. Then, for any given fixed slicing scheme, we have

P (\frac{{COP}_{1 : K}^{𝒜}}{n} \geq C_{H, 𝒜 + t}) \to 1, as n \to \infty,

(26)

where

C_{H, 𝒜 + t} = ({η̃}_{t, 𝒜})' M_{H, K} {η̃}_{t, 𝒜} / σ_{t, 𝒜}^{2},

(27)

$σ_{t, 𝒜}^{2} = Var (X_{t} | X_{𝒜})$ , η̃_t,𝒜 = Cov(X_t, v_K|X_𝒜) and M_H,K is defined in equation (20).

The difference between Theorem 3.1 and Theorem 4.1 is that the latter does not assume that X_t is an irrelevant predictor. When X_t is indeed a true predictor, then η^t is not a zero vector and max_{t∈𝒜^c⋂𝒯} C_H,𝒜+t is greater than zero. The larger C_H,𝒜+t is, the more likely X_t will be added to 𝒜. The next result shows that a finer slicing scheme leads to higher power for the addition step by COP. For any two different slicing schemes S = (S₁, ⋯, S_H₁) and $S' = (S_{1}^{'}, \dots, S_{H_{2}}^{'})$ , we say that S′ is a refinement of S, denoted by S′ ≾ S, if for any $S_{h'}^{'} \in S'$ , there exists a S_h ∈ S such that $S_{h'}^{'} \subseteq S_{h}$ .

Proposition 4.1. Suppose S and S′ are two slicing schemes such that S′ ≾ S. Then, for any η ∈ ℝ^K, we have

η' M_{H_{2}, K} η \geq η' M_{H_{1}, K} η,

(28)

where M_H₂,K and M_H₁,K are defined as in (20) under the slicing schemes S′ and S, respectively.

Proposition 4.1 implies that the constant C_H,𝒜 in Theorem 4.1 becomes larger when a finer slicing scheme is used. This further suggests that the power of the COP procedure in selecting true predictors tends to increase if a slicing scheme uses a larger number of slices. On the other hand, when a slicing scheme uses a larger number of slices, the number of observations in each slice will decrease, which makes the estimate of E(X|y ∈ S_h) less accurate and further makes the estimates of M = Cov{E(X|Y)} and its eigenvalues λ₁, ⋯, λ_K less stable. The success of the COP procedure hinges on a good balance between the number of slices and the number of observations in each slice. We observed from intensive simulation studies that, with a reasonable number of observations in each slice (say ≥ 20), a large number of slices is preferred.

4.2. Choice of c_e and c_d

Section 3 has characterized the asymptotic distributions or behaviors of the test statistics involved in the COP procedure. In theory, these results (Theorems 3.4 and 3.5) can be used for choosing the thresholds c_e and c_d. In practice, however, these thresholds should be used with much caution due to the following concerns. First, the distributions obtained in Section 3 are for a single addition or deletion step and under various assumptions. Second, the distributions are valid only in an asymptotic sense. In what follows, we propose to use a cross-validation procedure for selecting c_e and c_d.

Let {α_i}_1≤i≤d be a pre-specified grid on a sub-interval in (0, 1) and ${χ_{α_{i,} K}^{2}}_{1 \leq i \leq d}$ be the collection of the 100α_ith percentile of $χ_{K}^{2}$ . For convenience, we only consider the m pairs of $c_{e} = χ_{α_{i}, K}^{2} and c_{d} = χ_{α_{i} - 0.05, K}^{2}$ for 1 ≤ i ≤ m. Note that c_d < c_e and that there is only one tuning parameter we need to determine. We follow the general 5-fold cross validation scheme to select the best pair of c_e and c_d. We randomly divide the original data into five equal-sized subsets, and then apply the COP procedure to any four subsets to generate the estimation and variable selection results. The remaining subset of the data is used to test the model and generate a performance measurement. The performance measurements are averaged and the result is used as the CV score. We choose the pair of c_e and c_d that mazimizes the CV score.

We define the performance measure used in the CV procedure as follows. Suppose 𝒜 is the collection of selected predictors and η_1,𝒜, ⋯, η_K,𝒜 are the estimates of the principal profile correlation directions produced by applying the COP procedure to the training data set. We consider the first principal profile correlation direction first. Recall η_1,𝒜 is the direction that achieves the maximum correlation of a linear projection of X and the transformed response Y, and the optimal transformation is $T_{1} (Y) = E (η_{1, 𝒜}^{'} X | Y)$ (Theorem 3.1 in Chen and Li (1998)). With η_1,𝒜 estimated by η̂_1,𝒜 using the training data, we apply LOESS proposed by Cleveland (1979) to fit T₁(Y) using the training data and denote the fitted transformation as T̂₁(·). Let X̃ and Ỹ be the data matrix and the response vector of the testing data set. Then, the squared profile correlation between X̃ and Ỹ based on the direction η̂_1,𝒜 and transformation T̂₁(·) is computed as ${corr}^{2} ({T̂}_{1} (Ỹ), {η̂}_{1, 𝒜}^{'} X̃)$ . Similarly, the squared profile correlations between X̃ and Ỹ along η̂_2,𝒜, ⋯, η̂_K,𝒜 can be calculated. The overall performance measure is defined to be

PC = \sum_{k = 1}^{K} {corr}^{2} ({T̂}_{k} (Ỹ), {η̂}_{k, 𝒜}^{'} X̃) .

(29)

The CV score for any pair (c_e, c_d) is defined to be the average PC over the five possible partitions of the training-test data sets.

4.3. Selection of the number of directions K

To determine K, the number of principal profile correlation directions, we adopt a BIC-type criterion proposed by Zhu et al. (2006). For any given k between 1 and J, where J ≤ max(n, p) is a reasonable upper bound chosen by the user, we apply the COP procedure with K = k. Suppose the resulting collection of the selected predictors is 𝒜_k and the cardinality of 𝒜_k is p_k. Using the data involving only the selected predictors, we can estimate M = Cov(E(X_{𝒜_k}|Y)) as before and denote the result as M̂. Let θ̂₁ ≥ θ̂₂ ⋯ θ̂_p be the eigenvalues of M̂ + I_{p_k}, where I_{p_k} is the p_k by p_k identity matrix, and let τ be the number of θ̂_i's that are greater than 1. Define

G (k) = - log L (k) + \frac{log (n)}{2} k (2 p_{k} - k + 1),

(30)

where $log L (k) = \sum_{i = min (τ, k) + 1}^{p} (log {θ̂}_{i} + 1 - {θ̂}_{i})$ . We choose K = argmin_1≤k≤J G(k). In the original criterion proposed by Zhu et al. (2006), the authors show that the criterion produces a consistent estimate of K for fixed p_k. Our simulation study shows that the modified criterion leads to the correct specification of K for the COP procedure and can be generally used in practice.

5. Simulation Study

We have performed extensive simulation studies to compare the COP algorithm with a few existing variable selection methods and will present three examples in this section. When implementing the COP algorithm in these examples, we use the CV procedure and the GIC criterion discussed in the previous section to select the thresholds c_e and c_d and the dimensionality K, respectively. The grids used for selecting c_e is ${χ_{0.90, K}^{2}, χ_{0.95, K}^{2}, χ_{0.99, K}^{2}, χ_{0.999, K}^{2}, χ_{0.9999, K}^{2}}$ , and the associated grid for selecting c_d is ${χ_{0.85, K}^{2}, χ_{0.90, K}^{2}, χ_{0.94, K}^{2}, χ_{0.949, K}^{2}, χ_{0.9499, K}^{2}}$ . The range used for selecting K is from 1 to 4 (i.e., J = 4). For SSIR, we use the grid {0, 0.1, …, 0.9, 1} × {0, 0.1, …, 0.9, 1} to select the pair of tuning parameters that leads to its best performance. Both COP and SSIR involve slicing the range of the response variable, for which we use the same scheme to facilitate fair comparison.

5.1. Linear models

In this example, we consider the linear model

Y = X β + σ ε,

(31)

where X = (X₁, X₂, ⋯, X_p)′ follows a p-variate normal distribution with mean zero and covariances Cov(X_i, X_j) = ρ^|i−j| for 1 ≤ i, j ≤ p, and ε is independent of X and follows N(0, 1). The variable selection methods we compare the COP procedure to include LASSO, SCAD (Fan and Li 2001), MARS, and SSIR (Li 2007). The R packages SIS, lars and mda are used to run SCAD, LASSO and MARS, respectively. The tuning parameters involved in SCAD and LASSO are selected by cross validation. We use the codes provided by the original authors to run SSIR. In this example, we consider two specifications of the linear model given below.

Scenario 1.1 : p = 8, β = (3, 1.5, 2, 0, 0, 0, 0, 0)′, σ = 3, ρ = 0.5;
Scenario 1.2 : p = 1000, β = (3, 1.5, 1, 1, 2, 1, 0.9, 1, 1, 1, 0, ⋯, 0)′, σ = 1, ρ = 0.5.

Under Scenario 1.1, Model (31) involves 3 true predictors and 5 irrelevant variables, and was originally used in Tibshirani (1996) and Fan and Li (2001) to demonstrate the empirical performances of LASSO and SCAD. We randomly generated 100 data sets from Scenario 1.1, each with 40 data points (i.e. n=40), and applied the aforementioned methods to the data sets. Two quantities were used to measure the variable selection performance of each method, which are the average number of irrelevant predictors falsely selected as true predictors (denoted by FP) and the average number of true predictors falsely excluded as irrelevant predictors (denoted by FN). Note that under Scenario 1.1, FPs and FNs range from 0 to 5 and 0 to 3, respectively, with small values indicating good performances in variable selection. The FP and FN values of the tested methods are reported in Table 1.

Table 1.

Performance comparison under linear models: FP is the average number of irrelevant variables falsely selected by the method, and FN is the average number of true variables falsely excluded by the method; the number in (·) is the standard error of the FP or FN and NA* indicates that the corresponding algorithm broke down.

Methods	p = 8, n = 40, σ = 3, ρ = 0.5		p = 1000, n = 200, σ = 1, ρ = 0.5

	FP (0, 5)	FN (0, 3)	FP (0,990)	FN (0,10)
LASSO	0.77(0.093)	0.16(0.037)	8.87(0.586)	0.00(0.000)
SCAD	0.67(0.094)	0.10(0.030)	6.05(0.926)	1.16(0.150)
MARS	4.00(0.059)	0.04(0.020)	30.64(0.165)	0.00(0.000)
SSIR	0.19(0.051)	0.96(0.068)	NA*	NA*
COP	0.71(0.080)	0.56(0.066)	2.28(0.203)	0.75(0.095)

Open in a new tab

Under Scenario 1.2, Model (31) involves ten true predictors and 990 irrelevant predictors and is clearly more challenging than Scenario 1.1. We randomly generated 100 data sets each with 200 data points (i.e. n=200) from Scenario 1.2. Note that in each data set, n < p. Similar to Scenario 1.1, we applied the methods mentioned above to the data sets and report the FP and FN values of these methods in Table 1. The tuning parameters in all these methods are determined by cross validation.

From the left panel of Table 1, under Scenario 1.1, SSIR has the lowest FP value (FP=0.19), that is, the average number of irrelevant variables selected by SSIR is 0.19; and COP has the third lowest FP values (0.71). The other methods tend to have more false positives than SSIR and COP. In terms of FNs, the order of the methods ranked from the lowest to the highest is MARS, SCAD, LASSO, COP, and SSIR. The relative sub-par performance of COP and SSIR is due to the fact that these two methods are developed for variable selection under models more general than the linear model.

From the right panel of Table 1, under Scenario 1.2, COP has the lowest FP value (FP=2.28). In terms of FN, LASSO and MARS have the lowest value with COP following modestly behind. Compared with MARS, COP has a much lower FP value and a slightly higher FN value. SSIR breaks down under Scenario 1.2 because the variance-covariance matrix of X is no longer invertible. In terms of both FP and FN, COP outperformed SCAD under the scenario. One explanation for this comparison result is that SCAD involves non-convex optimization, and can be unstable in implementation.

5.2. Nonlinear multiple index models

In this example, we consider the following multiple index model,

Y = \frac{X_{1} + X_{2} + \dots + X_{d}}{0.5 + {(1.5 + X_{2} + X_{3} + X_{4})}^{2}} + σ ε,

(32)

where X₁, ⋯,, X_p are i.i.d. N(0, 1) random variables, ε is N(0, 1) and independent of X, and d and σ are parameters that need to be further specified. This model was originally used in Li (1991) for demonstrating the performance of SIR. It is not difficult to see that given the two projections X₁ + X₂ + ⋯ + X_d and X₂ + X₃ + X₄, Y and X are independent with each other. The dimensionality of the central subspace of Model (32) is two, and the collection of true predictors is {X₁, ⋯, X_d} ⋃ {X₂, X₃, X₄}. Because Model (32) is nonlinear, methods that were designed specifically for linear models such as LASSO and SCAD are clearly at disadvantage. Therefore, in this example, we only compare the performances of MARS, SSIR and COP.

By specifying p, d and σ at different values, we have the following three scenarios,

Scenario 2.1: p = 30, d = 3, σ = 0.1;
Scenario 2.2: p = 30, d = 3, σ = 2;
Scenario 2.3: p = 400, d = 8, σ = 0.1.

For each scenario, we generated 100 data sets each with 200 observations (i.e. n=200) and applied MARS, SSIR and COP to each data set. The resulting FP and FN values are reported in Table 2.

Table 2.

Performance comparison under multiple index model: FP is the average number of irrelevant variables falsely selected by the method, and FN is the average number of true variables falsely excluded by the method; the number in (·) is the standard error of the FP or FN and NA* indicates that the corresponding algorithm broke down.

Methods	σ = 0.1, p = 30, d = 3		σ = 2, p = 30, d = 3		σ = 0.1, p = 400, d = 8

	FP (0, 26)	FN (0, 4)	FP (0, 26)	FN (0–4)	FP (0, 292)	FN (0, 8)
MARS	16.55(0.174)	0.03(0.017)	17.18(0.186)	0.32(0.053)	NA*	NA*
SSIR	0.12(0.033)	0.91(0.029)	4.14(0.288)	1.76(0.115)	NA*	NA*
COP	1.88(0.149)	0.83(0.038)	3.26(0.210)	1.71(0.104)	8.93(0.576)	0.18(0.081)

Open in a new tab

For Scenario 2.1, MARS achieved the lowest FN value (0.03), but its FP value was unacceptably high (16.55); SSIR had the lowest FP values, but its FN value was the highest among the three. The FP and FN values of COP were between the extremes. It appears that the performances of SSIR and COP are similar under Scenario 2.1. For Scenario 2.2, COP outperformed SSIR in terms of both FP and FN values. MARS again achieved the lowest FN value (0.32) at the expense of an unacceptable FP value (17.18). Scenario 2.3 is the most challenging among the three scenarios, in which the number of predictors exceeds the number of observations. Both MARS and SSIR broke down under this scenario. However, COP still demonstrated an excellent performance with its FP and FN values reasonably low.

5.3. Heteroscedastic models

In the previous examples, the true predictors affect only the mean response. In this example, we consider the following heteroscedastic model

Y = \frac{0.2 ε}{1.5 + \sum_{j = 1}^{p} β_{j, 1} X_{j}},

(33)

where X = (X₁, X₂, ⋯, X_p)′ follows a p-variate normal distribution with mean zero and covariances Cov(X_i, X_j) = ρ^|i−j| for 1 ≤ i, j ≤ p, ε is independent of X and follows N(0, 1), and β_j,1 = 1 for 1 ≤ j ≤ 8 and =0 for j ≥ 9. Note that the central subspace is spanned by β₁ = (β_1,1, β_2,1, …, β_p,1)′ and the number of true predictors is 8. We further specify ρ and p in (33) and consider the following three scenarios,

Scenario 3.1 : ρ = 0, p = 500;
Scenario 3.2 : ρ = 0, p = 1000;
Scenario 3.3 : ρ = 0.3, p = 1500.

For each scenario, we generated 100 data sets each with n = 1000 observations and applied MARS, SSIR and COP to the data sets. The FP and FN values of the three methods are listed in Table 3.

Table 3.

Performance comparison under heteroscedastic model: FP is the average number of irrelevant variables falsely selected by the method, and FN is the average number of true variables falsely excluded by the method; the number in (·) is the standard error of the FP or FN and NA* indicates that the corresponding algorithm broke down.

Methods	ρ = 0 n = 1000, p = 500		ρ = 0 n = 1000, p = 1000		ρ = 0.3 n = 1000, p = 1500

	FP (0, 492)	FN (0, 8)	FP (0, 992)	FN (0, 8)	FP (0, 1492)	FN (0, 8)
MARS	212.15(0.428)	4.83(0.116)	230.33(0.372)	6.16(0.129)	236.60(0.524)	6.84(0.126)
SSIR	52.54(1.970)	0.88 (0.149)	NA*	NA*	NA*	NA*
COP	5.79(0.365)	1.21(0.030)	13.14(0.734)	1.29(0.037)	21.36(0.937)	1.5(0.039)

Open in a new tab

Under Scenario 3.1, both SSIR and COP outperformed MARS. The FN value of SSIR (0.99) is less than that of COP (1.21), but the FP value (52.54) is much larger than that of COP (5.71). Under both Scenarios 3.2 and 3.3, in which p is much larger than n, SSIR broke down, but COP still demonstrated excellent performances. The performances of MARS under these two scenarios were fairly poor.

6. Application: Predict Gene Expression from Sequence Using Next-generation Sequencing Data

Embryonic stem cells (ESCs) maintain self-renewal and pluripotency as they have the ability to differentiate into all cell types. To enhance the understanding of the embryonic stem cells development, predictive models, such as regression models, can be constructed in which the gene expression is regarded as the response variable and various features associated with gene-regulating transcription factors (TFs) are taken as the predictors. Examples of such features include motif scores based on position-specific weight matrices of motifs recognized by the TFs Conlon et al. (2003), and ChIP-chip log ratios.

Recently, the emerging next-generation sequencing technologies, in particular, RNA-Seq and ChIP-Seq, offer researchers an unprecedented opportunity to build predictive models for complex biological processes such as gene regulation. Compared to the traditional hybridization-based methods, such as microarray, RNA-Seq and ChIP-Seq provide more accurate quantification of gene expression and TF-DNA binding locations respectively (Mortazavi et al. 2008, Wilhelm et al. 2008, Nagalakshmi et al. 2008, Boyer et al. 2005, Johnson et al. 2007).

To quantify gene expression in RNA-Seq data, one may calculate the RPKM (reads per kilobase of exon region per million mapped reads), which has been shown to be proportional to the gene expression levels (Cloonan et al. 2008). From ChIP-Seq data, Ouyang et al. (2009) proposed a feature named TF association strength (TFAS), which was shown to explain a much higher proportion of gene expression variation than traditional predictors in predictive models. In particular, for each TF, the TFAS for each gene is computed as a weighted sum of the corresponding ChIP-Seq signal strengths, where the weights reflect the proximity of the signal to the gene. We here examine whether we can build a better predictive model for gene expressions by combining both TFASs and motif scores of TF in mouse ESCs.

To achieve this goal, we compiled a data set consisting of gene expressions, TFASs, and motif scores. In this data set, the RPKMs were calculated as gene expression levels from RNA-Seq data in mouse ESCs (Cloonan et al. (2008)). The TFASs of 12 TFs were calculated from the ChIP-Seq experiments in mouse ESCs (Chen et al. (2008)). In addition, we supplement this data set with motif scores of putative TFs of mouse. From the transcription factor database TRANSFAC, we complied a list of 300 TF binding motifs of mouse. For each gene, a matching score was calculated using the scoring system described in Zhong et al. (2005) for each TFBM. The matching score can be considered intuitively as the expected number of occurrences of a TFBM on the gene's promoter region. To build a predictive model in mouse ESC, we treat the gene expression as the response variable and the 12 TFASs as well as the 300 TF motif matching scores as predictors. More precisely, the response is a vector with 12408 entries and the data matrix is a 12408 × 312 matrix with (i, j)th entry representing the TFAS score of of the ith gene's promoter region for TF j if j ≤ 12; representing the matching score of the ith gene's promoter region for TF j if j > 12.

We have applied COP to this data set. The procedure has identified two principal directions and selected in total 42 predictors. The first squared profile correlation is λ₁ = 0.67, and the second squared profile correlation is λ₂ = 0.20. Among the 12 TFASs calculated from ChIP-Seq, eight of them were selected by COP. In particular, Oct4 is a well-known master regulator regulating the pluripotency, and Klf4 regulates differentiation (Cai et al. 2010). Evidence also suggests that at these early stages of development, STAT3 activation is required for self-renewal of ESCs (Matsuda et al. 1999). Among the 300 TF motif scores, 34 of them are selected by COP. To further understand what extra information that TF motif scores provide, we annotate the functions of the 34 TFs. It is of interest to note that 24 out of the 34 selected motifs correspond to TFs that are either regulators for development or cancer-related; see Table 4. Since ESCs are in a developmental phase, it is not surprising to have active TFs regulating general development. Some recent evidences suggest that tumor suppressors that control cancer cell proliferation also regulate stem cell self-renewal (Pardal et al. 2005). Thus, a careful study of these cancer-related TFs could lead to a better understanding of the stem cell regulatory network.

Table 4.

Motifs identified

development	COUP-TF, AP2, Sp1, CHOP C/EBpalpha, NF-AT Pax, Pax8, GABP, En1, TTF1 PITX2, NKx2-2, HIXA4, ZF5, PPAR direct repeat 1
cancer	IRF1, EVI1, NF1, GKLF, Whn VDR, POU6F1, Arnt, Cdx2
8 selected TFAS	E2F1, Mycn, ZFx, Klf4 Tcfcp2/1, Oct4, Stat3, Smad1

Open in a new tab

7. Discussion

The contribution of the COP procedure to the development of variable selection methodologies for high dimensional regression analysis is two-fold. First, it dos not impose any assumption on the relationship between the response variable and the predictors, and the sufficient dimension reduction framework that the COP procedure relies on includes fully nonparametric models as special cases. Therefore, COP can be considered a model-free variable selection procedure applicable in any high dimensional data analysis. Second, as demonstrated by our simulation studies, the COP procedure can effectively handle hundreds and thousands of predictors, which can be extremely challenging to other existing methods for variable selection beyond linear or parametric models. Like linear stepwise regression, the COP procedure may encounter issues typical to stepwise procedures as discussed in Miller (1984). Nonetheless, we believe that the COP procedure should become an indispensable member of the repository of variable selection tools and recommend its broad use. When a parametric model is postulated for the relationship between the response and the predictor variables and model-specific variable selection methods are available, we recommend to use COP together with these methods as a safeguard against possible model-misidentification. We have implemented the COP procedure in R, and the R package can be downloaded from http://cran.r-project.org/web/packages/COP/ or requested from the authors directly.

As a trade-off, the COP procedure imposes various assumptions on the distribution of the predictors, of which the linearity assumption is the most fundamental and crucial. When the linearity condition is required to hold for any lower dimensional projection, it is equivalent to requiring that the joint distribution of the predictors is elliptically-contoured (Eaton 1986). Hall and Li (1993) establishes the fact that low dimensional projections from high dimensional data approximately satisfy the linearity condition, which to a certain degree alleviates the concern of the linearity assumption and explains why SIR and the COP procedure worked well under mild violation of the assumption. When the linearity condition is heavily violated, data re-weighting schemes such as the Voronoi re-weighting scheme (Cook and Nachtsheim 1994) can be used to correct the violation. We plan to incorporate such schemes into the COP procedure in the future.

When the number of the predictors is extremely large, the performance of the COP procedure can be compromised. This is also the case for variable selection methods under the linear model. Lately, Fan and Lv (2008) advocates a two-step approach to attack the so-called ultra-high dimensionality. The first step is to perform screening to reduce the dimensionality from ultra-high to high or moderately high, and then in the second step, variable selection methods are applied to identify the true predictors. The same approach can be used for variable selection under the SDR framework. More precisely, we can apply the forward COP (FCOP) precedure, which is simply the COP procedure with the deletion step removed, to reduce the dimensionality of a problem from ultra-high to moderely high. The FCOP procedure is much easier to implement and computationally more efficient than the COP procedure. Then, the usual COP procedure is applied to the reduced data to select the true predictors. This approach is currently under investigation and the results will be reported in a future publication.

Acknowledgments

This work was supported by a NIH grant U01 ES016011, a DOE grant from the office of Science (BER) and an NSF grant DMS 1120256 to WZ, a NIH grant R01-HG02518-02 and a NSF grant DMS 1007762 to JL, a NSF grant DMS 0707004 to YZ.

Appendix

A.1 PROOF OF PROPOSITION 2.1

Let 𝒮^⊥(B) denote the space of vectors such that for any ρ ∈ 𝒮^⊥(B) and any β ∈ 𝒮(B), ρ′Σβ = 0. Let 𝒮^⊥(K̃) be the space of vectors such that for any ρ ∈ 𝒮^⊥(K̃,) ρ′Ση_k = 0 for k = 1, ⋯, K̃. We will show that 𝒮^⊥(B) ⊆ 𝒮^⊥(K̃), which means, for any ρ ∈ 𝒮^⊥(B) P(ρ) = 0. First, because for any T, T(Y)⊥η′X | B′X, then cov(T(Y), η′X) = E(T(Y)η′X) = E(E(T(Y)|B′X)E(η′X|B′X)). Due to the linearity condition, for any ρ ∈ 𝒮^⊥(B), $E (ρ' X | B' X) = c_{1} β_{1}^{'} X + \dots + c_{K} β_{K}^{'} X$ , where c₁, ⋯, c_K are linear coefficients. In addition, since $cov (ρ' X, β_{k}^{'} X) = 0$ for k = 1, ⋯, K, E(ρ′X|B′X) = 0. Consequently, ${corr}^{2} (T (Y), ρ' X) = \frac{{cov}^{2} (T (Y), ρ' X)}{var (T (Y)) var (ρ' X)} = 0$ , P(ρ) = 0 and 𝒮^⊥(B) ⊆ 𝒮^⊥(K̃). Proposition 2.1 holds.

A.1 PROOF OF THEOREM 3.1

Without loss of generality, we let 𝒜 = {1, ⋯, d} and t = d + 1. Let X^(j) be the vector of n i.i.d observations of the jth variable for j = 1, ⋯, d + 1. We assume that the predictors have been centered to have zero sample mean. Denote X_n×j = (X⁽¹⁾, ⋯, X^(j)) for j = d, d + 1. We let

{M̂}^{(j)} = \sum_{h = 1}^{H} \frac{n_{h}}{n} {({X̄}_{h}^{(j)})}^{T} {X̄}_{h}^{(j)} for j = d, d + 1

where ${X̄}_{h}^{(j)} (j = d, d + 1)$ is the average of the first j variables for those individuals whose responses fall into the hth slice S_h, h = 1, ⋯, H. Let n_h be the number of observations in the hth slice, h = 1, ⋯, H. Let ${λ̂}_{i}^{(j)}$ be the ith largest eigenvalue of ${Σ̂}_{j}^{- 1} {M̂}^{(j)}$ for j = d, d + 1 respectively, where Σ̂_j is the sample variance-covariance matrix of X_n×j. It is difficult to see the asymptotic distribution of ${λ̂}_{i}^{(d + 1)} - {λ̂}_{i}^{(d)}$ for i = 1, ⋯, K directly based on ${Σ̂}_{j}^{- 1} {M̂}^{(j)}$ for j = d, d + 1. We did some transformations such that the transformed ${Σ̂}_{j}^{- 1} {M̂}^{(d)}$ (with eigenvalues unchanged) is a sub-matrix of the transformed ${Σ̂}_{j}^{- 1} {M̂}^{(d + 1)}$ .

Let

{γ̂}_{n \times 1} = {({γ̂}_{1}, \dots, {γ̂}_{n})}^{T} = \frac{1}{σ̂} [I - X_{n \times d} {(X_{n \times d}^{T} X_{n \times d})}^{- 1} X_{n \times d}^{T}] X^{(d + 1)},

where σ̂² is the sample variance of $[I - X_{n \times d} {(X_{n \times d}^{T} X_{n \times d})}^{- 1} X_{n \times d}^{T}] X^{(d + 1)}$ . Denote ${γ̄}_{h} = n_{h}^{- 1} \sum_{y_{i} \in S_{h}} {γ̂}_{i}$ . Let γ = X_d+1 − E(X_d+1|X₁, ⋯, X_d), and γ_n×1 be the n regression error terms of the n observed X_d+1 on X₁, ⋯, X_d. Then γ_n×1 are i.i.d with mean zero and a finite variance. Under the null hypothesis H₀ : η_d+1,i = 0, i = 1, ⋯, K we have E(γ|y) = E(E(γ|X₁, ⋯, X_d)|y) = 0 for any y. Let γ̄ be the mean of γ_n×1. Then

{γ̂}_{n \times 1} = {({γ̂}_{1}, \dots, {γ̂}_{n})}^{T} = \frac{1}{σ̂} [I - X_{n \times d} {(X_{n \times d}^{T} X_{n \times d})}^{- 1} X_{n \times d}^{T}] (γ_{n \times 1} - γ̄) .

With transformations on ${Σ̂}_{j}^{- 1} {M̂}^{(d)}$ , we showed that ${λ̂}_{i}^{(d + 1)} - {λ̂}_{i}^{(d)}$ for i = 1, ⋯, K equals a squared linear combination of γ̄_h. Thus, we just need to show that (γ̄₁, ⋯, γ̄_H) converges to a multivariate normal distribution, and we complete the proof. Let $(z_{1}, \dots, z_{d})' = Σ_{d}^{- / 2} (x_{1}, \dots, x_{d})'$ . Define four matrices, A_H×H, B_H×d, E_d×d and Γ_H×d, where A_H×H=diag{var(γ|y ∈ S₁), ⋯, var(γ|y ∈ S_H)}/σ²; the (h, j)th entry of B_H×d is $\sqrt{p_{h}} cov (z_{j} γ, γ | y \in S_{h}) / σ^{2}$ , the (j, j′)th entry of E_d×d equals cov(z_j′γ, z_jγ)/σ², the (h, j)th entry of Γ_H×d is $\sqrt{p_{h}} E (z_{j} | y \in S_{h})$ and σ² = lim_n→ σ̂² = Var(γ). Let ϒ be a d by d matrix and

ϒ = Γ_{H \times d}^{T} A_{H \times H} Γ_{H \times d} - Γ_{H \times d}^{T} B_{H \times d} Γ_{H \times d}^{T} Γ_{H \times d} - Γ_{H \times d}^{T} Γ_{H \times d} B_{H \times d}^{T} Γ_{H \times d} + Γ_{H \times d}^{T} Γ_{H \times d} E_{d \times d} Γ_{H \times d}^{T} Γ_{H \times d} .

Define Q̃ to be a d × K matrix with jth column as $q_{j} / {(λ_{j}^{(d)} (1 - λ_{j}^{(d)}))}^{1 / 2}$ , where q_j is the the jth eigenvector of the limiting matrix ${lim}_{n \to \infty} {Σ̂}_{d}^{- 1 / 2} {M̂}^{(d)} {Σ̂}_{d}^{- 1 / 2}, and λ_{j}^{(d)} = {lim}_{n \to \infty} {λ̂}_{j}^{(d)}$ . Then W_Kt = Q̃^TϒQ̃.

A.1 PROOF OF Corollary 3.1

With an additional condition that E(γ²|X₁, ⋯, X_d) is constant, we can show that the asymptotic variance matrix of (γ̄₁, ⋯, γ̄_H) adopts a special form, with which the asymptotic standard chi-quare distribution can be derived.

A.3 PROOF OF THEOREM 3.2

Without loss of generality, we let 𝒜 = {X₁, ⋯, X_d}. Following the notations used in Theorem 3.1, let γ_j = X_j − E(X_j|X_i, i ∈ 𝒜) for j ∈ 𝒜^c, and

{γ̂}^{j} = ({γ̂}_{j, 1}, \dots, {γ̂}_{j, n})' = \frac{1}{{σ̂}_{j}} [I_{n} - X_{n \times d} {((X_{n \times d})' X_{n \times d})}^{- 1} (X_{n \times d})'] X^{(j)} .

Let ${γ̄}_{h}^{j} = n_{h}^{- 1} \sum_{y_{i} \in S_{h}} {γ̂}_{j, i}$ . Similar as in the proof of Theorem 3.1, we basically show ${γ̄}_{h}^{j}$ for j = d + 1, ⋯, p and h = 1, ⋯, H converge to a multivariate normal distribution.

A.3 PROOF OF THEOREM 3.3

We use the same notations defined in the proof of Theorem 3.2. Let ${γ̄}_{h}^{j} = n_{h}^{- 1} \sum_{y_{i} \in S_{h}} {γ̂}_{j, i}$ . Let ${λ̂}_{k}^{(d)}$ be defined as in the proof of Theorem 3.1. First, for any t, ${COP}_{1 : K}^{𝒜 + t} \geq n (\sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)})$ , and

| \sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)} | \geq | \sum_{k = 1}^{K} (λ_{k}^{(d + 1)} - λ_{k}^{(d)}) | - | \sum_{k = 1}^{K} (λ_{k}^{(d + 1)} - {λ̂}_{k}^{(d + 1)}) | - | \sum_{k = 1}^{K} (λ_{k}^{(d)} - {λ̂}_{k}^{(d)}) | .

Since X follows a multivariate normal distribution, from Li (1991), $λ_{k}^{(d)} = λ_{k}^{(d + 1)} = 0$ for k > K, then

\sum_{k = 1}^{K} (λ_{k}^{(d + 1)} - λ_{k}^{(d)}) = lim_{n \to \infty} trace ({Ω̃}^{(d + 1)}) - trace ({Ω̂}^{(d)}) = lim_{n \to \infty} \sum_{h = 1}^{H} \frac{n_{h}}{n} {({γ̄}_{h}^{j})}^{2} = \sum_{h = 1}^{H} p_{h} \frac{E^{2} (γ_{j} | y \in S_{h})}{σ_{j}^{2}} .

We need to use the two Lemmas 8.1 and 8.2 stated below. The proofs of the two lemmas are omitted here. From Lemma 8.1,

max_{t \in 𝒜^{c} \cap 𝒯} n (\sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)}) \geq ϖ \cdot ω_{H} \cdot n^{1 - ξ_{0}} \frac{τ_{min}^{2}}{τ_{max}} - | \sum_{k = 1}^{K} n (λ_{k}^{(d + 1)} - {λ̂}_{k}^{(d + 1)}) | - | \sum_{k = 1}^{K} n (λ_{k}^{(d)} - {λ̂}_{k}^{(d)}) | .

Then as long as

max_{𝒜 \subseteq {1, \dots, p}} n | \sum_{k = 1}^{K} (λ_{k}^{(d)} - {λ̂}_{k}^{(d)}) | \leq ϑ n^{1 - ξ_{0}} / 2,

we have

min_{𝒜 : 𝒜^{c} \cap 𝒯 \neq \emptyset} max_{t \in 𝒜^{c} \cap 𝒯} {COP}_{1 : K}^{𝒜 + t} \geq ϑ n^{1 - ξ_{0}} .

From Lemma 8.2,

P (max_{𝒜 \subseteq {1, \dots, p}} | \sum_{k = 1}^{K} λ_{k}^{(d)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)} | > ϑ n^{- ξ_{0}} / 2) \leq 2 Kp (p + 1) C_{1} exp {- C_{2} n^{1 - 2 ξ_{0}} \frac{τ_{min}^{2} ϑ^{2}}{256 K^{2} p^{2}}} .

Under Condition 8, since p = o(n^ϱ₀) with 2ϱ₀ + 2ξ₀ < 1, $P ({max}_{𝒜 \subseteq {1, \dots, p}} | \sum_{k = 1}^{K} λ_{k}^{(d)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)} | > ϑ n^{- ξ_{0}} / 2) \to 0, and P ({min}_{𝒜 : 𝒜^{c} \cap 𝒯 \neq \emptyset} {max}_{t \in 𝒜^{c} \cap 𝒯} {COP}_{1 : K}^{𝒜 + t} \geq ϑ n^{1 - ξ_{0}}) \to 1$ .

A.3 PROOF OF THEOREM 3.4

Since

| \sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)} | \leq | \sum_{k = 1}^{K} (λ_{k}^{(d + 1)} - λ_{k}^{(d)}) | + | \sum_{k = 1}^{K} (λ_{k}^{(d + 1)} - {λ̂}_{k}^{(d + 1)}) | + | \sum_{k = 1}^{K} (λ_{k}^{(d)} - {λ̂}_{k}^{(d)}) |,

and with 𝒯 ⊆ 𝒜, $| \sum_{k = 1}^{K} (λ_{k}^{(d + 1)} - λ_{k}^{(d)}) | = 0$ , then from Lemma 8.2, $P ({max}_{𝒜 \subseteq {1, \dots, p}} | \sum_{k = 1}^{K} λ_{k}^{(d)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)} | > ε) \to 0$ for ε > Cn^ϱ₀−1/2 and Theorem 3.4 holds.

Lemma 8.1. Under the same conditions as in Theorem 3.3, for any 𝒜 ⊆ {1, ⋯, p} and 𝒜^c ⋂ 𝒯 ≠ ∅,

max_{j \in 𝒜^{c} \cap 𝒯} \sum_{h = 1}^{H} p_{h} E^{2} (γ_{j} | y \in S_{h}) / σ_{j}^{2} \geq τ_{min}^{2} \cdot ϖ \cdot ω_{H} \cdot n^{- ξ_{0}} / τ_{max} > 0 .

Lemma 8.2. Under the same conditions as in Lemma 8.1,

P (max_{𝒜 \subseteq {1, \dots, p}} | \sum_{k = 1}^{K} λ_{k}^{(d)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)} | > ε) \leq 2 Kp (p + 1) C_{1} exp {- C_{2} n \frac{τ_{min}^{2} ε^{2}}{64 K^{2} p^{2}}} .

A.3 PROOF OF THEOREM 4.1

For coherence, we use the same notation as defined in the proof of Theorem 3.1. Without loss of generality, let 𝒜 = {1, ⋯, d} and t = d + 1. Under the assumption that X_n×(d+1) has a multivariate normal distribution, we derive the limiting value of $(\sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)})$ as n → ∞ for fixed slices. Let Ξ_K×K be the variance-covariance matrix of v_K.

Because {X₁, ⋯, X_d+1} follow a multivariate normal distribution, we have $γ = X_{d + 1} - (ρ_{0} + \sum_{i = 1}^{d} ρ_{i} X_{i}) and γ ~ N (0, σ_{d + 1}^{2})$ where the ρ_i are the coefficients. Since we assume that the response only depends on K linear combinations of X_n×(d+1), Ω̃^(d+1) and Ω̂^(d) have at most K nonzero eigenvalues, and

\frac{\sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)}}{trace ({Ω̃}^{(d + 1)})} \overset{P}{\to} 1, \frac{\sum_{k = 1}^{K} {λ̂}_{k}^{(d)}}{trace ({Ω̂}^{(d)})} \overset{P}{\to} 1, \frac{\sum_{k = 1}^{K} {λ̂}_{k}^{(d + 1)} - \sum_{k = 1}^{K} {λ̂}_{k}^{(d)}}{trace ({Ω̃}^{(d + 1)}) - trace ({Ω̂}^{(d)})} \overset{P}{\to} 1 .

Because of the following three results:

$trace ({Ω̃}^{(d + 1)}) - trace ({Ω̂}^{(d)}) = \sum_{h = 1}^{H} n_{h} {({γ̄}_{h})}^{2} / n$ .
${γ̄}_{h} \overset{P}{\to} E (γ | y \in S_{h}) / σ_{d + 1}$ , h = 1, ⋯, H.
Since $E (γ | v_{K}) = {η̃}_{t, 𝒜} Ξ_{K \times K}^{- 1} v_{K}$ , then
$E (γ | y \in S_{h}) = E [E (γ | v_{K}) | y \in S_{h}] = {η̃}_{t, 𝒜}^{'} Ξ_{K \times K}^{- 1} L_{H, K} .$

Combining 1,2,3, we have $\sum_{h = 1}^{H} n_{h} {({γ̄}_{h})}^{2} / n \overset{P}{\to} {η̃}_{t, 𝒜}^{'} Ξ_{K \times K}^{- 1} M_{H, K} Ξ_{K \times K}^{- 1} {η̃}_{t, 𝒜}$ . Since $Ξ_{K \times K}^{- 1} = I_{K \times K}$ ,
$\sum_{k = 1}^{K} {λ̂}_{k}^{𝒜 + t} - \sum_{k = 1}^{K} {λ̂}_{k}^{𝒜} \overset{P}{\to} \frac{1}{σ_{d + 1}^{2}} {η̃}_{t, 𝒜}^{'} M_{H, K} {η̃}_{t, 𝒜},$
and Theorem 4.1 holds.

A.3 PROOF OF Proposition 4.1

Note that η′M_H,Kη = Var (E(η′v_K|y ∈ S_h)) and

Var (E (η' v_{K} | y \in S_{h'}^{'})) = Var (E (η' v_{K} | y \in S_{h})) + Var [Var (E (η' v_{K} | y \in S_{h'}^{'}) | S_{h})] .

Thus, Proposition 4.1 holds.

Contributor Information

Wenxuan Zhong, Email: wenxuan@illinois.edu, Department of Statistics, University of Illinois at Urbana Champaign, Champaign, IL 61820.

Tingting Zhang, Email: tz3b@virginia.edu, Department of Statistics, University of Virginia, Charlottesville, VA 22904.

Yu Zhu, Email: yuzhu@stat.purdue.edu, Department of Statistics, Purdue University, West Lafayette, IN 47907.

Jun S. Liu, Email: jliu@stat.harvard.edu, Department of Statistics, Harvard University, Cambridge, MA 02138.

References

Bondell H, Li L. Shrinkage inverse regression estimation for model-free variable selection. J. Roy. Statist. Soc. Ser. B. 2009;71(1):287–299. [Google Scholar]
Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122(6):947–956. doi: 10.1016/j.cell.2005.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai J, Xie D, Fan Z, Chipperfield H, Marden J, Wong WH, Zhong S. Modeling co-expression across species for complex traits: insights to the difference of human and mouse embryonic stem cells. PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1000707. e1000707. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen C-H, Li K-C. Can SIR be as popular as multiple linear regression? Statistica Sinica. 1998;8:289–316. [Google Scholar]
Chen X, Xu H, Yuan P, Fang F, M H, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. doi: 10.1016/j.cell.2008.04.043. [DOI] [PubMed] [Google Scholar]
Cleveland W. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 1979;74(368):829–836. [Google Scholar]
Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods. 2008;5(7):613–619. doi: 10.1038/nmeth.1223. [DOI] [PubMed] [Google Scholar]
Conlon E, Liu X, Lieb J, Liu J. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences. 2003;100(6):3339–3344. doi: 10.1073/pnas.0630591100. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cook R. Testing predictor contributions in sufficent dimension reduction. Annals of Statistics. 2004;32(3):1062–1092. [Google Scholar]
Cook RD. An Introduction to Regression Graphics. New York: Wiley; 1994. [Google Scholar]
Cook RD, Nachtsheim CJ. Reweighting to achieve elliptically contoured covariates in regression. J. Amer. Statist. Assoc. 1994;89:592–599. [Google Scholar]
Eaton ML. A characterization of spherical distributions. Journal of Multivariate Analysis. 1986;20:272–276. [Google Scholar]
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32(2):407–499. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001;96(456):1348–1360. [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Statist. Soc. Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]
Friedman JH. Pathwise coordinate optimization. Annals of applied statistics. 2007;1(2):302–332. [Google Scholar]
Friedman JH, Tukey JW. A projection pursuit algorithm for explanatory data analysis. IEEE Trans. Comput. 1974;C-23:881–889. [Google Scholar]
Fung W, He X, Liu L, Shi P. Dimension reduction based on canonical correlation. Statistica Sinica. 2002;12(4):1093–1114. [Google Scholar]
Hall P, Li K-C. On almost linearity of low dimensional projections from high dimensional data. The Annals of Statistics. 1993;21:867–889. [Google Scholar]
Huber PJ. Projection pursuit. The Annals of Statistics. 1985;13:435–475. [Google Scholar]
Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–1502. doi: 10.1126/science.1141319. [DOI] [PubMed] [Google Scholar]
Li K-C. Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 1991;86:316–327. [Google Scholar]
Li L. Sparse sufficient dimension reduction. Biometrika. 2007;94(3):603–613. [Google Scholar]
Li L, Cook R, Nachtsheim C. Model-free variable selection. J. Roy. Statist. Soc. Ser. B. 2005;67:285–299. [Google Scholar]
Matsuda T, Nakamura T, Nakao K, Arai T, Katsuki M, Heike T, Yokota T. STAT3 activation is sufficient to maintain an undifferentiated state of mouse embryonic stem cells. EMBO J. 1999;18(15):4261–4269. doi: 10.1093/emboj/18.15.4261. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller AJ. Selection of subsets of regression variables. J. Roy. Statist. Soc. Ser. A. 1984;147(3):389–425. [Google Scholar]
Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5(7):621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–1349. doi: 10.1126/science.1158441. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ouyang Z, Zhou Q, Wong WH. Chip-seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA. 2009;106:21521–21526. doi: 10.1073/pnas.0904863106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pardal R, Molofsky AV, He S, Morrison SJ. Stem cell self-renewal and cancer cell proliferation are regulated by common networks that balance the activation of proto-oncogenes and tumor suppressors. Cold Spring Harb Symp Quant Biol. 2005;70:177–185. doi: 10.1101/sqb.2005.70.057. [DOI] [PubMed] [Google Scholar]
Shao J. An asymptotic theory for linear model selection (with discussion) Statistica Sinica. 1998;7:221–264. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B. 1996;58:267–288. [Google Scholar]
Wang h. Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc. 2009;104:1512–1524. [Google Scholar]
Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199):1239–U39. doi: 10.1038/nature07002. [DOI] [PubMed] [Google Scholar]
Zeng P, Zhu Y. An integral transform method for estimating the central mean and central subspaces. Journal of Multivariate Analysis. 2010;101(1):271–290. [Google Scholar]
Zhong W, Zeng P, Ma P, Liu J, Zhu Y. Regularized sliced inverse regression for motif discovery. Bioinformatics. 2005;21(22):4169–4175. doi: 10.1093/bioinformatics/bti680. [DOI] [PubMed] [Google Scholar]
Zhou J, He X. Dimension reduction based on constrained cannonical correlation and variable filtering. Ann. Statist. 2008;36(4):1649–1668. [Google Scholar]
Zhu L, Miao B, Peng H. On sliced inverse regression with high-dimensional covariates. J. Amer. Statist. Assoc. 2006;101:630–643. [Google Scholar]
Zou H. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101(476):1418–1429. [Google Scholar]

[R1] Bondell H, Li L. Shrinkage inverse regression estimation for model-free variable selection. J. Roy. Statist. Soc. Ser. B. 2009;71(1):287–299. [Google Scholar]

[R2] Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122(6):947–956. doi: 10.1016/j.cell.2005.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Cai J, Xie D, Fan Z, Chipperfield H, Marden J, Wong WH, Zhong S. Modeling co-expression across species for complex traits: insights to the difference of human and mouse embryonic stem cells. PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1000707. e1000707. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Chen C-H, Li K-C. Can SIR be as popular as multiple linear regression? Statistica Sinica. 1998;8:289–316. [Google Scholar]

[R5] Chen X, Xu H, Yuan P, Fang F, M H, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. doi: 10.1016/j.cell.2008.04.043. [DOI] [PubMed] [Google Scholar]

[R6] Cleveland W. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 1979;74(368):829–836. [Google Scholar]

[R7] Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods. 2008;5(7):613–619. doi: 10.1038/nmeth.1223. [DOI] [PubMed] [Google Scholar]

[R8] Conlon E, Liu X, Lieb J, Liu J. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences. 2003;100(6):3339–3344. doi: 10.1073/pnas.0630591100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Cook R. Testing predictor contributions in sufficent dimension reduction. Annals of Statistics. 2004;32(3):1062–1092. [Google Scholar]

[R10] Cook RD. An Introduction to Regression Graphics. New York: Wiley; 1994. [Google Scholar]

[R11] Cook RD, Nachtsheim CJ. Reweighting to achieve elliptically contoured covariates in regression. J. Amer. Statist. Assoc. 1994;89:592–599. [Google Scholar]

[R12] Eaton ML. A characterization of spherical distributions. Journal of Multivariate Analysis. 1986;20:272–276. [Google Scholar]

[R13] Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32(2):407–499. [Google Scholar]

[R14] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001;96(456):1348–1360. [Google Scholar]

[R15] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Statist. Soc. Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]

[R17] Friedman JH. Pathwise coordinate optimization. Annals of applied statistics. 2007;1(2):302–332. [Google Scholar]

[R18] Friedman JH, Tukey JW. A projection pursuit algorithm for explanatory data analysis. IEEE Trans. Comput. 1974;C-23:881–889. [Google Scholar]

[R19] Fung W, He X, Liu L, Shi P. Dimension reduction based on canonical correlation. Statistica Sinica. 2002;12(4):1093–1114. [Google Scholar]

[R20] Hall P, Li K-C. On almost linearity of low dimensional projections from high dimensional data. The Annals of Statistics. 1993;21:867–889. [Google Scholar]

[R21] Huber PJ. Projection pursuit. The Annals of Statistics. 1985;13:435–475. [Google Scholar]

[R22] Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–1502. doi: 10.1126/science.1141319. [DOI] [PubMed] [Google Scholar]

[R23] Li K-C. Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 1991;86:316–327. [Google Scholar]

[R24] Li L. Sparse sufficient dimension reduction. Biometrika. 2007;94(3):603–613. [Google Scholar]

[R25] Li L, Cook R, Nachtsheim C. Model-free variable selection. J. Roy. Statist. Soc. Ser. B. 2005;67:285–299. [Google Scholar]

[R26] Matsuda T, Nakamura T, Nakao K, Arai T, Katsuki M, Heike T, Yokota T. STAT3 activation is sufficient to maintain an undifferentiated state of mouse embryonic stem cells. EMBO J. 1999;18(15):4261–4269. doi: 10.1093/emboj/18.15.4261. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Miller AJ. Selection of subsets of regression variables. J. Roy. Statist. Soc. Ser. A. 1984;147(3):389–425. [Google Scholar]

[R28] Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5(7):621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]

[R29] Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–1349. doi: 10.1126/science.1158441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Ouyang Z, Zhou Q, Wong WH. Chip-seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA. 2009;106:21521–21526. doi: 10.1073/pnas.0904863106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Pardal R, Molofsky AV, He S, Morrison SJ. Stem cell self-renewal and cancer cell proliferation are regulated by common networks that balance the activation of proto-oncogenes and tumor suppressors. Cold Spring Harb Symp Quant Biol. 2005;70:177–185. doi: 10.1101/sqb.2005.70.057. [DOI] [PubMed] [Google Scholar]

[R32] Shao J. An asymptotic theory for linear model selection (with discussion) Statistica Sinica. 1998;7:221–264. [Google Scholar]

[R33] Tibshirani R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B. 1996;58:267–288. [Google Scholar]

[R34] Wang h. Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc. 2009;104:1512–1524. [Google Scholar]

[R35] Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199):1239–U39. doi: 10.1038/nature07002. [DOI] [PubMed] [Google Scholar]

[R36] Zeng P, Zhu Y. An integral transform method for estimating the central mean and central subspaces. Journal of Multivariate Analysis. 2010;101(1):271–290. [Google Scholar]

[R37] Zhong W, Zeng P, Ma P, Liu J, Zhu Y. Regularized sliced inverse regression for motif discovery. Bioinformatics. 2005;21(22):4169–4175. doi: 10.1093/bioinformatics/bti680. [DOI] [PubMed] [Google Scholar]

[R38] Zhou J, He X. Dimension reduction based on constrained cannonical correlation and variable filtering. Ann. Statist. 2008;36(4):1649–1668. [Google Scholar]

[R39] Zhu L, Miao B, Peng H. On sliced inverse regression with high-dimensional covariates. J. Amer. Statist. Assoc. 2006;101:630–643. [Google Scholar]

[R40] Zou H. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101(476):1418–1429. [Google Scholar]

PERMALINK

CORRELATION PURSUIT: FORWARD STEPWISE VARIABLE SELECTION FOR INDEX MODELS

Wenxuan Zhong

Tingting Zhang

Yu Zhu

Jun S Liu

Roles

Abstract

1. Introduction

2. Correlation Pursuit for Variable Selection

2.1. Profile Correlation and SIR

2.2. Correlation Pursuit

The COP Algorithm

3. Theoretical Properties

3.1. Asymptotic distributions of test statistics in COP

3.2. Selection consistency of COP

4. Implementation Issues

4.1. Slicing Schemes and the Choice of H

4.2. Choice of ce and cd

4.3. Selection of the number of directions K

5. Simulation Study

5.1. Linear models

Table 1.

5.2. Nonlinear multiple index models

Table 2.

5.3. Heteroscedastic models

Table 3.

6. Application: Predict Gene Expression from Sequence Using Next-generation Sequencing Data

Table 4.

7. Discussion

Acknowledgments

Appendix

A.1 PROOF OF PROPOSITION 2.1

A.1 PROOF OF THEOREM 3.1

A.1 PROOF OF Corollary 3.1

A.3 PROOF OF THEOREM 3.2

A.3 PROOF OF THEOREM 3.3

A.3 PROOF OF THEOREM 3.4

A.3 PROOF OF THEOREM 4.1

A.3 PROOF OF Proposition 4.1

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4.2. Choice of c_e and c_d