Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Nov 1.
Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2012 Apr 12;74(5):849–870. doi: 10.1111/j.1467-9868.2011.01026.x

CORRELATION PURSUIT: FORWARD STEPWISE VARIABLE SELECTION FOR INDEX MODELS

Wenxuan Zhong 1,*, Tingting Zhang 2, Yu Zhu 3, Jun S Liu 4,*
PMCID: PMC3519449  NIHMSID: NIHMS346711  PMID: 23243388

Abstract

In this article, a stepwise procedure, correlation pursuit (COP), is developed for variable selection under the sufficient dimension reduction framework, in which the response variable Y is influenced by the predictors X1, X2, …, Xp through an unknown function of a few linear combinations of them. Unlike linear stepwise regression, COP does not impose a special form of relationship (such as linear) between the response variable and the predictor variables. The COP procedure selects variables that attain the maximum correlation between the transformed response and the linear combination of the variables. Various asymptotic properties of the COP procedure are established, and in particular, its variable selection performance under diverging number of predictors and sample size has been investigated. The excellent empirical performance of the COP procedure in comparison with existing methods are demonstrated by both extensive simulation studies and a real example in functional genomics.

Keywords: variable selection, projection pursuit regression, sliced inverse regression, stepwise regression, dimension reduction

1. Introduction

Advances in science and technology in the past few decades have led to an explosive growth of high dimensional data across a variety of areas such as genetics, molecular biology, cognitive sciences, environmental sciences, astrophysics, finance, internet commerce, etc. Compared to their dimensionalities, a large amount of data sets generated from these areas have relatively small sample sizes. Variable (or feature) selection and dimension reduction are more than often key steps in analyzing these data. Much progress has been made in the past few decades on variable selection for linear models (see Shao 1998 and Fan and Lv 2010 for a review). In recent years, shrinkage based procedures for simultaneously estimating regression coefficients and selecting predictors have been particularly attractive to researchers, and many promising algorithms such as LASSO (Tibshirani 1996, Zou 2006, Friedman 2007), least angle regression (LARS) (Efron et al. 2004), SCAD (Fan and Li 2001), etc., have been invented.

Let Y ∈ ℝ be a univariate response variable and X = (X1, X2, ⋯, Xp)′ ∈ ℝp a vector of p continuous predictor variables. Throughout this article, we consider the following sufficient dimension reduction (SDR) model framework as pioneered by Li (1991) and Cook (1994). Let β1, β2, ⋯, βK be p-dimensional vectors with βi = (β1i, β2i, ⋯, βpi)′ for 1 ≤ iK. The SDR model assumes that Y and X are mutually independent conditional on β1X,β2X,,βKX, i.e.,

YX|BX, (1)

where “⊥″ means “independent of” and B = (β1, β2, ⋯, βK). Expression (1) implies that all the information X contains about Y is contained in the K projections β1X,,βKX. A predictor variable Xj (1 ≤ jp) is said to be relevant if there exists at least one i (1 ≤ iK) such that βji ≠ 0. Let L be the number of relevant predictor variables. When there are a large number of predictors (i.e. p is large), it is usually safe to impose the sparsity assumption, which states that only a small subset of the predictors influence Y and the others are irrelevant. In the SDR model, this assumption means that both K and L are small relative to p.

In his seminal paper on dimension reduction, Li (1991) proposed a seemingly different model of the form:

Y=f(β1X,β2X,,βKX,ε), (2)

where f is an unknown (K + 1)-variate link function and ε is a stochastic error independent of X. It has been shown that the two models (1) and (2) are in fact equivalent (Zeng and Zhu 2010). We henceforth always refer to β1, β2, ⋯, βK as the SDR directions and the space spanned by these directions as a SDR subspace. In general, SDR subspaces are not unique. To resolve this ambiguity, Cook (1994) introduced the concept of central subspace, which is the intersection of all possible SDR subspaces and is a SDR subspace itself, and showed that the central space is well-defined and unique under some general conditions. We denote the central subspace by 𝒮(B) and assume its existence throughout this article.

Various methods have been developed for estimating β1, ⋯, βK in the literature on SDR. One particular family of methods utilizes inverse regression, which is to regress X against Y. The sliced inversion regression (SIR) method proposed by Li (1991) is the forerunner of this family of methods. Recognizing that estimation of the SDR directions does not automatically lead to variable selection, Cook (2004) derived various χ2 tests for assessing the contribution of predictor variables to the SDR directions. Based on these tests, Li et al. (2005) proposed a backward subset selection method for selecting significant predictors. Following the recent trend of using the L1 or L2 penalty for variable selection, Zhong et al. (2005) proposed to regularize the sample covariance matrix of the predictor variables in SIR and developed a procedure called RSIR for variable selection. Li (2007) proposed Sparse SIR (SSIR) to obtain shrinkage estimates of the SDR directions. Bondell and Li (2009) further adopted the nonnegative garrote method for estimating the SDR directions and showed that the resulting method is consistent in variable selection.

The majority of the aforementioned methods take a two-step approach to variable selection under the SDR model. The first step is to perform dimension reduction, that is, to estimate the SDR directions; and the second step is to select the relevant variables using statistical testing or shrinkage methods. Because these methods need to estimate the covariance and conditional covariance matrices of X, both of which are of dimensions p × p, the effectiveness and robustness of the two-step approach are questionable when p is large relative to n. Zhu et al. (2006) have shown that the estimation accuracy of SDR directions deteriorates as p increases. In other words, the more irrelevant variables there are, the more likely a method fails to estimate the SDR directions accurately, and the less likely the method identifies the true relevant predictor variables.

In this article, we propose correlation pursuit (COP), a stepwise procedure for simultaneous dimension reduction and variable selection under the SDR model. Similar to projection pursuit (Friedman and Tukey 1974, Huber 1985), COP defines a projection function to measure the correlation between the transformed response and the projections of X and pursues a subset of explanatory variables that maximize the projection function. It starts with an randomly selected subset and iterates between finding an explanatory variable (predictor) that significantly improves the current projection function to add to the subset and finding an insignificant predictor to remove from the subset. During each iteration step, COP needs only to consider the predictors currently in the subset and one more predictor outside the subset. Therefore, COP can avoid the estimation and inversion of p × p covariance and conditional covariance matrices of X and mitigate the curse of dimensionality. Furthermore, COP performs dimension reduction and variable selection simultaneously. Therefore, dimension reduction and variable selection can be mutually enhanced. Our theoretical investigations as well as simulation studies show that COP is a promising tool for dimension reduction and variable selection in high dimensional data analysis.

The rest of the article is organized as follows. In Section 2, we give a brief introduction to SIR, following a correlation interpretation of SIR provided by Chen and Li (1998). This interpretation was also used in Fung et al. (2002) and Zhou and He (2008) for dimension reduction via canonical correlation. In the same section, we describe the COP procedure and derive various test statistics used by the procedure. The asymptotic behavior of the COP procedure is discussed in Section 3. Several implementation issues of the procedure are discussed in Section 4. Simulation and real data examples are reported in Sections 5 and 6, respectively. Additional remarks in Section 7 conclude the article. An abbreviated version of the proofs of the theorems is provided in the Appendix..

2. Correlation Pursuit for Variable Selection

2.1. Profile Correlation and SIR

Let η be an arbitrary direction in ℝp. We define the profile correlation between Y and η′X, denoted by P(η), as:

P(η)=maxT corr(T(Y),ηX), (3)

where the maximization is taken over all possible transformations of Y including non-monotone ones. The profile correlation P(η) reflects the largest possible correlation between a transformed response T(Y) and the projection η′X. Let η1 be the direction that maximizes P(η) subject to η′Ση = 1, that is, η1 = argmaxη′Ση=1 P(η). We refer to η1 as the first principal direction for the profile correlation between Y and X and call P1) the first profile correlation. Direction η1, or its projection η1X, may not entirely characterize the dependency between Y and X. Using P(η) as the projection function again, we can seek for a second direction, denoted by η2, which is uncorrelated with η1X,i.e.η2Ση1=0, and maximizes P(η), that is, η2 = argmaxη:η′Ση1=0 P(η). We refer to η2 as the second principal direction and P2) as the second profile correlation. This procedure can be continued until no more directions can be found that are orthogonal to the obtained directions and have a nonzero profile correlation with Y. Suppose principal directions exist between Y and X, which are η1, η2, ⋯, and η, with the corresponding profile correlations P1) ≥ P2) ≥ ⋯ P) > 0. We need to impose the following condition to establish the connection between the principal directions and the SDR directions under the SDR model.

Condition 1 (Linearity Condition): For any η in ℝp, E(η′X|BX) is linear in BX, where B is as defined in equation (1).

Proposition 2.1. Under the SDR model and the Linearity Condition, the principal directions η1, η2, …, η are in the central space 𝒮(B).

To make this article self-sufficient, we have included the proof of Proposition 2.1 in the Appendix. Based on the proposition, the principal directions are indeed SDR directions. In general, < K. When the link function f is symmetric along a direction, using correlation alone may fail to recover this direction. For example, if Y=X12+ε, the profile correlation between Y and X1 will always be zero. To exclude this possibility, we follow the convention in the SDR literature to impose the following condition.

Condition 2 (Coverage Condition): The number of principal directions of profile correlation is equal to the dimensionality of the central subspace, that is, = K.

Under both the linearity and coverage conditions, the principal directions η1, η2, ⋯, ηK form a special basis of the central subspace 𝒮(B), that is, 𝒮(B) = span(η1, η2, …., ηK). This basis is uniquely defined and is the estimation target of SIR. In the rest of the article, for ease of discussion, we use β1, β2, …., βk and η1, η2, …, ηK, interchangeably.

Chen and Li (1998) showed that, at the population level, there exits an explicit solution for the principal directions. In the proof of their Theorem 3.1, Chen and Li (1998) has derived that

P2(η)=ηvar[E(X|Y)]ηηΣηηMηηΣη, (4)

where M ≜ var[E(X|Y)] is the covariance matrix of the expectation of X given Y. Furthermore, the principal directions of profile correlation are the solutions of the following eigenvalue decomposition problem:

Mυi=λiΣυi,  υiΣυi=1, for i=1,2,,K; (5)
λ1λ2λK>0. (6)

The principal directions η1, η2, …, and ηK are the first K eigenvectors of Σ−1 M, and their corresponding eigenvalues are exactly the squared profile correlations, that is, P2i) = λi for i = 1, 2, …, K.

Given independent observations {(xi, yi)}i=1, ⋯, n of (X, Y), where xi = (xi1, ⋯, xip)′, Σ can be estimated by the sample covariance matrix,

Σ̂=i=1n(xi)(xi)n1, (7)

where is the sample mean of {xi}. Li (1991) proposed the following SIR procedure to estimate M. First, the range of {yi}i=1n is divided into H disjoint intervals, denoted as S1, …, SH. For h = 1, …, H, the mean vector h=nh1yiShxi is calculated, where nh is the number of yi's in Sh. Then, M is estimated by

=h=1Hnh(h)(h)n, (8)

and the matrix Σ−1 M is estimated by Σ̂−1 . The first K eigenvectors of Σ̂−1 , denoted by η̂1, η̂2, …, η̂K, are used to estimate the first K eigenvectors of Σ−1 M or, equivalently, the principal directions η1, η2, …, ηK, respectively. The first K eigenvalues of Σ̂−1 , denoted by λ̂1, λ̂2, …, λ̂K, are used to estimate the eigenvalues of Σ−1 M or, equivalently, the squared profile correlations λ1, λ2, …, λK, respectively.

2.2. Correlation Pursuit

The SIR method needs to estimate the two p × p covariance matrices Σ and M, and to obtain the eigenvalue decomposition of Σ̂−1 . When a large number of irrelevant variables are present and the sample size n is relatively small, Σ̂ and become unstable, which leads to very inaccurate estimates of principal directions η̂1, η̂2, …, η̂K (Zhu et al. 2006). As a consequence, those shrinkage-based variable selection methods that rely on η̂1, η̂2, …, η̂K often perform poorly for the SDR model when p is large. We here propose a stepwise SIR-based procedure for simultaneous dimension reduction (i.e., estimating the principal directions) and variable selection (i.e., identifying true predictors). Our procedure starts with a collection of randomly selected predictors and iterates between an addition step, which selects and adds a predictor to the collection, and a deletion step, which selects and deletes a predictor from the collection. The procedure terminates when no new addition or deletion occurs.

The Addition Step. Let 𝒜 denote the collection of the indices of the selected predictors and X𝒜 the collection of the selected variables. Applying SIR to the data involving only the predictors in X𝒜, we obtain the estimated squared profile correlations λ̂1𝒜,λ̂2𝒜,,λ̂K𝒜. Superscript 𝒜 indicates that the estimated squared profile correlations depend on the current subset of selected predictors. Let Xt be an arbitrary predictor outside 𝒜 and 𝒜 + t = 𝒜⋃{t}. Applying SIR to the data involving the predictors in 𝒜 + t, we obtain the estimated squared profile correlations λ̂1𝒜+t,λ̂2𝒜+t,,λ̂K𝒜+t. Because 𝒜 ⊂ 𝒜 + t, it is easy to see that λ̂1𝒜λ̂1𝒜+t. The difference λ̂1𝒜+tλ̂1𝒜 reflects the amount of improvement in the first profile correlation due to the incorporation of Xt. We standardize this difference and use the the resulting test statistic

COP1𝒜+t=n(λ̂1𝒜+tλ̂1𝒜)1λ̂1𝒜+t, (9)

to assess the significance of adding Xt to 𝒜 in improving the first profile correlation. Similarly, the contributions of adding Xt to the other profile correlations can be assessed by

COPi𝒜+t=n(λ̂i𝒜+tλ̂i𝒜)1λ̂i𝒜+t, (10)

for 2 ≤ iK. The overall contribution of adding Xt to the improvement in all the K profile correlations can be assessed by combining the statistics COPi𝒜+t into one single test statistic

COP1:K𝒜+t=i=1KCOPi𝒜+t. (11)

We further define that

COP¯1:K𝒜=maxt𝒜cCOP1:K𝒜+t. (12)

Let X be a predictor that attains COP¯1:K𝒜, that is, COP¯1:K𝒜=COP1:K𝒜+, and let ce be a pre-specified threshold (details about its choice are deferred to the next two sections). Then, if COP¯1:K𝒜>ce, we add to 𝒜; otherwise, we do not add any variable.

The deletion step. Let Xt be an arbitrary predictor in 𝒜 and define 𝒜 − t = 𝒜 − {t}. Let λ̂1𝒜t,λ̂2𝒜t,,λ̂K𝒜t be the estimated squared profile correlations based on the data involving the predictors in 𝒜 − t only. The impact of deleting Xt from 𝒜 on the ith squared profile correlation can be measured by

COPi𝒜t=n(λ̂i𝒜λ̂i𝒜t)1λ̂i𝒜, (13)

for 1 ≤ iK. The overall impact of deleting Xt is measured by

COP1:K𝒜t=i=1KCOPi𝒜t, (14)

and the least impact from deleting one predictor from 𝒜 is then defined to be

COP¯1:K𝒜=mint𝒜COP1:K𝒜t. (15)

Let X be a predictor that achieves COP¯1:K𝒜, and let cd be a pre-specified threshold for deletion. If COP¯1:K𝒜<cd, we delete X from 𝒜; otherwise, no deletion happens.

The asymptotic distributions of the proposed statistics and the selection of the thresholds will be discussed in the next two sections. Because the described procedure aims to find predictors that can most significantly improve the profile correlations between Y and X, we call it the Correlation Pursuit procedure (COP). Below we summarize COP into a pseudo-code.

The COP Algorithm

  1. Set the number of principal directions K and the threshold values ce and cd.

  2. Randomly select K + 1 variables as the initial collection of selected variables 𝒜.

  3. Iterate until no more addition or deletion of predictors can be performed:
    • The addition step:
      • Find such that COP1:K𝒜+=COP¯1:K𝒜;
      • If COP¯1:K𝒜>ce, add to 𝒜, that is, let 𝒜 = 𝒜 + ;
    • The deletion step:
      • Find such that COP1:K𝒜=COP¯1:K𝒜;
      • If COP¯1:K𝒜<cd, delete from 𝒜, that is, let 𝒜 = 𝒜 − ;
  4. Output 𝒜.

3. Theoretical Properties

3.1. Asymptotic distributions of test statistics in COP

Let us first consider an addition step. We assume that SIR uses a fixed slicing scheme relative to the number of observations n, that is, the slices S1, S2, ⋯, SH are fixed (defined by the range of the response variable) but the number of observations in each slice goes to infinity. Let Xt be an arbitrary predictor in 𝒜c. Under the null hypothesis H0 that all the predictors in 𝒜c are irrelevant, we have ηt1 = ηt2 = ⋯ = ηtK = 0. Recall that the statistics we propose to measure the contributions of Xt to the K profile correlations are (COP1𝒜+t,COP2𝒜+t,,COPK𝒜+t), and to measure the overall contribution of Xt by COP1:K𝒜+t. To establish the asymptotic distributions of these statistics, we need to impose a condition on the conditional expectation of Xt given X𝒜.

Condition 3 (Regression Condition): E(Xt | X𝒜) is linear in X𝒜.

Theorem 3.1. Assume that Conditions 1 and 2 hold, Condition 3 holds for (X𝒜, Xt) for any XtX𝒜C, and the squared profile correlations λ1, λ2, …, λK are positive and different from each other. Then, for any given fixed slicing scheme, under the null hypothesis H0 that all the predictors in 𝒜c are irrelevant, we have that

(COP1𝒜+t,COP2𝒜+t,,COPK𝒜+t)(Z1t2,Z2t2,,ZKt2) (16)

in distribution and

COP1:K𝒜+tl=1KZlt2 (17)

in distribution as n → ∞. Here, (Z1t, Z2t, ⋯, ZKt) follows the multivariate normal distribution with mean zero and covariance matrix WKt. The explicit expression of WKt is given in the Appendix.

The asymptotic distributions in Theorem 3.1 can be much simplified if we impose the following condition on the variance of the conditional expectation of Xt given X𝒜.

Condition 4 (Constant Variance Condition): E[(Xt − E(Xt|X𝒜))2|X𝒜] is a constant.

Corollary 3.1. Assume that Conditions 1 and 2 hold, Conditions 3 and 4 hold for (X𝒜, Xt) for XtX𝒜c, and the squared profile correlations λ1, λ2, …, λK are positive and different from each other. Then, for any given fixed slicing scheme, under the null hypothesis H0 that all the predictors in 𝒜c are irrelevant, we have that COP1𝒜+t,COP2𝒜+t,,COPK𝒜+t are asymptotically independent and identically distributed as χ2(1), and COP1:K𝒜+t is asymptotically χ2(K).

Theorems 3.1 and Corollary 3.1 characterize the asymptotic behaviors of the test statistics for an arbitrary Xt in 𝒜c. In the COP procedure, however, the predictor that attains the maximum value of COP1:K𝒜+t among t ∈ 𝒜c, which is COP¯1:K𝒜, is considered a candidate predictor to enter 𝒜. Our next theorem characterizes the joint asymptotic behavior of {COP1:K𝒜+t}t𝒜c as well as that of COP¯1:K𝒜.

Note that the linearity, regression, and constant variance conditions together are more general than the normality assumption on X because they only need to hold for the basis of the central subspace (e.g., B or η1, ⋯, ηK) and a given subset of predictors (e.g., 𝒜). If we require that the conditions hold for any projection and any given subset of the predictors, however, then it is equivalent to requiring that X follows a multivariate normal distribution. In order to understand the joint behavior of all the COP statistics, in what follows we impose the normality assumption on X.

Let 𝒜={tj}j=1d and 𝒜c={tj}j=d+1p denote the collection of currently selected predictors and its complement respectively. Let Σ𝒜 = Cov(X𝒜), Σ𝒜c = Cov(X𝒜c), Σ𝒜𝒜c = Cov(X𝒜, X𝒜c), and Σ̃𝒜c=Σ𝒜cΣ𝒜c𝒜Σ𝒜1Σ𝒜𝒜c. Note that Σ𝒜c𝒜=Σ𝒜𝒜c. Let ã = (ã1, ã2, …, ãpd)′ be the vector of the diagonal elements of Σ̃𝒜c. Define D𝒜c = diag(ã1, ã2, …, ãpd), and define U𝒜c=D𝒜c1/2Σ̃𝒜cD𝒜c1/2.

Theorem 3.2. Assume that (a) X follows a multivariate normal distribution; (b) the coverage condition holds; and (c) the squared profile correlations λ1, λ2, ⋯, λK are nonzero and different from each other. Then, for any fixed slicing scheme, under the null hypothesis H0 that all the predictors in 𝒜c are irrelevant, we have

(COP1:K𝒜+td+1,COP1:K𝒜+td+2,,COP1:K𝒜+tp)D(k=1Kzk,d+12,,k=1Kzk,p2), (18)

and

COP¯1:K𝒜Dmaxt𝒜ck=1Kzk,t2 (19)

as n goes to ∞. Here zk = (zk,d+1, ⋯, zk,p) for k = 1, ⋯, K are mutually independent and each zk follows a multivariate normal with mean zero and covariance matrix U𝒜c.

We now consider deletion steps of the COP procedure. We let 𝒜 denote the current collection of selected predictors before a deletion step, and let Xt be an arbitrary predictor in 𝒜. Note that COPk𝒜t=COPk𝒜̄+t, where 𝒜̃ = 𝒜 − t for 1 ≤ kK. Therefore, results similar to those stated in Theorem 3.1 and Corollary 3.1 can be obtained for (COP1𝒜t,COP2𝒜t,,COPK𝒜t) and COP1:K𝒜t after some modifications described below. First, our current “null hypothesis”, denoted as H0t, is that Xt and the predictors in 𝒜c are irrelevant. Second, the regression and constant variance conditions need to be imposed on the conditional expectation of Xt given X𝒜̃ instead. The asymptotic distribution of COP¯1:K𝒜, however, turns out to be fairly complicated if not entirely elusive, due to the fact that there does not exist a common null hypothesis for all Xt ∈ 𝒜. In what follows, we will establish two strong results that have implications for properly selecting the thresholds ce and cd, as well as for the consistency of the COP procedure in selecting true predictors.

3.2. Selection consistency of COP

Let 𝒯 be the collection of the true predictors under the SDR model. The principal profile correlation directions are η1, η2, ⋯, ηK, which form a basis of the central subspace. Assume that S1, ⋯, SH is a fixed slicing scheme used by SIR.

Let ph = P(ySh), vK=(η1XE(η1X),,ηKXE(ηKX)), and

MH,K=h=1HphLh,KLh,K, (20)

where Lh,K = E(vK|YSh). A few more conditions are needed for the results we state in the next two theorems.

Condition 5. X follows a multivariate normal distribution with covariance matrix Σ such that τmin ≤ λminp) ≤ λmaxp) ≤ τmax, where τmin and τmax are two positive constants, and λmin(·) and λmax(·) are the minimum and maximum eigenvalues of a matrix, respectively.

Condition 6. There exists a constant ωH > 0 such that λmin(MH,K) > ωH.

Condition 7. There exist constants σ02 and υ > 0 such that for any slice Sh and any two predictors Xi and Xj, Var(Xj|YSh)σ02 and Var (XiXj|ySh)σ02 for all i, j = 1, ⋯, p, and h = 1, ⋯, H. In addition,

E(|Xj|l|YSh)l!2Var(Xj|YSh)υl2 and
E(|XiXj|l|YSh)l!2Var(XiXj|YSh)υl2, for l2.

Condition 8. Let ηj = (ηj1, ηj2, ⋯, ηjK)′ in which ηjk is the coefficient of Xj in the kth principal correlation direction ηk. There exist a positive constant ϖ > 0 and a nonnegative constant ξ0; such that ‖ ηj2> ϖ · n−ξ0 for j ∈ 𝒯, where ‖ · ‖ denotes the standard L2 − norm.

Condition 9. limn→∞ p = ∞ and p = o(nϱ0) with ϱ0 ≥ 0 and 2ϱ0 + 2ξ0 < 1.

Condition 5 ensures that the variances of the predictors are on a comparable scale and that they are not strongly correlated. Condition 6 assumes a lower bound for the eigenvalues of MH,K, which is slightly stronger than the coverage condition that ensures SIR to recover all the SDR directions. Condition 7 imposes conditions on the moments of the conditional expectations of X given YSh so that the Bernstein inequalities hold for the conditional sample means. Condition 8 assumes that the coefficients of any true predictors do not decrease to zero too fast as both n and p increase; otherwise, such predictors will not be identifiable asymptotically. Condition 9 allows p to increase as n increases, but their rates are constrained. Similar conditions have been used by others for establishing variable selection results for stepwise procedures in linear regression (Wang 2009, Fan and Lv 2008).

Theorem 3.3. Let 𝒜 be the set of currently selected predictors and let 𝒯 be the set of true predictors. Let ϑ=ϖ·ωH·τmin22τmax. Assume that Conditions 5–9 hold. Then, we have

P(min𝒜:𝒜c𝒯maxt𝒜c𝒯COP1:K𝒜+tϑn1ξ0)1, (21)

for any fixed slicing scheme as n goes to ∞.

The probability statement in (21) is not just about one given collection of predictors. It considers all the possible collections that do not include all the true predictors yet, that is, {𝒜 : 𝒜c ⋂ 𝒯 ≠ ∅}. In other words, it considers all the possible scenarios where the null hypothesis H0 is not true. Further note that maxt𝒜c𝒯 COP1:K𝒜+tCOP¯1:K𝒜. Because maxt𝒜c COP1:K𝒜+tmaxt𝒜c𝒯 COP1:K𝒜+t, from equation (21), we have

P(min𝒜:𝒜c𝒯COP¯1:K𝒜ϑn1ξ0)1. (22)

This result implies that by setting ce to ϑn1−ξ0 or smaller, if the COP procedure has not collected all the true predictors yet, then with probability going to 1 (as n goes to ∞), it will continue to select a predictor to the current collection. Thus, the addition step of COP will not stop until all the true predictors are selected. Another way to interpret (22) is that the selection power of the COP procedure converges to 1 asymptotically.

Theorem 3.4. Assume that Conditions 5–9 hold. Then we have

P(max𝒜:𝒜c𝒯maxt𝒜c COP1:K𝒜+t<Cnϱ)1, (23)

for ϱ > 1/2 + ϱ0, and any positive constant C, under any fixed slicing scheme with n going to ∞.

Theorem 3.4 has two implications. The first one regards the addition step of COP. Once all the true predictors are selected, that is, 𝒜c ⋂ 𝒯 = ∅, the probability that it will select a false predictor from 𝒜c converges to zero. The second implication concerns the deletion step. Consider one collection of selected predictors 𝒜̃ and assume that 𝒜̃ contains all the true predictors and also some irrelevant ones, that is, 𝒜̃ ⊃ 𝒯. Clearly,

COP¯1:K𝒜̃mint𝒜̃𝒯 COP1:K𝒜̃tmax𝒜:𝒜c𝒯=maxt𝒜ck=1KCOPk𝒜+t. (24)

Therefore,

P(COP¯1:K𝒜̃<Cnϱ)1. (25)

In other words, with probability going to 1, the COP procedure will delete an irrelevant predictor from the current collection.

One possible choice of the thresholds is χe2=ϑn1ξ0 and χd2=ϑ·n1ξ0/2. From Theorem 3.3, asymptotically, the COP algorithm will not stop selecting variables until all the true predictors are included. Moreover, once all the true predictors are included, according to Theorem 3.4, all the redundant variables will be removed from the selected variables.

4. Implementation Issues

When implementing the COP algorithm, one needs to specify the number of profile correlation directions K, the thresholds for the addition and deletion steps ce and cd, and the slicing scheme particularly the number of slices H. A proper specification of these tuning parameters is critical for the success of the COP algorithm.

4.1. Slicing Schemes and the Choice of H

Li (1991) suggested that in terms of estimation, the performance of SIR is robust to the number of slices in general. The COP algorithm uses SIR to derive test statistics for selecting variables. It is of interest to understand the impact of a slicing scheme on the involved testing procedures. Again, we consider an addition step in the COP procedure. Let 𝒜 be the current collection of selected predictors. Let Xt be an arbitrary predictor in 𝒜c.

Theorem 4.1. Assume that X follows a multivariate normal distribution. Then, for any given fixed slicing scheme, we have

P(COP1:K𝒜nCH,𝒜+t)1,   as   n, (26)

where

CH,𝒜+t=(η̃t,𝒜)MH,Kη̃t,𝒜/σt,𝒜2, (27)

σt,𝒜2=Var(Xt|X𝒜), η̃t,𝒜 = Cov(Xt, vK|X𝒜) and MH,K is defined in equation (20).

The difference between Theorem 3.1 and Theorem 4.1 is that the latter does not assume that Xt is an irrelevant predictor. When Xt is indeed a true predictor, then ηt is not a zero vector and maxt∈𝒜c⋂𝒯 CH,𝒜+t is greater than zero. The larger CH,𝒜+t is, the more likely Xt will be added to 𝒜. The next result shows that a finer slicing scheme leads to higher power for the addition step by COP. For any two different slicing schemes S = (S1, ⋯, SH1) and S=(S1,,SH2), we say that S′ is a refinement of S, denoted by S′ ≾ S, if for any ShS, there exists a ShS such that ShSh.

Proposition 4.1. Suppose S and Sare two slicing schemes such that S′ ≾ S. Then, for any η ∈ ℝK, we have

ηMH2,KηηMH1,Kη, (28)

where MH2,K and MH1,K are defined as in (20) under the slicing schemes Sand S, respectively.

Proposition 4.1 implies that the constant CH,𝒜 in Theorem 4.1 becomes larger when a finer slicing scheme is used. This further suggests that the power of the COP procedure in selecting true predictors tends to increase if a slicing scheme uses a larger number of slices. On the other hand, when a slicing scheme uses a larger number of slices, the number of observations in each slice will decrease, which makes the estimate of E(X|ySh) less accurate and further makes the estimates of M = Cov{E(X|Y)} and its eigenvalues λ1, ⋯, λK less stable. The success of the COP procedure hinges on a good balance between the number of slices and the number of observations in each slice. We observed from intensive simulation studies that, with a reasonable number of observations in each slice (say ≥ 20), a large number of slices is preferred.

4.2. Choice of ce and cd

Section 3 has characterized the asymptotic distributions or behaviors of the test statistics involved in the COP procedure. In theory, these results (Theorems 3.4 and 3.5) can be used for choosing the thresholds ce and cd. In practice, however, these thresholds should be used with much caution due to the following concerns. First, the distributions obtained in Section 3 are for a single addition or deletion step and under various assumptions. Second, the distributions are valid only in an asymptotic sense. In what follows, we propose to use a cross-validation procedure for selecting ce and cd.

Let {αi}1≤id be a pre-specified grid on a sub-interval in (0, 1) and {χαi,K2}1id be the collection of the 100αith percentile of χK2. For convenience, we only consider the m pairs of ce=χαi,K2 and cd=χαi0.05,K2 for 1 ≤ im. Note that cd < ce and that there is only one tuning parameter we need to determine. We follow the general 5-fold cross validation scheme to select the best pair of ce and cd. We randomly divide the original data into five equal-sized subsets, and then apply the COP procedure to any four subsets to generate the estimation and variable selection results. The remaining subset of the data is used to test the model and generate a performance measurement. The performance measurements are averaged and the result is used as the CV score. We choose the pair of ce and cd that mazimizes the CV score.

We define the performance measure used in the CV procedure as follows. Suppose 𝒜 is the collection of selected predictors and η1,𝒜, ⋯, ηK,𝒜 are the estimates of the principal profile correlation directions produced by applying the COP procedure to the training data set. We consider the first principal profile correlation direction first. Recall η1,𝒜 is the direction that achieves the maximum correlation of a linear projection of X and the transformed response Y, and the optimal transformation is T1(Y)=E(η1,𝒜X|Y) (Theorem 3.1 in Chen and Li (1998)). With η1,𝒜 estimated by η̂1,𝒜 using the training data, we apply LOESS proposed by Cleveland (1979) to fit T1(Y) using the training data and denote the fitted transformation as 1(·). Let and be the data matrix and the response vector of the testing data set. Then, the squared profile correlation between and based on the direction η̂1,𝒜 and transformation 1(·) is computed as corr2(1(),η̂1,𝒜). Similarly, the squared profile correlations between and along η̂2,𝒜, ⋯, η̂K,𝒜 can be calculated. The overall performance measure is defined to be

PC=k=1Kcorr2(k(),η̂k,𝒜). (29)

The CV score for any pair (ce, cd) is defined to be the average PC over the five possible partitions of the training-test data sets.

4.3. Selection of the number of directions K

To determine K, the number of principal profile correlation directions, we adopt a BIC-type criterion proposed by Zhu et al. (2006). For any given k between 1 and J, where J ≤ max(n, p) is a reasonable upper bound chosen by the user, we apply the COP procedure with K = k. Suppose the resulting collection of the selected predictors is 𝒜k and the cardinality of 𝒜k is pk. Using the data involving only the selected predictors, we can estimate M = Cov(E(X𝒜k|Y)) as before and denote the result as . Let θ̂1 ≥ θ̂2 ⋯ θ̂p be the eigenvalues of + Ipk, where Ipk is the pk by pk identity matrix, and let τ be the number of θ̂i's that are greater than 1. Define

G(k)=log L(k)+log(n)2k(2pkk+1), (30)

where log L(k)=i=min(τ,k)+1p(log θ̂i+1θ̂i). We choose K = argmin1≤kJ G(k). In the original criterion proposed by Zhu et al. (2006), the authors show that the criterion produces a consistent estimate of K for fixed pk. Our simulation study shows that the modified criterion leads to the correct specification of K for the COP procedure and can be generally used in practice.

5. Simulation Study

We have performed extensive simulation studies to compare the COP algorithm with a few existing variable selection methods and will present three examples in this section. When implementing the COP algorithm in these examples, we use the CV procedure and the GIC criterion discussed in the previous section to select the thresholds ce and cd and the dimensionality K, respectively. The grids used for selecting ce is {χ0.90,K2,χ0.95,K2,χ0.99,K2,χ0.999,K2,χ0.9999,K2}, and the associated grid for selecting cd is {χ0.85,K2,χ0.90,K2,χ0.94,K2,χ0.949,K2,χ0.9499,K2}. The range used for selecting K is from 1 to 4 (i.e., J = 4). For SSIR, we use the grid {0, 0.1, …, 0.9, 1} × {0, 0.1, …, 0.9, 1} to select the pair of tuning parameters that leads to its best performance. Both COP and SSIR involve slicing the range of the response variable, for which we use the same scheme to facilitate fair comparison.

5.1. Linear models

In this example, we consider the linear model

Y=Xβ+σε, (31)

where X = (X1, X2, ⋯, Xp)′ follows a p-variate normal distribution with mean zero and covariances Cov(Xi, Xj) = ρ|ij| for 1 ≤ i, jp, and ε is independent of X and follows N(0, 1). The variable selection methods we compare the COP procedure to include LASSO, SCAD (Fan and Li 2001), MARS, and SSIR (Li 2007). The R packages SIS, lars and mda are used to run SCAD, LASSO and MARS, respectively. The tuning parameters involved in SCAD and LASSO are selected by cross validation. We use the codes provided by the original authors to run SSIR. In this example, we consider two specifications of the linear model given below.

  • Scenario 1.1 : p = 8, β = (3, 1.5, 2, 0, 0, 0, 0, 0)′, σ = 3, ρ = 0.5;

  • Scenario 1.2 : p = 1000, β = (3, 1.5, 1, 1, 2, 1, 0.9, 1, 1, 1, 0, ⋯, 0)′, σ = 1, ρ = 0.5.

Under Scenario 1.1, Model (31) involves 3 true predictors and 5 irrelevant variables, and was originally used in Tibshirani (1996) and Fan and Li (2001) to demonstrate the empirical performances of LASSO and SCAD. We randomly generated 100 data sets from Scenario 1.1, each with 40 data points (i.e. n=40), and applied the aforementioned methods to the data sets. Two quantities were used to measure the variable selection performance of each method, which are the average number of irrelevant predictors falsely selected as true predictors (denoted by FP) and the average number of true predictors falsely excluded as irrelevant predictors (denoted by FN). Note that under Scenario 1.1, FPs and FNs range from 0 to 5 and 0 to 3, respectively, with small values indicating good performances in variable selection. The FP and FN values of the tested methods are reported in Table 1.

Table 1.

Performance comparison under linear models: FP is the average number of irrelevant variables falsely selected by the method, and FN is the average number of true variables falsely excluded by the method; the number in (·) is the standard error of the FP or FN and NA* indicates that the corresponding algorithm broke down.

Methods p = 8, n = 40, σ = 3, ρ = 0.5 p = 1000, n = 200, σ = 1, ρ = 0.5

FP
(0, 5)
FN
(0, 3)
FP
(0,990)
FN
(0,10)
LASSO 0.77(0.093) 0.16(0.037) 8.87(0.586) 0.00(0.000)
SCAD 0.67(0.094) 0.10(0.030) 6.05(0.926) 1.16(0.150)
MARS 4.00(0.059) 0.04(0.020) 30.64(0.165) 0.00(0.000)
SSIR 0.19(0.051) 0.96(0.068) NA* NA*
COP 0.71(0.080) 0.56(0.066) 2.28(0.203) 0.75(0.095)

Under Scenario 1.2, Model (31) involves ten true predictors and 990 irrelevant predictors and is clearly more challenging than Scenario 1.1. We randomly generated 100 data sets each with 200 data points (i.e. n=200) from Scenario 1.2. Note that in each data set, n < p. Similar to Scenario 1.1, we applied the methods mentioned above to the data sets and report the FP and FN values of these methods in Table 1. The tuning parameters in all these methods are determined by cross validation.

From the left panel of Table 1, under Scenario 1.1, SSIR has the lowest FP value (FP=0.19), that is, the average number of irrelevant variables selected by SSIR is 0.19; and COP has the third lowest FP values (0.71). The other methods tend to have more false positives than SSIR and COP. In terms of FNs, the order of the methods ranked from the lowest to the highest is MARS, SCAD, LASSO, COP, and SSIR. The relative sub-par performance of COP and SSIR is due to the fact that these two methods are developed for variable selection under models more general than the linear model.

From the right panel of Table 1, under Scenario 1.2, COP has the lowest FP value (FP=2.28). In terms of FN, LASSO and MARS have the lowest value with COP following modestly behind. Compared with MARS, COP has a much lower FP value and a slightly higher FN value. SSIR breaks down under Scenario 1.2 because the variance-covariance matrix of X is no longer invertible. In terms of both FP and FN, COP outperformed SCAD under the scenario. One explanation for this comparison result is that SCAD involves non-convex optimization, and can be unstable in implementation.

5.2. Nonlinear multiple index models

In this example, we consider the following multiple index model,

Y=X1+X2++Xd0.5+(1.5+X2+X3+X4)2+σε, (32)

where X1, ⋯,, Xp are i.i.d. N(0, 1) random variables, ε is N(0, 1) and independent of X, and d and σ are parameters that need to be further specified. This model was originally used in Li (1991) for demonstrating the performance of SIR. It is not difficult to see that given the two projections X1 + X2 + ⋯ + Xd and X2 + X3 + X4, Y and X are independent with each other. The dimensionality of the central subspace of Model (32) is two, and the collection of true predictors is {X1, ⋯, Xd} ⋃ {X2, X3, X4}. Because Model (32) is nonlinear, methods that were designed specifically for linear models such as LASSO and SCAD are clearly at disadvantage. Therefore, in this example, we only compare the performances of MARS, SSIR and COP.

By specifying p, d and σ at different values, we have the following three scenarios,

  • Scenario 2.1: p = 30, d = 3, σ = 0.1;

  • Scenario 2.2: p = 30, d = 3, σ = 2;

  • Scenario 2.3: p = 400, d = 8, σ = 0.1.

For each scenario, we generated 100 data sets each with 200 observations (i.e. n=200) and applied MARS, SSIR and COP to each data set. The resulting FP and FN values are reported in Table 2.

Table 2.

Performance comparison under multiple index model: FP is the average number of irrelevant variables falsely selected by the method, and FN is the average number of true variables falsely excluded by the method; the number in (·) is the standard error of the FP or FN and NA* indicates that the corresponding algorithm broke down.

Methods σ = 0.1, p = 30, d = 3 σ = 2, p = 30, d = 3 σ = 0.1, p = 400, d = 8

FP
(0, 26)
FN
(0, 4)
FP
(0, 26)
FN
(0–4)
FP
(0, 292)
FN
(0, 8)
MARS 16.55(0.174) 0.03(0.017) 17.18(0.186) 0.32(0.053) NA* NA*
SSIR 0.12(0.033) 0.91(0.029) 4.14(0.288) 1.76(0.115) NA* NA*
COP 1.88(0.149) 0.83(0.038) 3.26(0.210) 1.71(0.104) 8.93(0.576) 0.18(0.081)

For Scenario 2.1, MARS achieved the lowest FN value (0.03), but its FP value was unacceptably high (16.55); SSIR had the lowest FP values, but its FN value was the highest among the three. The FP and FN values of COP were between the extremes. It appears that the performances of SSIR and COP are similar under Scenario 2.1. For Scenario 2.2, COP outperformed SSIR in terms of both FP and FN values. MARS again achieved the lowest FN value (0.32) at the expense of an unacceptable FP value (17.18). Scenario 2.3 is the most challenging among the three scenarios, in which the number of predictors exceeds the number of observations. Both MARS and SSIR broke down under this scenario. However, COP still demonstrated an excellent performance with its FP and FN values reasonably low.

5.3. Heteroscedastic models

In the previous examples, the true predictors affect only the mean response. In this example, we consider the following heteroscedastic model

Y=0.2ε1.5+j=1pβj,1Xj, (33)

where X = (X1, X2, ⋯, Xp)′ follows a p-variate normal distribution with mean zero and covariances Cov(Xi, Xj) = ρ|ij| for 1 ≤ i, jp, ε is independent of X and follows N(0, 1), and βj,1 = 1 for 1 ≤ j ≤ 8 and =0 for j ≥ 9. Note that the central subspace is spanned by β1 = (β1,1, β2,1, …, βp,1)′ and the number of true predictors is 8. We further specify ρ and p in (33) and consider the following three scenarios,

  • Scenario 3.1 : ρ = 0, p = 500;

  • Scenario 3.2 : ρ = 0, p = 1000;

  • Scenario 3.3 : ρ = 0.3, p = 1500.

For each scenario, we generated 100 data sets each with n = 1000 observations and applied MARS, SSIR and COP to the data sets. The FP and FN values of the three methods are listed in Table 3.

Table 3.

Performance comparison under heteroscedastic model: FP is the average number of irrelevant variables falsely selected by the method, and FN is the average number of true variables falsely excluded by the method; the number in (·) is the standard error of the FP or FN and NA* indicates that the corresponding algorithm broke down.

Methods ρ = 0
n = 1000, p = 500
ρ = 0
n = 1000, p = 1000
ρ = 0.3
n = 1000, p = 1500

FP
(0, 492)
FN
(0, 8)
FP
(0, 992)
FN
(0, 8)
FP
(0, 1492)
FN
(0, 8)
MARS 212.15(0.428) 4.83(0.116) 230.33(0.372) 6.16(0.129) 236.60(0.524) 6.84(0.126)
SSIR 52.54(1.970) 0.88 (0.149) NA* NA* NA* NA*
COP 5.79(0.365) 1.21(0.030) 13.14(0.734) 1.29(0.037) 21.36(0.937) 1.5(0.039)

Under Scenario 3.1, both SSIR and COP outperformed MARS. The FN value of SSIR (0.99) is less than that of COP (1.21), but the FP value (52.54) is much larger than that of COP (5.71). Under both Scenarios 3.2 and 3.3, in which p is much larger than n, SSIR broke down, but COP still demonstrated excellent performances. The performances of MARS under these two scenarios were fairly poor.

6. Application: Predict Gene Expression from Sequence Using Next-generation Sequencing Data

Embryonic stem cells (ESCs) maintain self-renewal and pluripotency as they have the ability to differentiate into all cell types. To enhance the understanding of the embryonic stem cells development, predictive models, such as regression models, can be constructed in which the gene expression is regarded as the response variable and various features associated with gene-regulating transcription factors (TFs) are taken as the predictors. Examples of such features include motif scores based on position-specific weight matrices of motifs recognized by the TFs Conlon et al. (2003), and ChIP-chip log ratios.

Recently, the emerging next-generation sequencing technologies, in particular, RNA-Seq and ChIP-Seq, offer researchers an unprecedented opportunity to build predictive models for complex biological processes such as gene regulation. Compared to the traditional hybridization-based methods, such as microarray, RNA-Seq and ChIP-Seq provide more accurate quantification of gene expression and TF-DNA binding locations respectively (Mortazavi et al. 2008, Wilhelm et al. 2008, Nagalakshmi et al. 2008, Boyer et al. 2005, Johnson et al. 2007).

To quantify gene expression in RNA-Seq data, one may calculate the RPKM (reads per kilobase of exon region per million mapped reads), which has been shown to be proportional to the gene expression levels (Cloonan et al. 2008). From ChIP-Seq data, Ouyang et al. (2009) proposed a feature named TF association strength (TFAS), which was shown to explain a much higher proportion of gene expression variation than traditional predictors in predictive models. In particular, for each TF, the TFAS for each gene is computed as a weighted sum of the corresponding ChIP-Seq signal strengths, where the weights reflect the proximity of the signal to the gene. We here examine whether we can build a better predictive model for gene expressions by combining both TFASs and motif scores of TF in mouse ESCs.

To achieve this goal, we compiled a data set consisting of gene expressions, TFASs, and motif scores. In this data set, the RPKMs were calculated as gene expression levels from RNA-Seq data in mouse ESCs (Cloonan et al. (2008)). The TFASs of 12 TFs were calculated from the ChIP-Seq experiments in mouse ESCs (Chen et al. (2008)). In addition, we supplement this data set with motif scores of putative TFs of mouse. From the transcription factor database TRANSFAC, we complied a list of 300 TF binding motifs of mouse. For each gene, a matching score was calculated using the scoring system described in Zhong et al. (2005) for each TFBM. The matching score can be considered intuitively as the expected number of occurrences of a TFBM on the gene's promoter region. To build a predictive model in mouse ESC, we treat the gene expression as the response variable and the 12 TFASs as well as the 300 TF motif matching scores as predictors. More precisely, the response is a vector with 12408 entries and the data matrix is a 12408 × 312 matrix with (i, j)th entry representing the TFAS score of of the ith gene's promoter region for TF j if j ≤ 12; representing the matching score of the ith gene's promoter region for TF j if j > 12.

We have applied COP to this data set. The procedure has identified two principal directions and selected in total 42 predictors. The first squared profile correlation is λ1 = 0.67, and the second squared profile correlation is λ2 = 0.20. Among the 12 TFASs calculated from ChIP-Seq, eight of them were selected by COP. In particular, Oct4 is a well-known master regulator regulating the pluripotency, and Klf4 regulates differentiation (Cai et al. 2010). Evidence also suggests that at these early stages of development, STAT3 activation is required for self-renewal of ESCs (Matsuda et al. 1999). Among the 300 TF motif scores, 34 of them are selected by COP. To further understand what extra information that TF motif scores provide, we annotate the functions of the 34 TFs. It is of interest to note that 24 out of the 34 selected motifs correspond to TFs that are either regulators for development or cancer-related; see Table 4. Since ESCs are in a developmental phase, it is not surprising to have active TFs regulating general development. Some recent evidences suggest that tumor suppressors that control cancer cell proliferation also regulate stem cell self-renewal (Pardal et al. 2005). Thus, a careful study of these cancer-related TFs could lead to a better understanding of the stem cell regulatory network.

Table 4.

Motifs identified

development COUP-TF, AP2, Sp1, CHOP C/EBpalpha, NF-AT
Pax, Pax8, GABP, En1, TTF1
PITX2, NKx2-2, HIXA4, ZF5, PPAR direct repeat 1
cancer IRF1, EVI1, NF1, GKLF, Whn
VDR, POU6F1, Arnt, Cdx2
8 selected TFAS E2F1, Mycn, ZFx, Klf4
Tcfcp2/1, Oct4, Stat3, Smad1

7. Discussion

The contribution of the COP procedure to the development of variable selection methodologies for high dimensional regression analysis is two-fold. First, it dos not impose any assumption on the relationship between the response variable and the predictors, and the sufficient dimension reduction framework that the COP procedure relies on includes fully nonparametric models as special cases. Therefore, COP can be considered a model-free variable selection procedure applicable in any high dimensional data analysis. Second, as demonstrated by our simulation studies, the COP procedure can effectively handle hundreds and thousands of predictors, which can be extremely challenging to other existing methods for variable selection beyond linear or parametric models. Like linear stepwise regression, the COP procedure may encounter issues typical to stepwise procedures as discussed in Miller (1984). Nonetheless, we believe that the COP procedure should become an indispensable member of the repository of variable selection tools and recommend its broad use. When a parametric model is postulated for the relationship between the response and the predictor variables and model-specific variable selection methods are available, we recommend to use COP together with these methods as a safeguard against possible model-misidentification. We have implemented the COP procedure in R, and the R package can be downloaded from http://cran.r-project.org/web/packages/COP/ or requested from the authors directly.

As a trade-off, the COP procedure imposes various assumptions on the distribution of the predictors, of which the linearity assumption is the most fundamental and crucial. When the linearity condition is required to hold for any lower dimensional projection, it is equivalent to requiring that the joint distribution of the predictors is elliptically-contoured (Eaton 1986). Hall and Li (1993) establishes the fact that low dimensional projections from high dimensional data approximately satisfy the linearity condition, which to a certain degree alleviates the concern of the linearity assumption and explains why SIR and the COP procedure worked well under mild violation of the assumption. When the linearity condition is heavily violated, data re-weighting schemes such as the Voronoi re-weighting scheme (Cook and Nachtsheim 1994) can be used to correct the violation. We plan to incorporate such schemes into the COP procedure in the future.

When the number of the predictors is extremely large, the performance of the COP procedure can be compromised. This is also the case for variable selection methods under the linear model. Lately, Fan and Lv (2008) advocates a two-step approach to attack the so-called ultra-high dimensionality. The first step is to perform screening to reduce the dimensionality from ultra-high to high or moderately high, and then in the second step, variable selection methods are applied to identify the true predictors. The same approach can be used for variable selection under the SDR framework. More precisely, we can apply the forward COP (FCOP) precedure, which is simply the COP procedure with the deletion step removed, to reduce the dimensionality of a problem from ultra-high to moderely high. The FCOP procedure is much easier to implement and computationally more efficient than the COP procedure. Then, the usual COP procedure is applied to the reduced data to select the true predictors. This approach is currently under investigation and the results will be reported in a future publication.

Acknowledgments

This work was supported by a NIH grant U01 ES016011, a DOE grant from the office of Science (BER) and an NSF grant DMS 1120256 to WZ, a NIH grant R01-HG02518-02 and a NSF grant DMS 1007762 to JL, a NSF grant DMS 0707004 to YZ.

Appendix

A.1 PROOF OF PROPOSITION 2.1

Let 𝒮(B) denote the space of vectors such that for any ρ ∈ 𝒮(B) and any β ∈ 𝒮(B), ρ′Σβ = 0. Let 𝒮() be the space of vectors such that for any ρ ∈ 𝒮(,) ρ′Σηk = 0 for k = 1, ⋯, . We will show that 𝒮(B) ⊆ 𝒮(), which means, for any ρ ∈ 𝒮(B) P(ρ) = 0. First, because for any T, T(Y)⊥η′X | BX, then cov(T(Y), η′X) = E(T(Y)η′X) = E(E(T(Y)|BX)E(η′X|BX)). Due to the linearity condition, for any ρ ∈ 𝒮(B), E(ρX|BX)=c1β1X++cKβKX, where c1, ⋯, cK are linear coefficients. In addition, since cov(ρX,βkX)=0 for k = 1, ⋯, K, E(ρ′X|BX) = 0. Consequently, corr2(T(Y),ρX)=cov2(T(Y),ρX)var(T(Y))var(ρX)=0, P(ρ) = 0 and 𝒮(B) ⊆ 𝒮(). Proposition 2.1 holds.

A.1 PROOF OF THEOREM 3.1

Without loss of generality, we let 𝒜 = {1, ⋯, d} and t = d + 1. Let X(j) be the vector of n i.i.d observations of the jth variable for j = 1, ⋯, d + 1. We assume that the predictors have been centered to have zero sample mean. Denote Xn×j = (X(1), ⋯, X(j)) for j = d, d + 1. We let

(j)=h=1Hnhn(h(j))Th(j)  for j=d,d+1

where h(j)(j=d,d+1) is the average of the first j variables for those individuals whose responses fall into the hth slice Sh, h = 1, ⋯, H. Let nh be the number of observations in the hth slice, h = 1, ⋯, H. Let λ̂i(j) be the ith largest eigenvalue of Σ̂j1(j) for j = d, d + 1 respectively, where Σ̂j is the sample variance-covariance matrix of Xn×j. It is difficult to see the asymptotic distribution of λ̂i(d+1)λ̂i(d) for i = 1, ⋯, K directly based on Σ̂j1(j) for j = d, d + 1. We did some transformations such that the transformed Σ̂j1(d) (with eigenvalues unchanged) is a sub-matrix of the transformed Σ̂j1(d+1).

Let

γ̂n×1=(γ̂1,,γ̂n)T=1σ̂[IXn×d(Xn×dTXn×d)1Xn×dT]X(d+1),

where σ̂2 is the sample variance of [IXn×d(Xn×dTXn×d)1Xn×dT]X(d+1). Denote γ̄h=nh1yiShγ̂i. Let γ = Xd+1E(Xd+1|X1, ⋯, Xd), and γn×1 be the n regression error terms of the n observed Xd+1 on X1, ⋯, Xd. Then γn×1 are i.i.d with mean zero and a finite variance. Under the null hypothesis H0 : ηd+1,i = 0, i = 1, ⋯, K we have E(γ|y) = E(E(γ|X1, ⋯, Xd)|y) = 0 for any y. Let γ̄ be the mean of γn×1. Then

γ̂n×1=(γ̂1,,γ̂n)T=1σ̂[IXn×d(Xn×dTXn×d)1Xn×dT](γn×1γ̄).

With transformations on Σ̂j1(d), we showed that λ̂i(d+1)λ̂i(d) for i = 1, ⋯, K equals a squared linear combination of γ̄h. Thus, we just need to show that (γ̄1, ⋯, γ̄H) converges to a multivariate normal distribution, and we complete the proof. Let (z1,,zd)=Σd/2(x1,,xd). Define four matrices, AH×H, BH×d, Ed×d and ΓH×d, where AH×H=diag{var(γ|yS1), ⋯, var(γ|ySH)}/σ2; the (h, j)th entry of BH×d is phcov(zjγ,γ|ySh)/σ2, the (j, j′)th entry of Ed×d equals cov(zjγ, zjγ)/σ2, the (h, j)th entry of ΓH×d is phE(zj|ySh) and σ2 = limn σ̂2 = Var(γ). Let ϒ be a d by d matrix and

ϒ=ΓH×dTAH×HΓH×dΓH×dTBH×dΓH×dTΓH×dΓH×dTΓH×dBH×dTΓH×d+ΓH×dTΓH×dEd×dΓH×dTΓH×d.

Define to be a d × K matrix with jth column as qj/(λj(d)(1λj(d)))1/2, where qj is the the jth eigenvector of the limiting matrix limnΣ̂d1/2(d)Σ̂d1/2, and λj(d)=limnλ̂j(d). Then WKt = Tϒ.

A.1 PROOF OF Corollary 3.1

With an additional condition that E2|X1, ⋯, Xd) is constant, we can show that the asymptotic variance matrix of (γ̄1, ⋯, γ̄H) adopts a special form, with which the asymptotic standard chi-quare distribution can be derived.

A.3 PROOF OF THEOREM 3.2

Without loss of generality, we let 𝒜 = {X1, ⋯, Xd}. Following the notations used in Theorem 3.1, let γj = XjE(Xj|Xi, i ∈ 𝒜) for j ∈ 𝒜c, and

γ̂j=(γ̂j,1,,γ̂j,n)=1σ̂j[InXn×d((Xn×d)Xn×d)1(Xn×d)]X(j).

Let γ̄hj=nh1yiShγ̂j,i. Similar as in the proof of Theorem 3.1, we basically show γ̄hj for j = d + 1, ⋯, p and h = 1, ⋯, H converge to a multivariate normal distribution.

A.3 PROOF OF THEOREM 3.3

We use the same notations defined in the proof of Theorem 3.2. Let γ̄hj=nh1yiShγ̂j,i. Let λ̂k(d) be defined as in the proof of Theorem 3.1. First, for any t, COP1:K𝒜+tn(k=1Kλ̂k(d+1)k=1Kλ̂k(d)), and

|k=1Kλ̂k(d+1)k=1Kλ̂k(d)||k=1K(λk(d+1)λk(d))||k=1K(λk(d+1)λ̂k(d+1))||k=1K(λk(d)λ̂k(d))|.

Since X follows a multivariate normal distribution, from Li (1991), λk(d)=λk(d+1)=0 for k > K, then

k=1K(λk(d+1)λk(d))=limn trace(Ω̃(d+1))trace(Ω̂(d))=limnh=1Hnhn(γ̄hj)2=h=1HphE2(γj|ySh)σj2.

We need to use the two Lemmas 8.1 and 8.2 stated below. The proofs of the two lemmas are omitted here. From Lemma 8.1,

maxt𝒜c𝒯n(k=1Kλ̂k(d+1)k=1Kλ̂k(d))ϖ·ωH·n1ξ0τmin2τmax|k=1Kn(λk(d+1)λ̂k(d+1))||k=1Kn(λk(d)λ̂k(d))|.

Then as long as

max𝒜{1,,p}n|k=1K(λk(d)λ̂k(d))|ϑn1ξ0/2,

we have

min𝒜:𝒜c𝒯maxt𝒜c𝒯 COP1:K𝒜+tϑn1ξ0.

From Lemma 8.2,

P(max𝒜{1,,p}|k=1Kλk(d)k=1Kλ̂k(d)|>ϑnξ0/2)2Kp(p+1)C1 exp{C2n12ξ0τmin2ϑ2256K2p2}.

Under Condition 8, since p = o(nϱ0) with 2ϱ0 + 2ξ0 < 1, P(max𝒜{1,,p}|k=1Kλk(d)k=1Kλ̂k(d)|>ϑnξ0/2)0, and P(min𝒜:𝒜c𝒯maxt𝒜c𝒯COP1:K𝒜+tϑn1ξ0)1.

A.3 PROOF OF THEOREM 3.4

Since

|k=1Kλ̂k(d+1)k=1Kλ̂k(d)||k=1K(λk(d+1)λk(d))|+|k=1K(λk(d+1)λ̂k(d+1))|+|k=1K(λk(d)λ̂k(d))|,

and with 𝒯 ⊆ 𝒜, |k=1K(λk(d+1)λk(d))|=0, then from Lemma 8.2, P(max𝒜{1,,p}|k=1Kλk(d)k=1Kλ̂k(d)|>ε)0 for ε > Cnϱ0−1/2 and Theorem 3.4 holds.

Lemma 8.1. Under the same conditions as in Theorem 3.3, for any 𝒜 ⊆ {1, ⋯, p} and 𝒜c ⋂ 𝒯 ≠ ∅,

maxj𝒜c𝒯h=1HphE2(γj|ySh)/σj2τmin2·ϖ·ωH·nξ0/τmax>0.

Lemma 8.2. Under the same conditions as in Lemma 8.1,

P(max𝒜{1,,p}|k=1Kλk(d)k=1Kλ̂k(d)|>ε)2Kp(p+1)C1 exp{C2nτmin2ε264K2p2}.

A.3 PROOF OF THEOREM 4.1

For coherence, we use the same notation as defined in the proof of Theorem 3.1. Without loss of generality, let 𝒜 = {1, ⋯, d} and t = d + 1. Under the assumption that Xn×(d+1) has a multivariate normal distribution, we derive the limiting value of (k=1Kλ̂k(d+1)k=1Kλ̂k(d)) as n → ∞ for fixed slices. Let ΞK×K be the variance-covariance matrix of vK.

Because {X1, ⋯, Xd+1} follow a multivariate normal distribution, we have γ=Xd+1(ρ0+i=1dρiXi) and γ~N(0,σd+12) where the ρi are the coefficients. Since we assume that the response only depends on K linear combinations of Xn×(d+1), Ω̃(d+1) and Ω̂(d) have at most K nonzero eigenvalues, and

k=1Kλ̂k(d+1)trace(Ω̃(d+1))P1,k=1Kλ̂k(d)trace(Ω̂(d))P1,k=1Kλ̂k(d+1)k=1Kλ̂k(d)trace(Ω̃(d+1))trace(Ω̂(d))P1.

Because of the following three results:

  1. trace(Ω̃(d+1))trace(Ω̂(d))=h=1Hnh(γ̄h)2/n.

  2. γ̄hPE(γ|ySh)/σd+1, h = 1, ⋯, H.

  3. Since E(γ|vK)=η̃t,𝒜ΞK×K1vK, then
    E(γ|ySh)=E[E(γ|vK)|ySh]=η̃t,𝒜ΞK×K1LH,K.
    Combining 1,2,3, we have h=1Hnh(γ̄h)2/nPη̃t,𝒜ΞK×K1MH,KΞK×K1η̃t,𝒜. Since ΞK×K1=IK×K,
    k=1Kλ̂k𝒜+tk=1Kλ̂k𝒜P1σd+12η̃t,𝒜MH,Kη̃t,𝒜,
    and Theorem 4.1 holds.

A.3 PROOF OF Proposition 4.1

Note that η′MH,Kη = Var (E(η′vK|ySh)) and

Var (E(ηvK|ySh))=Var (E(ηvK|ySh))+Var [Var (E(ηvK|ySh)|Sh)].

Thus, Proposition 4.1 holds.

Contributor Information

Wenxuan Zhong, Email: wenxuan@illinois.edu, Department of Statistics, University of Illinois at Urbana Champaign, Champaign, IL 61820.

Tingting Zhang, Email: tz3b@virginia.edu, Department of Statistics, University of Virginia, Charlottesville, VA 22904.

Yu Zhu, Email: yuzhu@stat.purdue.edu, Department of Statistics, Purdue University, West Lafayette, IN 47907.

Jun S. Liu, Email: jliu@stat.harvard.edu, Department of Statistics, Harvard University, Cambridge, MA 02138.

References

  1. Bondell H, Li L. Shrinkage inverse regression estimation for model-free variable selection. J. Roy. Statist. Soc. Ser. B. 2009;71(1):287–299. [Google Scholar]
  2. Boyer LA, Lee TI, Cole MF, Johnstone SE, Levine SS, Zucker JP, Guenther MG, Kumar RM, Murray HL, Jenner RG, Gifford DK, Melton DA, Jaenisch R, Young RA. Core transcriptional regulatory circuitry in human embryonic stem cells. Cell. 2005;122(6):947–956. doi: 10.1016/j.cell.2005.08.020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cai J, Xie D, Fan Z, Chipperfield H, Marden J, Wong WH, Zhong S. Modeling co-expression across species for complex traits: insights to the difference of human and mouse embryonic stem cells. PLoS Comput. Biol. 2010;6 doi: 10.1371/journal.pcbi.1000707. e1000707. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Chen C-H, Li K-C. Can SIR be as popular as multiple linear regression? Statistica Sinica. 1998;8:289–316. [Google Scholar]
  5. Chen X, Xu H, Yuan P, Fang F, M H, Vega VB, Wong E, Orlov YL, Zhang W, Jiang J, Loh YH, Yeo HC, Yeo ZX, Narang V, Govindarajan KR, Leong B, Shahab A, Ruan Y, Bourque G, Sung WK, Clarke ND, Wei CL, Ng HH. Integration of external signaling pathways with the core transcriptional network in embryonic stem cells. Cell. 2008;133:1106–1117. doi: 10.1016/j.cell.2008.04.043. [DOI] [PubMed] [Google Scholar]
  6. Cleveland W. Robust locally weighted regression and smoothing scatterplots. J. Amer. Statist. Assoc. 1979;74(368):829–836. [Google Scholar]
  7. Cloonan N, Forrest ARR, Kolle G, Gardiner BBA, Faulkner GJ, Brown MK, Taylor DF, Steptoe AL, Wani S, Bethel G, Robertson AJ, Perkins AC, Bruce SJ, Lee CC, Ranade SS, Peckham HE, Manning JM, McKernan KJ, Grimmond SM. Stem cell transcriptome profiling via massive-scale mRNA sequencing. Nature Methods. 2008;5(7):613–619. doi: 10.1038/nmeth.1223. [DOI] [PubMed] [Google Scholar]
  8. Conlon E, Liu X, Lieb J, Liu J. Integrating regulatory motif discovery and genome-wide expression analysis. Proceedings of the National Academy of Sciences. 2003;100(6):3339–3344. doi: 10.1073/pnas.0630591100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Cook R. Testing predictor contributions in sufficent dimension reduction. Annals of Statistics. 2004;32(3):1062–1092. [Google Scholar]
  10. Cook RD. An Introduction to Regression Graphics. New York: Wiley; 1994. [Google Scholar]
  11. Cook RD, Nachtsheim CJ. Reweighting to achieve elliptically contoured covariates in regression. J. Amer. Statist. Assoc. 1994;89:592–599. [Google Scholar]
  12. Eaton ML. A characterization of spherical distributions. Journal of Multivariate Analysis. 1986;20:272–276. [Google Scholar]
  13. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics. 2004;32(2):407–499. [Google Scholar]
  14. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 2001;96(456):1348–1360. [Google Scholar]
  15. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J. Roy. Statist. Soc. Ser. B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Statistica Sinica. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]
  17. Friedman JH. Pathwise coordinate optimization. Annals of applied statistics. 2007;1(2):302–332. [Google Scholar]
  18. Friedman JH, Tukey JW. A projection pursuit algorithm for explanatory data analysis. IEEE Trans. Comput. 1974;C-23:881–889. [Google Scholar]
  19. Fung W, He X, Liu L, Shi P. Dimension reduction based on canonical correlation. Statistica Sinica. 2002;12(4):1093–1114. [Google Scholar]
  20. Hall P, Li K-C. On almost linearity of low dimensional projections from high dimensional data. The Annals of Statistics. 1993;21:867–889. [Google Scholar]
  21. Huber PJ. Projection pursuit. The Annals of Statistics. 1985;13:435–475. [Google Scholar]
  22. Johnson DS, Mortazavi A, Myers RM, Wold B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007;316(5830):1497–1502. doi: 10.1126/science.1141319. [DOI] [PubMed] [Google Scholar]
  23. Li K-C. Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc. 1991;86:316–327. [Google Scholar]
  24. Li L. Sparse sufficient dimension reduction. Biometrika. 2007;94(3):603–613. [Google Scholar]
  25. Li L, Cook R, Nachtsheim C. Model-free variable selection. J. Roy. Statist. Soc. Ser. B. 2005;67:285–299. [Google Scholar]
  26. Matsuda T, Nakamura T, Nakao K, Arai T, Katsuki M, Heike T, Yokota T. STAT3 activation is sufficient to maintain an undifferentiated state of mouse embryonic stem cells. EMBO J. 1999;18(15):4261–4269. doi: 10.1093/emboj/18.15.4261. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Miller AJ. Selection of subsets of regression variables. J. Roy. Statist. Soc. Ser. A. 1984;147(3):389–425. [Google Scholar]
  28. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods. 2008;5(7):621–628. doi: 10.1038/nmeth.1226. [DOI] [PubMed] [Google Scholar]
  29. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Snyder M. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science. 2008;320(5881):1344–1349. doi: 10.1126/science.1158441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Ouyang Z, Zhou Q, Wong WH. Chip-seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl. Acad. Sci. USA. 2009;106:21521–21526. doi: 10.1073/pnas.0904863106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Pardal R, Molofsky AV, He S, Morrison SJ. Stem cell self-renewal and cancer cell proliferation are regulated by common networks that balance the activation of proto-oncogenes and tumor suppressors. Cold Spring Harb Symp Quant Biol. 2005;70:177–185. doi: 10.1101/sqb.2005.70.057. [DOI] [PubMed] [Google Scholar]
  32. Shao J. An asymptotic theory for linear model selection (with discussion) Statistica Sinica. 1998;7:221–264. [Google Scholar]
  33. Tibshirani R. Regression shrinkage and selection via the lasso. J. Roy. Statist. Soc. Ser. B. 1996;58:267–288. [Google Scholar]
  34. Wang h. Forward regression for ultra-high dimensional variable screening. J. Amer. Statist. Assoc. 2009;104:1512–1524. [Google Scholar]
  35. Wilhelm BT, Marguerat S, Watt S, Schubert F, Wood V, Goodhead I, Penkett CJ, Rogers J, Bahler J. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature. 2008;453(7199):1239–U39. doi: 10.1038/nature07002. [DOI] [PubMed] [Google Scholar]
  36. Zeng P, Zhu Y. An integral transform method for estimating the central mean and central subspaces. Journal of Multivariate Analysis. 2010;101(1):271–290. [Google Scholar]
  37. Zhong W, Zeng P, Ma P, Liu J, Zhu Y. Regularized sliced inverse regression for motif discovery. Bioinformatics. 2005;21(22):4169–4175. doi: 10.1093/bioinformatics/bti680. [DOI] [PubMed] [Google Scholar]
  38. Zhou J, He X. Dimension reduction based on constrained cannonical correlation and variable filtering. Ann. Statist. 2008;36(4):1649–1668. [Google Scholar]
  39. Zhu L, Miao B, Peng H. On sliced inverse regression with high-dimensional covariates. J. Amer. Statist. Assoc. 2006;101:630–643. [Google Scholar]
  40. Zou H. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 2006;101(476):1418–1429. [Google Scholar]

RESOURCES