Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Dec 1.
Published in final edited form as: Stat Sin. 2011;2011(21):707–730. doi: 10.5705/ss.2011.031a

ASYMPTOTIC PROPERTIES OF SUFFICIENT DIMENSION REDUCTION WITH A DIVERGING NUMBER OF PREDICTORS

Yichao Wu 1, Lexin Li 2
PMCID: PMC3228487  NIHMSID: NIHMS334794  PMID: 22140299

Abstract

We investigate asymptotic properties of a family of sufficient dimension reduction estimators when the number of predictors p diverges to infinity with the sample size. We adopt a general formulation of dimension reduction estimation through least squares regression of a set of transformations of the response. This formulation allows us to establish the consistency of reduction projection estimation. We then introduce the SCAD max penalty, along with a difference convex optimization algorithm, to achieve variable selection. We show that the penalized estimator selects all truly relevant predictors and excludes all irrelevant ones with probability approaching one, meanwhile it maintains consistent reduction basis estimation for relevant predictors. Our work differs from most model-based selection methods in that it does not require a traditional model, and it extends existing sufficient dimension reduction and model-free variable selection approaches from the fixed p scenario to a diverging p.

Key words and phrases: Central subspace, diverging parameters, SCAD, sliced inverse regression

1. Introduction

As data with a large number of predictors prevail in many scientific fields such as computational biology, dimension reduction is becoming central to high-dimensional regression analysis of these datasets. Among many dimension reduction methodologies, research in sufficient dimension reduction (SDR), pioneered by Li (1991) and formulated by Cook (1998), has gained considerable interest in recent years. It aims to reduce the predictor dimension by a linear projection of the predictor vector while preserving full regression information. For high-dimensional data, it is often further believed that only a subset of predictors suffice to fully characterize response-predictor relation. Toward this end, simultaneous variable selection along with dimension reduction projection can be achieved (Ni, Cook, and Tsai (2005), Ni et al. (2008), Zhou and He (2008), Bondell and Li (2009). In this article we investigate asymptotic properties of a family of sufficient dimension reduction methods, in terms of both reduction projection estimation and variable selection, while we allow the number of predictors p to diverge as the sample size n approaches infinity.

Specifically, for regression of a univariate response Y given a predictor vector X = (X1, …, Xp)T ∈ IRp×1, SDR seeks a minimum subspace Inline graphic, with a p × d basis matrix Inline graphic, such that YX| Inline graphic X. Under minor conditions (Cook (1996), Yin, Li, and Cook (2008)), such a subspace uniquely exists and is a parsimonious population parameter that contains all regression information of Y|X. It is named the central subspace, and is denoted by Inline graphic (Cook (1998)). Since the seminal sliced inverse regression (SIR) proposed by Li (1991), there have been a variety of methods proposed to estimate Inline graphic including, for instance, sliced average variance estimation (Cook and Weisberg (1991)), directional regression (Li and Wang (2007)), constructive estimation (Xia (2007)), and sliced regression (Wang and Xia (2008)). Among those methods, SIR is perhaps the most commonly used one for estimating Inline graphic, and there have been a number of elaborations on the original methodology of SIR, see for instance, Fung et al. (2002), Yin and Cook (2002), and Cook and Ni (2006). The asymptotic properties of SIR were studied in Li (1991), Hsing and Carroll (1992), Zhu and Ng (1995), and Zhu and Fang (1996). In all those cases, however, the predictor dimension p is treated as fixed. Toward variable selection, Ni, Cook, and Tsai (2005) introduced the lasso type of penalty to SIR to select important predictors along with dimension reduction basis estimation. Zhou and He (2008) imposed the lasso penalty along with thresholding for variable filtering. Ni et al. (2008) and Bondell and Li (2009) generalized the penalized estimation idea to a family of inverse regression estimators and obtained asymptotic properties in terms of consistency in variable selection. Again, p is fixed in those studies and extension to a diverging p is by no means trivial. Recently, there has been work on the diverging p case in the context of sufficient dimension reduction: Zhu, Miao, and Peng (2006) studied the asymptotic properties of SIR as p diverges, but their result is for SIR only, and variable selection is not studied at all; Zhu and Zhu (2009a) investigated weighted partial least squares with a diverging p, but again variable selection is not tackled; Zhu and Zhu (2009b) studied variable selection with a diverging number of predictors through inverse regression, but focused on single-index models only. By contrast, we establish asymptotic properties for a family of inverse regression estimators that includes SIR, study simultaneous dimension reduction and variable selection with a particular emphasis on the latter, and encompass more general model forms.

More specifically, we employ a general formulation of a family of SDR estimators that estimate the central subspace through least squares regression of a set of transformations of the response given the original predictors. This formulation can be viewed as a generalization of the original sliced inverse regression, and includes SIR as a special case in certain situations. Based on this formulation, we investigate the asymptotic properties of our dimension reduction basis estimator while we allow p = pn to increase with the sample size n. Under reasonable regularity conditions, we find the rate of convergence of the estimator to be Op(p/n).

In terms of variable selection, we adopt the SCAD type penalty that was first proposed by Fan and Li (2001), then further developed in Fan and Li (2002), Li and Liang (2008), among others, and combine it with our dimension reduction estimator. It is important to note that exclusion of a predictor in our context of reduction basis estimation requires an entire row of the corresponding basis matrix estimator be zero simultaneously. For this purpose, we employ the SCAD max penalty. We also note that the SDR estimators generally impose no assumption on the conditional distribution Y|X and thus require no traditional models. As a consequence, the penalized SDR estimators achieve variable selection in a model-free fashion. This characteristic distinguishes our result of variable selection with a diverging p from the existing literature, e.g., Fan and Peng (2004), where a parametric model and most often a homoscedastic linear model is assumed. We employ the pseudo-likelihood approach in our proof since no parametric model is imposed. Under suitable conditions, we show that our estimator achieves consistency in variable selection, i.e., the estimator selects all truly relevant predictors with probability approaching one. In addition, the basis estimator of all the relevant predictors is consistent with a n-rate.

The rest of the article is organized as follows. In Section 2, we review a family of SDR estimators and study the convergence as p diverges. In Section 3, we propose the SCAD regularized SDR estimator for variable selection, and investigate its asymptotics with a diverging p in terms of variable selection consistency and basis estimation consistency. We also propose a difference convex algorithm for optimization. We present numerical studies in Section 4, and conclude the paper with a discussion in Section 5. Some technical proofs are given in the Appendix.

2. Dimension Reduction Basis Estimation

2.1. Dimension reduction via response transformation

Throughout, we assume the central subspace Inline graphic exists and its dimension d = dim( Inline graphic) is fixed when p → ∞. This assures that there is a well-defined population parameter as the target of our dimension reduction estimation.

By marginal standardization, if necessary, we assume E(X) = 0 and Var (Xj) = 1, j = 1, …, p. Let Σ = Cov (X), and define the first moment inverse mean function φ(Y) = Σ−1E(X|Y). Sliced inverse regression is based upon the key observation that, if the linearity condition is satisfied, which states that E(X| Inline graphic X) is a linear function of Inline graphic X with Inline graphic denoting a basis of Inline graphic, then φ(Y) ∈ Inline graphic. The linearity condition is satisfied when X is multivariate normally distributed. Furthermore, Hall and Li (1993) proved that linear combinations of the predictors are approximately normally distributed when p → ∞ as n → ∞, which assures that the linearity condition is satisfied asymptotically. It is also interesting to note that this condition is imposed only on the marginal distribution of X, rather than the conditional distribution Y|X. For this reason, SIR is viewed as a model-free estimator of the central subspace.

For any function f(Y) satisfying E{f(Y)}= 0, following Yin and Cook (2002) one can show that

E{f(Y)φ(Y)}=1Cov{X,f(Y)}SYX (2.1)

under the linearity condition. Consequently, one can choose a series of transformation functions of the response variable, f1(Y), …, fh(Y), where h is a pre-specified number, and obtain the least squares estimates of regressing fk(Y) on X, i.e.,

βk0=argminβkE[{fk(Y)XTβk}2],k=1,,h. (2.2)

Write B0=(β10,,βh0)IRp×h, then Span(B0) ⊆ Inline graphic by (2.1). By following the usual protocol in the literature of sufficient dimension reduction, we take one step further by assuming the coverage condition Span(B0) = Inline graphic whenever Span(B0) ⊆ Inline graphic. This condition often holds in practice; see Cook and Ni (2006) for a discussion.

There are various choices for the transformation functions fk(Y). The original SIR corresponds to choosing the slice indicator function fk(Y) = 1 if Y is in slice k, and 0 otherwise, where the response Y is assumed to take h distinctive values {1, …, h}. Since E{fk(Y)φ(Y)} = P (Y = k−1E(X|Y = k), k = 1, …, h, we have Span(β1, …, βh) = Σ−1Span(E(X|Y = 1), …, E(X|Y = h)), and thus it is equivalent to the traditional SIR estimator. Fung et al. (2002) suggested choosing the normalized B-spline basis functions for fk(Y); Yin and Cook (2002) suggested the normalized polynomial transformation fk(Y) = Yk up to power h; Cook and Ni (2006) recommended choosing fk(Y) = Y if Y is in slice k, and 0 otherwise. We do not address which choice of the transformation function is the “best”; we focus on the asymptotic properties of this family of estimators in general. Moreover, the number of the transformation functions h is a tuning parameter, and although its value matters, it is generally recognized that methods based on inverse means alone are not overly sensitive to the choice of h as long as h > d (Li (1991, Remark 4.3), Cook and Ni (2006, p.71)). Since h usually takes a pre-specified small value in practice, we treat h(> d) as fixed in our asymptotic investigations.

Throughout, we assume that we have n i.i.d. realizations of the data, {(Xi, Yi), i = 1, …, n}, and h pre-specified transformation functions f1(·), …, fh(·) whose forms do not depend on the data. We then solve the least squares optimization

β^k=argminβki=1n{fk(Yi)XiTβk}2,k=1,,h. (2.3)

We construct the p × h matrix = (β̂1, …, β̂h), obtain the first d eigenvectors (υ̂1, …, υ̂d) of the matrix h−1B̂B̂T, and take Span(υ̂1, …, υ̂d) as an estimate of the targeted central subspace. The structure dimension d = dim( Inline graphic) of the central subspace can be estimated by an asymptotic test (Li (1991)), a permutation test (Cook and Yin (2001)), or an information criterion (Zhu, Miao, and Peng (2006)), and as such d is treated as known in our investigation of reduction basis estimation.

2.2. Asymptotic properties

We now study the asymptotic properties of our estimator of the central subspace. We begin with a lemma that is a key for our main asymptotic result in Theorem 1.

Lemma 1

Suppose Conditions (i), (ii), and (iii) of Appendix A hold. When p(log n)/n → 0, there exists a constant a*> 0 such that

P(inf||β||=1i=1n(XiTβ)2>an)1asn.

Proof of Lemma 1

For n i.i.d. standard normal errors εi, i = 1, …, n, we construct artificial data {(Xi, i), i = 1, …, n}, where Yi=XiTβ+εi for some fixed β̃ ∈ IRp. The desired result follows by applying Lemma 3.1 of Portnoy (1984) with ψ(t) = t.

For any function f(·) satisfying that E{f(Y)} = 0, let

β0=β0(f)=argminβE{f(Y)XTβ}2, (2.4)
β^=β(f)^=argminβi=1n{f(Yi)XiTβ}2. (2.5)

Theorem 1

Suppose Conditions (i)–(vii) of Appendix A hold. If p(log n)/n → 0, β̂ is a consistent estimator for β0 with ||β^β0||=Op(p/n).

The proof of this theorem is given in Appendix B. It serves as a basis for our consistency result for estimating the central subspace.

Corollary 1

Consider a set of response transformation functions, {f1(·), …, fh(·)}, each of which satisfies Conditions (vi) and (vii) of Appendix A. Under Conditions (i)–(v) of Appendix A, we have ||h1(B^B^TB0B0T)||=Op(p/n), B0 and as at (2.2) and (2.3).

Proof of Corollary 1

Note first that we assume h is finite and fixed. Consequently ||B0|| = O(1). Theorem 1implies that the Frobenius norm of B0 satisfies ||B^B0||=Op(p/n). Therefore,

||h1(B^B^TB0B0T)||h1(||(B^B0)(B^B0)T||+||B0(B^B0)T||+||(B^B0)B0T||)h1(||(B^B0)||||(B^B0)T||+||B0||||(B^B0)T||+||(B^B0)||||B0T||)h1(Op(pn)2+O(1)Op(pn)+O(1)Op(pn))=Op(pn).

Remark 1

Zhu, Miao, and Peng (2006) studied the asymptotics of the original SIR estimator when p diverges. In their study, they fixed the number of sample points in each slice while letting the number of slices h → ∞ as n → ∞. In our study, this notion of fixed number of observations per slice no longer applies for a choice of transformation functions other than the indicator function. Besides, since in practice h is pre-determined, we choose to treat h as fixed in our asymptotic investigations. For these reasons, our consistency rate is not directly comparable to that obtained by Zhu, Miao, and Peng (2006) for the original SIR, while our result goes beyond SIR and applies to the entire family of SDR estimators based on the first inverse moment φ(Y), as discussed in Section 2.1.

We can further bound estimation error of the first d eigenvectors of h−1B̂B̂T when the first d eigenvalues of h1B0B0T are distinct. Let A0=h1B0B0T, Â = h−1B̂B̂T, and E = ÂA0. Denote the eigenvalues and eigenvectors of A0 by {λj, υj}, j = 1, …, p.

Theorem 2

Suppose λ1, …, λd are unique. Under the conditions of Corollary 1, the estimated eigenvectors υ̂js satisfy

1(vjTv^j)2=Op(pn),j=1,,d.

Proof of Theorem 2

For j = 1, …, d, we rearrange the order of the first d elements and write Qj = (υj, υj+1, …, υd, υ1, υ2, …, υj−1, υd+1, υd+2, …, υp) = (υj, Qj2). Then QjTA0Qj=diag(λj,λj+1,,λd,λ1,λ2,,λj1,λd+1,λd+2,,λp). Next write QjTEQj=[εjεjεjTE22j] and aj = minji |λiλj| > 0. Note that ||E||=Op(p/n) due to Corollary 1. By applying Theorem 8.1.12 of Golub and van Loan (1996), there exists pj ∈ IRp1 satisfying ||pj|| ≤ 4||εj||/aj such that v^j=(vj+Qj2pj)/1+pjTpj is a unit eigenvector of  = A0+E. Furthermore, 1(vjTv^j)24||εj||/aj. Since ||εj||||QjTEQj||||Qj||||E||||Qj||=||E||=Op(p/n), the result follows.

3. Variable Selection

3.1. Regularization via the SCAD max penalty

When the number of predictors p is large in a regression analysis, regularization is often employed to add numerical stability, to improve statistical robustness, and to achieve variable selection. In the context of model-based variable selection, there has been an extensive literature on model selection via regularization for the Lasso (Tibshirani (1996)), the SCAD (Fan and Li (2001)), the nonnegative garrote (Breiman (1995)), and the adaptive Lasso (Zou (2006)), among many others. In particular, Fan and Li (2001) first demonstrated that the SCAD penalty possesses the oracle properties in the sense that the regularized estimator correctly selects predictors with nonzero coefficients in the model, excludes those with zero coefficients with probability approaching one, and estimates those nonzero coefficients with the asymptotic distribution they would have if all the zero coefficients were known in advance. Fan and Peng (2004) established these properties of the SCAD for linear models, and Zhu and Zhu (2009b) employed the SCAD for variable selection in single-index models, both with a diverging p. Here we adopt the SCAD max penalty for the purpose of variable selection when p tends to infinity, but we do not impose any parametric or semi-parametric models.

Before pursuing variable selection in the framework of sufficient dimension reduction, we first note that the notions of relevant and irrelevant variables need to be clearly defined, since in SDR estimation no parametric model is imposed. Toward that end, Cook (2004) and Bondell and Li (2009) showed that, as long as the central subspace Inline graphic exists, there exists a unique partition of the predictors X=(X+T,XT)T, X+ ∈ IRq×1, and X ∈ IR(pq1, such that

YXX+. (3.1)

Thus the regression of Y on X only relies on the set of predictors X+, which we call the relevant variables, while X is irrelevant. Without loss of generality, we assume that X+ consists of the first q predictors. Moreover we assume the number of relevant predictors q is fixed as p → ∞. That is, we regard all regression information as concentrated on a fixed number of predictors with the rest of additional variables as nuisance information. We think this condition reasonable, based upon the belief that, in many real applications, increasing the number of predictors after a certain stage does not necessarily induce an increasing amount of useful information. We then have a well-defined population target for the purpose of variable selection in the absence of a traditional model.

Predictor partition as in (3.1) can be directly connected with the basis Inline graphic of the central subspace; that is, the last pq rows of Inline graphic must all be zeros (Cook and Ni (2006, Prop. 1)). It also leads to the following lemma in our context of least squares estimation of the central subspace.

Lemma 2

For β0 at (2.4), we have β0=(β+0T,0(pq)×1T)T at (3.1), where β+0=argminβ+IRqE{f(Y)X+Tβ+} when the linearity condition is satisfied.

Proof of Lemma 2

Under the linearity condition, we have β0Inline graphic so that β0 can be written as a linear combination of the columns of the central subspace basis Inline graphic. Since the last pq rows of Inline graphic must all be zeros, the result follows.

The class of SDR estimators studied in Section 2 yield linear combinations of all the original predictors and thus perform no variable selection. We introduce a non-concave penalty to achieve selection of relevant predictors. For a set of transformation functions, f1, …, fh, define the negative pseudo loglikelihood function

Lk(βk)=n1i=1n{fk(Yi)XiTβk}2.

Applying the max-type penalty, we propose to minimize

Q(B)={k=1hLk(βk)+j=1ppλ(max1khβjk)} (3.2)

over B = (β1, …, βh), where βjk is the j-th element of βk ∈ IRp×1, j = 1, …, p, k = 1, …, h. Here pλ (θ) is a general penalty function indexed by a regularization parameter λ. For now we simply assume pλ is symmetric, singular at the origin, and non-decreasing and concave on [0, ∞). Later in this section, we introduce a specific non-concave form, the SCAD penalty function, for pλ (θ).

Two observations are noteworthy here. First, the minimization in (3.2) is over the entire p×h matrix B, since the penalty is imposed on the maximum over each row of B. This is different from the dimension reduction basis estimation without regularization as discussed in Section 2.1, where the minimization is carried over each column βk of B individually. Second, variable selection achieved through (3.2) requires no dimension reduction basis estimation as a preprocessing step, and thus requires no knowledge of the structural dimension d either. For this reason, the penalty term in (3.2) has ph parameters rather than pd parameters. Selection is done essentially in one step instead of two steps, which to some degree mitigates the dependency of variable selection on the accuracy of reduction basis estimation, and can be particularly useful if model-free variable selection is the sole purpose of the study.

With a slight abuse of notation, we denote the minimizer of (3.2) as = (β ^1, …, β̂h), and denote the minimizer of the corresponding population version k=1hE{fk(Y)βkTX}2 as B0=(β10,,βh0). We use B+0=(β1+0,,βh+0) to denote the submatrix of that consists of its first q rows, and similarly denote the first q rows of B0 as B+0=(β1+0,,βh+0). We next aim to show that β^k+βk+0 as n → ∞, and that the j-th element β̂jk of β̂k satisfies P(β̂jk = 0) → 1 for j > q, k = 1, …, h.

3.2. Asymptotic properties

Let λ = λn. For a general non-concave penalty function pλn(·), let an=max1jppλn(max1khβjk0) and bn=max1jppλn(max1khβjk0), where βjk0 is the j-th element of βk0, and pλn(·) and pλn(·) denote the first and second order derivative, respectively.

Lemma 3

Suppose X satisfies Conditions (iv) and (v), and that each of the response transformation functions f1(·), …, fh(·), satisfies Conditions (vi) and (vii) in Appendix A. If p = o(n1/4), and the penalty function pλn(·) has an=O(1/n) and bn = o(1), then there exists a local minimizer = (β̂1, …, β̂h) of Q(B) in (3.2) such that ||β^kβk0||=Op(p(n1/2+an)), k = 1, 2, …, h.

Lemma 4

Suppose that each of the response transformation functions, f1(·), …, fh(·), satisfies Conditions (vi) and (vii) in Appendix A, and that pλn satisfies liminfninfθ0+pλn(θ)/λn>0. If λn → 0, p/n/λn0, and p = o(n1/4) as n → ∞, then for any given q × h submatrix B+ = (β1+, …, βh+) satisfying ||βk+βk+0||=Op(p/n), k = 1, …, h, and any (pq) × h submatrix B = (β1−, …, βh) satisfying that ||βk||Cp/n for a constant C, k = 1, …, h, with probability tending to one,

Q((B+T,0(pq)×hT)T)=min||βk||Cp/nQ((B+T,BT)T).

The proofs of these two lemmas are given in Appendix B.

Theorem 3

Under the conditions of Lemmas 3 and 4, with probability tending to one, the p/n-consistent local minimizer of Q(B) satisfies

  1. β̂jk = 0 for j > q and 1 ≤ kh;

  2. β̂jk for 1 ≤ jq and 1 ≤ kh have the same asymptotic distribution as the minimizers of
    Q(B+)=n1k=1hi=1n(fk(Yi)Xi+Tβk+)2+j=1qpλn(max1khβjk)

    over B+ = (β1+, …, βh+), where βjk is the j-th element of βk+ ∈ IRq×1, j = 1, …, q, k = 1, …, h, and Xi+ is the i-th observation of X+.

Proof of Theorem 3

By Lemma 3, there exists a p/n-consistent local minimizer of Q(B). Part (1) holds by Lemma 4, that is, B^=(B^+T,0(pq)×hT)T with probability tending to one. Consequently, with probability tending to one, we are in effect minimizing (·). Then part (2) follows.

Remark 2

The asymptotic distributional result is given in a way similar to that in Knight and Fu (2000). For a non-concave max penalty, in general, the explicit asymptotic normality result, as in Fan and Li (2001) and Fan and Peng (2004), is not available because there may exist a tie βjk0=βjk0 for some 1 ≤ jq and kk′. For some specific non-concave max penalty, the asymptotic normality result is possible, as we discuss next.

We introduce a specific form of a non-concave penalty function, the SCAD penalty first proposed by Fan and Li (2001). Define a penalty function pλn(θ) through its first derivative

pλn(θ)=λn{I(θλn)+(aλnθ)+(a1)λnI(θ>λn)},θ0, (3.3)

where a is an additional parameter. It is easy to see that this function satisfies the non-concave penalty condition. Note that pλn(θ) flattens out for |θ| > aλn. Consequently, an = 0 and bn = 0 as long as λn<a1max1jq,1khβjk0. This feature enables us to refine the result of Theorem 3, and leads to the following corollary.

Corollary 2

For the SCAD penalty, an = 0 and bn = 0 when λn<a1max1jq,1khβjk0. Then under the conditions of Lemmas 3 and 4, with probability tending to one, the p/n-consistent local minimizer of Q(B) satisfies

  1. β̂jk = 0 for j > q and 1 ≤ kh;

  2. n(β^k+βk+0)N(0,+1k++1) for k =1, …, h, where Σ+ =Cov (X+) and Σk+ = Var {fk(Y)X+}.

Proof of Corollary 2

It is straightforward to verify that the SCAD penalty satisfies all the penalty-related conditions in Theorem 3. Since β̂k+, k = 1, …, h, are consistent, and λn<a1max1jq,1khβjk0 asymptotically, we are optimizing (B+) in a neighborhood of (β1+, …, βh+) satisfying max1≤kh |βjk| > for j = 1, …, q. Correspondingly, pλn (max1≤kh |βjk|) reduces to pλn(aλn)=(a+1)λn2/2, which does not depend on βk+, k = 1, …, h. As such,

argminβ1+,,βh+1nk=1hi=1n{fk(Yi)Xi+Tβk+}2+j=1qpλn(max1khβjk)

is the same as

argminβ1+,,βh+k=1hi=1n1n{fk(Yi)Xi+Tβk+}2.

The desired result follows.

Remark 3

Corollary 2 is a special case of Theorem 3 since the SCAD penalty function is a special case of the general non-concave penalty function. This refined result is possible because that the SCAD function is flat when its argument is larger than in magnitude. Consequently there is no asymptotic bias in using + to estimate B+. This is in a similar spirit as the result of Theorem 2 of Fan and Li (2001).

Remark 4

We obtain the n-rate for dimension reduction basis estimation after variable selection because the number of truly relevant predictors q is assumed fixed. Consequently, with the SCAD regularized estimator selecting all truly relevant predictors and excluding all irrelevant ones with probability one, the basis estimation based on those relevant predictors achieves a n-rate.

Remark 5

Our results differ from those of Fan and Li (2001) and Fan and Peng (2004), in that they require a parametric linear model and all results hinge on the model being correctly specified. By contrast, our approach does not require a traditional model, and our technical proofs are based on the pseudo-likelihood function.

3.3. Optimization algorithm

We propose an algorithm to minimize Q(B) in (3.2). Note that the SCAD type penalty is non-concave, and thus it requires some specially designed optimization algorithm. In the literature, there exist a number of such algorithms, including local quadratic approximation (Fan and Li (2001)), the minorize-maximize algorithm (Hunter and Li (2005)), local linear approximation (Zou and Li (2008)), and the difference convex algorithm (DC, An and Tao (1997) Wu and Liu (2009). For our problem, we employ the DC algorithm, that solves a non-concave optimization problem via a sequence of convex optimizations by decomposing the non-concave objective function as the difference of two convex functions.

For the SCAD penalty, we note that its first derivative as given in (3.3) can be decomposed as pλ(θ)=pλ,1(θ)+pλ,2(θ), where pλ,1(θ)=λ is a constant and pλ,2(θ)=λ[1(aλθ)+/{(a1)λ}]I(θ>λ) is a decreasing function on the range θ > 0. Accordingly, the SCAD penalty function can be decomposed as pλ (θ) = pλ1(θ) − pλ2(θ), where both pλ1(·) and pλ2(·) are convex, with pλ,1(θ) and pλ,2(θ) as the derivative, respectively. Figure 1 illustrates such a decomposition for a SCAD function with a particular set of parameters, a = 3.7 and λ = 2, where the left panel plots pλ1(θ), the central panel pλ2(θ), and the right panel pλ(θ) = pλ1(θ) − pλ2(θ).

Figure 1.

Figure 1

Decomposition of the SCAD penalty as pλ(θ) = pλ1(θ) − pλ2(θ), with parameters λ = 2 and a = 3.7

We next decompose the objective function in (3.2) as Q(B) = Qvex(B) + Qcav(B), where

Qvex(B)=k=1hLk(βk)+j=1ppλ,1(max1khβjk),Qcav(B)=j=1ppλ,2(max1khβjk).

We initialize B = B(0) and then update B iteratively. At the (t+1)-th step, the DC algorithm uses a linear function – j=1ppλ,2(max1khβjk(t))(max1khβjkmax1khβjk(t)) to approximate the concave part Qcav(B), where βjk(t) denotes the (j, k)-th element of the solution B(t) from the t-th step. Then minimizing Q(B) amounts to solving

B(t+1)=argminB{Qvex(B)j=1ppλ,2(max1khβjk(t))(max1khβjkmax1khβjk(t))}. (3.4)

Optimization in (3.4) can be further formulated as a quadratic programming problem by letting ξj(t)=max1khβjk(t), then minimizing

n1k=1hi=1n(fk(Yi)XiTβk)2+j=1p(λpλ,2(ξj(t)))ξj

over B = (β1, …, βh) subject to ξjβjk and ξj ≥ −βjk, j = 1, …, p, k = 1, …, h. Existing software is available to solve this quadratic programming problem.

Hunter and Li (2005) studied the convergence property of their minorize-maximize (MM) algorithm for the SCAD penalty. Our DC solution can also be viewed as an instance of their MM algorithm, since we replace the concave part Qcav(B) by its affine minorization at each iteration. As the objective function Q(B) is nonnegative, by the descent property of the MM algorithm, our DC algorithm is bound to converge to an ε-local minimizer in finite steps. Practically, we deem the algorithm convergent if k=1hj=1pβjk(t)βjk(t+1) is sufficient small, e.g., less than 10−4.

4. Numerical Studies

In this section, we examine the finite sample performance of the proposed method using both simulations and a data example. We employed the BIC type criterion to select the tuning parameter λ for the SCAD penalty, k=1hi=1n(fk(yi)XiTβ^k(λ))2+n(λ)logn, where n(λ)=#{j:max1khβ^jk(λ)>0} denotes the number of active predictors at λ. The BIC criterion has been commonly used in regularized variable selection, e.g., Wang, Li and Tsai (2007). For transformation functions, we implemented the slice indicator function that gives the usual SIR estimate, and the B-spline basis function suggested in Fung et al. (2002). For the former, we fixed the number of slices at h = 5 and, for the latter, we used a linear spline with three inner knots, which also yields h = 5.

4.1. Simulations

For Examples 4.1 and 4.2, we generated independent Xj from the standard normal. We also considered correlated predictors with Corr(Xi, Xj) = 0.5|ij|, 1 ≤ i, jp.

Example 4.1

Here

Y=XTβ10.5+(1.5+XTβ2)2+1.2ε,

where ε ~ Normal(0, 1) is independent of X. In this model the structural dimension d = 2. We chose β1 = (1, 1, 0, …, 0)T and β2 = (0, 0, 1, 1, 0, …, 0)T. We considered n = 400, p = 20 and n = 800, p = 40. We employed the vector correlation coefficient (Hotelling (1936)) to evaluate the accuracy of the dimension reduction basis estimation, and it ranges between 0 and 1 with a larger value indicating a better estimate. Results based on 100 data replications are reported in Table 1 (left half), where the mean and standard deviation (in parentheses) of the vector correlations between the true and the estimated central subspace basis are shown. We compared the usual SDR estimator without penalty and the one with the SCAD max penalty. Due to the sparse nature of the central subspace basis, the penalized SDR estimator achieved a better estimation accuracy. To evaluate the performance in terms of variable selection, we employ the true positive rate and the false positive rate, a pair of criteria that are commonly used in biomedical research. Table 2 (left half) reports the average results of the penalized SDR estimator. It is clearly seen that all truly relevant predictors were selected, while the false positive rate was low. Moreover, two choices of transformation functions had similar empirical performance in this example.

Table 1.

Evaluation of dimension reduction basis estimation for Examples 4.1 and 4.2. Reported are the mean and standard deviation (in parentheses) of the vector correlation coefficients.


Example 4.1 with d = 2
Example 4.2 with d = 3
Slicing
Spline
Slicing
Spline
p = 20, n = 400
p = 20, n = 600
w/o penalty 0.92 (0.03) 0.88 (0.04) 0.88 (0.03) 0.85 (0.06)
SCAD 0.98 (0.02) 0.92 (0.11) 0.96 (0.03) 0.96 (0.04)


p = 40, n = 800
p = 40, n = 1, 200
w/o penalty 0.92 (0.02) 0.87 (0.03) 0.87 (0.02) 0.84 (0.04)
SCAD 0.99 (0.01) 0.99 (0.01) 0.98 (0.01) 0.98 (0.01)
Table 2.

Evaluation of variable selection for Examples 4.1 and 4.2. Reported are the mean and standard deviation (in parentheses) of true positive rate (TPR) and false positive rate (FPR).


Example 4.1 with d = 2
Example 4.2 with d = 3
Slicing
Spline
Slicing
Spline
p = 20, n = 400
p = 20, n = 600
TPR 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
FPR 0.04 (0.05) 0.01 (0.02) 0.02 (0.05) 0.04 (0.04)


p = 40, n = 800
p = 40, n = 1, 200
TPR 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
FPR 0.00 (0.00) 0.02 (0.02) 0.06 (0.04) 0.00 (0.01)

Example 4.2

Here Y = sign(XT β1) log |XT β2 + 5| + XT β3 + 0.2ε, where ε ~ Normal(0, 1) is independent of X. In this example, the structural dimension d is 3. We chose β1 = (1, 1, 0, …, 0)T, β2 = (0, 0, 1, 1, 0, …, 0)T, and β3 = (0, 0, 0, 0, 1, 1, 0, …, 0)T, where n = 600, p = 20 and n = 1,200, p = 40. Results of reduction basis estimation are reported in Table 1 (right half), and results of variable selection are reported in Table 2 (right half). Again, the proposed SDR estimator with the SCAD max penalty achieved a good performance in terms of both basis estimation and variable selection.

We next consider the performance with correlated predictors. Table 3 reports the results of reduction basis estimation and variable selection when p = 40. It is seen from the table that correlation among the predictors had some bearing on the method, but the overall performance resembled the results for the case without correlation: the penalized SDR estimator improved the estimation accuracy in terms of reduction basis estimation, and achieved a high true positive rate and a low false positive rate.

Table 3.

Evaluation of dimension reduction basis estimation and variable selection for Examples 4.1 and 4.2 with correlated predictors.


Example 4.1 with d = 2
Example 4.2 with d = 3
Slicing
Spline
Slicing
Spline
p = 40, n = 800
p = 40, n = 1, 200
w/o penalty 0.77 (0.05) 0.73 (0.06) 0.77 (0.03) 0.74 (0.05)
SCAD 0.95 (0.04) 0.86 (0.13) 0.95 (0.03) 0.96 (0.03)

TPR 1.00 (0.00) 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
FPR 0.00 (0.01) 0.01 (0.01) 0.04 (0.04) 0.00 (0.01)

4.2. A data example

We briefly analyze the motif discovery data of Zhong et al. (2005) to illustrate the proposed method, though our analysis is by no means comprehensive. The goal here is to identify a subset of transcription factor binding motifs that affect the gene expression values. The response variable is the expression value obtained by DNA microarray experiments, the predictors are the motif-matching scores of = 414 candidate motifs, and the data consist of n = 5,970 genes as the sample observations. To bring the number of candidate predictors to the order of n, we employed univariate regression for an initial screening, following the spirit of Fan and Lv (2008). We set the cutoff p-value at 0.05, and obtained p = 118 motifs for subsequent analysis. Zhong et al. (2005) suggested that the central subspace is two-dimensional and that the predictors affect the response in some nonlinear fashion. We applied our variable selection method to these data. The slicing transformation selected 16 motifs, whereas the spline transformation selected 9 motifs, that form a subset of the 16.

5. Discussion

There are a number of ways to extend this work. First, in our current development, we have treated the number of transformations h as fixed since it usually takes a pre-specified small value, and it helps simplify the technical derivations. For some particular transformation choices, a fixed h may result in an estimate of a proper subspace of the central subspace. As such it is of interest to extend our results to a diverging h. We speculate that the results of Corollary 1 would be modified accordingly, with the convergence rate of h−1B̂B̂T at Op(log(h)p/n), while a rigorous conclusion needs more careful study. Second, the SDR estimators discussed in Section 2.1 rely on the first inverse moment E(X|Y). When E(X|Y) = 0, the estimated subspace obtained may be a proper subspace of the central subspace. There have been proposals of SDR estimators that take advantage of the second or higher inverse moments, for instance, Cook and Weisberg (1991), Yin and Cook (2003), Li, Zha, and Chiaromonte (2005), and Li and Wang (2007). It is of interest to investigate the asymptotics of those SDR estimators with a diverging p. Finally, in many recent microarray and genetics studies, the number of predictors exceeds the number of observational units. Asymptotic properties of both dimension reduction and variable selection with p > n remain to be explored. Full investigations of those extensions are to be our future research.

Acknowledgments

Wu is supported in part by NSF grant DMS-0905561, NIH grant R01-CA149569, and NCSU Faculty Research and Professional Development Award. Li is supported in part by NSF grant DMS-0706919.

Appendix A: Technical Conditions

  1. There are constants a*> 0 and C > 0 such that, for all β with ||β|| = 1,
    P(iI(XiTβ)2>an)1asn

    where I=I(β,C)={i=1,,n:XiTβC}.

  2. For any ε > 0, there exists a constant C > 0 such that, for all β with ||β|| = 1,
    P(iI(XiTβ)2n)1asn.
  3. There is a constant C such that P(maxi=1, …,n ||Xi||2Cn2) → 1 as n → ∞.

  4. E(Xj4)<C for some constant C > 0, j = 1, …, p.

  5. Σ = Cov (X) is positive definite with all its eigenvalues bounded between c and , 0 < c < < ∞, for all p = pn.

  6. E{f(Y)} = 0 and Var {f(Y) − XT β0} < ∞.

  7. The eigenvalues of the pseudo-Fisher information matrix I(β0) of β0(f) are bounded for all p = pn:
    0<λ_<λmin(I(β0))λmax(I(β0))<λ¯<forallp=pn,
    where, up to a constant,
    I(β0)=E([X{f(Y)XTβ0}][X{f(Y)XTβ0}]T).

Remark 6

The regularity Conditions (i), (ii), and (iii) are simplified versions of Conditions X1, X2, and X3 of Portnoy (1984). Portnoy (1984) showed that these conditions hold in probability if {X1, X2, …, Xn} are i.i.d. according to a distribution satisfying his (4.3). As our Conditions (i), (ii), and (iii) are weaker, the same result applies.

Appendix B: Proofs

Proof of Theorem 1

Let F(α)=i=1nXi{f(yi)XiTβ0XiTα} with α ∈ IRp×1. Due to the convexity of the squared loss and the fact that β̂ = β0+α̂, it suffices to show that there is a root α̂ of F(α) satisfying ||α̂||2 = Op(p/n). According to 6.3.4 of Ortega and Rheinboldt (1970), it in turn suffices to show that αT F(α) < 0 for ||α||2 = Bp/n for some B > 0. Toward that end, write αTF(α)=i=1nXiTα{f(Yi)XiTβ0}i=1n(XiTα)2A1A2.

For A2, we have A2=i=1n(XiTα)2||α||2inf||β||=1i=1n(XiTβ)2an||α||2 in probability for some constant a* > 0, due to Lemma 1.

For A1, we have that A1||α||||i=1nXi{f(Yi)XiTβ0}||. Then,

E||i=1nXi{f(Yi)XiTβ0}||2=E(j=1p[i=1nXij{f(Yi)XiTβ0}]2)=E(j=1pi=1ni=1nXijXij{f(Yi)XiTβ0}{f(Yi)XiTβ0})=j=1pi=1nE(Xij2{f(Yi)XiTβ0}2)+j=1p1iinEijEijBnpforsomeB>0,

where Eij=E[Xij{f(Yi)XiTβ0}]. The last inequality is true because β0 = argmin E{f(Y) − XT β}2, which implies that E(XT X)β0 = E{f(Y)X}, and thus for any 1≤ jp, m=1pE(XjXm)βm0=E{f(Y)Xj}, so Eij = 0. Then by Chebychev’s inequality, for any ε > 0, there is a constant B* such that P{A1Bnp||α||forallα}1ε.

Combining the above two results, we have

P{A1A2Bnp||α||an||α||2forallα}12ε.

Set B = (2B*/a*)2 and ||α||2 = Bp/n. Then we have

P{αTF(α)<0forallαwith||α||2=Bpn}P{A1A212Bapfor||α||2=Bpn}12ε.

Our desired result then follows from Ortega and Rheinboldt (1970).

Proof of Lemma 3

Let αn=pn(n1/2+an). We need to show that for any ε > 0 there exists a constant C > 0 such that

P{inf||U||=CQ(B+αnU)>Q(B)}1ε.

Note that

Q(B0+αnU)Q(B0)k=1h{Lk(βk0+αnuk)Lk(βk0)}+j=1q(pλ(max1khβjk0+αnujk)pλ(max1khβjk0))D1+D2.

We decompose D1 and D2, respectively, as D1 = D11 + D12 and D2 = D21 + D22, where

D11=αnk=1hukTβLk(βk0),D12=12αn2k=1hukT{2β2Lk(βk0)}uk,D21=j=1qpλ(max1khβjk0)(max1khβjk0+αnujkmax1khβjk0),D22=j=1q12pλ(max1khβjk0)(max1khβjk0+αnujkmax1khβjk0)2(1+o(1)).

For D11, by Condition (vii), the eigenvalues of the pseudo-Fisher Information matrix I(βk0) are bounded away from both zero and infinity. Therefore we have

||βLk(βk0)||=Op(pn). (B.1)

Then

D11αk=1hukTβLk(βk0)αnk=1h||ukT||·||βLk(βk0)||=Op(αnpn)k=1h||uk||=Op(αn2)k=1h||uk||.

For D12, we note that

D12=12αn2k=1hukT(1ni=1nXiXiT)uk=12αn2k=1hukT{1n(i=1nXiXiT)}uk+12αn2k=1hukTuk.

As in the proof of Lemma 8 of Fan and Peng (2004), by Chebyshev’s inequality, for any ε > 0 we have

P(||1ni=1nXiXiT||εp)p2n2ε2Ej,m=1p(XijXimσij)2=O(p4n)=o(1),

where σij is the (i, j)-element of Σ. Thus we can write

D12=op(1)12αn2k=1h||uk||2+12αn2k=1hukTuk.

We have

D21j=1qank=1hαnujkanαnqk=1h||uk||,D22j=1q12pλ(max1khβjk0)(k=1hαnujk)2(1+o(1))12maxj=1qpλ(max1khβjk0)αn2hk=1h||uk||2.

Combining the above results, D12 is asymptotically positive and dominates other terms. Setting C=||U||=(k=1h||uk||2)1/2 large enough, the desired result follows.

Proof of Lemma 4

It suffices to show that with probability tending to one as v n → ∞, for any given {βk+, k = 1, …, h} satisfying ||βk+βk+0||=Op(p/n) and any constant C, for j = q + 1, …, p,

βjkrQ(B)<0for0<βjk<Cpnandβjk=maxm=1hβjm,βjklQ(B)>0forCpn<βjk<0andβjk=maxm=1hβjm,

where /βjkl and /βjkr denote the left and right hand partial derivative, respectively.

By a Taylor expansion,

βjkLk(βk)=βjkLk(βk0)+l=1p2βjkβlkLk(βk0)(βlkβlk0)E1+E2.

Due to (B.1), we have E1=Op(p/n). Next we decompose E2 as

E2=l=1p[2βjkβlkLk(βk0)E{2βjkβlkLk(βk0)}](βlkβlk0)+l=1pE{2βjkβlkLk(βk0)}(βlkβlk0)E21+E22.

For E21, we have

E21||βkβk0||(l=1p[2βjkβlkLk(βk0)E{2βjkβlkLk(βk0)}]2)1/2=||βkβk0||(l=1p[1ni=1nXijXilE(XjXl)]2)1/2=Op(pn)Op(pn)=Op(pn),

where the second to last equality comes from the moment Condition (iv) and the fact that all the eigenvalues of Σ are bounded away from both 0 and ∞ by Condition (v).

For E22, by Cauchy-Schwarz inequality and ||βkβk0||=Op(p/n),

E22|l=1qE(XijXil)(βjkβjl)|=Op(pn)(l=1pσjl)1/2=Op(pn),

where σij is the (i, j)-element of Σ, and the last equality comes from the fact that l=1pσjl=O(1) which is ensured by Conditions (iv) and (v).

Combining the above two results, we have βjkLk(βk)=Op(p/n).

Finally, note that p/n/λn0 and liminfninfθ0+pλn(θ)/λn>0. When βjk=maxm=1hβjm, we have

Q(B)βjkr=λn{pλn(maxm=1hβjm)λn+Op(p/nλn)}ifβjk>0,Q(B)βjkl=λn{pλn(maxm=1hβjm)λn+Op(p/nλn)}ifβjk<0.

In both cases, the first term dominates the second. Thus the result of Lemma 4 follows.

Contributor Information

Yichao Wu, Email: wu@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.

Lexin Li, Email: li@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.

References

  1. An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Global Optimization. 1997;11:253–285. [Google Scholar]
  2. Bondell HD, Li L. Shrinkage inverse regression estimation for model free variable selection. J Roy Statist Soc Ser B. 2009;71:287–299. [Google Scholar]
  3. Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
  4. Cook RD. Graphics for regressions with a binary response. J Amer Statist Assoc. 1996;91:983–992. [Google Scholar]
  5. Cook RD. Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley; New York: 1998. [Google Scholar]
  6. Cook RD. Testing predictor contributions in sufficient dimension reduction. Ann Statist. 2004;32:1062–1092. [Google Scholar]
  7. Cook RD, Ni L. Using intra-slice covariances for improved estimation of the central subspace in regression. Biometrika. 2006;93:65–74. [Google Scholar]
  8. Cook RD, Weisberg S. Discussion of “Sliced inverse regression for dimension reduction,” by K.C. Li. J Amer Statist Assoc. 1991;86:328–332. [Google Scholar]
  9. Cook RD, Yin X. Dimension-reduction and visualization in discriminant analysis. Austral N Z J Statist. 2001;43:147–200. [Google Scholar]
  10. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
  11. Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99. [Google Scholar]
  12. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]
  14. Fung WK, He X, Liu L, Shi PD. Dimension reduction based on canonical correlation. Statist Sinica. 2002;12:1093–1114. [Google Scholar]
  15. Golub G, van Loan C. Matrix Computations. 3. The Johns Hopkins University Press; 1996. [Google Scholar]
  16. Hall P, Li KC. On almost linearity of low dimensional projection from high dimensional data. Ann Statist. 1993;21:867–889. [Google Scholar]
  17. Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377. [Google Scholar]
  18. Hsing T, Carroll RJ. An asymptotic theory for sliced inverse regression. Ann Statist. 1992;20:1040–1061. [Google Scholar]
  19. Hunter DR, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Knight K, Fu WJ. Asymptotics for Lasso-type estimators. Ann Statist. 2000;28:1356–1378. [Google Scholar]
  21. Li B, Zha H, Chiaromonte F. Contour regression: a general approach to dimension reduction. Ann Statist. 2005;33:1580–1616. [Google Scholar]
  22. Li B, Wang S. On directional regression for dimension reduction. J Amer Statist Assoc. 2007;102:997–1008. [Google Scholar]
  23. Li KC. Sliced inverse regression for dimension reduction (with discussion) J Amer Statist Assoc. 1991;86:316–327. [Google Scholar]
  24. Li R, Liang H. Variable selection in semiparametric regression modeling. Ann Statist. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Ni L, Cook RD, Tsai CL. A note on shrinkage sliced inverse regression. Biometrika. 2005;92:242–247. [Google Scholar]
  26. Ni L, Wang H, Tsai CL, Zhou J. Model free variable selection via adaptive lasso. Proceedings of the Joint Statistical Meetings.2008. [Google Scholar]
  27. Ortega JM, Rheinboldt WC. Iterative Solution of Nonlinear Equations in Several Variables. Wiley; New York: 1970. [Google Scholar]
  28. Portnoy S. Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. I Consistency. Ann Statist. 1984;12:1298–1309. [Google Scholar]
  29. Tibshirani R. Regression shrinkage and selection via the LASSO. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]
  30. Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Wang H, Xia Y. Sliced regression for dimension reduction. J Amer Statist Assoc. 2008;103:811–821. [Google Scholar]
  32. Wu Y, Liu Y. Variable selection in quantile regression. Statist Sinica. 2009;19:801–817. [Google Scholar]
  33. Xia Y. A constructive approach to the estimation of dimension reduction directions. Ann Statist. 2007;35:2654–2690. [Google Scholar]
  34. Yin X, Cook RD. Dimension reduction for the conditional k-th moment in regression. J Roy Statist Soc Ser B. 2002;64:159–176. [Google Scholar]
  35. Yin X, Cook RD. Estimating central subspaces via inverse third moments. Biometrika. 2003;90:113–125. [Google Scholar]
  36. Yin X, Li B, Cook RD. Successive direction extraction for estimating the central subspace in a multiple-index regression. J Multivariate Anal. 2008;99:1733–1757. [Google Scholar]
  37. Zhong W, Zeng P, Ma P, Liu JS, Zhu Y. RSIR: Regularized Sliced Inverse Regression for Motif Discovery. Bioinformatics. 2005;21:4169–4175. doi: 10.1093/bioinformatics/bti680. [DOI] [PubMed] [Google Scholar]
  38. Zhou J, He X. Dimension reduction based on constrained canonical correlation and variable filtering. Ann Statist. 2008;36:1649–1668. [Google Scholar]
  39. Zhu LP, Zhu LX. On distribution-weighted partial least squares with diverging number of highly correlated predictors. J Roy Statist Soc Ser B. 2009a;71:525–548. [Google Scholar]
  40. Zhu LP, Zhu LX. Nonconcave penalized inverse regression in single-index models with high dimensional predictors. J Multivariate Anal. 2009b;100:862–875. [Google Scholar]
  41. Zhu LX, Fang KT. Asymptotics for kernel estimate of sliced inverse regression. Ann Statist. 1996;24:1053–1068. [Google Scholar]
  42. Zhu LX, Miao B, Peng H. On sliced inverse regression with high dimensional covariates. J Amer Statist Assoc. 2006;101:630–643. [Google Scholar]
  43. Zhu LX, Ng KW. Asymptotics of sliced inverse regression. Statist Sinica. 1995;5:727–736. [Google Scholar]
  44. Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
  45. Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES