ASYMPTOTIC PROPERTIES OF SUFFICIENT DIMENSION REDUCTION WITH A DIVERGING NUMBER OF PREDICTORS

Yichao Wu; Lexin Li

doi:10.5705/ss.2011.031a

. Author manuscript; available in PMC: 2011 Dec 1.

Published in final edited form as: Stat Sin. 2011;2011(21):707–730. doi: 10.5705/ss.2011.031a

ASYMPTOTIC PROPERTIES OF SUFFICIENT DIMENSION REDUCTION WITH A DIVERGING NUMBER OF PREDICTORS

Yichao Wu ¹, Lexin Li ²

PMCID: PMC3228487 NIHMSID: NIHMS334794 PMID: 22140299

Abstract

We investigate asymptotic properties of a family of sufficient dimension reduction estimators when the number of predictors p diverges to infinity with the sample size. We adopt a general formulation of dimension reduction estimation through least squares regression of a set of transformations of the response. This formulation allows us to establish the consistency of reduction projection estimation. We then introduce the SCAD max penalty, along with a difference convex optimization algorithm, to achieve variable selection. We show that the penalized estimator selects all truly relevant predictors and excludes all irrelevant ones with probability approaching one, meanwhile it maintains consistent reduction basis estimation for relevant predictors. Our work differs from most model-based selection methods in that it does not require a traditional model, and it extends existing sufficient dimension reduction and model-free variable selection approaches from the fixed p scenario to a diverging p.

Key words and phrases: Central subspace, diverging parameters, SCAD, sliced inverse regression

1. Introduction

As data with a large number of predictors prevail in many scientific fields such as computational biology, dimension reduction is becoming central to high-dimensional regression analysis of these datasets. Among many dimension reduction methodologies, research in sufficient dimension reduction (SDR), pioneered by Li (1991) and formulated by Cook (1998), has gained considerable interest in recent years. It aims to reduce the predictor dimension by a linear projection of the predictor vector while preserving full regression information. For high-dimensional data, it is often further believed that only a subset of predictors suffice to fully characterize response-predictor relation. Toward this end, simultaneous variable selection along with dimension reduction projection can be achieved (Ni, Cook, and Tsai (2005), Ni et al. (2008), Zhou and He (2008), Bondell and Li (2009). In this article we investigate asymptotic properties of a family of sufficient dimension reduction methods, in terms of both reduction projection estimation and variable selection, while we allow the number of predictors p to diverge as the sample size n approaches infinity.

Specifically, for regression of a univariate response Y given a predictor vector X = (X₁, …, X_p)^T ∈ IR^p^×¹, SDR seeks a minimum subspace Inline graphic , with a p × d basis matrix , such that Y ⫫ X| X. Under minor conditions (Cook (1996), Yin, Li, and Cook (2008)), such a subspace uniquely exists and is a parsimonious population parameter that contains all regression information of Y|X. It is named the central subspace, and is denoted by Inline graphic (Cook (1998)). Since the seminal sliced inverse regression (SIR) proposed by Li (1991), there have been a variety of methods proposed to estimate including, for instance, sliced average variance estimation (Cook and Weisberg (1991)), directional regression (Li and Wang (2007)), constructive estimation (Xia (2007)), and sliced regression (Wang and Xia (2008)). Among those methods, SIR is perhaps the most commonly used one for estimating Inline graphic , and there have been a number of elaborations on the original methodology of SIR, see for instance, Fung et al. (2002), Yin and Cook (2002), and Cook and Ni (2006). The asymptotic properties of SIR were studied in Li (1991), Hsing and Carroll (1992), Zhu and Ng (1995), and Zhu and Fang (1996). In all those cases, however, the predictor dimension p is treated as fixed. Toward variable selection, Ni, Cook, and Tsai (2005) introduced the lasso type of penalty to SIR to select important predictors along with dimension reduction basis estimation. Zhou and He (2008) imposed the lasso penalty along with thresholding for variable filtering. Ni et al. (2008) and Bondell and Li (2009) generalized the penalized estimation idea to a family of inverse regression estimators and obtained asymptotic properties in terms of consistency in variable selection. Again, p is fixed in those studies and extension to a diverging p is by no means trivial. Recently, there has been work on the diverging p case in the context of sufficient dimension reduction: Zhu, Miao, and Peng (2006) studied the asymptotic properties of SIR as p diverges, but their result is for SIR only, and variable selection is not studied at all; Zhu and Zhu (2009a) investigated weighted partial least squares with a diverging p, but again variable selection is not tackled; Zhu and Zhu (2009b) studied variable selection with a diverging number of predictors through inverse regression, but focused on single-index models only. By contrast, we establish asymptotic properties for a family of inverse regression estimators that includes SIR, study simultaneous dimension reduction and variable selection with a particular emphasis on the latter, and encompass more general model forms.

More specifically, we employ a general formulation of a family of SDR estimators that estimate the central subspace through least squares regression of a set of transformations of the response given the original predictors. This formulation can be viewed as a generalization of the original sliced inverse regression, and includes SIR as a special case in certain situations. Based on this formulation, we investigate the asymptotic properties of our dimension reduction basis estimator while we allow p = p_n to increase with the sample size n. Under reasonable regularity conditions, we find the rate of convergence of the estimator to be $O_{p} (\sqrt{p / n})$ .

In terms of variable selection, we adopt the SCAD type penalty that was first proposed by Fan and Li (2001), then further developed in Fan and Li (2002), Li and Liang (2008), among others, and combine it with our dimension reduction estimator. It is important to note that exclusion of a predictor in our context of reduction basis estimation requires an entire row of the corresponding basis matrix estimator be zero simultaneously. For this purpose, we employ the SCAD max penalty. We also note that the SDR estimators generally impose no assumption on the conditional distribution Y|X and thus require no traditional models. As a consequence, the penalized SDR estimators achieve variable selection in a model-free fashion. This characteristic distinguishes our result of variable selection with a diverging p from the existing literature, e.g., Fan and Peng (2004), where a parametric model and most often a homoscedastic linear model is assumed. We employ the pseudo-likelihood approach in our proof since no parametric model is imposed. Under suitable conditions, we show that our estimator achieves consistency in variable selection, i.e., the estimator selects all truly relevant predictors with probability approaching one. In addition, the basis estimator of all the relevant predictors is consistent with a $\sqrt{n}$ -rate.

The rest of the article is organized as follows. In Section 2, we review a family of SDR estimators and study the convergence as p diverges. In Section 3, we propose the SCAD regularized SDR estimator for variable selection, and investigate its asymptotics with a diverging p in terms of variable selection consistency and basis estimation consistency. We also propose a difference convex algorithm for optimization. We present numerical studies in Section 4, and conclude the paper with a discussion in Section 5. Some technical proofs are given in the Appendix.

2. Dimension Reduction Basis Estimation

2.1. Dimension reduction via response transformation

Throughout, we assume the central subspace Inline graphic exists and its dimension d = dim( ) is fixed when p → ∞. This assures that there is a well-defined population parameter as the target of our dimension reduction estimation.

By marginal standardization, if necessary, we assume E(X) = 0 and Var (X_j) = 1, j = 1, …, p. Let Σ = Cov (X), and define the first moment inverse mean function φ(Y) = Σ⁻¹E(X|Y). Sliced inverse regression is based upon the key observation that, if the linearity condition is satisfied, which states that E(X| Inline graphic X) is a linear function of X with denoting a basis of , then φ(Y) ∈ . The linearity condition is satisfied when X is multivariate normally distributed. Furthermore, Hall and Li (1993) proved that linear combinations of the predictors are approximately normally distributed when p → ∞ as n → ∞, which assures that the linearity condition is satisfied asymptotically. It is also interesting to note that this condition is imposed only on the marginal distribution of X, rather than the conditional distribution Y|X. For this reason, SIR is viewed as a model-free estimator of the central subspace.

For any function f(Y) satisfying E{f(Y)}= 0, following Yin and Cook (2002) one can show that

E {f (Y) φ (Y)} = \sum^{- 1} Cov {X, f (Y)} \in S_{Y ∣ X}

(2.1)

under the linearity condition. Consequently, one can choose a series of transformation functions of the response variable, f₁(Y), …, f_h(Y), where h is a pre-specified number, and obtain the least squares estimates of regressing f_k(Y) on X, i.e.,

β_{k}^{0} = arg min_{β_{k}} E [{f_{k} (Y) - X^{T} β_{k}}^{2}], k = 1, \dots, h .

(2.2)

Write $B_{0} = (β_{1}^{0}, \dots, β_{h}^{0}) \in {IR}^{p \times h}$ , then Span(B₀) ⊆ Inline graphic by (2.1). By following the usual protocol in the literature of sufficient dimension reduction, we take one step further by assuming the coverage condition Span(B₀) = whenever Span(B₀) ⊆ . This condition often holds in practice; see Cook and Ni (2006) for a discussion.

There are various choices for the transformation functions f_k(Y). The original SIR corresponds to choosing the slice indicator function f_k(Y) = 1 if Y is in slice k, and 0 otherwise, where the response Y is assumed to take h distinctive values {1, …, h}. Since E{f_k(Y)φ(Y)} = P (Y = k)Σ⁻¹E(X|Y = k), k = 1, …, h, we have Span(β₁, …, β_h) = Σ⁻¹Span(E(X|Y = 1), …, E(X|Y = h)), and thus it is equivalent to the traditional SIR estimator. Fung et al. (2002) suggested choosing the normalized B-spline basis functions for f_k(Y); Yin and Cook (2002) suggested the normalized polynomial transformation f_k(Y) = Y^k up to power h; Cook and Ni (2006) recommended choosing f_k(Y) = Y if Y is in slice k, and 0 otherwise. We do not address which choice of the transformation function is the “best”; we focus on the asymptotic properties of this family of estimators in general. Moreover, the number of the transformation functions h is a tuning parameter, and although its value matters, it is generally recognized that methods based on inverse means alone are not overly sensitive to the choice of h as long as h > d (Li (1991, Remark 4.3), Cook and Ni (2006, p.71)). Since h usually takes a pre-specified small value in practice, we treat h(> d) as fixed in our asymptotic investigations.

Throughout, we assume that we have n i.i.d. realizations of the data, {(X_i, Y_i), i = 1, …, n}, and h pre-specified transformation functions f₁(·), …, f_h(·) whose forms do not depend on the data. We then solve the least squares optimization

{\hat{β}}_{k} = arg min_{β_{k}} \sum_{i = 1}^{n} {f_{k} (Y_{i}) - X_{i}^{T} β_{k}}^{2}, k = 1, \dots, h .

(2.3)

We construct the p × h matrix B̂ = (β̂₁, …, β̂_h), obtain the first d eigenvectors (υ̂₁, …, υ̂_d) of the matrix h⁻¹B̂B̂^T, and take Span(υ̂₁, …, υ̂_d) as an estimate of the targeted central subspace. The structure dimension d = dim( Inline graphic ) of the central subspace can be estimated by an asymptotic test (Li (1991)), a permutation test (Cook and Yin (2001)), or an information criterion (Zhu, Miao, and Peng (2006)), and as such d is treated as known in our investigation of reduction basis estimation.

2.2. Asymptotic properties

We now study the asymptotic properties of our estimator of the central subspace. We begin with a lemma that is a key for our main asymptotic result in Theorem 1.

Lemma 1

Suppose Conditions (i), (ii), and (iii) of Appendix A hold. When p(log n)/n → 0, there exists a constant a^*> 0 such that

P (inf_{| | β | | = 1} \sum_{i = 1}^{n} {(X_{i}^{T} β)}^{2} > a^{*} n) \to 1 a s n \to \infty .

Proof of Lemma 1

For n i.i.d. standard normal errors ε_i, i = 1, …, n, we construct artificial data {(X_i, Ỹ_i), i = 1, …, n}, where ${\tilde{Y}}_{i} = X_{i}^{T} \tilde{β} + ε_{i}$ for some fixed β̃ ∈ IR^p. The desired result follows by applying Lemma 3.1 of Portnoy (1984) with ψ(t) = t.

For any function f(·) satisfying that E{f(Y)} = 0, let

β^{0} = β^{0} (f) = \underset{β}{argmin} E {f (Y) - X^{T} β}^{2},

(2.4)

\hat{β} = \hat{β (f)} = \underset{β}{argmin} \sum_{i = 1}^{n} {f (Y_{i}) - X_{i}^{T} β}^{2} .

(2.5)

Theorem 1

Suppose Conditions (i)–(vii) of Appendix A hold. If p(log n)/n → 0, β̂ is a consistent estimator for β⁰ with $| | \hat{β} - β^{0} | | = O_{p} (\sqrt{p / n})$ .

The proof of this theorem is given in Appendix B. It serves as a basis for our consistency result for estimating the central subspace.

Corollary 1

Consider a set of response transformation functions, {f₁(·), …, f_h(·)}, each of which satisfies Conditions (vi) and (vii) of Appendix A. Under Conditions (i)–(v) of Appendix A, we have $| | h^{- 1} ({\hat{B} \hat{B}}^{T} - B_{0} B_{0}^{T}) | | = O_{p} (\sqrt{p / n})$ , B₀ and B̂ as at (2.2) and (2.3).

Proof of Corollary 1

Note first that we assume h is finite and fixed. Consequently ||B₀|| = O(1). Theorem 1implies that the Frobenius norm of B̂ − B₀ satisfies $| | \hat{B} - B_{0} | | = O_{p} (\sqrt{p / n})$ . Therefore,

\begin{array}{l} | | h^{- 1} ({\hat{B} \hat{B}}^{T} - B_{0} B_{0}^{T}) | | \leq h^{- 1} (| | (\hat{B} - B_{0}) {(\hat{B} - B_{0})}^{T} | | + | | B_{0} {(\hat{B} - B_{0})}^{T} | | + | | (\hat{B} - B_{0}) B_{0}^{T} | |) \\ \leq h^{- 1} (| | (\hat{B} - B_{0}) | | | | {(\hat{B} - B_{0})}^{T} | | + | | B_{0} | | | | {(\hat{B} - B_{0})}^{T} | | + | | (\hat{B} - B_{0}) | | | | B_{0}^{T} | |) \\ \leq h^{- 1} (O_{p} {(\sqrt{\frac{p}{n}})}^{2} + O (1) O_{p} (\sqrt{\frac{p}{n}}) + O (1) O_{p} (\sqrt{\frac{p}{n}})) \\ = O_{p} (\sqrt{\frac{p}{n}}) . \end{array}

Remark 1

Zhu, Miao, and Peng (2006) studied the asymptotics of the original SIR estimator when p diverges. In their study, they fixed the number of sample points in each slice while letting the number of slices h → ∞ as n → ∞. In our study, this notion of fixed number of observations per slice no longer applies for a choice of transformation functions other than the indicator function. Besides, since in practice h is pre-determined, we choose to treat h as fixed in our asymptotic investigations. For these reasons, our consistency rate is not directly comparable to that obtained by Zhu, Miao, and Peng (2006) for the original SIR, while our result goes beyond SIR and applies to the entire family of SDR estimators based on the first inverse moment φ(Y), as discussed in Section 2.1.

We can further bound estimation error of the first d eigenvectors of h⁻¹B̂B̂^T when the first d eigenvalues of $h^{- 1} B_{0} B_{0}^{T}$ are distinct. Let $A_{0} = h^{- 1} B_{0} B_{0}^{T}$ , Â = h⁻¹B̂B̂^T, and E = Â − A₀. Denote the eigenvalues and eigenvectors of A₀ by {λ_j, υ_j}, j = 1, …, p.

Theorem 2

Suppose λ₁, …, λ_d are unique. Under the conditions of Corollary 1, the estimated eigenvectors υ̂_js satisfy

\sqrt{1 - {(v_{j}^{T} {\hat{v}}_{j})}^{2}} = O_{p} (\sqrt{\frac{p}{n}}), j = 1, \dots, d .

Proof of Theorem 2

For j = 1, …, d, we rearrange the order of the first d elements and write Q_j = (υ_j, υ_j₊₁, …, υ_d, υ₁, υ₂, …, υ_j₋₁, υ_d₊₁, υ_d₊₂, …, υ_p) = (υ_j, Q_j₂). Then $Q_{j}^{T} A_{0} Q_{j} = diag (λ_{j}, λ_{j + 1}, \dots, λ_{d}, λ_{1}, λ_{2}, \dots, λ_{j - 1}, λ_{d + 1}, λ_{d + 2}, \dots, λ_{p})$ . Next write $Q_{j}^{T} {E Q}_{j} = [\begin{matrix} ε_{j} & ε_{j} \\ ε_{j}^{T} & E_{22 j} \end{matrix}]$ and a_j = min_j_≠_i |λ_i − λ_j| > 0. Note that $| | E | | = O_{p} (\sqrt{p / n})$ due to Corollary 1. By applying Theorem 8.1.12 of Golub and van Loan (1996), there exists p_j ∈ IR^p⁻¹ satisfying ||p_j|| ≤ 4||ε_j||/a_j such that ${\hat{v}}_{j} = (v_{j} + Q_{j 2} p_{j}) / \sqrt{1 + p_{j}^{T} p_{j}}$ is a unit eigenvector of Â = A₀+E. Furthermore, $\sqrt{1 - {(v_{j}^{T} {\hat{v}}_{j})}^{2}} \leq 4 | | ε_{j} | | / a_{j}$ . Since $| | ε_{j} | | \leq | | Q_{j}^{T} {E Q}_{j} | | \leq | | Q_{j} | | | | E | | | | Q_{j} | | = | | E | | = O_{p} (\sqrt{p / n})$ , the result follows.

3. Variable Selection

3.1. Regularization via the SCAD max penalty

When the number of predictors p is large in a regression analysis, regularization is often employed to add numerical stability, to improve statistical robustness, and to achieve variable selection. In the context of model-based variable selection, there has been an extensive literature on model selection via regularization for the Lasso (Tibshirani (1996)), the SCAD (Fan and Li (2001)), the nonnegative garrote (Breiman (1995)), and the adaptive Lasso (Zou (2006)), among many others. In particular, Fan and Li (2001) first demonstrated that the SCAD penalty possesses the oracle properties in the sense that the regularized estimator correctly selects predictors with nonzero coefficients in the model, excludes those with zero coefficients with probability approaching one, and estimates those nonzero coefficients with the asymptotic distribution they would have if all the zero coefficients were known in advance. Fan and Peng (2004) established these properties of the SCAD for linear models, and Zhu and Zhu (2009b) employed the SCAD for variable selection in single-index models, both with a diverging p. Here we adopt the SCAD max penalty for the purpose of variable selection when p tends to infinity, but we do not impose any parametric or semi-parametric models.

Before pursuing variable selection in the framework of sufficient dimension reduction, we first note that the notions of relevant and irrelevant variables need to be clearly defined, since in SDR estimation no parametric model is imposed. Toward that end, Cook (2004) and Bondell and Li (2009) showed that, as long as the central subspace Inline graphic exists, there exists a unique partition of the predictors $X = {(X_{+}^{T}, X_{-}^{T})}^{T}$ , X₊ ∈ IR^q^×¹, and X₋ ∈ IR⁽^p⁻^q^)×¹, such that

Y ⫫ X_{-} ∣ X_{+} .

(3.1)

Thus the regression of Y on X only relies on the set of predictors X₊, which we call the relevant variables, while X₋ is irrelevant. Without loss of generality, we assume that X₊ consists of the first q predictors. Moreover we assume the number of relevant predictors q is fixed as p → ∞. That is, we regard all regression information as concentrated on a fixed number of predictors with the rest of additional variables as nuisance information. We think this condition reasonable, based upon the belief that, in many real applications, increasing the number of predictors after a certain stage does not necessarily induce an increasing amount of useful information. We then have a well-defined population target for the purpose of variable selection in the absence of a traditional model.

Predictor partition as in (3.1) can be directly connected with the basis Inline graphic of the central subspace; that is, the last p − q rows of must all be zeros (Cook and Ni (2006, Prop. 1)). It also leads to the following lemma in our context of least squares estimation of the central subspace.

Lemma 2

For β⁰ at (2.4), we have $β^{0} = {(β_{+}^{0 T}, 0_{(p - q) \times 1}^{T})}^{T}$ at (3.1), where $β_{+}^{0} = {argmin}_{β_{+} \in {IR}^{q}} E {f (Y) - X_{+}^{T} β_{+}}$ when the linearity condition is satisfied.

Proof of Lemma 2

Under the linearity condition, we have β⁰ ∈ Inline graphic so that β⁰ can be written as a linear combination of the columns of the central subspace basis . Since the last p − q rows of must all be zeros, the result follows.

The class of SDR estimators studied in Section 2 yield linear combinations of all the original predictors and thus perform no variable selection. We introduce a non-concave penalty to achieve selection of relevant predictors. For a set of transformation functions, f₁, …, f_h, define the negative pseudo loglikelihood function

L_{k} (β_{k}) = n^{- 1} \sum_{i = 1}^{n} {f_{k} (Y_{i}) - X_{i}^{T} β_{k}}^{2} .

Applying the max-type penalty, we propose to minimize

Q (B) = {\sum_{k = 1}^{h} L_{k} (β_{k}) + \sum_{j = 1}^{p} p_{λ} (max_{1 \leq k \leq h} ∣ β_{j k} ∣)}

(3.2)

over B = (β₁, …, β_h), where β_jk is the j-th element of β_k ∈ IR^p^×¹, j = 1, …, p, k = 1, …, h. Here p_λ (θ) is a general penalty function indexed by a regularization parameter λ. For now we simply assume p_λ is symmetric, singular at the origin, and non-decreasing and concave on [0, ∞). Later in this section, we introduce a specific non-concave form, the SCAD penalty function, for p_λ (θ).

Two observations are noteworthy here. First, the minimization in (3.2) is over the entire p×h matrix B, since the penalty is imposed on the maximum over each row of B. This is different from the dimension reduction basis estimation without regularization as discussed in Section 2.1, where the minimization is carried over each column β_k of B individually. Second, variable selection achieved through (3.2) requires no dimension reduction basis estimation as a preprocessing step, and thus requires no knowledge of the structural dimension d either. For this reason, the penalty term in (3.2) has ph parameters rather than pd parameters. Selection is done essentially in one step instead of two steps, which to some degree mitigates the dependency of variable selection on the accuracy of reduction basis estimation, and can be particularly useful if model-free variable selection is the sole purpose of the study.

With a slight abuse of notation, we denote the minimizer of (3.2) as B̂ = (β ^₁, …, β̂_h), and denote the minimizer of the corresponding population version $\sum_{k = 1}^{h} E {f_{k} (Y) - β_{k}^{T} X}^{2}$ as $B_{0} = (β_{1}^{0}, \dots, β_{h}^{0})$ . We use $B_{+}^{0} = (β_{1 +}^{0}, \dots, β_{h +}^{0})$ to denote the submatrix of B̂ that consists of its first q rows, and similarly denote the first q rows of B⁰ as $B_{+}^{0} = (β_{1 +}^{0}, \dots, β_{h +}^{0})$ . We next aim to show that ${\hat{β}}_{k +} \to β_{k +}^{0}$ as n → ∞, and that the j-th element β̂_jk of β̂_k satisfies P(β̂_jk = 0) → 1 for j > q, k = 1, …, h.

3.2. Asymptotic properties

Let λ = λ_n. For a general non-concave penalty function p_{λ_n}(·), let $a_{n} = {max}_{1 \leq j \leq p} p_{λ_{n}}^{'} ({max}_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣)$ and $b_{n} = {max}_{1 \leq j \leq p} p_{λ_{n}}^{″} ({max}_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣)$ , where $β_{j k}^{0}$ is the j-th element of $β_{k}^{0}$ , and $p_{λ_{n}}^{'} (\cdot)$ and $p_{λ_{n}}^{″} (\cdot)$ denote the first and second order derivative, respectively.

Lemma 3

Suppose X satisfies Conditions (iv) and (v), and that each of the response transformation functions f₁(·), …, f_h(·), satisfies Conditions (vi) and (vii) in Appendix A. If p = o(n^1/4), and the penalty function p_{λ_n}(·) has $a_{n} = O (1 / \sqrt{n})$ and b_n = o(1), then there exists a local minimizer B̂ = (β̂₁, …, β̂_h) of Q(B) in (3.2) such that $| | {\hat{β}}_{k} - β_{k}^{0} | | = O_{p} (\sqrt{p} (n^{- 1 / 2} + a_{n}))$ , k = 1, 2, …, h.

Lemma 4

Suppose that each of the response transformation functions, f₁(·), …, f_h(·), satisfies Conditions (vi) and (vii) in Appendix A, and that p_{λ_n} satisfies $lim {inf}_{n \to \infty} {inf}_{θ \to 0^{+}} p_{λ_{n}}^{'} (θ) / λ_{n} > 0$ . If λ_n → 0, $\sqrt{p / n} / λ_{n} \to 0$ , and p = o(n^1/4) as n → ∞, then for any given q × h submatrix B₊ = (β₁₊, …, β_h₊) satisfying $| | β_{k +} - β_{k +}^{0} | | = O_{p} (\sqrt{p / n})$ , k = 1, …, h, and any (p − q) × h submatrix B₋ = (β₁₋, …, β_h₋) satisfying that $| | β_{k -} | | \leq C \sqrt{p / n}$ for a constant C, k = 1, …, h, with probability tending to one,

Q ({(B_{+}^{T}, 0_{(p - q) \times h}^{T})}^{T}) = min_{| | β_{k -} | | \leq C \sqrt{p / n}} Q ({(B_{+}^{T}, B_{-}^{T})}^{T}) .

The proofs of these two lemmas are given in Appendix B.

Theorem 3

Under the conditions of Lemmas 3 and 4, with probability tending to one, the $\sqrt{p / n}$ -consistent local minimizer of Q(B) satisfies

β̂_jk = 0 for j > q and 1 ≤ k ≤ h;
β̂_jk for 1 ≤ j ≤ q and 1 ≤ k ≤ h have the same asymptotic distribution as the minimizers of
$\tilde{Q} (B_{+}) = n^{- 1} \sum_{k = 1}^{h} \sum_{i = 1}^{n} {(f_{k} (Y_{i}) - X_{i +}^{T} β_{k +})}^{2} + \sum_{j = 1}^{q} p_{λ_{n}} (max_{1 \leq k \leq h} ∣ β_{j k} ∣)$

over B₊ = (β₁₊, …, β_h₊), where β_jk is the j-th element of β_k₊ ∈ IR^q^×¹, j = 1, …, q, k = 1, …, h, and X_i₊ is the i-th observation of X₊.

Proof of Theorem 3

By Lemma 3, there exists a $\sqrt{p / n}$ -consistent local minimizer B̂ of Q(B). Part (1) holds by Lemma 4, that is, $\hat{B} = {({\hat{B}}_{+}^{T}, 0_{(p - q) \times h}^{T})}^{T}$ with probability tending to one. Consequently, with probability tending to one, we are in effect minimizing Q̃(·). Then part (2) follows.

Remark 2

The asymptotic distributional result is given in a way similar to that in Knight and Fu (2000). For a non-concave max penalty, in general, the explicit asymptotic normality result, as in Fan and Li (2001) and Fan and Peng (2004), is not available because there may exist a tie $∣ β_{j k}^{0} ∣ = ∣ β_{{j k}^{'}}^{0} ∣$ for some 1 ≤ j ≤ q and k ≠ k′. For some specific non-concave max penalty, the asymptotic normality result is possible, as we discuss next.

We introduce a specific form of a non-concave penalty function, the SCAD penalty first proposed by Fan and Li (2001). Define a penalty function p_{λ_n}(θ) through its first derivative

p_{λ_{n}}^{'} (θ) = λ_{n} {I (θ \leq λ_{n}) + \frac{{(a λ_{n} - θ)}_{+}}{(a - 1) λ_{n}} I (θ > λ_{n})}, θ \geq 0,

(3.3)

where a is an additional parameter. It is easy to see that this function satisfies the non-concave penalty condition. Note that p_{λ_n}(θ) flattens out for |θ| > aλ_n. Consequently, a_n = 0 and b_n = 0 as long as $λ_{n} < a^{- 1} {max}_{1 \leq j \leq q, 1 \leq k \leq h} ∣ β_{j k}^{0} ∣$ . This feature enables us to refine the result of Theorem 3, and leads to the following corollary.

Corollary 2

For the SCAD penalty, a_n = 0 and b_n = 0 when $λ_{n} < a^{- 1} {max}_{1 \leq j \leq q, 1 \leq k \leq h} ∣ β_{j k}^{0} ∣$ . Then under the conditions of Lemmas 3 and 4, with probability tending to one, the $\sqrt{p / n}$ -consistent local minimizer of Q(B) satisfies

β̂_jk = 0 for j > q and 1 ≤ k ≤ h;
$\sqrt{n} ({\hat{β}}_{k +} - β_{k +}^{0}) \to N (0, \sum_{+}^{- 1} \sum_{k +} \sum_{+}^{- 1})$ for k =1, …, h, where Σ₊ =Cov (X₊) and Σ_k₊ = Var {f_k(Y)X₊}.

Proof of Corollary 2

It is straightforward to verify that the SCAD penalty satisfies all the penalty-related conditions in Theorem 3. Since β̂_k₊, k = 1, …, h, are consistent, and $λ_{n} < a^{- 1} {max}_{1 \leq j \leq q, 1 \leq k \leq h} ∣ β_{j k}^{0} ∣$ asymptotically, we are optimizing Q̃ (B₊) in a neighborhood of (β₁₊, …, β_h₊) satisfying max_1≤_k_≤_h |β_jk| > aλ for j = 1, …, q. Correspondingly, p_{λ_n} (max_1≤_k_≤_h |β_jk|) reduces to $p_{λ_{n}} (a λ_{n}) = (a + 1) λ_{n}^{2} / 2$ , which does not depend on β_k₊, k = 1, …, h. As such,

\underset{β_{1 +}, \dots, β_{h +}}{argmin} \frac{1}{n} \sum_{k = 1}^{h} \sum_{i = 1}^{n} {f_{k} (Y_{i}) - X_{i +}^{T} β_{k +}}^{2} + \sum_{j = 1}^{q} p_{λ_{n}} (max_{1 \leq k \leq h} ∣ β_{j k} ∣)

is the same as

\underset{β_{1 +}, \dots, β_{h +}}{argmin} \sum_{k = 1}^{h} \sum_{i = 1}^{n} \frac{1}{n} {f_{k} (Y_{i}) - X_{i +}^{T} β_{k +}}^{2} .

The desired result follows.

Remark 3

Corollary 2 is a special case of Theorem 3 since the SCAD penalty function is a special case of the general non-concave penalty function. This refined result is possible because that the SCAD function is flat when its argument is larger than aλ in magnitude. Consequently there is no asymptotic bias in using B̂₊ to estimate B₊. This is in a similar spirit as the result of Theorem 2 of Fan and Li (2001).

Remark 4

We obtain the $\sqrt{n}$ -rate for dimension reduction basis estimation after variable selection because the number of truly relevant predictors q is assumed fixed. Consequently, with the SCAD regularized estimator selecting all truly relevant predictors and excluding all irrelevant ones with probability one, the basis estimation based on those relevant predictors achieves a $\sqrt{n}$ -rate.

Remark 5

Our results differ from those of Fan and Li (2001) and Fan and Peng (2004), in that they require a parametric linear model and all results hinge on the model being correctly specified. By contrast, our approach does not require a traditional model, and our technical proofs are based on the pseudo-likelihood function.

3.3. Optimization algorithm

We propose an algorithm to minimize Q(B) in (3.2). Note that the SCAD type penalty is non-concave, and thus it requires some specially designed optimization algorithm. In the literature, there exist a number of such algorithms, including local quadratic approximation (Fan and Li (2001)), the minorize-maximize algorithm (Hunter and Li (2005)), local linear approximation (Zou and Li (2008)), and the difference convex algorithm (DC, An and Tao (1997) Wu and Liu (2009). For our problem, we employ the DC algorithm, that solves a non-concave optimization problem via a sequence of convex optimizations by decomposing the non-concave objective function as the difference of two convex functions.

For the SCAD penalty, we note that its first derivative as given in (3.3) can be decomposed as $p_{λ}^{'} (θ) = p_{λ, 1}^{'} (θ) + p_{λ, 2}^{'} (θ)$ , where $p_{λ, 1}^{'} (θ) = λ$ is a constant and $p_{λ, 2}^{'} (θ) = λ [1 - {(a λ - θ)}_{+} / {(a - 1) λ}] I (θ > λ)$ is a decreasing function on the range θ > 0. Accordingly, the SCAD penalty function can be decomposed as p_λ (θ) = p_λ₁(θ) − p_λ₂(θ), where both p_λ₁(·) and p_λ₂(·) are convex, with $p_{λ, 1}^{'} (θ)$ and $p_{λ, 2}^{'} (θ)$ as the derivative, respectively. Figure 1 illustrates such a decomposition for a SCAD function with a particular set of parameters, a = 3.7 and λ = 2, where the left panel plots p_λ₁(θ), the central panel p_λ₂(θ), and the right panel p_λ(θ) = p_λ₁(θ) − p_λ₂(θ).

Decomposition of the SCAD penalty as *p_λ*(θ) = *p_λ*₁(θ) − *p_λ*₂(θ), with parameters λ = 2 and a = 3.7

We next decompose the objective function in (3.2) as Q(B) = Q_vex(B) + Q_cav(B), where

\begin{array}{l} Q_{vex} (B) = \sum_{k = 1}^{h} L_{k} (β_{k}) + \sum_{j = 1}^{p} p_{λ, 1} (max_{1 \leq k \leq h} ∣ β_{j k} ∣), \\ Q_{cav} (B) = - \sum_{j = 1}^{p} p_{λ, 2} (max_{1 \leq k \leq h} ∣ β_{j k} ∣) . \end{array}

We initialize B = B⁽⁰⁾ and then update B iteratively. At the (t+1)-th step, the DC algorithm uses a linear function – $\sum_{j = 1}^{p} p_{λ, 2}^{'} ({max}_{1 \leq k \leq h} ∣ β_{j k}^{(t)} ∣) ({max}_{1 \leq k \leq h} ∣ β_{j k} ∣ - {max}_{1 \leq k \leq h} ∣ β_{j k}^{(t)} ∣)$ to approximate the concave part Q_cav(B), where $β_{j k}^{(t)}$ denotes the (j, k)-th element of the solution B⁽^t⁾ from the t-th step. Then minimizing Q(B) amounts to solving

B^{(t + 1)} = \underset{B}{argmin} {Q_{vex} (B) - \sum_{j = 1}^{p} p_{λ, 2}^{'} (max_{1 \leq k \leq h} ∣ β_{j k}^{(t)} ∣) (max_{1 \leq k \leq h} ∣ β_{j k} ∣ - max_{1 \leq k \leq h} ∣ β_{j k}^{(t)} ∣)} .

(3.4)

Optimization in (3.4) can be further formulated as a quadratic programming problem by letting $ξ_{j}^{(t)} = {max}_{1 \leq k \leq h} ∣ β_{j k}^{(t)} ∣$ , then minimizing

n^{- 1} \sum_{k = 1}^{h} \sum_{i = 1}^{n} {(f_{k} (Y_{i}) - X_{i}^{T} β_{k})}^{2} + \sum_{j = 1}^{p} (λ - p_{λ, 2}^{'} (ξ_{j}^{(t)})) ξ_{j}

over B = (β₁, …, β_h) subject to ξ_j ≥ β_jk and ξ_j ≥ −β_jk, j = 1, …, p, k = 1, …, h. Existing software is available to solve this quadratic programming problem.

Hunter and Li (2005) studied the convergence property of their minorize-maximize (MM) algorithm for the SCAD penalty. Our DC solution can also be viewed as an instance of their MM algorithm, since we replace the concave part Q_cav(B) by its affine minorization at each iteration. As the objective function Q(B) is nonnegative, by the descent property of the MM algorithm, our DC algorithm is bound to converge to an ε-local minimizer in finite steps. Practically, we deem the algorithm convergent if $\sum_{k = 1}^{h} \sum_{j = 1}^{p} ∣ β_{j k}^{(t)} - β_{j k}^{(t + 1)} ∣$ is sufficient small, e.g., less than 10⁻⁴.

4. Numerical Studies

In this section, we examine the finite sample performance of the proposed method using both simulations and a data example. We employed the BIC type criterion to select the tuning parameter λ for the SCAD penalty, $\sum_{k = 1}^{h} \sum_{i = 1}^{n} {(f_{k} (y_{i}) - X_{i}^{T} {\hat{β}}_{k}^{(λ)})}^{2} + n^{(λ)} log n$ , where $n^{(λ)} = # {j : {max}_{1 \leq k \leq h} ∣ {\hat{β}}_{j k}^{(λ)} ∣ > 0}$ denotes the number of active predictors at λ. The BIC criterion has been commonly used in regularized variable selection, e.g., Wang, Li and Tsai (2007). For transformation functions, we implemented the slice indicator function that gives the usual SIR estimate, and the B-spline basis function suggested in Fung et al. (2002). For the former, we fixed the number of slices at h = 5 and, for the latter, we used a linear spline with three inner knots, which also yields h = 5.

4.1. Simulations

For Examples 4.1 and 4.2, we generated independent X_j from the standard normal. We also considered correlated predictors with Corr(X_i, X_j) = 0.5^|ⁱ⁻^j^|, 1 ≤ i, j ≤ p.

Example 4.1

Here

Y = \frac{X^{T} β_{1}}{0.5 + {(1.5 + X^{T} β_{2})}^{2}} + 1.2 ε,

where ε ~ Normal(0, 1) is independent of X. In this model the structural dimension d = 2. We chose β₁ = (1, 1, 0, …, 0)^T and β₂ = (0, 0, 1, 1, 0, …, 0)^T. We considered n = 400, p = 20 and n = 800, p = 40. We employed the vector correlation coefficient (Hotelling (1936)) to evaluate the accuracy of the dimension reduction basis estimation, and it ranges between 0 and 1 with a larger value indicating a better estimate. Results based on 100 data replications are reported in Table 1 (left half), where the mean and standard deviation (in parentheses) of the vector correlations between the true and the estimated central subspace basis are shown. We compared the usual SDR estimator without penalty and the one with the SCAD max penalty. Due to the sparse nature of the central subspace basis, the penalized SDR estimator achieved a better estimation accuracy. To evaluate the performance in terms of variable selection, we employ the true positive rate and the false positive rate, a pair of criteria that are commonly used in biomedical research. Table 2 (left half) reports the average results of the penalized SDR estimator. It is clearly seen that all truly relevant predictors were selected, while the false positive rate was low. Moreover, two choices of transformation functions had similar empirical performance in this example.

Table 1.

Evaluation of dimension reduction basis estimation for Examples 4.1 and 4.2. Reported are the mean and standard deviation (in parentheses) of the vector correlation coefficients.

	Example 4.1 with d = 2		Example 4.2 with d = 3
	Slicing	Spline	Slicing	Spline
	p = 20, n = 400		p = 20, n = 600
w/o penalty	0.92 (0.03)	0.88 (0.04)	0.88 (0.03)	0.85 (0.06)
SCAD	0.98 (0.02)	0.92 (0.11)	0.96 (0.03)	0.96 (0.04)

	p = 40, n = 800		p = 40, n = 1, 200
w/o penalty	0.92 (0.02)	0.87 (0.03)	0.87 (0.02)	0.84 (0.04)
SCAD	0.99 (0.01)	0.99 (0.01)	0.98 (0.01)	0.98 (0.01)

Open in a new tab

Table 2.

Evaluation of variable selection for Examples 4.1 and 4.2. Reported are the mean and standard deviation (in parentheses) of true positive rate (TPR) and false positive rate (FPR).

	Example 4.1 with d = 2		Example 4.2 with d = 3
	Slicing	Spline	Slicing	Spline
	p = 20, n = 400		p = 20, n = 600
TPR	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)
FPR	0.04 (0.05)	0.01 (0.02)	0.02 (0.05)	0.04 (0.04)

	p = 40, n = 800		p = 40, n = 1, 200
TPR	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)
FPR	0.00 (0.00)	0.02 (0.02)	0.06 (0.04)	0.00 (0.01)

Open in a new tab

Example 4.2

Here Y = sign(X^T β₁) log |X^T β₂ + 5| + X^T β₃ + 0.2ε, where ε ~ Normal(0, 1) is independent of X. In this example, the structural dimension d is 3. We chose β₁ = (1, 1, 0, …, 0)^T, β₂ = (0, 0, 1, 1, 0, …, 0)^T, and β₃ = (0, 0, 0, 0, 1, 1, 0, …, 0)^T, where n = 600, p = 20 and n = 1,200, p = 40. Results of reduction basis estimation are reported in Table 1 (right half), and results of variable selection are reported in Table 2 (right half). Again, the proposed SDR estimator with the SCAD max penalty achieved a good performance in terms of both basis estimation and variable selection.

We next consider the performance with correlated predictors. Table 3 reports the results of reduction basis estimation and variable selection when p = 40. It is seen from the table that correlation among the predictors had some bearing on the method, but the overall performance resembled the results for the case without correlation: the penalized SDR estimator improved the estimation accuracy in terms of reduction basis estimation, and achieved a high true positive rate and a low false positive rate.

Table 3.

Evaluation of dimension reduction basis estimation and variable selection for Examples 4.1 and 4.2 with correlated predictors.

	Example 4.1 with d = 2		Example 4.2 with d = 3
	Slicing	Spline	Slicing	Spline
	p = 40, n = 800		p = 40, n = 1, 200
w/o penalty	0.77 (0.05)	0.73 (0.06)	0.77 (0.03)	0.74 (0.05)
SCAD	0.95 (0.04)	0.86 (0.13)	0.95 (0.03)	0.96 (0.03)

TPR	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)
FPR	0.00 (0.01)	0.01 (0.01)	0.04 (0.04)	0.00 (0.01)

Open in a new tab

4.2. A data example

We briefly analyze the motif discovery data of Zhong et al. (2005) to illustrate the proposed method, though our analysis is by no means comprehensive. The goal here is to identify a subset of transcription factor binding motifs that affect the gene expression values. The response variable is the expression value obtained by DNA microarray experiments, the predictors are the motif-matching scores of p̃ = 414 candidate motifs, and the data consist of n = 5,970 genes as the sample observations. To bring the number of candidate predictors to the order of $\sqrt{n}$ , we employed univariate regression for an initial screening, following the spirit of Fan and Lv (2008). We set the cutoff p-value at 0.05, and obtained p = 118 motifs for subsequent analysis. Zhong et al. (2005) suggested that the central subspace is two-dimensional and that the predictors affect the response in some nonlinear fashion. We applied our variable selection method to these data. The slicing transformation selected 16 motifs, whereas the spline transformation selected 9 motifs, that form a subset of the 16.

5. Discussion

There are a number of ways to extend this work. First, in our current development, we have treated the number of transformations h as fixed since it usually takes a pre-specified small value, and it helps simplify the technical derivations. For some particular transformation choices, a fixed h may result in an estimate of a proper subspace of the central subspace. As such it is of interest to extend our results to a diverging h. We speculate that the results of Corollary 1 would be modified accordingly, with the convergence rate of h⁻¹B̂B̂^T at $O_{p} (\sqrt{log (h) p / n})$ , while a rigorous conclusion needs more careful study. Second, the SDR estimators discussed in Section 2.1 rely on the first inverse moment E(X|Y). When E(X|Y) = 0, the estimated subspace obtained may be a proper subspace of the central subspace. There have been proposals of SDR estimators that take advantage of the second or higher inverse moments, for instance, Cook and Weisberg (1991), Yin and Cook (2003), Li, Zha, and Chiaromonte (2005), and Li and Wang (2007). It is of interest to investigate the asymptotics of those SDR estimators with a diverging p. Finally, in many recent microarray and genetics studies, the number of predictors exceeds the number of observational units. Asymptotic properties of both dimension reduction and variable selection with p > n remain to be explored. Full investigations of those extensions are to be our future research.

Acknowledgments

Wu is supported in part by NSF grant DMS-0905561, NIH grant R01-CA149569, and NCSU Faculty Research and Professional Development Award. Li is supported in part by NSF grant DMS-0706919.

Appendix A: Technical Conditions

There are constants a^*> 0 and C > 0 such that, for all β with ||β|| = 1,
$P (\sum_{i \in I} {(X_{i}^{T} β)}^{2} > a^{*} n) \to 1 as n \to \infty$

where $I = I (β, C) = {i = 1, \dots, n : ∣ X_{i}^{T} β ∣ \leq C}$ .
For any ε > 0, there exists a constant C > 0 such that, for all β with ||β|| = 1,
$P (\sum_{i \notin I} {(X_{i}^{T} β)}^{2} \leq n) \to 1 as n \to \infty .$
There is a constant C such that P(max_i_{=1, …,}_n ||X_i||² ≤ Cn²) → 1 as n → ∞.
$E (X_{j}^{4}) < C$ for some constant C > 0, j = 1, …, p.
Σ = Cov (X) is positive definite with all its eigenvalues bounded between c and c̄, 0 < c < c̄ < ∞, for all p = p_n.
E{f(Y)} = 0 and Var {f(Y) − X^T β⁰} < ∞.
The eigenvalues of the pseudo-Fisher information matrix I(β⁰) of β⁰(f) are bounded for all p = p_n:
$0 < \underline{λ} < λ_{min} (I (β^{0})) \leq λ_{max} (I (β^{0})) < \bar{λ} < \infty for all p = p_{n},$

where, up to a constant,
$I (β^{0}) = E ([X {f (Y) - X^{T} β^{0}}] {[X {f (Y) - X^{T} β^{0}}]}^{T}) .$

Remark 6

The regularity Conditions (i), (ii), and (iii) are simplified versions of Conditions X1, X2, and X3 of Portnoy (1984). Portnoy (1984) showed that these conditions hold in probability if {X₁, X₂, …, X_n} are i.i.d. according to a distribution satisfying his (4.3). As our Conditions (i), (ii), and (iii) are weaker, the same result applies.

Appendix B: Proofs

Proof of Theorem 1

Let $F (α) = \sum_{i = 1}^{n} X_{i} {f (y_{i}) - X_{i}^{T} β^{0} - X_{i}^{T} α}$ with α ∈ IR^p^×¹. Due to the convexity of the squared loss and the fact that β̂ = β⁰+α̂, it suffices to show that there is a root α̂ of F(α) satisfying ||α̂||² = O_p(p/n). According to 6.3.4 of Ortega and Rheinboldt (1970), it in turn suffices to show that α^T F(α) < 0 for ||α||² = Bp/n for some B > 0. Toward that end, write $α^{T} F (α) = \sum_{i = 1}^{n} X_{i}^{T} α {f (Y_{i}) - X_{i}^{T} β^{0}} - \sum_{i = 1}^{n} {(X_{i}^{T} α)}^{2} \equiv A_{1} - A_{2}$ .

For A₂, we have $A_{2} = \sum_{i = 1}^{n} {(X_{i}^{T} α)}^{2} \geq {| | α | |}^{2} {inf}_{| | β | | = 1} \sum_{i = 1}^{n} {(X_{i}^{T} β)}^{2} \geq a^{*} n {| | α | |}^{2}$ in probability for some constant a^* > 0, due to Lemma 1.

For A₁, we have that $∣ A_{1} ∣ \leq | | α | | | | \sum_{i = 1}^{n} X_{i} {f (Y_{i}) - X_{i}^{T} β^{0}} | |$ . Then,

\begin{array}{l} E {| | \sum_{i = 1}^{n} X_{i} {f (Y_{i}) - X_{i}^{T} β^{0}} | |}^{2} = E (\sum_{j = 1}^{p} {[\sum_{i = 1}^{n} X_{i j} {f (Y_{i}) - X_{i}^{T} β^{0}}]}^{2}) \\ = E (\sum_{j = 1}^{p} \sum_{i = 1}^{n} \sum_{i^{'} = 1}^{n} X_{i j} X_{i^{'} j} {f (Y_{i}) - X_{i}^{T} β^{0}} {f (Y_{i^{'}}) - X_{i^{'}}^{T} β^{0}}) \\ = \sum_{j = 1}^{p} \sum_{i = 1}^{n} E (X_{i j}^{2} {f (Y_{i}) - X_{i}^{T} β^{0}}^{2}) + \sum_{j = 1}^{p} \sum_{1 \leq i \neq i^{'} \leq n} E_{i j} E_{i^{'} j} \\ \leq B^{'} n p for some B^{'} > 0, \end{array}

where $E_{i j} = E [X_{i j} {f (Y_{i}) - X_{i}^{T} β^{0}}]$ . The last inequality is true because β⁰ = argmin E{f(Y) − X^T β}², which implies that E(X^T X)β⁰ = E{f(Y)X}, and thus for any 1≤ j ≤ p, $\sum_{m = 1}^{p} E (X_{j} X_{m}) β_{m}^{0} = E {f (Y) X_{j}}$ , so E_ij = 0. Then by Chebychev’s inequality, for any ε > 0, there is a constant B^* such that $P {A_{1} \leq B^{*} \sqrt{n p} | | α | | for all α} \geq 1 - ε$ .

Combining the above two results, we have

P {A_{1} - A_{2} \leq B^{*} \sqrt{n p} | | α | | - a^{*} n {| | α | |}^{2} for all α} \geq 1 - 2 ε .

Set B = (2B^*/a^*)² and ||α||² = Bp/n. Then we have

P {α^{T} F (α) < 0 for all α with {| | α | |}^{2} = B \frac{p}{n}} \geq P {A_{1} - A_{2} \leq - \frac{1}{2} {B a}^{*} p for {| | α | |}^{2} = B^{'} \frac{p}{n}} \geq 1 - 2 ε .

Our desired result then follows from Ortega and Rheinboldt (1970).

Proof of Lemma 3

Let $α_{n} = \sqrt{p_{n}} (n^{- 1 / 2} + a_{n})$ . We need to show that for any ε > 0 there exists a constant C > 0 such that

P {inf_{| | U | | = C} Q (B + α_{n} U) > Q (B)} \geq 1 - ε .

Note that

\begin{array}{l} Q (B^{0} + α_{n} U) - Q (B^{0}) \geq \sum_{k = 1}^{h} {L_{k} (β_{k}^{0} + α_{n} u_{k}) - L_{k} (β_{k}^{0})} + \sum_{j = 1}^{q} (p_{λ} (max_{1 \leq k \leq h} ∣ β_{j k}^{0} + α_{n} u_{j k} ∣) - p_{λ} (max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣)) \\ \equiv D_{1} + D_{2} . \end{array}

We decompose D₁ and D₂, respectively, as D₁ = D₁₁ + D₁₂ and D₂ = D₂₁ + D₂₂, where

\begin{array}{l} D_{11} = α_{n} \sum_{k = 1}^{h} u_{k}^{T} \frac{\partial}{\partial β} L_{k} (β_{k}^{0}), \\ D_{12} = \frac{1}{2} α_{n}^{2} \sum_{k = 1}^{h} u_{k}^{T} {\frac{\partial^{2}}{\partial β^{2}} L_{k} (β_{k}^{0})} u_{k}, \\ D_{21} = \sum_{j = 1}^{q} p_{λ}^{'} (max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣) (max_{1 \leq k \leq h} ∣ β_{j k}^{0} + α_{n} u_{j k} ∣ - max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣), \\ D_{22} = \sum_{j = 1}^{q} \frac{1}{2} p_{λ}^{″} (max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣) {(max_{1 \leq k \leq h} ∣ β_{j k}^{0} + α_{n} u_{j k} ∣ - max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣)}^{2} (1 + o (1)) . \end{array}

For D₁₁, by Condition (vii), the eigenvalues of the pseudo-Fisher Information matrix $I (β_{k}^{0})$ are bounded away from both zero and infinity. Therefore we have

| | \frac{\partial}{\partial β} L_{k} (β_{k}^{0}) | | = O_{p} (\sqrt{\frac{p}{n}}) .

(B.1)

Then

\begin{array}{l} ∣ D_{11} ∣ \leq α \sum_{k = 1}^{h} ∣ u_{k}^{T} \frac{\partial}{\partial β} L_{k} (β_{k}^{0}) ∣ \leq α_{n} \sum_{k = 1}^{h} | | u_{k}^{T} | | \cdot | | \frac{\partial}{\partial β} L_{k} (β_{k}^{0}) | | \\ = O_{p} (α_{n} \sqrt{\frac{p}{n}}) \sum_{k = 1}^{h} | | u_{k} | | = O_{p} (α_{n}^{2}) \sum_{k = 1}^{h} | | u_{k} | | . \end{array}

For D₁₂, we note that

\begin{array}{l} D_{12} = \frac{1}{2} α_{n}^{2} \sum_{k = 1}^{h} u_{k}^{T} (\frac{1}{n} \sum_{i = 1}^{n} X_{i} X_{i}^{T}) u_{k} \\ = \frac{1}{2} α_{n}^{2} \sum_{k = 1}^{h} u_{k}^{T} {\frac{1}{n} (\sum_{i = 1}^{n} X_{i} X_{i}^{T}) - \sum} u_{k} + \frac{1}{2} α_{n}^{2} \sum_{k = 1}^{h} u_{k}^{T} \sum u_{k} . \end{array}

As in the proof of Lemma 8 of Fan and Peng (2004), by Chebyshev’s inequality, for any ε > 0 we have

P (| | \frac{1}{n} \sum_{i = 1}^{n} X_{i} X_{i}^{T} - \sum | | \geq \frac{ε}{p}) \leq \frac{p^{2}}{n^{2} ε^{2}} E \sum_{j, m = 1}^{p} {(X_{i j} X_{i m} - σ_{i j})}^{2} = O (\frac{p^{4}}{n}) = o (1),

where σ_ij is the (i, j)-element of Σ. Thus we can write

D_{12} = o_{p} (1) \frac{1}{2} α_{n}^{2} \sum_{k = 1}^{h} {| | u_{k} | |}^{2} + \frac{1}{2} α_{n}^{2} \sum_{k = 1}^{h} u_{k}^{T} \sum u_{k} .

We have

\begin{array}{l} ∣ D_{21} ∣ \leq \sum_{j = 1}^{q} a_{n} \sum_{k = 1}^{h} α_{n} ∣ u_{j k} ∣ \leq a_{n} α_{n} \sqrt{q} \sum_{k = 1}^{h} | | u_{k} | |, \\ ∣ D_{22} ∣ \leq \sum_{j = 1}^{q} \frac{1}{2} p_{λ}^{″} (max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣) {(\sum_{k = 1}^{h} α_{n} ∣ u_{j k} ∣)}^{2} (1 + o (1)) \\ \leq \frac{1}{2} {max}_{j = 1}^{q} p_{λ}^{″} (max_{1 \leq k \leq h} ∣ β_{j k}^{0} ∣) α_{n}^{2} h \sum_{k = 1}^{h} {| | u_{k} | |}^{2} . \end{array}

Combining the above results, D₁₂ is asymptotically positive and dominates other terms. Setting $C = | | U | | = {(\sum_{k = 1}^{h} {| | u_{k} | |}^{2})}^{1 / 2}$ large enough, the desired result follows.

Proof of Lemma 4

It suffices to show that with probability tending to one as v n → ∞, for any given {β_k₊, k = 1, …, h} satisfying $| | β_{k +} - β_{k +}^{0} | | = O_{p} (\sqrt{p / n})$ and any constant C, for j = q + 1, …, p,

\begin{array}{l} \frac{\partial}{\partial β_{j k}^{r}} Q (B) < 0 for 0 < β_{j k} < C \sqrt{\frac{p}{n}} and β_{j k} = {max}_{m = 1}^{h} ∣ β_{j m} ∣, \\ \frac{\partial}{\partial β_{j k}^{l}} Q (B) > 0 for - C \sqrt{\frac{p}{n}} < β_{j k} < 0 and β_{j k} = - {max}_{m = 1}^{h} ∣ β_{j m} ∣, \end{array}

where $\partial / \partial β_{j k}^{l}$ and $\partial / \partial β_{j k}^{r}$ denote the left and right hand partial derivative, respectively.

By a Taylor expansion,

\frac{\partial}{\partial β_{j k}} L_{k} (β_{k}) = \frac{\partial}{\partial β_{j k}} L_{k} (β_{k}^{0}) + \sum_{l = 1}^{p} \frac{\partial^{2}}{\partial β_{j k} \partial β_{l k}} L_{k} (β_{k}^{0}) (β_{l k} - β_{l k}^{0}) \equiv E_{1} + E_{2} .

Due to (B.1), we have $E_{1} = O_{p} (\sqrt{p / n})$ . Next we decompose E₂ as

E_{2} = \sum_{l = 1}^{p} [\frac{\partial^{2}}{\partial β_{j k} \partial β_{l k}} L_{k} (β_{k}^{0}) - E {\frac{\partial^{2}}{\partial β_{j k} \partial β_{l k}} L_{k} (β_{k}^{0})}] (β_{l k} - β_{l k}^{0}) + \sum_{l = 1}^{p} E {\frac{\partial^{2}}{\partial β_{j k} \partial β_{l k}} L_{k} (β_{k}^{0})} (β_{l k} - β_{l k}^{0}) \equiv E_{21} + E_{22} .

For E₂₁, we have

\begin{array}{l} E_{21} \leq | | β_{k} - β_{k}^{0} | | {(\sum_{l = 1}^{p} {[\frac{\partial^{2}}{\partial β_{j k} \partial β_{l k}} L_{k} (β_{k}^{0}) - E {\frac{\partial^{2}}{\partial β_{j k} \partial β_{l k}} L_{k} (β_{k}^{0})}]}^{2})}^{1 / 2} \\ = | | β_{k} - β_{k}^{0} | | {(\sum_{l = 1}^{p} {[\frac{1}{n} \sum_{i = 1}^{n} X_{i j} X_{i l} - E (X_{j} X_{l})]}^{2})}^{1 / 2} \\ = O_{p} (\sqrt{\frac{p}{n}}) O_{p} (\sqrt{\frac{p}{n}}) = O_{p} (\frac{p}{n}), \end{array}

where the second to last equality comes from the moment Condition (iv) and the fact that all the eigenvalues of Σ are bounded away from both 0 and ∞ by Condition (v).

For E₂₂, by Cauchy-Schwarz inequality and $| | β_{k} - β_{k}^{0} | | = O_{p} (\sqrt{p / n})$ ,

E_{22} \leq | \sum_{l = 1}^{q} E (X_{i j} X_{i l}) (β_{j k} - β_{j l}) | = O_{p} (\sqrt{\frac{p}{n}}) {(\sum_{l = 1}^{p} σ_{j l})}^{1 / 2} = O_{p} (\sqrt{\frac{p}{n}}),

where σ_ij is the (i, j)-element of Σ, and the last equality comes from the fact that $\sum_{l = 1}^{p} σ_{j l} = O (1)$ which is ensured by Conditions (iv) and (v).

Combining the above two results, we have $\frac{\partial}{\partial β_{j k}} L_{k} (β_{k}) = O_{p} (\sqrt{p / n})$ .

Finally, note that $\sqrt{p / n} / λ_{n} \to 0$ and $lim {inf}_{n \to \infty} {inf}_{θ \to 0^{+}} p_{λ_{n}}^{'} (θ) / λ_{n} > 0$ . When $∣ β_{j k} ∣ = {max}_{m = 1}^{h} ∣ β_{j m} ∣$ , we have

\begin{array}{l} \frac{\partial Q (B)}{β_{j k}^{r}} = λ_{n} {\frac{p_{λ_{n}}^{'} ({max}_{m = 1}^{h} ∣ β_{j m} ∣)}{λ_{n}} + O_{p} (\frac{\sqrt{p / n}}{λ_{n}})} if β_{j k} > 0, \\ \frac{\partial Q (B)}{β_{j k}^{l}} = λ_{n} {- \frac{p_{λ_{n}}^{'} ({max}_{m = 1}^{h} ∣ β_{j m} ∣)}{λ_{n}} + O_{p} (\frac{\sqrt{p / n}}{λ_{n}})} if β_{j k} < 0. \end{array}

In both cases, the first term dominates the second. Thus the result of Lemma 4 follows.

Contributor Information

Yichao Wu, Email: wu@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.

Lexin Li, Email: li@stat.ncsu.edu, Department of Statistics, North Carolina State University, Raleigh, NC 27695, USA.

References

An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Global Optimization. 1997;11:253–285. [Google Scholar]
Bondell HD, Li L. Shrinkage inverse regression estimation for model free variable selection. J Roy Statist Soc Ser B. 2009;71:287–299. [Google Scholar]
Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
Cook RD. Graphics for regressions with a binary response. J Amer Statist Assoc. 1996;91:983–992. [Google Scholar]
Cook RD. Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley; New York: 1998. [Google Scholar]
Cook RD. Testing predictor contributions in sufficient dimension reduction. Ann Statist. 2004;32:1062–1092. [Google Scholar]
Cook RD, Ni L. Using intra-slice covariances for improved estimation of the central subspace in regression. Biometrika. 2006;93:65–74. [Google Scholar]
Cook RD, Weisberg S. Discussion of “Sliced inverse regression for dimension reduction,” by K.C. Li. J Amer Statist Assoc. 1991;86:328–332. [Google Scholar]
Cook RD, Yin X. Dimension-reduction and visualization in discriminant analysis. Austral N Z J Statist. 2001;43:147–200. [Google Scholar]
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]
Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99. [Google Scholar]
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]
Fung WK, He X, Liu L, Shi PD. Dimension reduction based on canonical correlation. Statist Sinica. 2002;12:1093–1114. [Google Scholar]
Golub G, van Loan C. Matrix Computations. 3. The Johns Hopkins University Press; 1996. [Google Scholar]
Hall P, Li KC. On almost linearity of low dimensional projection from high dimensional data. Ann Statist. 1993;21:867–889. [Google Scholar]
Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377. [Google Scholar]
Hsing T, Carroll RJ. An asymptotic theory for sliced inverse regression. Ann Statist. 1992;20:1040–1061. [Google Scholar]
Hunter DR, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]
Knight K, Fu WJ. Asymptotics for Lasso-type estimators. Ann Statist. 2000;28:1356–1378. [Google Scholar]
Li B, Zha H, Chiaromonte F. Contour regression: a general approach to dimension reduction. Ann Statist. 2005;33:1580–1616. [Google Scholar]
Li B, Wang S. On directional regression for dimension reduction. J Amer Statist Assoc. 2007;102:997–1008. [Google Scholar]
Li KC. Sliced inverse regression for dimension reduction (with discussion) J Amer Statist Assoc. 1991;86:316–327. [Google Scholar]
Li R, Liang H. Variable selection in semiparametric regression modeling. Ann Statist. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ni L, Cook RD, Tsai CL. A note on shrinkage sliced inverse regression. Biometrika. 2005;92:242–247. [Google Scholar]
Ni L, Wang H, Tsai CL, Zhou J. Model free variable selection via adaptive lasso. Proceedings of the Joint Statistical Meetings.2008. [Google Scholar]
Ortega JM, Rheinboldt WC. Iterative Solution of Nonlinear Equations in Several Variables. Wiley; New York: 1970. [Google Scholar]
Portnoy S. Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. I Consistency. Ann Statist. 1984;12:1298–1309. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the LASSO. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]
Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H, Xia Y. Sliced regression for dimension reduction. J Amer Statist Assoc. 2008;103:811–821. [Google Scholar]
Wu Y, Liu Y. Variable selection in quantile regression. Statist Sinica. 2009;19:801–817. [Google Scholar]
Xia Y. A constructive approach to the estimation of dimension reduction directions. Ann Statist. 2007;35:2654–2690. [Google Scholar]
Yin X, Cook RD. Dimension reduction for the conditional k-th moment in regression. J Roy Statist Soc Ser B. 2002;64:159–176. [Google Scholar]
Yin X, Cook RD. Estimating central subspaces via inverse third moments. Biometrika. 2003;90:113–125. [Google Scholar]
Yin X, Li B, Cook RD. Successive direction extraction for estimating the central subspace in a multiple-index regression. J Multivariate Anal. 2008;99:1733–1757. [Google Scholar]
Zhong W, Zeng P, Ma P, Liu JS, Zhu Y. RSIR: Regularized Sliced Inverse Regression for Motif Discovery. Bioinformatics. 2005;21:4169–4175. doi: 10.1093/bioinformatics/bti680. [DOI] [PubMed] [Google Scholar]
Zhou J, He X. Dimension reduction based on constrained canonical correlation and variable filtering. Ann Statist. 2008;36:1649–1668. [Google Scholar]
Zhu LP, Zhu LX. On distribution-weighted partial least squares with diverging number of highly correlated predictors. J Roy Statist Soc Ser B. 2009a;71:525–548. [Google Scholar]
Zhu LP, Zhu LX. Nonconcave penalized inverse regression in single-index models with high dimensional predictors. J Multivariate Anal. 2009b;100:862–875. [Google Scholar]
Zhu LX, Fang KT. Asymptotics for kernel estimate of sliced inverse regression. Ann Statist. 1996;24:1053–1068. [Google Scholar]
Zhu LX, Miao B, Peng H. On sliced inverse regression with high dimensional covariates. J Amer Statist Assoc. 2006;101:630–643. [Google Scholar]
Zhu LX, Ng KW. Asymptotics of sliced inverse regression. Statist Sinica. 1995;5:727–736. [Google Scholar]
Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]
Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] An LTH, Tao PD. Solving a class of linearly constrained indefinite quadratic problems by D.C. algorithms. J Global Optimization. 1997;11:253–285. [Google Scholar]

[R2] Bondell HD, Li L. Shrinkage inverse regression estimation for model free variable selection. J Roy Statist Soc Ser B. 2009;71:287–299. [Google Scholar]

[R3] Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]

[R4] Cook RD. Graphics for regressions with a binary response. J Amer Statist Assoc. 1996;91:983–992. [Google Scholar]

[R5] Cook RD. Regression Graphics: Ideas for Studying Regressions Through Graphics. Wiley; New York: 1998. [Google Scholar]

[R6] Cook RD. Testing predictor contributions in sufficient dimension reduction. Ann Statist. 2004;32:1062–1092. [Google Scholar]

[R7] Cook RD, Ni L. Using intra-slice covariances for improved estimation of the central subspace in regression. Biometrika. 2006;93:65–74. [Google Scholar]

[R8] Cook RD, Weisberg S. Discussion of “Sliced inverse regression for dimension reduction,” by K.C. Li. J Amer Statist Assoc. 1991;86:328–332. [Google Scholar]

[R9] Cook RD, Yin X. Dimension-reduction and visualization in discriminant analysis. Austral N Z J Statist. 2001;43:147–200. [Google Scholar]

[R10] Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J Amer Statist Assoc. 2001;96:1348–1360. [Google Scholar]

[R11] Fan J, Li R. Variable selection for Cox’s proportional hazards model and frailty model. Ann Statist. 2002;30:74–99. [Google Scholar]

[R12] Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion) J Roy Statist Soc Ser B. 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Fan J, Peng H. On non-concave penalized likelihood with diverging number of parameters. Ann Statist. 2004;32:928–961. [Google Scholar]

[R14] Fung WK, He X, Liu L, Shi PD. Dimension reduction based on canonical correlation. Statist Sinica. 2002;12:1093–1114. [Google Scholar]

[R15] Golub G, van Loan C. Matrix Computations. 3. The Johns Hopkins University Press; 1996. [Google Scholar]

[R16] Hall P, Li KC. On almost linearity of low dimensional projection from high dimensional data. Ann Statist. 1993;21:867–889. [Google Scholar]

[R17] Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377. [Google Scholar]

[R18] Hsing T, Carroll RJ. An asymptotic theory for sliced inverse regression. Ann Statist. 1992;20:1040–1061. [Google Scholar]

[R19] Hunter DR, Li R. Variable selection using MM algorithms. Ann Statist. 2005;33:1617–1642. doi: 10.1214/009053605000000200. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Knight K, Fu WJ. Asymptotics for Lasso-type estimators. Ann Statist. 2000;28:1356–1378. [Google Scholar]

[R21] Li B, Zha H, Chiaromonte F. Contour regression: a general approach to dimension reduction. Ann Statist. 2005;33:1580–1616. [Google Scholar]

[R22] Li B, Wang S. On directional regression for dimension reduction. J Amer Statist Assoc. 2007;102:997–1008. [Google Scholar]

[R23] Li KC. Sliced inverse regression for dimension reduction (with discussion) J Amer Statist Assoc. 1991;86:316–327. [Google Scholar]

[R24] Li R, Liang H. Variable selection in semiparametric regression modeling. Ann Statist. 2008;36:261–286. doi: 10.1214/009053607000000604. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Ni L, Cook RD, Tsai CL. A note on shrinkage sliced inverse regression. Biometrika. 2005;92:242–247. [Google Scholar]

[R26] Ni L, Wang H, Tsai CL, Zhou J. Model free variable selection via adaptive lasso. Proceedings of the Joint Statistical Meetings.2008. [Google Scholar]

[R27] Ortega JM, Rheinboldt WC. Iterative Solution of Nonlinear Equations in Several Variables. Wiley; New York: 1970. [Google Scholar]

[R28] Portnoy S. Asymptotic behavior of M-estimators of p regression parameters when p2/n is large. I Consistency. Ann Statist. 1984;12:1298–1309. [Google Scholar]

[R29] Tibshirani R. Regression shrinkage and selection via the LASSO. J Roy Statist Soc Ser B. 1996;58:267–288. [Google Scholar]

[R30] Wang H, Li R, Tsai CL. Tuning parameter selectors for the smoothly clipped absolute deviation method. Biometrika. 2007;94:553–568. doi: 10.1093/biomet/asm053. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Wang H, Xia Y. Sliced regression for dimension reduction. J Amer Statist Assoc. 2008;103:811–821. [Google Scholar]

[R32] Wu Y, Liu Y. Variable selection in quantile regression. Statist Sinica. 2009;19:801–817. [Google Scholar]

[R33] Xia Y. A constructive approach to the estimation of dimension reduction directions. Ann Statist. 2007;35:2654–2690. [Google Scholar]

[R34] Yin X, Cook RD. Dimension reduction for the conditional k-th moment in regression. J Roy Statist Soc Ser B. 2002;64:159–176. [Google Scholar]

[R35] Yin X, Cook RD. Estimating central subspaces via inverse third moments. Biometrika. 2003;90:113–125. [Google Scholar]

[R36] Yin X, Li B, Cook RD. Successive direction extraction for estimating the central subspace in a multiple-index regression. J Multivariate Anal. 2008;99:1733–1757. [Google Scholar]

[R37] Zhong W, Zeng P, Ma P, Liu JS, Zhu Y. RSIR: Regularized Sliced Inverse Regression for Motif Discovery. Bioinformatics. 2005;21:4169–4175. doi: 10.1093/bioinformatics/bti680. [DOI] [PubMed] [Google Scholar]

[R38] Zhou J, He X. Dimension reduction based on constrained canonical correlation and variable filtering. Ann Statist. 2008;36:1649–1668. [Google Scholar]

[R39] Zhu LP, Zhu LX. On distribution-weighted partial least squares with diverging number of highly correlated predictors. J Roy Statist Soc Ser B. 2009a;71:525–548. [Google Scholar]

[R40] Zhu LP, Zhu LX. Nonconcave penalized inverse regression in single-index models with high dimensional predictors. J Multivariate Anal. 2009b;100:862–875. [Google Scholar]

[R41] Zhu LX, Fang KT. Asymptotics for kernel estimate of sliced inverse regression. Ann Statist. 1996;24:1053–1068. [Google Scholar]

[R42] Zhu LX, Miao B, Peng H. On sliced inverse regression with high dimensional covariates. J Amer Statist Assoc. 2006;101:630–643. [Google Scholar]

[R43] Zhu LX, Ng KW. Asymptotics of sliced inverse regression. Statist Sinica. 1995;5:727–736. [Google Scholar]

[R44] Zou H. The adaptive Lasso and its oracle properties. J Amer Statist Assoc. 2006;101:1418–1429. [Google Scholar]

[R45] Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models (with discussion) Ann Statist. 2008;36:1509–1533. doi: 10.1214/009053607000000802. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

ASYMPTOTIC PROPERTIES OF SUFFICIENT DIMENSION REDUCTION WITH A DIVERGING NUMBER OF PREDICTORS

Yichao Wu

Lexin Li

Abstract

1. Introduction

2. Dimension Reduction Basis Estimation

2.1. Dimension reduction via response transformation

2.2. Asymptotic properties

Lemma 1

Proof of Lemma 1

Theorem 1

Corollary 1

Proof of Corollary 1

Remark 1

Theorem 2

Proof of Theorem 2

3. Variable Selection

3.1. Regularization via the SCAD max penalty

Lemma 2

Proof of Lemma 2

3.2. Asymptotic properties

Lemma 3

Lemma 4

Theorem 3

Proof of Theorem 3

Remark 2

Corollary 2

Proof of Corollary 2

Remark 3

Remark 4

Remark 5

3.3. Optimization algorithm

Figure 1.

4. Numerical Studies

4.1. Simulations

Example 4.1

Table 1.

Table 2.

Example 4.2

Table 3.

4.2. A data example

5. Discussion

Acknowledgments

Appendix A: Technical Conditions

Remark 6

Appendix B: Proofs

Proof of Theorem 1

Proof of Lemma 3

Proof of Lemma 4

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases