Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2012 Jun 1.
Published in final edited form as: Biometrics. 2010 Sep 28;67(2):513–523. doi: 10.1111/j.1541-0420.2010.01490.x

Sufficient Dimension Reduction for Censored Regressions

Wenbin Lu 1,*, Lexin Li 1,**
PMCID: PMC3018713  NIHMSID: NIHMS234132  PMID: 20880013

SUMMARY

Methodology of sufficient dimension reduction (SDR) has offered an effective means to facilitate regression analysis of high dimensional data. When the response is censored, however, most existing SDR estimators can not be applied, or require some restrictive conditions. In this article we propose a new class of inverse censoring probability weighted SDR estimators for censored regressions. Moreover, regularization is introduced to achieve simultaneous variable selection and dimension reduction. Asymptotic properties and empirical performance of the proposed methods are examined.

Keywords: Censored data, Central subspace, Inverse censoring probability weighted estimation, Sliced inverse regression, Sufficient dimension reduction

1. Introduction

Regression analysis of high-dimensional data has recently prevailed in many scientific fields, while the high dimensionality often poses challenges to classical regression techniques. In many applications such as biomedical and economic studies, the analysis is further complicated by possible censoring in response variables. For instance, in survival data analysis, patients’ survival times to death or disease recurrence are often right censored due to the termination of follow-up study or patients’ drop-out; in employment studies, the times spent in the unemployed state are possibly censored by the cutoff nature of the sampling. Common methods for handling censored responses include Cox’s proportional hazards model (Cox, 1972), proportional odds model (Bennett, 1983), accelerated failure time model (Cox and Oakes, 1984), among many others. Most of those methods require specification of some models. However, knowledge for selecting an appropriate model is often inadequate prior to the analysis, and model specification and diagnosis can become more elusive when the predictor dimension is high. Sufficient dimension reduction requires no model specification, it retains full regression information, and it provides a usually small set of composite variables upon which subsequent model formulation and prediction can be based. For these reasons SDR offers an appealing avenue to analysis of censored data with high-dimensional predictors.

Let Y denote a univariate response variable and X a p × 1 predictor vector. SDR seeks a minimum subspace S in ℝp such that YX|PSX, where ⫫ indicates independence and PS denotes an orthogonal projection onto S. Under minor conditions, such a subspace, denoted by SY|X, uniquely exists and is called the central subspace of regression of Y on X. SY|X is the primary parameter of interest in our dimension reduction inquiry, because it carries all regression information of Y|X. There have been many SDR methods proposed to estimate SY|X that impose no parametric assumptions on the conditional distribution Y|X. Among them, sliced inverse regression (SIR) is perhaps the most widely used. It amounts to a spectral decomposition of Cov{E(X|Y)} with respect to Σ = Cov(X). The first d eigenvectors corresponding to the largest d eigenvalues form an estimate of SY|X under appropriate conditions. Operationally, SIR partitions Y into a fixed number of non-overlapping intervals, each of which is called a slice, and then takes the average of X within each slice to form an estimate of E(X|Y).

Our discussion hereinafter will assume the context of survival data analysis, but the methodology applies to other censored regressions as well. Let Y denote the survival time, C be the corresponding censoring time, = min(Y, C) be the observed survival time, and δ = I(YC) be the censoring indicator. For identifiability, we assume YC|X. When the response Y is censored, SIR and other SDR estimators can not be applied directly to estimate the central subspace SY|X. It is noted though, the bivariate response (, δ) is observable and is a function of (Y, C), and as such S(,δ)|XS(Y,C)|X. Consequently, one may gain useful information about SY|X through S(Y,C)|X, which is in turn estimated through S(,δ)|X from regression of (, δ) on X. Operationally this amounts to the slicing of the bivariate response (, δ), i.e., partitioning within each subsample where δ = 0 and δ = 1, respectively. The remaining steps remain the same as a usual SIR. This procedure is called double slicing in Li et al. (1999). Mainly due to its simplicity, double slicing SIR has received wide applications; see Li and Li (2004) for analysis of a microarray gene expression data with censored phenotypes, and Li et al. (2007) for analysis of Tobit models in economics data. Although double slicing SIR is simple to use, the parameter of interest is SY|X instead of S(Y,C)|X. Actually, S(Y,C)|X is always a “larger” subspace than SY|X in the sense that SY|XSC|XS(Y,C)|X, where ⊕ denotes a direct sum between two subspaces. Li et al. (1999) examined a special case where SY|X = S(Y,C)|X by requiring that (Y, X) ⫫ C. However this seems a too strong restriction. Cook (2002) relaxed this condition by requiring that (Y, C) ⫫ XX, where Γ denotes a basis of SY|X. But given data this condition is very difficult to verify. Alternatively, Li et al. (1999) acknowledged that censoring can introduce substantial bias to the usual SIR estimator for the central subspace SY|X of interest, and proposed a modification by utilizing information on the censoring mechanism to adjust for the censoring bias. The bias correction term, however, relies on a nonparametric estimate of the conditional survival function of Y given X, which itself is computationally complicated. Recently, Xia et al. (2010) developed a dimension reduction method for censored survival data based on kernel estimation of the conditional hazard function, which may suffer the curse of dimensionality when the number of predictors is large. In short, there seems a lack of an effective dimension reduction estimator of SY|X given censored Y.

The main difficulty to apply the usual SDR estimators to censored response is concerned with the fact that most SDR methods rely on the inverse regression of X given Y. Take SIR as an example, it involves the first moment of inverse mean E(X|Y), and when Y is censored, it is difficult to slice Y and thus difficult to obtain an unbiased estimate of E(X|Y). To overcome this difficulty, we first investigate a re-formulation of SIR and recognize that one can estimate SY|X through least squares regression of a set of transformation functions of Y given X. We then employ the inverse censoring probability weighted (ICPW) estimation method to handle censored responses. We further develop a variable selection strategy through regularized sparse estimation. Our proposed method contributes in at least three ways. First, it provides a useful addition to sufficient dimension reduction methodology and extends SDR to a large number of applications where the response is subject to censoring. An effective estimator is developed to target the primary parameter of interest SY|X directly. It is no longer necessary to gain knowledge of SY|X through a larger subspace S(Y,C)|X, and as such there is no need for conditions to force the equivalence between SY|X and S(Y,C)|X. Secondly, thanks to the flexible structure of the central subspace, the proposed method permits a variety of regression forms, for instance, the single-index model and the heteroscedastic model. Thirdly, the proposed solution achieves variable selection with censored responses, but requires no model pre-specification, and thus distinguishes itself from most of existing variable selection methods in censored regressions.

2. Sufficient Dimension Reduction with Censored Responses

2.1 Estimation of the central subspace through least squares

Without loss of generality, we assume E(X) = 0 throughout this article. Define ϕ(y) ≡ Σ−1E(X|Y = y) for any y in the support of Y. Li (1991) effectively showed that

ϕ(y)SY|X, (1)

under the linearity condition such that E(X|ΓX = x) is a linear function of x, where Γ denotes a basis of SY|X. Equation (1) connects the inverse mean vector with the central subspace, and is the foundation of SIR and many elaborated variants of SIR. The linearity condition is imposed by SIR and most SDR estimation methods. It is interesting to note that it involves the distribution of X only, rather than Y|X. The condition is satisfied when the distribution of X is elliptically symmetric (Eaton, 1986), e.g., multivariate normal. In practice, some caution may need to be taken if some predictors are discrete, since the elliptical symmetry is then not satisfied. But in general, this condition is often viewed as only a mild restriction, since it may be induced by coordinate-wise power transformation, and it holds to a reasonable approximation as p increases (Hall and Li, 1993). See also Cook and Ni (2006) for a further discussion on the linearity condition.

Proceeding from (1), we next discuss a re-formulation of SIR that is equivalent to its original form. SIR starts with constructing a p × h matrix 𝔹 = (β1, …, βh), where βk = Σ−1E{XJk(Y)}, Jk(Y) = 1 if Y is in slice k and 0 otherwise, k = 1, …, h. Here the range of Y is partitioned into h non-overlapping intervals. Given (1), Span(𝔹) ⊆ SY|X. SIR often takes one step further by assuming the coverage condition that Span(𝔹) = SY|X whenever Span(𝔹) ⊆ SY|X, which often holds in practice; see also Cook and Ni (2006) for a discussion on the coverage condition. Then the eigenvectors of the matrix 𝔹𝔹 form an estimate of a basis of SY|X. There are a number of variants of SIR that can all be formulated in a similar way, except that a different βk is employed. For instance, Yin and Cook (2002) suggested using polynomial transformation βk = Σ−1E{XYk} up to power h; Cook and Ni considered βk = Σ−1E{XJk(Y)Y}. Collectively, let bk(Y) denote a function of Y, SIR and its variants are all based upon the components

βk=1E{Xbk(Y)},   k=1,,h. (2)

We make some key observations that will lead to our new estimator of the central subspace with censored responses. First, βk is actually the least squares estimator of the slope vector of regressing bk(Y) on X. Secondly, βk involves no inverse regression, and thus it makes possible to adopt existing survival regression techniques, e.g., inverse censoring probability weighted estimation, to estimate βk and subsequently SY|X. We also note that, although least squares estimator is used here, it does not imply that the linear regression model is true or even provides an adequate fit to the data.

Next we develop a new class of SDR estimators for censored regressions. Although the proposed estimation strategy applies to all transformation functions bk(Y), our empirical experiences have suggested that the polynomial function of Yin and Cook (2002) enjoys the best numerical performance, and thus it will be our choice of transformation in the subsequent development.

2.2 ICPW estimation of the central subspace

Define GX(t) ≡ P(Ct|X) as the survival function of the censoring time given the predictors. We have

E{δGX(Y˜)bk(Y˜)|X}=E[E{δGX(Y˜)bk(Y˜)|Y,X}|X]=E[E{I(YC)GX(Y)bk(Y)|Y,X}|X]=E{bk(Y)|X}.

Then βk = Σ−1E{Xbk()δ/GX()}, k = 1, …, h. In light of this observation, one can obtain an unbiased estimate of βk in (2) by introducing the inverse of GX() as weight to the uncensored observations. This is the basic idea of ICPW estimation method that has been widely used in the survival analysis literature (Koul et al., 1981; Fan and Gijbels, 1994; Cheng et al., 1995, 1997; Fine et al., 1998; Kong et al., 2004; among others). It motivates us to consider the following ICPW estimator of the central subspace for censored responses.

Let {(i, δi, Xi), i = 1, ⋯, n} denote the n i.i.d. random observations of the triplet (,δ,X), and let ĜX(t) denotes an estimate of GX(t). We propose to estimate βk in (2) by solving the following weighted least squares estimating equation,

Uk(β)i=1nδiG^Xi(Y˜i)Xi{bk(Y˜i)βXi}=0,   k=1,,h. (3)

There is a closed form solution for (4), i.e.

β^k={i=1nδiG^Xi(Y˜i)XiXi}1i=1nδiG^Xi(Y˜i)Xibk(Y˜i),   k=1,,h.

We then form the p × h matrix 𝔹̂ = (β̂1, …, β̂h), and perform spectral decomposition of 𝔹̂𝔹̂. Let {γ̂1, …, γ̂p} denote the resulting eigenvectors corresponding to its eigenvalues in descending order. We call Span (γ̂1, …, γ̂d) an ICPW estimator of the central subspace SY|X. Here we first assume the structural dimension d = dim(SY|X) is known, and will develop a consistent estimator of d later.

Next we discuss a number of ways to estimate the survival function GX(·) of the censoring time under different scenarios, and discuss the strength and limitations of each method. First, if the censoring times are i.i.d. and independent of the covariates, a consistent estimator of GX(·) can be obtained by the Kaplan-Meier method. It requires no model assumption but the condition that CX can be restrictive. Secondly, if C does depend on X, one can posit a semiparametric model, for instance, a proportional hazards model, and then estimate GX(·) based on the fitted model. This approach is very simple to use, and as long as the model is correctly specified, the resulting estimator of survival function is consistent. On the other hand, it loses some nonparametric flavor of the SDR estimation since it requires a model specification on the censoring time. In our implementation, we adopt the semiparametric approach by positing a proportional hazards model due to its simplicity as well as the high dimensional nature of our targeting problems. To study its performance under possible model misspecification, we conduct an intensive sensitivity analysis in Section 3.2, as is often done in the ICPW literature. Our experiences have suggested that it works reasonably well for a moderate to large sample size even when the model for the censoring distribution is misspecified. We also comment that, operationally, one may estimate GX(·) nonparametrically using, e.g., the kernel-weighted local Kaplan-Meier estimator (Beran, 1981), although it does not satisfy the conditions (C3) and (C4) given below. This nonparametric solution does not require any model and is thus robust to model misspecification. However, it is subject to the curse of dimensionality and works best only when the number of predictors is relatively small. We will briefly examine the numerical performance of this nonparametric estimation strategy in Section 3.

To establish the asymptotic properties of our proposed ICPW estimator, we assume the following regularity conditions and the properties of the estimator ĜX(·):

  • (C1)

    The vector of covariates X has a convex support;

  • (C2)

    There exists a constant τ > 0 such that P(C = τ) > 0 and P(C > τ) = 0;

  • (C3)

    supt∈[0,τ] |ĜX(t) − GX(t)| = op(1) for all possible values of X, where τ is a pre-specified positive constant;

  • (C4)
    For 0 ≤ t ≤ τ,
    n{G^X(t)GX(t)}=1nj=1n{ψ1(t,X)0tψ2(s)dMj(s)+ψ3(t,X)ψ4,j}+op(1),
    where Mj(t)’s are independent mean-zero martingale processes, ψ4,j’s are i.i.d. random variables with mean zero and finite variance, and the functions ψ1, ψ2 and ψ3 are bounded and satisfy the Lipschitz conditions.

The regularity condition (C2) is the same as the condition (C1) in Peng and Fine (2009), which is used to simplify theoretical arguments and is satisfied in many clinical studies with administrative censoring. In practice, τ can be chosen as the maximum of follow-up. To improve the ICPW estimator, especially for the heavy censoring case, Fine et al. (1998) proposed an effective modification by introducing an artificial censoring time L to truncate data at the right tail, i.e. C* = min(C, L). Then condition (C2) is automatically satisfied with the truncated censoring time C*. The same strategy can be adopted in our context as well. Conditions (C3) and (C4) are assumed to ensure the uniform consistency and the asymptotic representation of ĜX(·), which in turn leads to the consistency and asymptotic normality of the ICPW estimator of SY|X. If a proportional hazards model is correctly specified for C, the uniform consistency and asymptotic representation of the estimator can be easily derived (Andersen and Gill, 1982). For other correctly specified semiparametric models, similar results can be obtained. We next state the main asymptotic results.

Theorem 1: Under conditions (C1)–(C4), we have

  1. n{vec(𝔹^)vec(𝔹)} converges in distribution to a normal random vector with mean zero and a covariance matrix Λ ∈ ℝ ph×ph, as n → ∞.

  2. Furthermore, if the linearity and coverage conditions hold, then Span(γ̂1, …, γ̂d) is a n-consistent estimator of the central subspace SY|X.

Here vec(·) denotes the matrix operator that stacks all columns of a matrix to a vector. The proof of Theorem 1 is given in the Web Appendix. Basically, Theorem 1 first establishes the n-consistency of the ICPW estimator 𝔹̂ of 𝔹. With the additional linearity and coverage conditions, Span(𝔹) is further connected with the central subspace of interest. As such, Span(γ̂1, …, γ̂d) provides a n-consistent estimator of SY|X.

To estimate the structural dimension d = dim(SY|X), we employ the criterion proposed by Zhu et al. (2006), which estimates d via the number of nonzero eigenvalues of the matrix 𝔹̂𝔹̂. More specifically, let ν̂1, …, ν̂p denote the eigenvalues of 𝔹̂𝔹̂ + Ip, and κ denotes the number of ν̂i’s that are greater than one. Then an estimate of d is taken as the maximizer of the criterion,

{n2k=1+min(κ,m)p(log(ν^k)+1ν^k)Cnm(2pm+1)2}, (4)

for m ∈ {0,1, …, p−1}. In (4), Cn is a penalty constant, and Zhu et al. (2006) recommended Cn = Op(na). We choose a = 0.1 in our implementation, and our experience suggests that it estimates d very well. Based on Theorem 2 of Zhu et al. (2006) and the n-consistency of 𝔹̂ established in Theorem 1, we have the consistency of .

Corollary 1: Assume that limn→∞ Cn/n = 0 and limn→∞ Cn = ∞, then d̂d almost surely as n → ∞.

2.3 Sparse estimation of the central subspace

Variable selection is another important aspect of dimension reduction in survival data analysis, since it usually leads to a better health risk assessment and an easier model interpretation. In this section, we develop a regularized sparse estimation of the central subspace for simultaneous variable selection and reduction projection basis estimation with censored responses. The idea is similar to Bondell and Li (2009) but with a different aim. Here we focus on variable selection with censored responses.

Given the ICPW estimator, β̂k, k = 1, …, h, we propose to minimize the following objective function over a p × 1 shrinkage vector α = (α1, …, αp),

1nhk=1hi=1nδiG^Xi(Y˜i)[bk(Y˜i){diag(α)β^k}Xi]2   subject to   j=1p|αj|τ, (5)

where τ ≥ 0 is a regularization parameter. Letting α̂ denote the resulting minimizer of (5), we then construct the p × h matrix 𝔹̃ = diag(α̂)𝔹̂, and call Span(𝔹̃) the sparse ICPW estimator of the central subspace SY|X.

We first note that the shrinkage estimation given in (5) is closely related to the nonnegative garrote estimation (Breiman, 1995) and the adaptive-Lasso method (Zou, 2006) in the multivariate response setup. When τ ≥ p, α̂j = 1 for all j = 1, …, p, and one gets back a usual ICPW solution. As τ gradually decreases, however, some α̂j’s are shrunk to exact zero, which indicates exclusion of the corresponding predictors Xj’s. As such variable selection is achieved. Secondly, we note that the structural dimension d of the central subspace is not required if variable selection is the only purpose for (5). This dimension information is needed only when one aims to obtain a simultaneous basis estimation of SY|X.

We next turn to numerical optimization of (5). Rewrite (5) as,

1nhUVα2,  subject to  j=1p|αj|τ,

where

U=(w11/2b1(Y˜1)wn1/2b1(Y˜n)w11/2bh(Y˜1)wn1/2bh(Y˜n))nh×1,   and  V=(w11/2β^11X11w11/2β^p1X1pwn1/2β^11Xn1wn1/2β^p1Xnpw11/2β^1hX11w11/2β^phX1pwn1/2β^1hXn1wn1/2β^phXnp)nh×p

wi = δi/ĜXi(Ỹi), i = 1, …, n, β̂jk is the j-th element of β̂k, j = 1, …, p, k = 1, …, h, and Xij is the j-th element of Xi. Then estimation of α becomes a standard lasso problem and can be obtained by any existing lasso algorithm (Tibshirani, 1996, Fu, 1998, Efron et al., 2004). To choose the regularization parameter τ, we adopt a BIC-type criterion (Wang and Leng, 2007) by minimizing

UVα^(τ)2σ^2+log(n)pe,

where α̂(τ) denotes the minimizer of (5) given τ, σ̂2 is the usual variance estimator obtained from the least squares fit of U on V, and pe is the effective number of parameters, 2j=1p1(α^j(τ)>0)+j=1pα^j(τ)(h2), as given in Yuan and Lin (2006, equation 6.5).

We next continue to examine the asymptotic properties of the proposed sparse estimation. Following Bondell and Li (2009), variable selection in our context is to seek the smallest subset of the predictors X𝒜, with partition X=(X𝒜,X𝒜c), such that YX𝒜c|X𝒜. This subset 𝒜 uniquely exists when the central subspace SY|X exists. As implied by Cook (2004, Proposition 1), 𝒜 is directly connected to 𝔹 that spans SY|X. That is, if we partition 𝔹 to two submatrices 𝔹𝒜 and 𝔹𝒜c with the same number of columns as 𝔹 following the partition of X, then 𝔹𝒜c = 0. Consequently, we can define 𝒜 as 𝒜 = {j : βkj ≠ 0 for some k, 1 ≤ jp, 1 ≤ kh} and βkj is the (k, j)-th element of 𝔹. Given the sparse estimate from (5), we partition 𝔹̃ in a similar way as 𝔹, yielding 𝔹̃𝒜 and 𝔹̃𝒜c, and denote the sparse estimate of 𝒜 as 𝒜̂ = {j : α̂j ≠ 0}. We further consider the equivalent Lagrangian formulation of (5),

1nhk=1hi=1nδiG^Xi(Y˜i)[bk(Y˜i){diag(α)β^k}Xi]2+λj=1p|αj|, (6)

where λ is a penalty constant. Then we have the following theorem.

Theorem 2: Suppose that the same conditions of Theorem 1 hold and that nλ → ∞ and nλ0, then as n → ∞,

  1. Consistency in variable selection: P(𝒜̂ = 𝒜) → 1.

  2. Asymptotic normality: n{vec(𝔹˜𝒜)vec(𝔹𝒜)}N(0,Ω), for some Ω > 0.

The proof of Theorem 2 is given in the Web Appendix. Theorem 2 shows that our sparse estimate can select all truly relevant predictors in 𝒜 with probability approaching one. In addition, the basis estimate of those relevant predictors remains n-consistent.

3. Simulations and Real Data Analysis

3.1 Dimension reduction basis estimation

We first examine the finite sample performance of the ICPW estimator of SY|X when the censoring time distribution is correctly specified. We also compare with the double slicing SIR (DSIR) estimator of Li et al. (1999) that is commonly used in practice.

The true survival time is generated as follows:

Y=exp{2.5+sin(0.1πγ1X)+0.1(γ1X+2)2+0.25ε}, (7)
Y=exp{2.5+γ1X+0.5γ1X×γ2X+0.25ε}, (8)

where X is a 10-dimensional multivariate normal vector with mean zero and identity covariance matrix. Different choice of the error ε leads to some commonly used survival models: the extreme value distribution ε = log{−log(1 − u)} for the proportional hazards (PH) model, and the logistic distribution ε = log{u/(1 − u)} for the proportional odds (PO) model, where u is uniform(0,1). For model (7), SY|X = Span(γ1), with γ1 = (1,1,1,0,…,0), and the structural dimension d = 1; for model (8), SY|X = Span(γ1, γ2), with γ1 = (1, 0, …, 0), γ2 = (0, …, 0,1, −1), and d = 2. We generate the censoring time C as

C=exp(c+γcX+εc), (9)

where c is a constant that controls the censoring proportion, γc = (−1,0,0,1,0, …, 0) and εc first takes the extreme value distribution, yielding a PH model for C.

The vector correlation coefficient (Hotelling, 1936) is employed to evaluate the discrepancy between the estimated and the true central subspaces. This measure ranges between 0 and 1 with a larger value indicating a better estimation accuracy. We examine both the PH model and the PO model for (7) and (8), each with two censoring proportions, 20% and 40%. Since the results show very similar qualitative pattern, we only report the PH model of (7) and the PO model of (8) with 40% censoring. Figure 1 shows the median vector correlation out of 100 data replications. The solid line denotes the covariance SDR estimator of Yin and Cook (2002) without any censoring, which serves as a benchmark. The dashed line denotes the ICPW estimator where GX(·) is estimated based on a PH model fitting of C given all the covariates X. The estimation accuracy of the ICPW estimator quickly approaches the benchmark when the sample size increases. We also examine two variants of the ICPW estimator, one estimating GX(·) using the usual Kaplan-Meier method ignoring the covariate information (dot-dashed line), and the other using a local Kaplan-Meier method based on the estimates from double slicing SIR (dotted line) to partly overcome the curse of dimensionality. For the local Kaplan-Meier method, we use the normal kernel for each of the derived covariates from DSIR and choose the optimal bandwidth parameter hx = 41/3σxn−1/3 (Jones, 1990), where σx is the sample standard deviation of a derived covariate. From the plot, we see that the ICPW estimator with the censoring time distribution estimated by a PH model uniformly dominates the two variants. However, this is not surprising since in the current setting the censoring distribution is correctly specified by the fitted PH model. In the subsequent analysis, we will concentrate on using the PH model for GX(·). Finally, we examine the double slicing SIR estimator, which is denoted by the long-dashed line in the plot, and in all cases, our ICPW estimator outperforms DSIR uniformly. In addition, we also compare the two estimators when the censoring time is generated independent of X, so DSIR is supposed to work well. Our results (not shown here) indicate that the ICPW estimator achieves almost the same accuracy as DSIR in that setup.

Figure 1.

Figure 1

The ICPW estimation of SY|X: median vector correlation coefficient versus the sample size. The solid line denotes the benchmark with no censoring; the dashed line denotes the ICPW estimator with the censoring distribution estimated by a PH model; the dot-dashed line denotes the ICPW estimator with the censoring distribution estimated by the Kaplan-Meier method without covariate information; the dotted line denotes the ICPW estimator with the censoring distribution estimated by the local Kaplan-Meier method based on DSIR; the long-dashed line denotes the DSIR estimator.

3.2 Sensitivity analysis

In the ICPW literature, sensitivity analyses have often been employed to evaluate the effect of misspecification of the censoring time distribution. In this section, we conduct an intensive sensitivity analysis by varying the censoring distribution specified in (9).

Specifically, the hazard function of εc in (9) is given by

λεc(t,r)=exp(t)1+rexp(t), (10)

where the constant r controls the level of deviation from a PH model. That is, when r = 0, it yields a PH model, but as r increases, it leads to a model deviant from PH, and when r = 1, it corresponds to a PO model for the censoring time distribution. In our sensitivity analysis, we consider r = 0.25, 0.5, 0.75 and 1.0. In addition, we also examine a log-normal model for C by generating εc from a standard normal distribution (denoted by log normal), and a PH model (r = 0) but with an interaction term of covariates, i.e., C = exp(cX1 + X4 + 0.5X1X2 + εc) (denoted by mis-spec. PH). Recall that for our ICPW estimator, the censoring time distribution is obtained by always fitting a PH model with all linear terms of covariates, and as such the model is misspecified for all above cases considered. The true survival time is still generated from models (7) and (8), respectively.

Table 1 summarizes the results in terms of the median vector correlation based on 100 data replications, with 40% censoring. As a comparison, the results for the DSIR estimator are also reported. It is seen that the performance of ICPW degrades as the censoring time distribution deviates away from the PH model; however, it maintains a competitive accuracy throughout. More importantly, it always outperforms DSIR, often by a large margin, even when the censoring distribution is misspecified. Such sensitivity analyses, though by no means comprehensive, clearly suggest that the proposed ICPW estimator provides a useful solution to sufficient dimension reduction estimation, and is a better alternative to DSIR that is widely used in the current literature. For practical application of the ICPW estimator, sensitivity analyses using various censoring time distributions are always recommended.

Table 1.

Sensitivity analysis: median vector correlation coefficient for various censoring time distributions different than the proportional hazards model. For each scenario, we report the results when the censoring time distribution follows model (10) with a varying value of r, a log-normal model, and a PH model with interaction terms.

n = 400 n = 800 n = 1200

Model ε εc ICPW DSIR ICPW DSIR ICPW DSIR
(7) PH r = 0.25 0.969 0.835 0.976 0.838 0.980 0.853
r = 0.50 0.968 0.852 0.969 0.858 0.971 0.875
r = 0.75 0.964 0.868 0.969 0.879 0.965 0.891
r = 1.00 0.964 0.888 0.957 0.900 0.958 0.905
log normal 0.962 0.725 0.967 0.736 0.964 0.748
mis-spec. PH 0.969 0.857 0.976 0.857 0.981 0.870

(7) PO r = 0.25 0.968 0.782 0.977 0.780 0.981 0.794
r = 0.50 0.967 0.800 0.972 0.803 0.972 0.819
r = 0.75 0.960 0.824 0.967 0.821 0.964 0.844
r = 1.00 0.959 0.831 0.958 0.841 0.958 0.856
log normal 0.958 0.663 0.966 0.674 0.967 0.678
mis-spec. PH 0.967 0.813 0.975 0.818 0.980 0.829

(8) PH r = 0.25 0.757 0.491 0.829 0.562 0.846 0.593
r = 0.50 0.729 0.484 0.789 0.559 0.833 0.589
r = 0.75 0.695 0.525 0.780 0.587 0.803 0.618
r = 1.00 0.687 0.554 0.771 0.614 0.805 0.650
log normal 0.716 0.393 0.727 0.496 0.776 0.511
mis-spec. PH 0.761 0.449 0.812 0.529 0.850 0.566

(8) PO r = 0.25 0.750 0.440 0.821 0.519 0.850 0.555
r = 0.50 0.710 0.461 0.816 0.546 0.849 0.575
r = 0.75 0.697 0.475 0.775 0.561 0.813 0.593
r = 1.00 0.694 0.507 0.777 0.570 0.811 0.608
log normal 0.715 0.367 0.737 0.475 0.751 0.492
mis-spec. PH 0.768 0.449 0.811 0.522 0.856 0.563

We have also briefly compared the ICPW estimators based on the semiparametric PH model estimation and the nonparametric local Kaplan-Meier estimate as considered in the previous section under the censoring time mis-specification, where the true censoring times are from a proportional odds model. The results, which are reported in the supplementary Appendix for brevity of the paper, show that the nonparametric estimator performs slightly better than the semiparametric estimator for model (7), while the two estimators have very comparable performance for model (8). One possible explanation is that when the true number of sufficient predictors from both the survival time and the censoring time is relatively small, (e.g., in model (7), this number equals 2), the nonparametric solution is to have a competitive performance, but when the true dimension is relative large (e.g., in model (8), this number equals 3), the kernel-based nonparametric estimator is to suffer. Moreover, in the current simulations, we treat the true number of sufficient predictors as known. In practice, it is not uncommon that this number is overestimated, which would adversely affect the nonparametric estimator. By contrast, the semiparametric estimator is expected to be less sensitive to the overestimation of the true dimension since it always fits a PH model for the censoring times with all the covariates.

3.3 Model assessment and comparison

In Sections 3.1 and 3.2, we have focused our attention on evaluating the estimation accuracy of the dimension reduction basis, while no model fitting is conducted after reduction. From a complete modeling point of view, it is also informative to study the performance of the dimension reduction aided modeling of survival time, and to compare with some common survival modeling practices in the literature. This is to be the focus of this section.

Toward that end, we employ the mean squared error (MSE) of the estimated covariate effects as the evaluation criterion, i.e. MSE=(1/n)i=1n{f^(Xi)f0(Xi)}2, where f0(·) is the true covariates effect and (·) is the estimated covariates effect. We continue to employ models (7) and (8) and compare four solutions. The first is SDR-aided Cox modeling. That is, we first use ICPW to obtain the estimated sufficient predictors υ^1=γ^1X for model (7), and υ^1=γ^1X,υ^2=γ^2X for model (8). We then fit a Cox PH model with saturated covariate effects, i.e., λ(t)=λ0(t)exp(θ1υ^1+θ2υ^12) for (7), and λ(t)=λ0(t)exp(θ1υ^1+θ2υ^2+θ3υ^12+θ4υ^22+θ5υ^1×υ^2) for (8). Since SDR effectively reduces the number of covariates from p = 10 to d = 1 or 2, it actually affords us flexibility to consider more complex models such as a quadratic or higher-order polynomial than just a linear model. The second solution is similar to the first one, but the reduced covariates are obtained by DSIR, with the dimension of d+1 to incorporate the directions in both the survival time distribution and the censoring time distribution. The third solution is a standard linear Cox model with the original covariates X. The last is a Cox model based on the cubic spline bases of X with three inner knots, so to accommodate nonlinear covariate effects.

The median MSEs out of 100 replications for the three methods are reported in Table 2. It is seen that the Cox model based on the ICPW estimator achieves the smallest MSEs among all solutions for all the scenarios, suggesting that the proposed dimension reduction method would be useful to help build a better model than the standard solutions.

Table 2.

Model assessment and comparison: median mean squared error of the estimated covariate effects. Under comparison are the Cox model fittings based on the ICPW estimate, the DSIR estimate, the original predictors, and the spline basis of the original predictors, respectively.

Model ε Method n = 400 n = 800 n = 1200
(7) PH Cox-ICPW 0.656 0.619 0.603
Cox-DSIR 0.797 0.643 0.591
Cox-Linear 1.200 1.293 1.351
Cox-Spline 0.864 0.926 0.990

(7) PO Cox-ICPW 0.888 0.865 0.858
Cox-DSIR 1.091 0.978 0.936
Cox-Linear 1.321 1.369 1.435
Cox-Spline 1.046 1.124 1.175

(8) PH Cox-ICPW 0.511 0.403 0.353
Cox-DSIR 1.052 1.017 0.977
Cox-Linear 0.864 0.890 0.932
Cox-Spline 0.755 0.796 0.818

(8) PO Cox-ICPW 0.624 0.548 0.523
Cox-DSIR 1.122 1.104 1.131
Cox-Linear 0.911 0.938 0.975
Cox-Spline 0.816 0.881 0.900

3.4 Variable selection

We next evaluate the sparse estimation of the central subspace in terms of variable selection. We employ the simulation setup in Section 3.1, but vary the number of active predictors in the basis of SY|X. More specifically, for model (7), γ1 = (1…, 1, 0, …, 0), with the number of ones equal to 3, 6, and 10; and for model (8), γ1 takes the same form but the number of ones equals 1, 4, and 8, while γ2 remains the same as before. As a result, the number of zeros equal 7, 4, and 0 for both models, which correspond roughly to very sparse, medium sparse, and non-sparse scenarios. In addition, we introduce correlation among the predictors, with Corr(Xi, Xj) = 0.5|i–j|, 1 ≤ i, jp.

For (7) and (8), both the PH and PO models are examined, and we only report the median vector correlation coefficients for the PO model of (7) and the PH model of (8) with 40% censoring in Table 3, while the results for the other two cases are very similar. We show for both the non-regularized ICPW estimator and its regularized version (denoted by ICPWs in the table). Evaluation criteria include the number of correctly identified zeros (cor 0) and the number of incorrectly identified zeros (inc 0), both of which measure accuracy of variable selection, and the vector correlation coefficient (vcr), which measures estimation accuracy of the sparse central subspace basis. It is seen from the table that for the regularized ICPW estimator (ICPWs), the number of incorrect zeros are 0 or very small, indicating that nearly all truly active predictors have been selected; while the number of correct zeros is slightly smaller than the true, indicating that false positive selection is low. Furthermore, a generally large value of vector correlation coefficient suggests that sparse basis estimate is also accurate.

Table 3.

Variable selection: estimation accuracy in terms of variable selection (correctly identified zeros and incorrectly identified zeros) and basis estimation (vector correlation coefficient).

Model ε # of
true 0
n = 400 n = 800

cor 0 inc 0 vcr cor 0 inc 0 vcr
(7) odds 7 ICPW 0.00 0.00 0.942 0.00 0.00 0.967
ICPWs 5.87 0.00 0.958 6.11 0.00 0.979

4 ICPW 0.00 0.00 0.916 0.00 0.00 0.953
ICPWs 3.00 0.00 0.902 4.00 0.00 0.957

0 ICPW 0.00 0.00 0.908 0.00 0.00 0.941
ICPWs 0.00 1.00 0.858 0.00 0.00 0.918

(8) hazards 7 ICPW 0.00 0.00 0.805 0.00 0.00 0.850
ICPWs 4.63 0.64 0.652 4.02 0.33 0.779

4 ICPW 0.00 0.00 0.729 0.00 0.00 0.845
ICPWs 4.00 0.00 0.718 4.00 0.00 0.881

0 ICPW 0.00 0.00 0.683 0.00 0.00 0.819
ICPWs 0.00 1.00 0.620 0.00 0.00 0.808

3.5 PBC data

To illustrate the proposed method, we first apply it to the primary biliary cirrhosis data collected at the Mayo Clinic. This data has been analyzed in the literature. In particular, Zhang and Lu (2007) concentrated on 276 subjects with complete observations and concluded that 8 covariates, age, albumin (alb), serum bilirubin (bil), urine copper (cop), prothrombin time (prot), liver enzyme (sgot), presence of oedema (oed) and stage, are important in modeling subjects’ survival time. Following Zhang and Lu (2007), we focus on these 8 covariates in our illustration. Among them, the variable oed takes three possible values, 0, 0.5 and 1, while the variable stage takes integer value between 1 and 4. For simplicity, we treat all variables as continuous. Among the 276 subjects, 147 subjects have their responses censored.

We apply the ICPW estimation to this data, and the criterion (4) concludes that = 2. We next obtain the two SDR variates, υ^1=γ^1X and υ^2=γ^2X where γ̃1 and γ̃2 are the sparse ICPW estimators. Figure 2 (a) shows the solution path of the shrinkage vector α as the regularization parameter τ increases. The vertical line indicates the value of τ selected by BIC, and all covariates but age and liver enzyme are selected. It is also seen that when τ equals the number of predictors, all elements of α̂ become 1. To complete the analysis, we next fit a Cox proportional hazards model with full quadratic effects, i.e. λ(t)=λ0(t)exp(θ1υ^1+θ2υ^2+θ3υ^12+θ4υ^22+θ5υ^1×υ^2), and find the main effect term υ̂2 and the interaction term υ̂1 × υ̂2 are significant. We then re-fit the model with these two terms, yielding λ(t) = λ0(t) exp(−1.15 υ̂2 − 0.18 υ̂1 × υ̂2), with the corresponding p-values for both coefficients less than 1e-4. Figure 2 (b) shows the Kaplan-Meier estimate of the survival curves for the three risk groups determined by the estimated risk scores (divided by the 33% and 66% quantiles). A good separation of the three risk curves is observed, suggesting that the model fits the data reasonably well.

Figure 2.

Figure 2

Analysis of the PBC data: the left panel shows the entire solution path, where the vertical line denotes the selected tuning parameter. The right panel shows the Kaplan-Meier estimate of survival curves for the three risk groups of patients.

3.6 DLBCL gene expression data

We next analyze the large B-cell lymphoma (DLBCL) microarray gene expression data of Rosenwald et al. (2002). The data consist of the survival time of 240 DLBCL patients, along with expression levels of 7399 genes for each patient. Following the practice of Li and Luan (2005) and Lu and Li (2008), we focus on the top 50 genes ranked by the univariate Cox scores in our analysis. The data was pre-partitioned by Rosenwald et al. (2002) to a training group of 160 samples and a testing group of 80 samples. Moreover, we remove five subjects whose observed survival time equal to exactly zero from further analysis, leaving 156 training samples and 79 testing samples. Among 235 patients, 102 patients have censored responses.

The ICPW estimator concludes that = 3 based on the training samples. Three SDR variates are constructed, υ^j=γ^jX, j = 1,2,3, where γ̃j’s are obtained from the sparse ICPW estimation. 10 genes out of 50 are selected. We next fit a Cox proportional hazards model with full quadratic effects based on υ̂j’s and find that all three main effects and one interaction term are significant. The final model is λ(t) = λ0(t) exp(−1.12 υ̂1 − 0.427 υ̂2 + 1.182 υ̂3 + 0.757 υ̂2 × υ̂3), with the corresponding p-values equal to 0.0003, 0.018, 1.9e-7, and 0.032, respectively. Figure 3 (a) shows the Kaplan-Meier estimate of the survival curves for the same three risk groups determined by the estimated risk scores for the training data. The three groups are well separated, again suggesting a good model fitting. We also briefly evaluate the predictive performance of the fitted model, and Figure 3 (b) shows the survival curves of the three risk groups for the testing data. The separation is reasonably good, suggesting a satisfactory prediction performance of the fitted survival model.

Figure 3.

Figure 3

Analysis of the DLBCL data: the left panel shows the Kaplan-Meier estimate of survival curves for the three risk groups of patients in the training data, and the right panel shows for the testing data.

4. Discussion

In this article, we have developed a sufficient dimension reduction estimator for regressions with censored responses. It is based on a least squares formulation of sliced inverse regression, which is then coupled with inverse censoring probability weighted estimation. The main motivation is that, when the dimension of the predictors is high, the failure time can depend on the covariates in a very complicated fashion and a direct modeling can be very challenging. We believe a dimension reduction step that reduces the high dimensional covariates to a few summary predictors would greatly facilitate modeling of such type of data. Both our theoretical and numerical studies have offered some support to this belief.

Our new estimator requires modeling the censoring time distribution, and its asymptotic properties hinge on a correct estimation of GX(t) = P(Ct|X). This task itself is by no means easy when the dimension of X is high. However, several reasons lead us to employ the ICPW strategy in our dimension reduction context. First, ICPW has been widely used in the survival analysis literature. It has been shown to be an effective solution for both estimation and inference. Second, in many well designed and followed clinical trials, the censoring distribution may often be relatively easier to model compared to the failure time distribution of interest. Third, we always recommend to carry out a sensitivity analysis in real practice to examine the performance of the ICPW estimator when the censoring distribution is misspecified. This is indeed a common practice in the ICPW literature. Our intensive sensitivity analysis has suggested that the proposed ICPW estimator maintains a competitive accuracy under censoring time model misspecification, and more importantly, it always outperforms, often by a large margin, the currently used double sliced inverse regression estimator. As such, we feel our ICPW estimator would provide a useful and better alternative to SDR with censored responses.

The least squares formulation used in this article also points to possible future development of SDR estimator for censored regressions. One option is the rank estimation that has been widely studied in the survival analysis literature (Tsiatis, 1990; Ying, 1993; among others). In the context of standard linear regression with right censored data, the rank estimator makes use of the well-studied counting process based martingale theory, and is known to be generally more efficient than the ICPW estimator. In our context of central subspace estimation, one may consider to estimate βk in (2) by solving the weighted log-rank estimating equations. However, unlike the ICPW estimator, the asymptotic properties of the rank estimator of central subspace are hard to obtain since they require a linear model for the transformed survival time that is generally not true for the dimension reduction context. A careful study of the rank estimator in terms of both asymptotic properties and finite sample performance will be our future research.

Supplementary Material

Supplemental data

ACKNOWLEDGEMENTS

The authors thank the associate editor and two referees for their comments that substantially improved the presentation of the article. This work was partially supported by NIH/NCI grant R01 CA-140632 (WL) and NSF grant DMS-0706919 (LL).

Footnotes

Supplementary Materials

The Web Appendix referenced in Sections 2.2 and 2.3 is available under the Paper Information link at the Biometrics website http://www.tibs.org/biometrics.

REFERENCES

  1. Andersen PK, Gill RD. Cox’s regression model for counting processes: a large sample study. Annals of Statistics. 1982;10:1100–1120. [Google Scholar]
  2. Bennett S. Analysis of survival data by the proportional odds model. Statistics in Medicine. 1983;2:273–277. doi: 10.1002/sim.4780020223. [DOI] [PubMed] [Google Scholar]
  3. Beran R. Technical Report. Berkeley: University of California; 1981. Nonparametric regression with randomly censored survival data. [Google Scholar]
  4. Breiman L. Better subset regression using the nonnegative garrote. Technometrics. 1995;37:373–384. [Google Scholar]
  5. Bondell HD, Li L. Shrinkage inverse regression estimation for model free variable selection. Journal of the Royal Statistical Society, Series B. 2009;71:287–299. [Google Scholar]
  6. Cheng SC, Wei LJ, Ying Z. Analysis of transformation models with censored data. Biometrika. 1995;82:835–845. [Google Scholar]
  7. Cheng SC, Wei LJ, Ying Z. Prediction of survival probabilities with semi-parametric transformation models. Journal of the American Statistical Association. 1997;92:227–235. [Google Scholar]
  8. Cook RD. Dimension reduction and graphical exploration in regression including survival analysis. Statistics in Medicine. 2002;22:1399–1413. doi: 10.1002/sim.1503. [DOI] [PubMed] [Google Scholar]
  9. Cook RD. Testing predictor contributions in sufficient dimension reduction. Annals of Statistics. 2004;32:1061–1092. [Google Scholar]
  10. Cook RD, Ni L. Using intraslice covariances for improved estimation of the central subspace in regression. Biometrika. 2006;93:65–74. [Google Scholar]
  11. Cox DR. Regression models and life tables (with discussion) Journal of the Royal Statistical Society, Series B. 1972;34:187–220. [Google Scholar]
  12. Cox DR, Oakes D. Analysis of Survival Data. New York: Chapman and Hall; 1984. [Google Scholar]
  13. Eaton M. A characterization of spherical distributions. Journal of Multivariate Analysis. 1986;20:272–276. [Google Scholar]
  14. Efron B, Hastie T, Johnstone I, Tibshirani R. Least Angle Regression. Annals of Statistics. 2004;32:407–451. [Google Scholar]
  15. Fan J, Gijbels I. Censored regression: nonparametric techniques and their applications. Journal of the American Statistical Association. 1994;89:560–570. [Google Scholar]
  16. Fine J, Ying Z, Wei LJ. On the linear transformation model for censored data. Biometrika. 1998;85:980–986. [Google Scholar]
  17. Fu WJ. Penalized regression: the bridge versus the lasso. Journal of Computational and Graphical Statistics. 1998;7:397–416. [Google Scholar]
  18. Hall P, Li KC. On almost linearity of low dimensional projections from high dimensional data. Annals of Statistics. 1993;21:867–889. [Google Scholar]
  19. Hotelling H. Relations between two sets of variates. Biometrika. 1936;28:321–377. [Google Scholar]
  20. Kong L, Cai J, Sen PK. Weighted estimating equations for semiparametric transformation model with censored data from a case-cohort design. Biometrika. 2004;91:305–319. [Google Scholar]
  21. Jones MC. The performance of kernel density functions in kernel distribution function estimation. Statistics and Probability Letters. 1990;9:129–132. [Google Scholar]
  22. Koul H, Susarla V, van Ryzin J. Regression analysis with randomly right-censored data. Annals of Statistics. 1981;9:1276–1288. [Google Scholar]
  23. Li H, Luan Y. Boosting proportional hazards models using smoothing spline, with application to high-dimensional microarray data. Bioinformatics. 2005;21:2403–2409. doi: 10.1093/bioinformatics/bti324. [DOI] [PubMed] [Google Scholar]
  24. Li KC. Sliced inverse regression for dimension reduction (with discussion) Journal of the American Statistical Association. 1991;86:316–327. [Google Scholar]
  25. Li KC, Wang JL, Chen CH. Dimension reduction for censored regression data. The Annals of Statistics. 1999;27:1–23. [Google Scholar]
  26. Li L, Li H. Dimension reduction methods for microarrays with application to censored survival data. Bioinformatics. 2004;20:3406–3412. doi: 10.1093/bioinformatics/bth415. [DOI] [PubMed] [Google Scholar]
  27. Li L, Simonoff JS, Tsai CL. Tobit model estimation and sliced inverse regression. Statistical Modelling. 2007;7:107–123. [Google Scholar]
  28. Lu W, Li L. Boosting method for nonlinear transformation models with censored survival data. Biostatistics. 2008;9:658–667. doi: 10.1093/biostatistics/kxn005. [DOI] [PubMed] [Google Scholar]
  29. Peng L, Fine J. Competing Risks Quantile Regression. Journal of the American Statistical Association. 2009;104:1440–1453. [Google Scholar]
  30. Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Staudt LM. The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma. The New England Journal of Medicine. 2002;346:1937–1947. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]
  31. Tibshirani R. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B. 1996;58:267–288. [Google Scholar]
  32. Tsiatis AA. Estimating regression parameters using linear rank tests for censored data. Annals of Statistics. 1990;18:354–372. [Google Scholar]
  33. Wang H, Leng C. Unified lasso estimation via least squares approximation. Journal of the American Statistical Association. 2007;102:1039–1048. [Google Scholar]
  34. Xia Y, Zhang D, Xu J. Dimension reduction and semiparametric estimation of survival models. Journal of the American Statistical Association. 2010;105:278–290. [Google Scholar]
  35. Yin X, Cook RD. Dimension reduction for the conditional kth moment in regression. Journal of the Royal Statistical Society, Series B. 2002;64:159–175. [Google Scholar]
  36. Ying Z. A large sample study of rank estimation for censored regression data. Annals of Statistics. 1993;21:76–99. [Google Scholar]
  37. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society, Series B. 2006;68:49–67. [Google Scholar]
  38. Zhang HH, Lu W. Adaptive-LASSO for Cox’s proportional hazards model. Biometrika. 2007;94:1–13. [Google Scholar]
  39. Zhu LX, Miao B, Peng H. On sliced inverse regression with large dimensional covariates. Journal of the American Statistical Association. 2006;101:630–643. [Google Scholar]
  40. Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental data

RESOURCES