Abstract
When there is not enough scientific knowledge to assume a particular regression model, sufficient dimension reduction is a flexible yet parsimonious nonparametric framework to study how covariates are associated with an outcome. We propose a novel estimator of low-dimensional composite scores, which can summarize the contribution of covariates on a right-censored survival outcome. The proposed estimator determines the degree of dimension reduction adaptively from data; it estimates the structural dimension, the central subspace and a rate-optimal smoothing bandwidth parameter simultaneously from a single criterion. The methodology is formulated in a counting process framework. Further, the estimation is free of the inverse probability weighting employed in existing methods, which often leads to instability in small samples. We derive the large sample properties for the estimated central subspace with data-adaptive structural dimension and bandwidth. The estimation can be easily implemented by a forward selection algorithm, and this implementation is justified by asymptotic convexity of the criterion in working dimensions. Numerical simulations and two real examples are given to illustrate the proposed method.
Keywords: central subspace, counting process, data-adaptive bandwidth, higher-order kernel, structural dimension
1. Introduction
In survival analysis, a major interest is to predict or explain the association between survival times and interesting covariates when the survival time is subjected to a censorship caused by the termination of follow-up study or patients’ drop-out. In the literature, semiparametric models for right-censored survival data includes Cox’s proportional hazards model (Cox, 1972), the proportional odds model (Bennett, 1983), and the accelerated failure time model (Cox and Oakes, 1984), among many others. Although semiparametric models do not impose full distributional assumptions, certain parametric structures are still specified for the relation between response and covariates. In practice, there is often not enough scientific knowledge to assume a particular transformation or link function. A possible solution is the fully nonparametric regression such as Beran’s estimator (Beran, 1981) for the conditional survival function. When the number of covariates gets larger, the nonparametric estimator usually suffers from the curse of dimensionality. To consider a more flexible yet parsimonious model formulation between parametric and nonparametric frameworks, sufficient dimension reduction (Li, 1991) arises as an appealing middle ground, in which the model complexity is controlled by the structural dimension. To truly let the data speak, the key is the ability to estimate the structural dimension jointly with the central subspace, for which we provide a vigorous solution in this paper for censored survival outcomes collected in biomedical studies.
For uncensored data, various methods have been proposed to estimate the central subspace of the sufficient dimension reduction model with a fixed dimension, representatively including the inverse regression (Li, 1991; Li and Wang, 2007; Zhu et al., 2010), the minimum average variance estimation coupled with average derivatives (Zhu and Zeng, 2006; Xia, 2007; Wang and Xia, 2008; Yin and Li, 2011), the semiparametric framework (Ma and Zhu, 2012, 2013), and the reproducing kernel approaches (Fukumizu et al., 2009; Fukumizu and Leng, 2014). To determine the structural dimension, commonly used methods are sequential testing (Li, 1991), bic-type criterion (Zhu et al., 2006; Ma and Zhang, 2015), cross-validation (Wang and Xia, 2008), and bootstrap (Dong and Li, 2010). Under a right-censoring mechanism, the data structure may not permit direct extensions of these approaches and only a limited number of methods have been studied. By using an imputation technique, Li et al. (1999) proposed a consistent estimator for the central subspace by calculated the conditional expectation of the unobserved part of response in the sliced inverse regression. Another method proposed by Lu and Li (2011) was using inverse censoring probability weighting (ICPW) to remove the bias caused by censoring, and the structural dimension is determined by a bic-type criterion. Similarly, Nadkarni et al. (2011) proposed a minimum discrepancy approach coupled with the inverse censoring weighting to build a more efficient inverse regression estimator, and bootstrapping was used to estimate structural dimension. To relax strong assumptions such as linearity and constant variance conditions on the design matrices from the conventional sliced inverse regression, an inverse survival weighting and double kernel smoothing techniques were utilized and the minimum average variance estimation based on hazard functions (hMAVE) was proposed by Xia et al. (2010). To obtain the structural dimension, these authors applied a cross-validation criterion for the conditional hazard function.
Among all these methods, an inverse weighting technique is required to adjust the censored response. However, in practice the inverse weights often lead to unstable estimators, especially when the values of weights are close to zero. In this work, we propose a new criterion which focuses directly on the mean function of the counting process for observed failure event, instead of treating the partially observed failure time as a missing data problem. Hence, no inverse weights are required and the resulting estimator is more stable than existing ones. In addition, the existing methodologies consider the basis estimation and dimension determination as separate problems, and require different criteria to estimate the parameters of interest. Instead, we use a single criterion for the simultaneously estimation of effective dimension, central subspace and a rate-optimal bandwidth for the estimation of conditional cumulative hazard and survival functions, which eases the burden of computation in practice. The data-adaptive bandwidth is another important contribution, since existing nonparametric methods often involve subjective bandwidth which could compromise practical performance. Besides, no subjective tuning parameters are required.
The rest of this article is organized as follows. Section 2 introduces the model structure. The proposed estimator is introduced in Section 3 and its asymptotic properties are established. In Section 4, a series of simulation studies are conducted and two empirical examples are given in Section 5 to illustrate the proposed methods. Some concluding remarks are given in Section 6. The technical proofs are given in the Appendix.
2. Sufficient dimension reduction model for censored survival data
Let T denotes the failure time of interest and X = (X1, … ,Xp)T be a covariate vector of interest. The sufficient dimension reduction model is of the form:
| (1) |
for some full-rank p × d parameter matrix B with d ≤ p, where denotes independence. The column space of B is called a sufficient dimension reduction subspace and is denoted by span(B). Obviously, (1) holds trivially when d = p and B is equal to the p × p identity matrix since . Moreover, when span(B1) is a sufficient dimension reduction subspace and , it is easy to see that span(B2) is also a sufficient dimension reduction subspace. Thus, the model with fixed d1 is a submodel of that with fixed d2 > d1. Due to this nested structure, the primary parameter of interest is the sufficient dimension subspace with the smallest dimension, which is called the central subspace and is denoted by . The corresponding basis matrix is denoted by B0 and its dimension d0 is called the structural dimension. Some discussions about the existence and uniqueness of central subspace can be found in Cook (1998).
Another equivalent form of (1) is
| (2) |
for some unknown function F (·, ·), where FT (t | x) is the conditional distribution function of T given X = x. Expression (2) shows that sufficient dimension reduction is indeed a distribution regression problem and that the central subspace can capture all the information between T and X. Let λT (t | x) be the conditional hazard function of T given X = x. By the one-to-one relationship between the distribution and the hazard function, (2) is equivalent to
| (3) |
for some unspecified function λ(·, ·) (Xia et al., 2010). Under (2) and (3), FT (t | x) and λT (t | x) remain the same for any basis matrix B with the same column space. In fact, there are infinitely many basis matrices spanning the same space, which are isomorphic up to a linear transformation. The parameter space of B is a subspace of called the Grassmann manifold (Ma and Zhu, 2013).
In survival analysis, the failure time is often censored by a censoring time C. One can only observe Y = T ∧ C = min(T,C) and the non-censoring indicator δ = 1(T ≤ C), where 1(·) represents the indicator function. For identifiability, conditional independence between T and C is assumed; that is,
| (4) |
The condition (4) is a common assumption in regression analysis of survival data. Let SY (t | x), ST (t | x), and SC (t | x) be the conditional survival functions of Y , T, and C given X = x. From (4), it is easy to see that SY (t | x) = ST (t | x)SC(t | x) and . These properties further ensure that , where the sum L1 +L2 of two linear subspaces L1 and L2 is defined as {v1 +v2 : v1 ∈ L1, v2 ∈ L2}. Since only Y and (Y, δ) are observable, existing methods for uncensored data can be directly applied to obtain and . However, these subspaces can not recover directly. Thus, we have to investigate the relationship between and to target the primary parameter of interest.
Since the hazard function can only be identified up to the maximal support of the survival function of Y , denoted by τ , one can only estimate the central subspace of T up to τ , such that B0 satisfies (2) and (3) for t ∈ [0, τ ]. For example, when for t ∈ [0, τ ] and for t > τ with , the overall central subspace is . In such cases, can never be identified from the right-censored data observable up to τ , but our proposed method would still be able to estimate B0. Since our method can be applied to finite or infinite τ , for simplicity, we set τ to be +∞ in the following discussions so that the parameter of interest is as the same as span(B0).
3. The Proposed Estimator
We propose an estimation criterion based on the counting process Nt = 1(Y ≤ t, δ = 1) for the observed failure event. Let Rt = 1(Y ≥ t). From (3), we can have the following:
Proposition 1.
Proposition 1 transforms the original sufficient dimension reduction problem into a mean regression one using the counting process for the observed failure event as outcome. Although Proposition 1 seems standard as in many common methods in survival analysis, our objective of estimating d0 and B0 simultaneously post a unique challenge. This requires us to consider a prediction criterion, which will be shown in (7) later, based on a least squares criterion
| (5) |
for the estimation of B0, where FY (t) is the marginal distribution of Y . Note that the expectation is taken with respect to the joint distribution of (Y, δ, X). Instead of using as in the existing methods, our approach puts Rt in the conditional mean and, hence, no inverse weight E(Rt | x) is required. A simple calculation shows that this criterion can be decomposed into
| (6) |
where and FY (t | x) = 1 − SY (t | x).Note that is non-negative. Thus, when both SY (t | x) and λ(t,BTx) are continuous in t ∈ (0,∞) and SY (t |X) > 0 for t ∈ [0, τ ], it can be shown that the last two terms in (6) are equal to zero if and only if . Thus, the criterion in (5) attains its minimum if and only if the column space of B is a sufficient dimension reduction subspace. To further distinguish the overfitted models with d > d0, we follow the idea of Huang and Chiang (2017) and a leave-one-out cross-validation criterion for is proposed in the following.
From Proposition 1, we have
Thus, a nonparametric estimator for can be
where
, , with is a positive bandwidth, and K is a qth order kernel function. Note that and are kernel smoothing estimators for and , respectively. Here we suggest to take , for reasons to be discussed later in Remark 3. Now let (Y 0, δ0, X0) be a future run independent of current data , , and . To perform the cross-validation, we consider a prediction risk
| (7) |
which can be decomposed into , where
| (8) |
| (9) |
Note that the expectation is taken with respect to the joint distribution of and (Y 0, δ0,X0). When h → 0 and nhd → ∞, we can show that both miseB(h) and C(B, h) converge to zero and thus (7) is dominated by . Since model (3) has a nested structure, decreases with the increase of working dimension, when the working dimension is less than the structural dimension d0. Further, as discussed in (6), and the equality holds if and only if span(B) is a sufficient dimension reduction subspace. Thus, the minimum of the prediction risk occurs only when . In this case, C(B, h) = 0, and (7) reduces to . In addition, to minimize , the optimal rate of h is O{n−1/(2q+d)}. Thus, once the working dimension d is equal to or larger than the structural dimension in the case that , the prediction risk has an asymptotic order of , which starts to increase in d. In summary, we have the following proposition:
Proposition 2.
Under model (1), the basis matrix B0 of the central subspace and the optimal bandwidth minimize the prediction in (7) as h → 0, , and n → ∞, where the constant is given in the Appendix A.1.
Based on Proposition 2, the proposed estimator for (B0, h0) is the minimizer of the sample analogue
where is the empirical distribution of and the superscript −i indicates an estimator based on a sample with ith subject being deleted. Note that and are both step functions in t, the integrals in CV(B, h) indeed have closed forms for computation. More precisely,
From the fact that the prediction risk is asymptotically convex in d, we utilize the following procedure to obtain the estimator.
Step a. For d = 0, calculate
where
Step b. For d ≥ 1, define as the minimizer of CV(B, h) over all and , then calculate . Since B is identifiable only up to its column space, we use an iterative procedure for separated B and h to implement the optimization problem.
Step b1. Choose a proper initial value .A possible choice of can be n−1/(2q+d). The choice of will be discussed in Remark 1.
Step b2. For k = 1, 2, …, define as the minimizer of . This step is a univariate optimization problem, which can be done by common methods such as gradient deccent and Newton-type algorithms.
Step b3. Define as the minimizer of . The practical implementation will be discussed in Remark 1.
Step b4. Repeat Steps b2–b3 until for some pre-chosen ϵ > 0.
Step c. Repeat Step b until with . The proposed estimator is then defined as .
We show that CV(B, h) converges to the prediction risk in (7) as n → ∞ in Appendix A.4. Thus, its minimizer provides a valid estimator for the central subspace. A distinguishing feature of our estimation procedure is that it estimates the basis matrix and the dimension of central subspace simultaneously. Thus, it requires less computing time compared to the existing proposals (Xia et al., 2010; Nadkarni et al., 2011). Moreover, the bandwidth used in the estimation criterion is also selected at the same time, which can be used to estimate the conditional survival functions after obtaining the estimated central subspace. Although the cross-validation criteria may not be convex in d for small samples, it is asymptotically convex in d so that the stopping rule of the forward searching procedure ensures the convergence of the proposed estimator to the global optimum in large samples.
Remark 1.
Step b3 can be done in two ways. First, the Newton-type optimization algorithms for Grassmann manifold (Edelman et al., 1999; Adragni et al., 2012) can be applied to solve the minimization problem. We suggest the method of Xia et al. (2010) as initial value, which can be computed quickly and do not rely on additional distributional assumptions as required in some other existing methods. An alternative way to implement Step b3 is to employ a local coordinate system of the Grassmann manifold (Ma and Zhu, 2013), which transforms the Grassmann manifold optimization to an unconstrained optimization of (p − d) × d free parameters. The transformation is possible through Gaussian elimination given a consistent initial value, and a Newton-type algorithm (Fletcher and Reeves, 1964) can be directly employed in the resulting optimization problem. In limited simulations, we found that both methods have similar performance but the latter has a slightly less computing time and is recommended.
Remark 2.
Although a cross-validation criterion has also been considered (Xia et al., 2010), the cross-validation values can be unbounded and are sensitive to bandwidth selection. On the other hand, our proposed method fits the observed failure process with its conditional mean. As a result, the proposed cross-validation function is bounded.
Based on the notations and assumptions in Appendix A.2, the large sample properties of our proposed estimator are established in the following theorem:
Theorem 1.
Suppose that Assumptions A1–A5 are satisfied. Then, , and
as n → ∞. The asymptotic variance is defined in the Appendix.
Remark 3.
We can show that for each fixed d in the proof of Theorem 1. Coupled with the restriction in Assumption A3, the order of kernel function should satisfy q > max{2, (d + 2)/2}. Since we always use a symmetric kernel function with an even order and require the order as small as possible, a guidance of the choice is to take for each working dimension d. Since q ≥ 4, in the practical implementation we use the bi-weight kernel K(u) = (105/64)(1 − 3u2)(1 − u2)21(|u| ≤ 1). More details about the higher-order kernel functions can be found in the literature (Hansen, 2005).
4. Simulation Studies
In this section, we investigate the finite sample performance of our proposed estimator and compare with the hMAVE (Xia et al., 2010) and the ICPW estimator (Lu and Li, 2011). We also performed additional simulations to the IRE estimator (Nadkarni et al., 2011); the results were qualitatively similar to the ICPW estimator and are not presented here. We first consider two different settings which are slight modifications from existing examples (Xia et al., 2010). The first one is a proportional hazard model
where ε ∼ Exp(1) and X = (X1, … ,X7) ∼ N(0, I7) are independent, B0 = (−0.5, 0, 0.5, 0, −0.5, 0, 0.5)T, and with Φ(·) being the cumulative distribution function of the standard normal distribution. The censoring time follows C = Φ(2X2 + 2X3) + c1, where c1 is a constant used to control the proportions of censoring. The second setting is a nonlinear model
where ε ∼ N (0, 0.22), Xk ∼ Uniform(0, 1) independently, k = 1, … , 7, and B0 = 2−1/2(1, 0, 0, 0, 1, 0, … , 0)T. Further, the censoring time is set to be , where βc = 2−1/2(0, 1, 0, 0, 1, 0, … , 0)T and c2 is used to control the censoring rate. A more complicated model setting is also considered:
where X = (X1, … ,X20), ϕ is the standard normal density function, β1 = (1, 0, 0, 0.1, … , 0.1), β2 = (0, 1, 0, 0.1, … , 0.1), β3 = (0, 0, 1, 0.1, … , 0.1), and Xk ∼ Uniform(0, 10) are independently generated for k = 1, … , 20. The true basis matrix is hence B0 = (β1, β2, β3). The censoring time , where βc = 2−1/2(0, 1, 0, 0, 1, 0, … , 0)T and c3 is used to control the censoring rate.. All settings are implemented through 1000 simulations and the estimation errors for any estimator is measured by the Frobenius norm of .
The simulation results are displayed in Tables 1–2. One can see that our proposal selects the correct structure dimension very often. For all settings, the proportion of simulations that select true dimension increases with sample sizes. Also, our proposed estimator has a smaller average estimation error than those of hMAVE and ICPW estimators, while the variabilities of estimation errors are fairly comparable. In ICPW estimation, the conditional censoring distribution is estimated by a flexible kernel-weighted local Kaplan-Meier estimator, which suffers from the curse of dimensionality and is highly variable when the censoring rate is low; a related conclusion can be found in Lu and Li (2011). Moreover, the poor performance of ICPW estimator under M2 is probably caused by an additional violation of a linearity condition in covariate distributions.
Table 1:
The proportion of estimated structural dimension and the means and standard deviations (s.d.) of estimated bandwidths. c.r. denotes censoring rate (%).
| model | c.r. | n |
|
bandwidth |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | mean | s.d. | |||
| M1 | 20 | 100 | 0.000 | 0.924 | 0.076 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.010 | 0.0351 |
| 200 | 0.000 | 0.950 | 0.050 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.012 | 0.0385 | ||
| 400 | 0.000 | 0.978 | 0.022 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.025 | 0.1082 | ||
|
|
||||||||||||
| 50 | 100 | 0.000 | 0.934 | 0.066 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.009 | 0.0330 | |
| 200 | 0.000 | 0.941 | 0.059 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.007 | 0.0292 | ||
| 400 | 0.000 | 0.962 | 0.038 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.007 | 0.0274 | ||
|
| ||||||||||||
| M2 | 20 | 100 | 0.010 | 0.875 | 0.114 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 0.019 | 0.0720 |
| 200 | 0.001 | 0.957 | 0.040 | 0.002 | 0.000 | 0.000 | 0.000 | 0.000 | 0.012 | 0.0986 | ||
| 400 | 0.000 | 0.981 | 0.019 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.003 | 0.0191 | ||
|
|
||||||||||||
| 50 | 100 | 0.013 | 0.799 | 0.183 | 0.005 | 0.000 | 0.000 | 0.000 | 0.000 | 0.030 | 0.0773 | |
| 200 | 0.000 | 0.919 | 0.078 | 0.003 | 0.000 | 0.000 | 0.000 | 0.000 | 0.014 | 0.0727 | ||
| 400 | 0.000 | 0.976 | 0.023 | 0.001 | 0.000 | 0.000 | 0.000 | 0.000 | 0.006 | 0.0645 | ||
|
| ||||||||||||
| M3 | 20 | 100 | 0.000 | 0.000 | 0.248 | 0.571 | 0.172 | 0.008 | 0.001 | 0.000 | 1.650 | 0.5278 |
| 200 | 0.000 | 0.000 | 0.176 | 0.688 | 0.131 | 0.005 | 0.000 | 0.000 | 1.551 | 0.4705 | ||
| 400 | 0.000 | 0.000 | 0.069 | 0.848 | 0.080 | 0.002 | 0.001 | 0.000 | 1.456 | 0.3111 | ||
|
|
||||||||||||
| 50 | 100 | 0.000 | 0.000 | 0.290 | 0.566 | 0.136 | 0.007 | 0.001 | 0.000 | 1.553 | 0.4978 | |
| 200 | 0.000 | 0.000 | 0.252 | 0.618 | 0.120 | 0.009 | 0.000 | 0.001 | 1.434 | 0.4258 | ||
| 400 | 0.000 | 0.000 | 0.200 | 0.699 | 0.099 | 0.002 | 0.000 | 0.000 | 1.389 | 0.3921 | ||
Table 2:
The means and standard deviations (s.d.) of the basis estimation errors, and the averaged computing time (in seconds). c.r. denotes censoring rate (%).
| model | c.r. | n | proposed |
hMAVE |
ICPW |
||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| mean | s.d. | time | mean | s.d. | time | mean | s.d. | time | |||
| M1 | 20 | 100 | 0.087 | 0.0860 | 3.05 | 0.127 | 0.1126 | 3.19 | 0.489 | 0.4957 | 0.03 |
| 200 | 0.051 | 0.0522 | 13.05 | 0.073 | 0.0638 | 11.81 | 0.217 | 0.2854 | 0.14 | ||
| 400 | 0.038 | 0.0461 | 57.66 | 0.049 | 0.0576 | 97.00 | 0.077 | 0.1003 | 0.57 | ||
|
|
|||||||||||
| 50 | 100 | 0.076 | 0.0892 | 3.11 | 0.109 | 0.1143 | 2.93 | 0.245 | 0.1495 | 0.04 | |
| 200 | 0.039 | 0.0540 | 14.56 | 0.055 | 0.0694 | 11.80 | 0.113 | 0.0675 | 0.16 | ||
| 400 | 0.028 | 0.0414 | 54.59 | 0.036 | 0.0509 | 51.96 | 0.057 | 0.0341 | 0.61 | ||
|
| |||||||||||
| M2 | 20 | 100 | 0.140 | 0.3303 | 4.06 | 0.147 | 0.3309 | 19.36 | 0.911 | 0.0934 | 0.04 |
| 200 | 0.016 | 0.1184 | 15.92 | 0.019 | 0.1193 | 65.64 | 0.912 | 0.0749 | 0.18 | ||
| 400 | 0.001 | 0.0004 | 63.28 | 0.002 | 0.0009 | 278.08 | 0.910 | 0.0558 | 0.68 | ||
|
|
|||||||||||
| 50 | 100 | 0.173 | 0.3413 | 3.69 | 0.198 | 0.3379 | 23.33 | 0.960 | 0.0514 | 0.04 | |
| 200 | 0.040 | 0.1727 | 15.33 | 0.055 | 0.1737 | 72.37 | 0.961 | 0.0415 | 0.18 | ||
| 400 | 0.004 | 0.0435 | 62.22 | 0.010 | 0.0443 | 266.22 | 0.962 | 0.0299 | 0.63 | ||
|
| |||||||||||
| M3 | 20 | 100 | 1.714 | 0.6722 | 10.45 | 3.824 | 0.2984 | 23.42 | 3.146 | 0.2642 | 0.06 |
| 200 | 1.555 | 0.7308 | 45.33 | 3.677 | 0.2791 | 86.61 | 2.937 | 0.2217 | 0.22 | ||
| 400 | 1.410 | 0.7672 | 197.58 | 3.670 | 0.3039 | 335.90 | 2.825 | 0.2053 | 0.87 | ||
|
|
|||||||||||
| 50 | 100 | 1.865 | 0.5943 | 9.26 | 4.432 | 0.4375 | 24.89 | 3.375 | 0.3008 | 0.06 | |
| 200 | 1.677 | 0.6275 | 44.05 | 4.003 | 0.3615 | 94.08 | 3.043 | 0.2497 | 0.23 | ||
| 400 | 1.550 | 0.6666 | 178.37 | 3.792 | 0.3633 | 333.52 | 2.880 | 0.2086 | 0.89 | ||
Since the estimation of the conditional survival function of the observed time is not required in our proposal, the final estimator is thus more robust to the misspecification of the censoring distribution, the censoring rate, and the dimension of covariates. From the computational time displayed in Table 2, we also found that our proposal is comparable to hMAVE and is often faster. Even though hMAVE adopts a local linear regression to estimate the gradient of the conditional hazard function and avoid the nonlinear minimization in the estimation, the method needs an iterative refinement procedure to update the estimator to deal with the curse of dimensionality. In our method, we adopt a forward selection procedure from lower dimension to avoid high dimensional smoothing and estimate the cumulative hazard functions directly conditioning on fixed subspaces. Since there is no additional refinement, the proposed estimation procedure can often perform faster than hMAVE.
5. Applications
5.1. Worcester Heart Attack Study Data
The first example is the Worcester heart attack study data, which is collected from 1975 to 2001 on all acute myocardial infarction patients admitted to hospitals in the Worcester, Massachusetts Standard Metropolitan Statistical Area. The main goal of this study is to describe factors associated with trends over time in the incidence and survival rates following hospital admission for acute myocardial infarction. Since the dataset is not fully released, we use a random subsample of 500 patients (Hosmer et al., 2008) and consider all 13 variables, which are displayed in the first two columns of Table 3. Also, all the variables are standardized to have mean zero and unit variance. There are 215 observed deaths in the study, and hence the censoring rate is 57%.
Table 3:
The estimated coefficients and corresponding standard errors for Worcester heart attach study data.
| collected variable | covariate | |
|---|---|---|
| initial systolic blood | X 1 | 1 |
| initial diastolic pressure | X 2 | 0.836(0.0954) |
| congestive heart complications | X 3 | −0.486(0.0845) |
| age (in years) | X 4 | 0.125(0.0792) |
| myocardial infarction order | X 5 | −0.173(0.0917) |
| body mass index | X 6 | −0.528(0.1181) |
| gender | X 7 | −0.510(0.0683) |
| initial heart rate | X 8 | 0.553(0.0888) |
| history of cardiovascular disease | X 9 | −0.062(0.0727) |
| atrial fibrillation | X 10 | −0.086(0.0816) |
| cardiogenic shock | X 11 | 0.621(0.0802) |
| complete heart bolck | X 12 | 0.054(0.0319) |
| myocardial infarction type | X 13 | −0.506(0.1061) |
The cross-validation values for working dimensions d = 0, 1, 2 are 0.302, 0.247, 0.264 with corresponding standard errors 0.022, 0.024, 0.023. The standard errors are obtained from Hájek projections since CV(B, h) has an asymptotic representation as a U-statistic. Thus, the estimated structural dimension is one and the estimated coefficients of linear index with corresponding standard errors are shown in the third column of Table 3. The estimated bandwidth is 2.538. Generally, we detect the same covariates as those detected by hMAVE (Xia et al., 2010) except age and complete heart block. Our estimated structural dimension is much smaller than that obtained in hMAVE, and the central subspace with a smaller dimension is preferred in practice. The sample correlation of and is (−0.112, 0.416, −0.399, −0.747). Thus, the fourth direction is more significant to be selected through our cross-validation criterion. To assess the model fitting for the observed failure process, we also calculate the cross-validation values based on hMAVE which is 0.335, which is 36% larger than our cross-validation value of 0.247 with a 95% confidence interval of (0.200, 0.295). Thus, our method arrives at a more parsimonious estimate and a better fit to the observed data.
5.2. AIDS Clinical Trials Group Study 175
The second example is a randomized clinical trial to compare the effects of different treatments on adults infected with the human immunodeficiency virus type I (HIV-1) whose CD4 T cell counts were between 200 and 500 per cubic millimeter at baseline. The patients were randomly assigned to four treatment groups: zidovudine, zidovudine plus didanosine, zidovudine plus zalcitabine, and didanosine, where zidovudine only is considered as the baseline comparison group. There are 2467 patients in this dataset. Excluding the subjects who had missing values or unrecorded relevant information, we consider a subset of 2139 subjects in the original data which is found in the R package speff2trial. A detailed description of the data can be found in the literature (Hammer et al., 1996). The events of interest are the diagnosed acquired immune deficiency syndrome (AIDS), which is defined as first occurrence of a decline in CD4 T cell count of at least 50, or death. In this work we are interested in assessing the effects of baseline covariates in addition to the treatments X = (X1, … ,X17) on the patients’ time to event T . The events are observed for 521 subjects (24.4%). All of the covariates considered are listed in the first two columns of Table 4. The covariates X1 and X2 are log(CD4 counts + 1) and log(CD8 counts + 1), respectively, and then being centralized and standardized. The covariates X6, X7, and X11 are log transformed, centralized, and standardized, and X14 is centralized and standardized from the original covariates. In the literature, some research indicated that the log(CD4 counts + 1) and log((CD4 counts + 1)/(CD8 counts + 1)) may be better predictors than original log(CD4 counts + 1) and log(CD8 counts + 1). However, under the proposed semiparametric model, the new designed covariates are just linear combinations of original ones. Thus, they lead to the same conditional survival model and prediction values for the survival time. For the sake of convenience, we simply choose log(CD4 counts+1) and log(CD8 counts+1) as our design covariates.
Table 4:
The estimated coefficients and corresponding standard errors for ACTG175 data.
| collected variable | covariate | ||
|---|---|---|---|
| CD4 T cell count | X 1 | 1 | 0 |
| CD8 T cell count | X 2 | 0 | 1 |
| treatment arm | |||
| zidovudine and didanosine | X 3 | 2.964(0.1327) | −1.727(0.2006) |
| zidovudine and zalcitabine | X 4 | 1.666(0.0833) | −1.584(0.1852) |
| didanosine | X 5 | 1.284(0.1188) | −2.204(0.1188) |
| v.s. zidovudine | |||
| age (in years) | X 6 | −0.172(0.0408) | 0.281(0.0536) |
| weight (in kg) | X 7 | −0.217(0.0256) | 0.862(0.0475) |
| hemophilia | X 8 | 1.002(0.1592) | 2.045(0.2823) |
| homosexual activity | X 9 | 0.007(0.0955) | 0.232(0.1490) |
| history of intravenous drug use | X 10 | 0.340(0.0885) | −1.933(0.1589) |
| Karnofsky score | X 11 | 0.489(0.0205) | −1.086(0.0729) |
| prior treatment | |||
| non-zidovudine antiretroviral | X 12 | 0.947(0.2025) | 3.012(0.2981) |
| zidovudine use in the 30 days | X 13 | 1.686(0.1117) | 7.248(0.2151) |
| number of days of antiretroviral | X 14 | 0.115(0.0628) | 1.964(0.1229) |
| race | X 15 | −0.035(0.0597) | −1.061(0.1269) |
| gender | X 16 | 1.192(0.1372) | 4.020(0.1503) |
| symptomatic indicator | X 17 | −1.033(0.0445) | 0.891(0.0894) |
The cross-validation values for working dimensions d = 0, 1, 2, 3 are 0.193, 0.190, 0.188, 0.189 with standard errors 0.010, 0.010, 0.010, 0.010. Our proposal reveals two linear indices to explain the relationship between T and X. The coefficients of indices and corresponding standard errors are shown in the third and fourth columns of Table 4. The standard errors are obtained by estimating the asymptotic covariance matrix in Theorem 1. The estimated bandwidth is 4.251. One can see that the 95% confidence intervals of all treatment arms do not include 0 in both central subspace directions, but having opposite signs. To further understand the direction of treatment effects, we examine the survival probabilities and , for being the sample mean of X, e1 = (1, 0, 0, … , 0), e2 = (0, 1, 0, … , 0) and r is a perturbation parameter, and we plot the estimates for t = 1 and 2 years in Figure 1. As shown in the solid lines, the survival probabilities increase with an increase in CD4 counts but remains constant or decreases with an increase in CD8 counts, holding other factors constant. This also shows that survival increases with the first linear index but decreases in general with the second. Therefore, the three treatment arms are associated with improved survival compared to the zidovudine only group. Moreover, the relationship between the second linear index and the conditional survival function is non-linear, which may not be discovered by common regression models.
Figure 1:

The estimated conditional survival probabilities as functions of covariates perturbed along the leading coefficient of the first (solid line) and second (dashed line) linear indices, around the mean covariates.
We also implement the hMAVE method for this dataset and it gives an one-dimensional central subspace with basis matrix . The cross-validation criterion based on the linear index gives a value of 0.193, which shows a slightly poorer prediction accuracy than that of our estimated linear indices.
6. Discussion
Sufficient dimension reduction is a flexible alternative to regression models to summarize the relationship between a response and a covariate vector when there is not enough prior knowledge to assume a particular regression model. It is well studied for completely observable response data but not for survival data. In this work, we consider dimension reduction for survival data and propose a novel estimation method to estimate the central subspace This method requires no inverse probability weighting and performs better than existing methods in numerical studies. An appealing feature of the sufficient dimension reduction model is that it does not assume any stringent structure for the conditional survival function, and we estimate the effective dimensions simultaneously with the basis matrix.
To estimate the central subspace, existing literature often suggests to use separate criteria to estimate the basis matrix and the structural dimension. Thus, more computation is required to calculate different criteria. The main advantage of our proposal is that we can estimate the basis and dimension through a single criterion. Thus, the computation burden can be eased in practice. Moreover, the tuning bandwidth can be selected at the same time. We have shown that the estimated bandwidth is which reaches the optimal rate for nonparametric estimation of conditional survival function (Dabrowska, 1992). Indeed, this bandwidth minimizes the integrated mean squared error asymptotically. The weak convergence of the estimated conditional survival function will be studied in the future.
The investigation of the semiparametric efficiency bound for the central subspace under the survival regression setting still remains an open problem. A profile likelihood approach may reach the semiparametric efficiency bound for a fixed dimension but would be unable to select the structural dimension simultaneously, since the associated bandwidth estimator will be suboptimal for reasons similar to existing literature (Hall, 1987). Thus, it becomes a major challenge to find a simple criterion for simultaneously estimating the structural dimension and the basis matrix efficiently.
Due to the connection with counting process framework, this paper provides a novel way to extend the idea into different survival data structures, for example, to left-truncated response or recurrent events. Details will be studied in the future.
Acknowledgment
The authors thank an editor, an associate editor and two reviewers for constructive comments. Both authors are partially supported by the United States National Institutes of Health grant R01HL122212, and the second author is partially supported by the United States National Science Foundation grant DMS1711952.
Appendix
A.1. Proof of Proposition 2
Proof.
In Section 3 we have seen that the minimum of the prediction risk in (5) is attained if and only if , which reduces the prediction risk into . By paralleling the proof steps of Du and Akritas (2002), we can derive that
| (A.1) |
where is the density of BTX. By substituting (A.1) into miseB(h) and from the arguments of Härdle and Marron (1985); Härdle et al. (1988), we can show that , where
, and . with FX(x) being the marginal distribution function of X. When with
amiseB(h) has minimum , which is increasing in d. Thus, the prediction risk in (5) attains its minimum when B = B0 and .
A.2. Notations and Assumptions
Let (·)⊗ be the Kronecker power of a vector. Define
for m = 0, 1, 2. The estimators and its derivatives will be shown to converge uniformly to Λ(t, BTx) and , where
for m = 1, 2. Moreover, to derive the asymptotic normality of our proposed estimator, we also define the corresponding score vectors and information matrices of CV(B, h):
The following regularity conditions are imposed for our theorem:
A1 , , , and are Lipschitz continuous in u with the Lipschitz constants being independent of (t, x, B).
A2 and .
A3 For d ≥ 1, there exist δ ∈ (1/(4q), 1/ max{2d + 2, d + 4}) and positive constants hl,d and hu,d such that both ς and h fall in the interval .
A4 and if and only if B = B0 when d = d0.
A5 V (Bd,0) is non-singular for d ≥ d0.
Assumptions A1–A2 are smoothness and boundedness conditions for the population functions to ensure the uniform convergence of kernel estimators used in CV(B, h). Moreover, assumption A3 is used to remove the remainder term in the approximation of CVY (B, h) and CV(B, h) to their target functions and to establish the n1/2-consistency of . Assumptions A4–A5 are made to ensure the identifiability of B0. One should also note that only BTX are required to have continuous density. Thus, some discrete covariates are allowed in our proposal as long as there exists at least one continuous covariate. Related conditions can also be found in assumption A1 in Ma and Zhu (2013).
A.3. Preliminary Lemmas
We first derive the large sample properties of for m = 0, 1, 2. To simplify our presentation, the following notations are further introduced:
where . Moreover, we define the strong representations for , m = 0, 1, as follows:
Since the VC-indices of , , and are 1, d, and 1, respectively for k = 0, 1, 2, these classes is ensured to be Euclidean by Lemma 2.12 of Pakes and Pollard (1989). Coupled with Lemma 2.14 of Pakes and Pollard (1989) and Theorem II.37 of Pollard (1984), we can show that
almost surely. By assumption A1, one can further derive that
Coupled with the triangular inequality, we obtain the following lemma:
Lemma 1.
Suppose that assumption A1 is satisfied. Then,
almost surely.
By applying the Taylor expansion thoerem and the results in Lemma 1, one can further ensure from assumptions A2–A3 that
Lemma 2.
Suppose that assumptions A1–A3 are satisfied. Then,
A.4. Proof of Theorem 1
Proof.
The proof is very similar to that in Huang and Chiang (2017). Thus, we only outline the steps here. Let . The first step is to show the uniform convergence of CV(B, h) to ecv(B, h). By substituting , , and into the proof of Theorem 1 in Huang and Chiang (2017), we have
| (A.2) |
| (A.3) |
The second step is to show that the underestimated dimension will be asymptotically excluded. Denote dcv(B, h) = CV(B, h)−ecv(B, h). By virtue of the minimizer of CV(B, h) and the Boole’s inequality, we have the following inequalities:
| (A.4) |
for any ε > 0. Since , , and , one has for any ε > 0. Now by taking and using the Boole’s inequality again, we have .
In the third step, we derive the asymptotic properties of for d ≥ d0. Similar to the derivation in the second step, we can also show that
| (A.5) |
Since implies that , we now consider the case when and, hence, . By the first-order Taylor expansion of at B = Bd,0 and , it yields that
| (A.6) |
where lies on the line segment between and vec(Bd,0). Similar to the approximation in the proof of Theorem 2 in Huang and Chiang (2017), we have
| (A.7) |
and, hence, . Coupled with assumption A3, it further implies that
| (A.8) |
To show the consistency of and asymptotic normality of , we define the following sets first:
By the minimizer of CV(B, h) and the Boole’s inequality, one has
| (A.9) |
From , we have
| (A.10) |
Moreover, from we have
| (A.11) |
Since , for some constant C when as n → ∞. Thus,
| (A.12) |
where and , since the left-hand side converges to zero by (A.2), (A.3), and (A.8) and the right-hand side tends to infinity when as n → ∞. Further, we also have
| (A.13) |
since the left-hand side converges to zero by (A.2), (A.3), and (A.8) and the right-hand side tends to infinity when as n → ∞. By substituting (A.10)–(A.13) into (A.9), we immediately have
| (A.14) |
Finally, the asymptotic normality in Theorem 1 is ensured by (A.14) and (A.7). □
Contributor Information
Ming-Yueh Huang, Institute of Statistical Science,Academia Sinica, Taiwan.
Kwun Chuen Gary Chan, Department of Biostatistics, University of Washington.
References
- Adragni KP, Cook DR, Wu S et al. (2012) GrassmannOptim: An R package for Grassmann manifold optimization. Journal of Statistical Software, 50, 1–18.25317082 [Google Scholar]
- Bennett S (1983) Analysis of survival data by the proportional odds model. Statistics in medicine, 2, 273–277. [DOI] [PubMed] [Google Scholar]
- Beran R (1981) Nonparametric regression with randomly censored survival data. Tech. rep, Univ. California, Berkeley. [Google Scholar]
- Cook DR (1998) Regression Graphics: Ideas for Studying Regressions Through Graphics, vol. 318. John Wiley & Sons. [Google Scholar]
- Cox DR (1972) Regression models and life-tables. J. R. Stat. Soc. Ser. B. Stat. Methodol, 34, 187–220. With discussion by F. Downton, Richard Peto, D. J. Bartholomew, D. V. Lindley, P. W. Glassborow, D. E. Barton, Susannah Howard, B. Benjamin, John J. Gart, L. D. Meshalkin, A. R. Kagan, M. Zelen, R. E. Barlow, Jack Kalbfleisch, R. L. Prentice and Norman Breslow, and a reply by D. R. Cox. [Google Scholar]
- Cox DR and Oakes D (1984) Analysis of survival data, vol. 21. CRC Press. [Google Scholar]
- Dabrowska DM (1992) Variable bandwidth conditional Kaplan-Meier estimate. Scand. J. Statist, 19, 351–361. [Google Scholar]
- Dong Y and Li B (2010) Dimension reduction for non-elliptically distributed predictors: second-order methods. Biometrika, 97, 279–294. [Google Scholar]
- Du Y and Akritas MG (2002) Uniform strong representation of the conditional Kaplan-Meier process. Math. Methods Statist, 11, 152–182. [Google Scholar]
- Edelman A, Arias TA and Smith ST (1999) The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl, 20, 303–353. [Google Scholar]
- Fletcher R and Reeves CM (1964) Function minimization by conjugate gradients. Comput. J, 7, 149–154. [Google Scholar]
- Fukumizu K, Bach FR and Jordan MI (2009) Kernel dimension reduction in regression. Ann. Statist, 37, 1871–1905. [Google Scholar]
- Fukumizu K and Leng C (2014) Gradient-based kernel dimension reduction for regression. J. Amer. Statist. Assoc, 109, 359–370. [Google Scholar]
- Hall P (1987) On Kullback-Leibler loss and density estimation. Ann. Statist, 15, 1491–1519. [Google Scholar]
- Hammer SM, Katzenstein DA, Hughes MD, Gundacker H, Schooley RT, Haubrich RH, Henry WK, Lederman MM, Phair JP, Niu M et al. (1996) A trial comparing nucleoside monotherapy with combination therapy in hiv-infected adults with cd4 cell counts from 200 to 500 per cubic millimeter. New England Journal of Medicine, 335, 1081–1090. [DOI] [PubMed] [Google Scholar]
- Hansen BE (2005) Exact mean integrated squared error of higher order kernel estimators. Econometric Theory, 21, 1031–1057. [Google Scholar]
- Härdle W, Hall P and Marron JS (1988) How far are automatically chosen regression smoothing parameters from their optimum? J. Amer. Statist. Assoc, 83, 86–101. With comments by David W. Scott and Iain Johnstone and a reply by the authors. [Google Scholar]
- Härdle W and Marron JS (1985) Optimal bandwidth selection in nonparametric regression function estimation. Ann. Statist, 13, 1465–1481. [Google Scholar]
- Hosmer DW, Lemeshow S and May S (2008) Applied Survival Analysis: Regression Modeling of Time-to-event Data Wiley-Interscience. [Google Scholar]
- Huang M-Y and Chiang C-T (2017) An effective semiparametric estimation approach for the sufficient dimension reduction model. J. Amer. Statist. Assoc, 112, 1296–1310. [Google Scholar]
- Li B and Wang S (2007) On directional regression for dimension reduction. J. Amer. Statist. Assoc, 102, 997–1008. [Google Scholar]
- Li K-C (1991) Sliced inverse regression for dimension reduction. J. Amer. Statist. Assoc, 86, 316–342. With discussion and a rejoinder by the author. [Google Scholar]
- Li K-C, Wang J-L and Chen C-H (1999) Dimension reduction for censored regression data. Ann. Statist, 27, 1–23. [Google Scholar]
- Lu W and Li L (2011) Sufficient dimension reduction for censored regression. Biometrics, 67, 513–523. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma Y and Zhang X (2015) A validated information criterion to determine the structural dimension in dimension reduction models. Biometrika, 102, 409–420. [Google Scholar]
- Ma Y and Zhu L (2012) A semiparametric approach to dimension reduction. J. Amer. Statist. Assoc, 107, 168–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ma Y and Zhu L (2013) Efficient estimation in sufficient dimension reduction. Ann. Statist, 41, 250–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nadkarni NV, Zhao Y and Kosorok MR (2011) Inverse regression estimation for censored data. J. Amer. Statist. Assoc, 106, 178–190. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pakes A and Pollard D (1989) Simulation and the asymptotics of optimization estimators. Econometrica, 57, 1027–1057. [Google Scholar]
- Pollard D (1984) Convergence of stochastic processes Springer-Verlag, New York. [Google Scholar]
- Wang H and Xia Y (2008) Sliced regression for dimension reduction. J. Amer. Statist. Assoc, 103, 811–821. [Google Scholar]
- Xia Y (2007) A constructive approach to the estimation of dimension reduction directions. Ann. Statist, 35, 2654–2690. [Google Scholar]
- Xia Y, Zhang D and Xu J (2010) Dimension reduction and semiparametric estimation of survival models. J. Amer. Statist. Assoc, 105, 278–290. [Google Scholar]
- Yin X and Li B (2011) Sufficient dimension reduction based on an ensemble of minimum average variance estimators. Ann. Statist, 39, 3392–3416. [Google Scholar]
- Zhu L, Miao B and Peng H (2006) On sliced inverse regression with high-dimensional covariates. J. Amer. Statist. Assoc, 101, 630–643. [Google Scholar]
- Zhu L-P, Zhu L-X and Feng Z-H (2010) Dimension reduction in regressions through cumulative slicing estimation. J. Amer. Statist. Assoc, 105, 1455–1466. [Google Scholar]
- Zhu Y and Zeng P (2006) Fourier methods for estimating the central subspace and the central mean subspace in regression. J. Amer. Statist. Assoc, 101, 1638–1651. [Google Scholar]
