Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Dec 1.
Published in final edited form as: Biometrika. 2018 Oct 17;105(4):891–903. doi: 10.1093/biomet/asy050

The Change-Plane Cox Model

Susan Wei 1, Michael R Kosorok 2
PMCID: PMC6289527  NIHMSID: NIHMS988684  PMID: 30555175

Summary

We propose a projection pursuit technique in survival analysis for finding lower-dimensional projections that exhibit differentiated survival outcome. This idea is formally introduced as the change-plane Cox model, a non-regular Cox model with a change-plane in the covariate space dividing the population into two subgroups whose hazards are proportional. The proposed technique offers a potential framework for principled subgroup discovery. Estimation of the change-plane is accomplished via likelihood maximization over a data-driven sieve constructed using sliced inverse regression. Consistency of the sieve procedure for the change-plane parameters is established. In simulations the sieve estimator demonstrates better classification performance for subgroup identification than alternatives.

Keywords: Latent Supervised Learning, Projection Pursuit, Random Projection, Sieve Estimation, Sliced Inverse Regression, Subgroup Discovery

1. Introduction

Projection pursuit, the analysis of high-dimensional data via its lower-dimensional projections, is a common tool in exploratory data analysis. The idea is to search for projections that reveal interesting structure in the data. In this work, we present a projection pursuit technique in survival analysis where a projection is considered interesting if it leads to a separation of survival outcomes. The proposed technique is based on the change-plane Cox model, set forth below.

Let (X, Z, U) be a random vector of covariates, where XRp, ZRq1, and URq2. Let Sp be the collection of unit vectors in Rp. The following assumptions constitute what shall be called the change-plane Cox model:

Assumption 1. The hazard function of the true survival time T has the form

λ(tX,Z,U)=exp{β1TZ+β21(ωTXγ)+β3TZ1(ωTXγ)+β4TU}λ(t), (1)

where ω is an element of Sp, γ is in some known interval [a, b], β = (β1, … , β4) is the vector of regression parameters with at least either one of β2 or β3 nonzero for model identifiability, and λ(t) is an unknown baseline hazard function;

Assumption 2. The survival time T with hazard function (1) may be subject to right-censoring at a censoring time C which, conditional on (X, Z, U), is independent of T;

Assumption 3. X and (Z, U) are independent.

We observe the covariate vector (X, Z, U), the censored time T = min(T, C), and the censoring indicator δ, where δ = 1 if TC and δ = 0 otherwise. By seeking the change-plane, given by ωTX = γ, we accomplish our goal of finding a lower-dimensional projection of X that reveals two subgroups with differentiated survival.

To fix ideas, imagine X to be a set of biomarkers potentially predictive of survival, Z a categorical treatment variable, and U a set of baseline covariates such as age or gender. In this case, the regression coefficient β3 represents the interaction effect between treatment and the subgroup indicator 1(ωTXγ). A significant β3 is of practical interest since it would suggest the presence of treatment heterogeneity.

Rigorous assessment of β’s significance is likely to be challenging considering results in Pons (2003). There, it is shown for a certain change-point Cox model, which may be viewed as a special case of (1), that the maximum partial likelihood estimator for the change-point is n consistent but root-n consistent for the regression coefficients. Such non-regularity can be expected in the change-plane Cox model as well. Leaving distributional theory to future work, we propose in the meantime a resampling procedure in the Supplementary Material that serves as a heuristic proxy for assessing the significance of β.

2. Methodology

2·1. Overview

Our aim in this section is to propose an estimation scheme for the change-plane parameters in (1) based on a sample of n independent and identically distributed replicates of (R, T, δ) where R = (X, Z, U) denotes the full covariate set. The maximum partial likelihood estimator of the change-plane parameters can incur overfitting even when the dimension of X is moderately high, e.g., p = 25. This consideration leads us to employ a regularization technique known as Grenander’s method of sieves (Grenander, 1981), in which maximization takes place over an approximating subset of the parameter space called a sieve. It is desired that the sieve be dense, in a sense that will be later made rigorous in Definition 2. Interestingly, as demonstrated by Geman & Hwang (1982) in the context of nonparametric density estimation, regularization of the likelihood via the method of sieves may produce consistent estimators even when the full maximum likelihood estimator is not.

A sieve maximization scheme for fitting (1) is as follows. Collect the parameters into θ = (β, ω, γ). The sample log partial likelihood under (1) is

Ln(θ)=n1i=1n(δiη(Ri,θ)δilog[j:TjTin1exp{η(Ri,θ)}]) (2)

where η(R,θ)=β1TZ+β21(ωTXγ)+β3TZ1(ωTXγ)+β4TU. The factor n−1 is added for consistency with the empirical process notation in Section 3. Now, let

Mn(ω,γ)=Ln{β^n(ω,γ),ω,γ} (3)

where the quantity β^n(ω,γ)=arg maxβLn(β,ω,γ) is uniquely defined and can be found via Newton’s method. We shall focus on the estimation of ω since, once it is determined, the other parameters in (1) can be estimated by profiling.

Definition 1. For a sieve ΩnSp, the corresponding sieve estimator for ω in (1) is

ω^(Ωn)=arg maxωΩnMn{ω,γ~(ω)}

where

γ~(ω)=arg maxγ[a,b]Mn(ω,γ). (4)

The success of the sieve estimator hinges on the specification of rhe sieve. The remainder of Section 2 describes the construction of a data-driven sieve.

2·2. Initialization of the sieve

Algorithm 1 details the construction of an initial sieve consisting of vectors that represent possible change-planes in the X covariate space. Consideration of computation time leads to the particular choices in Algorithm 1, such as the number of clusters K, chosen deliberately so that ∣Ω0∣ is linear in n. Similarly, the discarding of clusters with fewer than four elements and the downsampling of clusters with more than ten elements are taken merely for computational gain. To get a sense for the size of Ω0, consider that Algorithm 1 applied to the simulations in Section 4 results in ∣Ω0∣ ≈ 3000 for sample size n = 100. If computation time is a nonfactor, better empirical performance of the overall sieve procedure (Algorithm 2 in the next section) has been observed for Ω0 in Algorithm 1 with a larger number of elements.

Algorithm1.Initial sieveΩ0Input:{X1,,Xn}InitializeΩ0to the empty set;SetKton10;Partition the data{X1,,Xn}intoKclusters usingK-means clustering;Discard clusters with fewer than four elements;Retain ten elements at random for clusters with more than ten elements;foreachremainingclusterdo|foreachnon-overlappingpartitionoftheclusterintotwopartsP1andP2do|Add toΩ0the unit-length vector that connects the centroids ofP1andP2;Output:Ω0

2·3. Updating the sieve using sliced inverse regression

We next update Ω0 by incorporating survival information using sliced inverse regression (Li, 1991). We first briefly review the technique. Sliced inverse regression is based on a model in which a response variable S and a covariate vector X in Rp satisfy

S=f(κ1TX,,κkTX,ϵ) (5)

for unknown constant vectors κj’s of the same dimension as X, unknown function f, and noise term ϵ that is independent of X. Below is the linearity condition, satisfied by X with elliptically symmetric distributions, used to justify sliced inverse regression.

Condition 1. For any bRp, E(bTXω0TX) is linear in ω0TX.

If X satisfies Condition 1, then for every s, the centered inverse regression curve, E(XS = s) – E(X), is in the span of {Σκ1, … , Σκk} where Σ= cov(X). Thus, the space spanned by the k eigenvectors of the covariance matrix of E(XS) associated with the k largest eigenvalues coincides with the span of {Σκ1, … , Σκk}. Then clearly, the span of {κ1, … , κk} itself can be obtained through standardization by Σ−1. The inverse regression curve is estimated empirically by slicing the range of S into H nonoverlapping intervals Ih, h = 1, … , H and computing the sample version of E(XSIh).

The subscript zero will be used to denote the true parameter value under (1). Since T with hazard function (1) satisfies (5) with k = 1, the recovery of ω0 in the change-plane Cox model can be accomplished via an eigendecomposition of the covariance matrix of E(XT), followed by standardization using Σ−1. To avoid issues in estimating Σ and Σ−1 using their sample versions, we assume throughout the paper that n > p. However, rather than slicing on T, we slice simultaneously on T and 1{ωTXγ~(ω)} where ω ∈ Ω0. Specifically, let 0 = t1 < ⋯ < tH < = tH+1 be a partition of the positive real line into non-overlapping intervals Ih = [th, th+1), h = 1, … , H. Let ν(ω) denote the largest-eigenvalue eigenvector of the weighted covariance matrix

V(ω)=l=01h=1Hphl(ω){mhl(ω)E(X)}{mhl(ω)E(X)}T (6)

where

mhl(ω)=E[XTIh,1{ωTXγ~(ω)}=l],phl(ω)=pr[TIh,1{ωTXγ~(ω)}=l].

Assuming Condition 1 holds, the rescaled eigenvector Σ−1ν(ω) coincides with the desired ω0.

We now describe an estimate of V (ω) that accounts for censoring by employing the conditioning argument in Li et al. (1999). First, we have

mh1(ω)=E[X1{Tth,ωTXγ~(ω)}]E[X1{Tth+1,ωTXγ~(ω)}]E[1{Tth,ωTXγ~(ω)}]E[1{Tth+1,ωTXγ~(ω)}],

which can be further decomposed as

E[X1{Tt,ωTXγ~(ω)}]=E[X1{Tt,ωTXγ~(ω)}]+E[X1{T<t,δ=0,ωTXγ~(ω)}α(T,t,X)]

where

α(t,t,X)=pr(TtX)pr(TtX),t<t, (7)

can be interpreted as a weight adjusting for the presence of censoring. This decomposition allows us to rewrite the numerator of mh1(ω) as

E[X1{Tth,ωTXγ~(ω)}]E[X1{Tth+1,ωTXγ~(ω)}]=E[X1{thTth+1,ωTXγ~(ω)}]+E[X1{T<th,δ=0,ωTXγ~(ω)}α(T,th,X)]E[X1{T<th+1,δ=0,ωTXγ~(ω)}α(T,th+1,X)].

Thus we can slice on the observed survival time T rather than T. Let

c^i,h1(ω)=1{thTi<th+1,ωTXiγ~(ω)})+1{Ti<th,δi=0,ωTXiγ~(ω)}α^(Ti,th,Xi)1{Ti<th+1,δi=0,ωTXiγ~(ω)}α^(Ti,th+1,Xi),

where α^(,,) denotes a nonparametric estimate of (7) to be discussed in Section 2·4. To estimate mh1 and ph1, we use the sample moments

m^h1(ω)=i=1nXic^i,h1(ω)/i=1nc^i,h1(ω),

and p^h1(ω)=n1i=1nc^i,h1(ω), respectively. The estimation of mh0 and ph0 is analogous. These components are incorporated into the data-driven sieve detailed in Algorithm 2. Let the resulting sieve be denoted Ω^n. The sieve estimator associated to it will be written ω^(Ω^n), following the notation introduced in Definition 1.

Algorithm2.Data-driven sieveΩ^nbased on sliced inverse regressionInput:(Xi,Ti,δi),i=1,,nH,the number of slicesΩ0,the initial sieveα^(,,),censoring weight estimate.InitializeΩ^n𝕊pto the empty set;FindΣ^,the empirical covariance matrix based onX1,,Xn;Set{th}h=1according to the observed range ofTis divided intoHequal intervals witht1=0andtH+1=;Findα^(Ti,th+1,Xi),i=1,,n;h=1,,H;foreachωΩ0doFindV^n(ω)=l=01h=1Hp^hl(ω){m^hl(ωX)}{m^hl(ω)X}T;Find the largest-eigenvalue eigenvector ofV^n(ω),denote this byν^n(ω);AddΣ^1ν^n(ω)f,normalized to unit length,toΩ^n;Output:Ω^n

Algorithm 2 is rather insensitive to H and we recommend setting it to 10. Far more critical to Algorithm 2 is the estimation of the censoring weight, the focus of the next section.

2·4. Estimation of censoring weights

The estimation of the censoring weight α in (7) reduces to that of pr(TtX), the conditional survival function of T. We shall consider two nonparametric estimates of the latter, and hence of (7) itself. The first is the classic nonparametric kernel estimator in Beran (1981), which is described in equations (3.11) to (3.13) of Li et al. (1999) in notation similar to our setting. The corresponding censoring weight estimate shall also be referred to as Beran’s kernel estimate.

Despite its simplicity, the performance of Beran’s kernel estimate quickly deteriorates as the dimension of X increases. This limitation may be overcome by modern machine learning techniques. We shall employ the recursively imputed survival tree method proposed by Zhu & Kosorok (2012), a powerful, albeit complex, method for estimating the conditional survival function for censored data.

The recursively imputed survival tree combines imputation of censored observations with the idea of extremely randomized trees. Like the random forest, the extremely randomized tree selects a subset of candidate features at random. However, it does not search for the most discriminative cutpoints as in the random forest, rather basing itself on random thresholds for each covariate. The imputation of censored observations enables more terminal nodes, and thus more complex trees, to be constructed. Full details of the recursively imputed survival tree algorithm are given in the Supplementary Material. We have found that the recursively imputed survival tree estimate of α leads to better performance of Algorithm 2 compared to Beran’s kernel estimate, as soon as the dimension of X increases beyond a few dimensions, e.g., p > 5.

3. Consistency

Theorem 1 establishes the consistency of the sieve estimator corresponding to a general sieve Ωn under the following conditions:

Condition 2. The parameter θ0 = (β0, ω0, γ0) lies in a compact subset Θ = Θ1 × Θ2 of R2q1+q2+1×Sp×[a,b] where Θ1 and Θ2 are compact subsets of R2q1+q2+1 and Sp×[a,b], respectively.

Condition 3. The covariate X has a continuous distribution and the projection ω0TX has a strictly bounded and positive density f over [a, b].

Condition 4. The probabilities pr(C = 0) = 0 and pr(CτX) = pr(C = τX) are positive almost surely for some 0 < τ < ∞.

Condition 5. The variables Z and U lie in bounded sets.

Conditions 2 and 3 are rather technical and simplify the proof. Condition 4 is common in survival analysis, though it is not precisely true in practice, e.g. in a clinical trial with staggered entry. Condition 5 is needed for an application of the dominated convergence theorem. The statement of Theorem 1 requires a definition first.

Definition 2. A sieve ΩnSp is called dense for (1) if there exists a sequence ωn ∈ Ωn such that {ωn, γ~(ωn)} converges to (ω0, γ0) as n → ∞.

Theorem 1 (Consistency of general sieve estimator). Suppose Conditions 2–5 and ΩnSp is a dense sieve for (1). If ω^n=ω^(Ωn) denotes the sieve estimator, then {ω^n,γ~(ω^n)} is consistent for (ω0, γ0) as n → ∞.

The proof of Theorem 1 can be found in the Appendix. Next, Corollary 1 establishes the consistency of the sieve estimator corresponding to Algorithm 2 under Condition 1 and the following meta-condition: Condition 6. The censoring weight estimate α^ is such that for every ω ∈ Ω0, m^hl(ω) is consistent for mhl(ω) as n → ∞ for h = 1, … , H and l = 0, 1.

Though we will limit our discussion of Condition 6 to the two estimators considered in Section 2·4, its specification is left deliberately broad so as to allow for other possible censoring weight estimators.

For Beran’s kernel estimate, the arguments in the proof of Lemma 3.1 in Li et al. (1999) can be used to verify Condition 6. The application of Lemma 3.1 requires regularity conditions labeled therein as (B.1), (B.3), (B.5) and (B.8), which mostly pertain to the relationship between the bandwidth rate and the bias and variance terms of the kernel estimate.

As for the recursively imputed survival tree estimate of α, Theorem 1 of Cui et al. (2017) addresses the consistency of estimating the underlying hazard function using a similar survival tree-based method. In both cases, a single tree is partitioned enough so that the failure and censoring observations in the terminal nodes are approximately independent while maintaining a sufficient number of observations. In Theorem 1 of Cui et al. (2017), this is used to establish consistency of the resulting local Nelson-Aalen estimators for the conditional hazard estimators. For the recurseively imputed survival tree, the Kaplan-Meier estimator (approximately through the Monte Carlo EM algorithm) is used instead of the Nelson-Aalen estimator.

For both Lemma 3.1 in Li et al. (1999) and Theorem 1 in Cui et al. (2017), suitable smoothness on the conditional survival function is most convenient in ascertaining the key conditions. Under Condition 3, the region where the smoothness is not met by the change-plane Cox model, i.e. the change-plane, can be bounded by a region with arbitrarily small probability.

Corollary 1 (Consistency of sieve estimator corresponding to Algorithm 2). Let Ω^n denote the sieve produced by Algorithm 2 for some nonempty initial sieve Ω0. Suppose Conditions 1–6 hold. If ω^n=ω^(Ω^n) denotes the sieve estimator, then {ω^n,γ~(ω^n)} is consistent for (ω0, γ0) as n → ∞.

Proof. Let ω ∈ Ω0. Through conditioning, we have the identity

mh1(ω)=E{XT[th,th+1),ωTXγ~(ω)}=E{E(XT)T[th,th+1),ωTXγ~(ω)}.

A similar identity holds for mh0. By Condition 1, ν(ω), the largest-eigenvalue eigenvector of (6), is a scalar multiple of Σω0. By Condition 6, the individual components in V^n(ω) are consistent for their theoretical counterparts. Thus V^n(ω) is consistent for V (ω) and hence the eigenvector ν^n(ω) is consistent for ν(ω) as n → ∞. Thus, the sieve Ω^n is dense and Theorem 1 yields the desired result.□

4. Simulation study

In this section, we use simulation to compare the sieve estimator to two alternatives. To focus on subgroup identification in the change-plane Cox model, we set Z = 1 and U = 0 in (1). This yields the reduced change-plane Cox model, with hazard function

λ(tX)=exp{β1(ωTXγ)}λ(t).

Subgroup identification in this model can be viewed as a type of latent supervised learning (Wei & Kosorok, 2013) where the right-censored survival time plays the role of a surrogate training label.

The first alternative we consider is the double-slicing procedure proposed in Li et al. (1999), which simultaneously slices on the censored survival time and the censoring indicator. A critical assumption is that the censoring time also satisfies a sliced inverse regression representation, i.e.

C=g(κ1TX,,κcTX,ϵ) (8)

where g and ϵ′ are unspecified, and ϵ′ is independent of X. As Li’s double-slicing method does not automatically produce an estimate of γ, we obtain one by applying γ~ in (4) to the estimated ω. A complete description of Li’s double-slicing method can be found in the Supplementary Material.

The second alternative we consider is the standard survival tree implemented using the rpart package in R (Therneau & Atkinson, 2018). We use the rpart tree to produce a direct estimate of subgroup membership since oen cannot be obtained for the change-plane itself. This is done by thresholding the hazard rate at unity to divide the terminal nodes of the rpart tree into two subgroups. The rpart survival tree should not be confused with the recursively imputed survival tree. The latter is used in this paper solely for the estimation of α. It must also be said that rpart was implemented using default rather than carefully-tuned parameters.

The sieve estimator corresponding to Algorithm 2 is implemented as follows. The initial sieve Ω0 is produced according to Algorithm 1 with K = n/10. The recursively imputed survival tree is used to estimate the conditional survival function of T and, in turn, the censoring weight α.

The simulation setup is as follows. We draw n = 100 independent and identically distributed observations (X, T, δ) from the reduced change-plane Cox model with parameters

β=log10,λ(t)=1,XN(0,Ip),ω=(p12,,p12[p2],p12,,p12)p[p2],γ=1/4,

and one of three censoring mechanisms in Table 1. As this setup results in exponential survival times on either side of the change-plane with all components of ω nonzero, we call it the abundant exponential simulation.

Table 1:

Censoring mechanisms

Name Distribution
independent C ~ uniform(0, 10)
linear C ~ min{uniform(0, 31.97), 20}1(ωTXγ) + min{uniform(0, 3.2), 2}1(ωTX < γ)
nonlinear Cexponential{101exp(X1+X22+logX3)}

We write uniform(a, b) to denote the uniform distribution with parameters a and b and exponential(μ) to denote the exponential distribution with mean μ. The independent setting is so-called because censoring is independent of X. In the linear setting, censoring is dependent on X only through the change-plane while in the nonlinear setting censoring depends nonlinearly on X.

The average misclassification rate over 100 Monte Carlo simulations on a large independent test set (sample size 10, 000) of the covariate X will serve as the measure of performance. Figure 1 summarizes the classification performance of the three methods as a function of dimension p for each of the three censoring mechanisms in Table 1.

Fig. 1:

Fig. 1:

Results for abundant exponential simulation. Misclassification rate over 100 Monte Carlo simulations for the sieve (solid), Li’s double-slicing (dotted), and rpart tree (dashed), as a function of dimension p. Vertical bars indicate Monte Carlo simulation error.

The sieve estimator performs better than Li’s double-slicing procedure under the independent censoring mechanism, since there is no benefit to slicing on the censoring variable. In the linear censoring case, the two methods have similar performance since the sieve estimator is unlikely to provide a substantial improvement when C satisfies (8). In contrast, under the nonlinear censoring mechanism, C cannot be written as a function of a linear combination of the covariates which leads to the violation of (8) in Li’s double-slicing model. The sieve estimator slightly outperforms it in this case.

Figure 1 reveals the rpart tree has difficulty across all censoring mechanisms and dimensions, probably because the geometry of the change-plane is far from that assumed by it. When the geometry is favorable to the rpart survival tree, it can be expected to perform substantially better. An example of this can be found in the sparse exponential simulation presented in the Supplementary Material. The rpart approach is still outperformed by both the sieve estimator and Li’s double-slicing for dimensions p = 5, 10, 25. It is not until p = 50 that it shows its advantages. Nonetheless, survival tree methods for subgroup identification cannot produce subgroups that are contiguous in the covariate space, which may hamper interpretability in certain settings.

The abundant exponential simulation in this section and the sparse exponential simulation in the Supplementary Material both consider an idealized setting where the data are generated according the reduced change-plane Cox model. The sieve estimator is seen to offer generally better classification performance over both Li’s double-slicing and the rpart tree across a range of dimensions p and censoring mechanisms.

5. Future work

We originally envisioned the change-plane Cox model as a tool for performing subgroup discovery, which aims to identify subgroups with heterogenous treatment responses from a very large pool of candidate subgroups (Lipkovich et al., 2017). Given its post-hoc nature, subgroup discovery, and more generally subgroup analysis, is notoriously controversial (Wang et al., 2007). The change-plane Cox model may provide a principled, data-driven framework for subgroup discovery when the outcome of interest is survival. However, as the data examples in the Supplementary Material highlight, several issues must be addressed before the potential can be realized.

In the Supplementary Material, we apply the full change-plane Cox model to two datasets. The significance of β is assessed by repeatedly partitioning the data into training and test sets. Each time, only the training data is used to obtain an estimate of the change-plane parameters ω and γ. The significance of the regression coefficient β is then assessed in the test set, ignoring the fact that the change-plane was learned from the data. For both datasets, the resampling strategy reveals that significant β coefficients in the training data may not remain so in the test set.

Distributional theory for the parameters in the change-plane Cox model, which is currently lacking, could help identify these instances of overoptimism. For now, we recommend any application of the proposed technique always be accompanied by the resampling strategy, which seems adequate for detecting whether the subgroups discovered are real or not. A deeper issue is the challenge that data-driven approaches pose to the standard paradigm of the scientific method. When hypotheses are generated from the data, care is needed to avoid confirmation bias.

Supplementary Material

Supplementary material

Acknowledgement

The authors are grateful to all the reviewers, especially the Associate Editor, for their meticulous readings and invaluable input. The authors would also like to thank Ruoqing Zhu for helpful comments on an early version. The second author was supported in part by grant P01 CA142538 from the US National Cancer Institute and from grant DMS-1407732 from the US National Science Foundation.

Appendix

Proof of Theorem 1. Let P denote the probability measure of W = (R, T, δ) under (1). Define the empirical measure to be Pn=n1i=1nδWi where δw is the measure that assigns mass 1 at w and zero elsewhere. For a measurable function f, we denote Pnf=n1i=1nf(Wi) and Pf = ∫ f dP. Let W~=(R~,T~,δ~) be a realization from P, independent of W. Let P~ and P~n be defined analogously for W~. Next let Y(t) = 1(Tt) be the at-risk process. Using empirical process notation we can write (2) and (3) as Ln(θ)=P~nδ~{η(R~,θ)logFn(T~,θ)} where Fn(t,θ)=PnY(t)exp{η(R,θ)} and Mn(ω,γ)=P~nδ~{η(R~,β^n(ω,γ),ω,γ)logFn(T~,β^n(ω,γ),ω,γ)}. In the expressions for Ln and Mn, the random variables (R~, T~, δ~) in the first term on the right-hand side have their expectations taken with respect to P~n. In the second term on the right hand side, two successive integrations take place: first the expectation of (R, T, δ) in Fn with respect to Pn and then the expectation of T~ with respect to Pn. Let F0(t, θ) = PY (t) exp{η(R, θ)}. The corresponding population versions of Ln and Mn are

Lp(θ)=P~δ~{η(R~,θ)logF0(T~,θ)}, (A1)

and

M(ω,γ)=P~δ~[η{R~,β(ω,γ),ω,γ}logF0{T~,β(ω,γ),ω,γ}]

where β(ω, γ) = arg maxβ LP(β, ω, γ). The subscript in Lp refers to the fact that this is a partial likelihood. Later we will use L to denote the full likelihood.

Following the argmax theorem in M-estimation theory (Kosorok, 2008, Theorem 14.1), the following conditions are sufficient to obtain consistency: 1) The sequence {ω^n, γ~(ω^n)} is uniformly tight; 2a) The map (ω, γ) ↦ M(ω, γ) is upper semi-continuous with 2b) a unique maximum at (ω0, γ0); 3) Mn converges to M uniformly over every compact set K in Θ2; and 4) the sieve estimator nearly maximizes the objective function, i.e., Mn{ω^n,γ~(ω^n)}Mn(ω0,γ0)oP(1). We now check each of these conditions in turn.

The first condition of the argmax theorem holds since ω^n=1 and γ~(ω^n) must lie in the interval [a, b]. For condition (2a), we will show that M(ω, γ) is continuous. Let (ωn, γn) be a sequence converging to (ω, γ) and βn be a sequence converging to β. Then θn = (βn, ωn, n) is a sequence converging to θ = (β, ω, γ). We first show that P~δ~η(R~,θn)P~δ~η(R~,θ) if θnθ. This can be seen to hold component-wise for η in light of Conditions 3 and 5. We’ll show it explicitly for one of the components. Since X is continuous by Condition 3, we have

Pδ1(ωnTXγn)δ1(ωTXγ)Pδ1(ωnTXγn)1(ωTXγ)1(ωnTXγnωTXγ0ϵ)+Pδ1(ωnTXγn)1(ωTXγ)1(ωnTXγnωTXγ0>ϵ)0.

If β(ωn, γn) → β(ω, γ) then F0{T~,β(ωn,γn),ωn,γn}F0{T~,β(ω,γ),ω,γ} almost surely. Note thatF0{T~,β(ωn,γn),ωn,γn} is bounded by an integrable function under Conditions 4 and 5. This gives P~δ~logF0{T~,β(ωn,γn),ωn,γn}P~δ~logF0{T~,β(ω,γ),ω,γ}. Thus to show that M(ω, γ) is continuous, it suffices to establish continuity of β(ω, γ). To see this, first note Lp(θ) is continuous using the arguments above. Next we establish that Lp(θ) has a unique maximum in β for every pair (ω, γ). Consider

βLp(θ)=P~δ~[βη(R~,θ)PY(T~)exp{η(R,θ)}βη(R,θ)PY(T~)exp{η(R,θ)}]

where

βη(R,θ)={Z,1(ωTXγ),Z1(ωTXγ),U}.

A straightforward calculation shows the second partial derivative with respect to β is strictly negative definite. Thus β(ωn, γn) → β(ω, γ).

We now verify condition (2b). Under (1), write the integrated hazard function of T, given X, as exp{η(R, θ)}Λ(t) where Λ is continuous and monotone increasing with Λ(0) = 0. The joint likelihood, in θ and the nuisance parameter Λ, for a single observation (R, T, δ) is proportional to L(θ, Λ) ≡ {b(R, θ)λ(T)}δ exp{−b(R, θ)Λ(T)} where b(R, θ) = exp{η(R, θ)}. Next we check (2b) by showing that the profile of L over Λ equals Lp(θ) in equation (A1) up to a constant, which will then enable us to use the standard Kullback–Leibler argument for identifiability to show that θ0 is a unique maximizer of (A4) and hence (ω0, γ0) is a unique maximizer of M(ω, γ).

In L, replace λ(t) with λs(t) = {1 + sf(t)}λ(t), where f is for now an unspecified bounded function, and take the Gateaux derivative of L with respect to s at s = 0. Letting N(t) = 1(Tt. δ = 1) be the counting process and using the fact that P dN(t) = PY (t)b(R, θ0) dΛ0(t), we obtain that the expectation of the resulting derivative is

0τf(t)P{Y(t)b(R,θ0)}dΛ0(t)0τf(t)P{Y(t)b(R,θ)}dΛ(t). (A2)

Now if we replace Λ in (A2) with Λs(t)=0t{1+sg(u)}dΛ(u), for some other function g, and differentiate again with respect to s at s = 0, we obtain that the second Gateaux derivative is 0tf(t)g(t)P{Y(t)b(R,θ)}dΛ(t), which is strictly negative when f = g, implying that for fixed θ, any Λ which is a zero of (A2) for a rich enough collection of functions f is a maximizer over all Λ for fixed θ. Plug f(t) = 1(tu) into (A2), and allow u to range over [0, τ], and we obtain that the profile maximizer of L over Λ satisfies 0uP{Y(t)b(R,θ0)}dΛ0(t)0uP{Y(t)b(R,θ)}dΛ(t)=0 for all u ∈ [0, τ]. Hence

dΛ(t)dΛ0(t)=P{Y(t)b(R,θ0)}P{Y(t)b(R,θ)}. (A3)

Plugging (A3) back into L, and removing additive terms which are constants with respect to θ, we obtain that the profile of L over the parameter Λ is

P(0τlogb(R,θ)dN(t)0τlog[P{Y(t)b(R,θ)}]dN(t)), (A4)

which equals to Lp(θ) in equation (A1). Now let θ1 maximize (A4). Then, by the fact that (A4) is the profile of L over the parameter Λ, there exists a Λ1 such that the joint parameter (θ1, Λ1) maximizes L. By the property of the Kullback–Leibler discrepancy and model identifiability, this implies that θ1 = θ0. Hence (A4) has a unique maximizer at θ0 and we have shown that M(ω, γ) is uniquely maximized at (ω0, γ0).

Proceeding on to condition (3) of the argmax theorem, fix a compact K = K1 × K2 ⊂ Θ where K1 is compact in Θ1 and K2 is compact in Θ2. Let mθ(v, t, δ) = δ{η(v, θ) – log Fn(t, θ)} and consider the class of functions {mθ(v, t, δ) : θK}. First we consider the component {η(v, θ) : θK}. Trivially, the classes {βi} for i = 1, … , 4 are each Donsker, as are the classes {Z} and {U}. The class {1(ωTxγ) : (ω, γ) ∈ K2} is also Donsker by the example in Section 4.1.1 of Kosorok (2008). Since products of bounded Donsker classes are Donsker, {η(v, θ) : θK} is Donsker. Next, we examine the component {log Fn(t, θ) : t ∈ [0, τ], θK}. The class {exp{β21(ωTxγ)}} is Donsker, since exponentiation is Lipschitz continuous on compact sets. The at-risk process Y (t) is Donsker by Lemma 4.1 in Kosorok (2008). Thus {log Fn(t, θ)} is Donsker. Repeating arguments for sums of Donsker classes and products of bounded Donsker classes shows that {mθ(v, t, δ) : θK} is a Donsker class of functions, and therefore also a Glivenko–Cantelli class of functions.

Now, let mω,γ(v,t,δ)=δ{η(v,β^n(ω,γ),ω,γ)logFn(t,β^n(ω,γ),ω,γ)}. Then we can write Mn(ω,γ)=P~nmω,γ(R~,T~,δ~). Since the estimated log ratio hazard β^n(ω,γ) lies in a compact set in Θ1 for all (ω, γ) ∈ K2, the class {mω, γ(v, t, δ) : (ω, γ) ∈ K2} is contained in a Donsker class, which implies it is a Glivenko–Cantelli class. Thus

sup(ω,γ)K2Mn(ω,γ)P~mω,γ(R~,T~,δ~)0

in probability as n → ∞. Next we show that P~mω,γ(R~,T~,δ~) converges uniformly to M(ω, γ). The uniform convergence of β^n(ω,γ) to β(ω, γ) can be shown by adapting the arguments of Theorem 1 in Pons (2003). Next we show Fn{t,β^n(ω,γ),ω,γ} to F0{t, β(ω, γ), ω, γ} uniformly over (ω, γ) ∈ K2. We may write Fn{t,β^n(ω,γ),ω,γ}=PnY(t)exp[η{R,β^n(ω,γ),ω,γ}] and F0{t, β(ω, γ), ω, γ} = PY (t) exp[η{R, β(ω, γ), ω, γ}]. We have already argued the Donsker property of the classes {1(tr) : r ∈ [0; τ]} and {exp{η(v, θ)} : θK}. Thus we conclude that {1(tr) exp{η(v, θ)} : r ∈ [0, τ], θK} is Donsker and hence Glivenko–Cantelli. Hence, Mn(ω, γ) converges uniformly to M(ω, γ) over compact K2 ⊂ Θ2.

Finally, we look at condition (4) of the argmax theorem. If the sieve Ωn is dense, there is a sequence {ωn,γ~(ωn)}Ωn×[a,b] that converges to (ω0, γ0). By definition Mn{ωn,γ~(ωn)}Mn{ωn,γ~(ωn)}. By the continuity of M(ω, γ), Mn(ω0,γ0)Mn{ωn,γ~(ωn)}=oP(1) and thus Mn{ω^n,γ~(ω^n)}Mn(ω0,γ0)oP(1). The conditions of the argmax theorem are met, and consistency follows.□

Footnotes

Supplementary material

Supplementary material available at Biometrika online includes 1) descriptions of the recursively imputed survival tree and Li’s double-slicing method, 2) implementation details of all methods used in the simulations and data analyses, 3) results of the sparse exponential simulation, and 4) analysis of two survival datasets.

Contributor Information

Susan Wei, Division of Biostatistics, University of Minnesota, Minneapolis, Minnesota 55455, U.S.A., susanwei@umn.edu.

Michael R. Kosorok, Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599, U.S.A., kosorok@unc.edu

References

  1. Beran R (1981). Nonparametric regression with randomly censored survival data. Tech. rep., Univ. California, Berkeley [Google Scholar]
  2. Cui Y, Zhu R & Kosorok M (2017). Tree based weighted learning for estimating individualized treatment rules with censored data. Electronic Journal of Statistics 11, 3927–3953. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Geman S & Hwang C-R (1982). Nonparametric Maximum Likelihood Estimation by the Method of Sieves. The Annals of Statistics 10, 401–414. [Google Scholar]
  4. Grenander U (1981). Abstract Inference. New York: Wiley. [Google Scholar]
  5. Kosorok MR (2008). Introduction to Empirical Processes and Semiparametric Inference. New York: Springer-Verlag New York. [Google Scholar]
  6. Li K-C (1991). Sliced Inverse Regression for Dimension Reduction. Journal of the American Statistical Association 86, 316–327. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Li K-CC, Wang J-LL & Chen C-HH (1999). Dimension reduction for censored regression data. Annals of Statistics 27, 1–23. [Google Scholar]
  8. Lipkovich I, Dmitrienko A & D’Agostino RB (2017). Tutorial in biostatistics: data-driven subgroup identification and analysis in clinical trials. Statistics in Medicine 36, 136–196. [DOI] [PubMed] [Google Scholar]
  9. Pons O (2003). Estimation in a Cox Regression Model with a Change-Point According to a Threshold in a Covariate. The Annals of Statistics 31, 442–463. [Google Scholar]
  10. Therneau TM & Atkinson EJ (2018). rpart: Recursive Partitioning and Regression Trees. R package version 41–13. [Google Scholar]
  11. Wang R, Lagakos SW, Ware JH, Hunter DJ & Drazen JM (2007). Statistics in Medicine Reporting of Subgroup Analyses in Clinical Trials. New England Journal of Medicine 357, 2189–2194. [DOI] [PubMed] [Google Scholar]
  12. Wei S & Kosorok MR (2013). Latent Supervised Learning. Journal of the American Statistical Association 108, 957–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Zhu R & Kosorok MR (2012). Recursively Imputed Survival Trees. Journal of The American Statistical Association 107, 331–340. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

RESOURCES