Abstract
In this article we study a semiparametric additive risks model (McKeague and Sasieni (1994)) for two-stage design survival data where accurate information is available only on second stage subjects, a subset of the first stage study. We derive two-stage estimators by combining data from both stages. Large sample inferences are developed. As a by-product, we also obtain asymptotic properties of the single stage estimators of McKeague and Sasieni (1994) when the semiparametric additive risks model is misspecified. The proposed two-stage estimators are shown to be asymptotically more efficient than the second stage estimators. They also demonstrate smaller bias and variance for finite samples. The developed methods are illustrated using small intestine cancer data from the SEER (Surveillance, Epidemiology, and End Results) Program.
Key words and phrases: Censored data, correlation, efficiency, measurement errors, missing covariates
1. Introduction
Two-stage designs are useful in medical studies and other fields of research. The first stage sample of a two-stage design consists of a set of subjects under study with surrogate, inaccurate, or missing information. The second stage sample is a subset of individuals from the first stage with accurate and complete data. Typical scenarios include measurement error and missing covariates problems. For example, when a complete survey is complicated, expensive, and time consuming, researchers often use a simplified version for all study subjects in the first stage. The complete version is taken only by a small subset of study subjects. Two-stage data also arise in applications where certain information from more recently available technology, such as a genome-wide scan, is collected only for newly-diagnosed patients. A medical device postmarking surveillance example was given by Li and Tseng (2008): St Jude Medical conducted a postmarket surveillance study to evaluate the safety and efficacy of five pacing electrodes by collecting information of the adverse events or failures. The database maintained by the medical device company, which contains all the market devices, has serious under-reporting problems that might underestimate the failure and adverse event rates. To offset the under-reporting bias, St Jude Medical drew an active follow-up sample and collected accurate and complete information on this sample. This typical two-stage survival study consists of the company administrative data (first stage data) and the active follow-up data (second stage data). In general, analysis based on the first stage data alone could be biased. On the other hand, analysis based on the second stage data alone would not be the most efficient since it does not utilize information from the first stage. It would be desirable to combine data from both stages to increase efficiency of the second stage data analysis.
The two-stage design has been studied extensively for complete data; see, e.g., White (1982), Schill et al. (1993), Breslow and Holubkov (1997), and references therein. However, there are relatively few methods available for analysis of two-stage censored survival data, especially when both the outcome variable and covariates are subject to error in the first stage sample. Among others, Zhou and Pepe (1995) and Wang et al. (1997) studied the surrogate covariates problem for a multiplicative semiparametric hazard model using regression calibration techniques. Kulich and Lin (2000) proposed a corrected pseudo-score estimator for the additive risks model of Lin and Ying (1994) with measurement errors on covariates. Based on the work of Chen and Chen (2000) on regression models in two-stage designs, Chen (2002) and Tseng (2004) studied the Cox model for two-stage survival data, where both survival time and covariates are subject to measurement error. Li and Tseng (2008) studied nonparametric estimation of survival functions for two-stage survival data. Jiang and Zhou (2007) studied a two-stage design problem for Lin and Ying 's (1994) model.
In this paper we study the semiparametric additive risks model of McKeague and Sasieni (1994) (referred to as MS hereafter) for analysis of two-stage survival data where both the survival time and covariates are subject to measurement errors. Let h(t|x, z) denote the conditional hazard function of a survival time given x and z. The MS model postulates that
The MS model provides a useful alternative to the Cox (1972) model when the proportional hazards assumption is violated. Including Lin and Ying 's (1994) model as a special case, the MS is more parsimonious than Aalen's (1978) additive risks model.
We derive two-stage estimates for the parametric and nonparametric regression coefficients by bridging the first stage and second stage estimates through their asymptotic joint distribution. The estimators introduced in this paper take the form
where is the second stage estimator, and are first stage estimators for different sample sizes, Σ21 is the covariance matrix between and , and Σ11 is the variance of . The second stage estimator can be improved by incorporating the information from the first stage, i.e. . The use of information from both stages allows us to fit a model with full information. A major challenge in establishing the asymptotic joint distribution of the first-stage and second-stage estimates in our model is the loss of the martingale property, which is the key to the theoretical development of the MS model for the single stage estimates. Moreover, we need to derive the properties of the MS estimators under a misspecified model. We use a different approach to study the asymptotic joint distribution by deriving i.i.d. representations. The same approach is then used to establish large sample properties of the proposed two-stage estimates and to develop large sample inferences.
Our methods are developed under a very general setting that incorporates measurement errors on both covariates and the survival outcome without requiring specific model specifications for the errors. No assumption is needed for the relationship between surrogate variables and target variables. We allow misspecified models for the first stage data and derive a robust sandwich variance estimate for the MS model.
The paper is organized as follows. In Section 2, we study the properties of the single stage MS estimators under misspecified models, and propose two-stage estimators for the regression coefficients and the conditional survival function. Large sample properties of our proposed estimators are given in Section 3. Point-wise and simultaneous confidence intervals for the conditional survival function are derived. Section 4 presents a simulation study to evaluate the performance of our methods. In Section 5, we illustrate our method using small intestine cancer data from the Surveillance, Epidemiology, and End Results (SEER) Program supported by the National Cancer Institute (NCI). Section 6 provides some concluding remarks. The proofs are provided in the appendix.
2. Two-Stage Estimators
2.1. Notation and assumptions
Supposed there are N subjects in the first stage and only coarse measurements, denoted as (x1i, z1i, T1i, δ1i), i = 1, …, N, are available. Here x1i ∊ Rp1, zi1 ∊ Rq1 are the observed surrogate covariates that might depend on time, , is a survival time, C1i is a censoring time conditionally independent of given the covariates, and is the censoring indicator. In the second stage, accurate data (x2i, z2i, T2i, δ2i), i ∊ V (n), are collected for a random validation subsample V (n) of n subjects from the first stage, where x2i ∊ Rp2, z21 ∊ Rp2, , is the true survival time, C2i is a censoring time conditionally independent of given the covariates x2i and z2i, and is the censoring indicator.
Assume the following MS model for the survival time :
for i ∊ V (n). For the second stage sample, let
be the weighted least squares estimators of β2 and , respectively (McKeague and Sasieni (1994)), where , , is the at-risk process, , N2i(t) = I(T2i ≤ t, δ2i = 1) is the counting process,
with , is a uniformly consistent estimate of the weight function h2i(t) ≡ h2(t|x2i, z2i) for subject i, and τ is the last time point in the study (see a more rigorous definition in the appendix). Similarly, we define the first stage estimators [, and , ] using the first stage data based on all N subjects and the n subjects in the validation sample, respectively.
2.2. Asymptotic properties of the MS estimators under misspecified models
Our theorem gives large sample properties of and without making any model assumption for the first stage data. It is a nontrivial generalization of the result of MS by allowing the model to be misspecified. Scheike (2002) considered a particular misspecification of the MS model, where the form holds only for the rate function and not the intensity, while our results work for general misspecification.
It is shown in the appendix that is equivalent to a sum of independent and identically distributed random variables with mean zero,
where is the unknown working parameter, and
with
The variance of is therefore and can be consistently estimated by with the unknown quantities replaced by their estimates.
Similarly, we prove in the appendix that
where
| (2.1) |
| (2.2) |
The pointwise variance of the asymptotic Gaussian process can be estimated by , with the unknown quantities replaced by the estimates.
The asymptotic results for a misspecified MS model is summarized in the theorem, and the proof is in the appendix.
Theorem 1 Under the regularity conditions (C1)–(C3) stated in the appendix, in a misspecified model
where with a⊗2 = aa′ The variance Σβ,11 can be consistently estimated by , with dt and defined in (A.2). Moreover,
where is the standard Skorohod space on [0, τ], τ = sup{t : S1(t|x, z) S2(t|x, z)SC(t|x, z) > 0 for all x, z} (see regularity assumption (C1) in Appendix), and is a zero-mean Gaussian process with covariance function . The variance function of is given , which can be consistently estimated by with the unknown quantities replaced by the estimates.
The estimators and , based on all the N first stage subjects, have the same asymptotic properties as stated in Theorem 1. Notice that no model is assumed for the first stage data in Theorem 1. If the MS model holds for the first stage data, then β1 and A1(t) coincide with the regression parameters in the true MS model.
2.3. Two-stage estimators of β and A(t)
To develop the two-stage estimator for β2, we first give the joint distribution of .
Lemma 1. Assume the regularity conditions (C1)–(C3) given in the appendix, then
| (2.3) |
where , , and g = 1, 2 indicates the stage. The covariance matrix can be estimated by , and Here β2 Φ2, w2i(t), , and are defined similarly as β1, Φ1 w1i(t), , and .
It follows from (2.3) that . This suggests that β2 be estimated by
| (2.4) |
Next, we consider the joint distribution of and .
Lemma 2. Let A2(t) and v2i(t) be defined similarly as (2.1) and (2.2), respectively, based on the second stage sample. Under the regularity conditions (C1)–(C3), as n, N → ∞, n/N → ρ,
in , where is a zero-mean Gaussian random field, with variance-covariance function
where ΣA,kl(t1, t2) = E [vki(t1)·vli(t2)′] for k, l ε {1, 2}, and ΣA,gg(t) = ΣA,gg(t, t). The variance and covariance functions can be consistently estimated by as defined in (A.4) in the Appendix.
By Lemma 2 and the argument leading to (2.4), we take
| (2.5) |
where .
Our two-stage estimators possess some appealing properties. In particular, if the first stage data are barely correlated with the second stage data, then the proposed estimate is close to the second stage estimate . This is a desirable property since the first stage data are not expected to contribute much useful information for estimating β2. The same comment applies to . It can also be easily verified that when the first stage sample contains precise and complete information, the proposed estimates and are identical to the estimates and . This means that we should use all the first stage data to estimate the parameters and make statistical inference when no bias is present in the first stage sample.
This method is general enough to allow variables to have different types of coefficients in the two stages, as well as different sets of covariates for the first- and second-stage models. However, we recommend using the same type of coefficients for a variable in both stages whenever possible, based on the intuition that the effects of a variable are expected to be similar for both stages. For example, we use constant coefficients for age and gender for both the first- and second-stage models in our data example in Section 5.
3. Asymptotic Properties and Inferences
3.1. Asymptotic properties of
The following result states the weak convergence property of the joint distribution of the proposed estimators and .
Theorem 2 Under conditions (C1)–(C3), we have
| (3.1) |
where with , and is a zero-mean Gaussian process with covariance function
with ΣA,kl(s, t) defined in Lemma 2. The variance of is . The covariance between Z2 and is
where n/N → ρ for some constant 0 < ρ < 1 as n, N → ∞, and for k, l ε {1, 2}.
Furthermore, can be consistently estimated by , and can be consistently estimated by for any t ε [0, τ]. The covariance function can be estimated by
with .
Obviously, (i.e. is nonnegative definite). Hence our proposed two-stage estimators are asymptotically more efficient than the estimators using the second stage data alone. We will compare their finite sample performance in Section 4.
3.2. Estimation of the conditional survival function
We consider the problem of estimating the conditional cumulative hazard function H2(t) = H2(t|x0, z0) and the conditional survival function S2(t) = S2(t|x0, z0), for some given covariates x0, z0. Let and , where .
Theorem 3 Assume that n/N → ρ for some constant 0 < ρ < 1 as n, N → ∞ Under the regularity conditions (C1)–(C3),
where is a zero-mean Gaussian process with covariance function (t1, t2) equal to ; this can be consistently estimated by replacing each term by its estimate.
Thus at any t ε [0, τ], 100(1 − α)% pointwise confidence intervals for H2(t) and S2(t) are given by , and , where 1 z1−α/2 is the (1 − α/2)th percentile of the standard normal distribution.
Notice that the proposed estimator is not necessarily monotonically non-increasing in t. As mentioned by Li and Tseng (2008), this problem is local and minor, especially when the sample size is large. In practice, one can improve the estimates by the `Poor-Adjacent Violator' algorithm (Barlow et al. (1972)) or some simpler modification (c.f. Lin and Ying (1994)).
Theorem 3 cannot be readily applied to construct simultaneous confidence bands for S2(t) over a given interval [τ1, τ2] since the distribution of sup is intractable. Using an idea similar to that in Lin and Ying (1994), we develop a Monte Carlo method for constructing simultaneous confidence bands for H2(t) and S2(t). It can be shown that the process En(t) in Theorem 3 is asymptotically equivalent to a sum of i.i.d. random variables. Specifically,
To approximate the distribution of En(t), we define another process as
where Gi are i.i.d. N(0, 1). We prove in the appendix that En(t) and have the same limiting distribution.
Theorem 4 Conditioned on the data (x1i, z1i, T1i, δ1i), i = 1,…, N, and (x2j, z2j, T2j, δ2j), j ε V (n), the random process converges weakly to in .
Theorem 4 suggests that the limiting distribution of En(t) can be approximated by that of . The latter can be obtained by generating a large number of independent Monte Carlo random samples G1,…,GN from the standard normal distribution. Similar to Lin and Ying (1994), the confidence bands for H2(t) and S2(t) can be obtained as and , where qα is the critical value of , g is a weight function, and ϕ is a known transformation function with non-zero and continuous first derivative ϕ′. Specifically, we consider to get an equal-precision band, and set ϕ(t) = log(t) to obtain bands on meaningful ranges and to attain better coverage probabilities.
4. Simulation Studies
We present a small simulation study to illustrate and evaluate the finite sample performance of the proposed two-stage estimators. The two-stage estimators are compared with the second stage estimators in terms of bias, variance, mean square error (MSE), and achieved confidence interval coverage probabilities.
The weight matrices Wg(t), g = 1, 2, in Section 2.1 require consistent estimates of the conditional hazard functions hgi(t|xgi, zgi); they are given by , where
K(t) is a kernel function, and b is the bandwidth. In the simulation study, we used the Epanechnikov kernel, K(x) = 3(1 − x2)/4 for |x| < 1, with b = 0.4τ, around a given point t. The boundary effects are corrected by the modified asymmetric kernel proposed by Gasser and Muller (1979). In a more comprehensive simulation, Wu (2006) observed that both second stage and two-stage estimators are not very sensitive to the choice of the smoothing parameter.
The second stage survival time was generated from h2(t|x2i, z2i,1, z2i,2) = α2,0(t) + α2,1(t)x2i + β2,1 z2i,1 + β2,2 z2i,2, where α2,0(t) = 1, α2,1(t) = t, β2,1 = β2,2 = 1 (q2 = 2), and x2i, z2i,1, z2i,2 ~ i.i.d. Unif[0, 1]. The random censoring times C2i, i = 1,…,n, were generated from h(c|x2i, z2i) = 0.1 + 0.1cx2i + 0.5z2i,1 + 0.5z2i,2. About 20 percent of the subjects were censored under this model. In the first stage, we considered a general situation by incorporating both the measurement error problem and missing covariate problem. The working model for the stage data was h1(t|x1i, z1i,1) = α1,0(t) + α1,1(t)x1i + β1,1z1i,1, where , x1i = x2i + Unif[0, 0.1], z1i,1 = z2i,1 + Unif[0, 0.5], and the covariate z2i,2 is missing for all subjects. We generated 1,000 Monte Carlo samples for each size (n, N) = (100, 1, 000), (500, 1, 000), and (500, 2, 000). With different sample sizes at both the first and second stages, we can evaluate: (1) the performance of the variance estimator under a misspecified model at finite sample sizes; and (2) the improvement of the second stage estimator when incorporating different amount of information form the first stage.
Table 1 presents the bias, variance, estimated variance, and MSE of , , , and together with the achieved coverage probabilities of the respective confidence intervals. When the second stage sample was small (n = 100), the variance of the second stage estimate (0.69) was underestimated (0.58). As a comparison, we can see that the first stage estimates had large bias, which indicates that the information from the first stage was biased. Even so, there was gain from combining the first stage data of z1i,1. The estimated variance of (0.23) was close to the variance (0.24), which was much smaller than the variance of the second stage estimate. We also observed a better coverage probability of than On the other hand, showed little improvement over This can be explained by the fact that the first stage data contained no information on z2i,2. As n increased to 500, the second stage estimates were improved with better variance estimates and higher coverage probabilities – the variance estimator for a misspecified model worked well when the sample size got bigger. Our proposed estimates had smaller variances, better variance estimates, and better coverage probabilities. As N increased to 2,000 and n remained at 500, the variance of got further reduced since more information from stage one was obtained. As for comparison, Tables 1 also shows the “ideal” estimates, denoted as and using the complete and accurate information for all N subjects. Figure 1 depicts the mean and variance of two-stage and second stage estimates of A2,0(t), A2,1(t), and S2(t|z0 = (0.5, 0, 5), x0 = 0.5), respectively, for sample size (n, N) = (100, 1, 000). The two-stage estimates (thick solid line) show less bias at the right tail than the second stage estimates (dashed line). Moreover, the two-stage estimates have much smaller variances throughout. Table 2 presents the simulated coverage probabilities of the pointwise 95% confidence intervals of A(t) and S(t|z0, x0) for (n, N) = (100, 1, 000). The coverage probabilities are quite satisfactory for most time points. As sample size increases, the performance is improved in the far right tail. The results for (n, N) = (500, 1, 000) and (500, 2, 000) are similar and thus not reported here.
Table 1.
Simulated bias, variance, means square error (MSE), and 95% coverage probability (CP) of , , and . The true parameter value was .
| (n,N) | Estimate | Bias | Var | Est. Var | MSE | 95% CP |
|---|---|---|---|---|---|---|
| (100,1,000) | (Stage 1) | −0.290 | 0.380 | 0.310 | NA | NA |
| (Stage 1) | −0.320 | 0.037 | 0.033 | NA | NA | |
| (Stage 2) | 0.040 | 0.690 | 0.580 | 0.690 | 0.910 | |
| (Proposed) | < 0.001 | 0.240 | 0.230 | 0.240 | 0.940 | |
| (Ideal) | < 0.001 | 0.060 | 0.061 | 0.061 | 0.950 | |
| (Stage 2) | −0.060 | 0.650 | 0.660 | 0.660 | 0.920 | |
| (Proposed) | −0.040 | 0.670 | 0.570 | 0.680 | 0.910 | |
| (Ideal) | −0.020 | 0.058 | 0.061 | 0.058 | 0.950 | |
|
| ||||||
| (500, 1, 000) | −0.320 | 0.075 | 0.075 | NA | NA | |
| −0.310 | 0.038 | 0.039 | NA | NA | ||
| (Stage 2) | −0.010 | 0.130 | 0.130 | 0.130 | 0.940 | |
| (Proposed) | −0.010 | 0.078 | 0.077 | 0.078 | 0.950 | |
| (Ideal) | < 0.001 | 0.060 | 0.061 | 0.060 | 0.950 | |
| (Stage 2) | 0.010 | 0.130 | 0.130 | 0.130 | 0.940 | |
| (Proposed) | 0.010 | 0.130 | 0.130 | 0.130 | 0.940 | |
| (Ideal) | −0.010 | 0.058 | 0.060 | 0.058 | 0.950 | |
|
| ||||||
| (500, 2, 000) | −0.310 | 0.074 | 0.075 | NA | NA | |
| −0.320 | 0.019 | 0.020 | NA | NA | ||
| (Stage 2) | < 0.001 | 0.120 | 0.120 | 0.120 | 0.950 | |
| (Proposed) | < 0.001 | 0.054 | 0.054 | 0.054 | 0.950 | |
| (Ideal) | < 0.001 | 0.030 | 0.031 | 0.031 | 0.950 | |
| (Stage 2) | < 0.001 | 0.130 | 0.120 | 0.130 | 0.940 | |
| (Proposed) | < 0.001 | 0.130 | 0.120 | 0.130 | 0.940 | |
| (Ideal) | < 0.001 | 0.030 | 0.030 | 0.030 | 0.950 | |
Figure 1.

Comparison of the two-stage estimates (thick solid line) and second stage estimates (dash line) with the true coefficients (solid line) for sample size (n, N) = (100, 1, 000). The top panel gives the estimates and variances of and ; the middle panel shows the estimates and variances of and ; the bottom panel gives the estimates and variances of and . The first column is the plot of point estimates compared to the true value (solid line), and the second column is the plot of variance estimates.
Table 2.
Simulated coverage probabilities of the nonparametric two-stage estimators at nominal level 95%.
| Time | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 | 1.2 |
|---|---|---|---|---|---|---|
| 0.91 | 0.92 | 0.91 | 0.93 | 0.93 | 0.90 | |
| 0.95 | 0.96 | 0.96 | 0.95 | 0.92 | 0.90 | |
| 0.95 | 0.95 | 0.95 | 0.94 | 0.93 | 0.92 |
5. An Example
5.1. Data description
We illustrate our method using a data set on small intestine cancer from the SEER program supported by NCI. Surgery and radiation therapy are the most commonly used treatments for small intestine cancer. In this study we wanted to know how these treatments affect both survival time and the development of subsequent tumors. Therefore, we defined the survival time as the time from the diagnosis of the first primary small intestine cancer to the diagnosis of the second primary cancer or death. We considered eleven covariates: surgery status (1 if yes, or 0 if no), radiation therapy (1 if yes, or 0 if no), age at the first primary cancer diagnosis (1 if age < 60, or 0 if age ≥ 60), gender (1 if male, or 0 if female), dummy variable race (black, other races, and the reference group white), dummy variable stage (regional stage, distant stage, and the reference group local stage), and dummy variable tumor grade (grade II, grade III, grade IV, and the reference group grade I).
To illustrate our method, we constructed a two-stage design data set as follows. The second stage sample consists of 300 patients (censoring rate = 33.7%) randomly chosen from the 2,669 patients (censoring rate = 26.5%) in the data set with all eleven covariates and survival information. The first stage data include all 2,669 patients, however the variable tumor grade is missing. The 2,669 patients with all variables were used as the reference population.
5.2. Analysis results
With all eleven covariates as time-dependent variables, we plotted the cumulative hazard function for each variable based on the Aalen (1978) nonparametric additive risks model (Wu (2006)). The linear trends of age and gender (Figure 2) suggest that these two variables have time-independent effects and might be used as Z (q2 = 2) in the MS model. The nonlinear trend of the other nine covariates (radiation, race, tumor grade, stage, and surgery) suggests that they be used as X (p2 = 9). For more information on how to assign x and z, refer to Martinussen and Scheike (2006).
Figure 2.
Plots of cumulative hazard function for age and gender show linear trends.
With age and gender as covariate Z (q2 = 2) with time-independent effects, and the other nine covariates (radiation therapy, black, other races, grade II, grade III, grade IV, regional stage, distant stage, and surgery status) as X (p2 = 9) with time-dependent effects, Table 3 compares the proposed and second stage estimates for age and gender. Both methods show significant higher risks associated with being male and older (≥ 60 years). We note that both confidence intervals cover the reference parameter values obtained from the complete data (“ideal” estimates), but the interval based on the two-stage estimates is much narrower than the second stage estimate.
Table 3.
Comparison of the second stage and the proposed two-stage estimators for age and gender.
|
|
|
|||
|---|---|---|---|---|
| Covariate |
Point Est. |
95% Conf. Interval |
Point Est. |
95% Conf. Interval |
| age | 0.053 | (0.017, 0.089) | 0.062 | (0.046, 0.078) |
| gender | 0.036 | (0.001, 0.072) | 0.031 | (0.016, 0.047) |
Figure 3 displays the estimates and confidence bands of the cumulative regression functions for surgery and radiotherapy, which are of most interest in this study. It is seen from Figure (3b) that, after adjusting for other factors, surgery significantly reduces the risk of second primary cancer or death during the first 2.5 years; after this time period, the efficacy of surgery deceases. However, the effects of surgery are inconclusive based on the second stage estimate (Figure (3a)), since its confidence band is very wide and contains zero. Radiation therapy, as opposed to surgery, seems to have no significant impact on subsequent cancer development. We note that the proposed two-stage estimators are usually closer to the reference estimate from the complete data, with much narrower confidence bands than the second-stage estimator.
Figure 3.

95% confidence bands (dash line) of the second stage and two-stage estimators (solid line) of Aj(t) for surgery and radiation therapy. The thick solid line is obtained from the complete data.
Figure 4 depicts a 95% simultaneous confidence band for the conditional survival function for a white male patient who is diagnosed as cancer grade IV, at distant stage, younger than 60, and treated by both surgery and radiation therapy. The proposed two-stage estimate is more accurate with a much narrower confidence band. For example, the survival probabilities at year 1 and year 3 are estimated to be 0.613 (with variance 0.0016) and 0.332 (0.0019), respectively, using the two-stage estimator, and 0.625 (0.0065) and 0.340 (0.0088), respectively, using the second stage data only.
Figure 4.
95% simultaneous confidence bands (dash line) for the conditional survival function (solid line) for a white male patient who is diagnosed as grade IV, at distant stage, younger than 60, and has both surgery and radiation therapy. The thick solid line is obtained from the complete data.
6. Discussion
We propose two-stage estimators for the partial linear semiparametric hazard model introduced by McKeague and Sasieni (1994) in a two-stage design setting. We allow measurement errors for survival time, censoring time, and covariates in the first stage data. We also allow missing covariates in the first stage. The proposed estimators are consistent and asymptotically normal. Confidence bands are developed to assess time-varying covariate effects, and to predict conditional survival probabilities. By utilizing information from the first stage, our estimators are more efficient than the second stage estimators for both large and small samples. Reduction in bias is also observed for small samples.
The estimators introduced in this paper take a form that is similar to that of a trick used for variance reduction in the theory of Monte Carlo methods (sometimes called “control variate”):
where m* and m are unbiased estimates for a parameter of interest, say μ, E (t) = τ, σm and σt are variance of m and t, respectively, and ρmt = corr(m, t). In the control variate theory, σm, σt, and ρmt = corr(m, t) can be estimated across the Monte Carlo replicates if they are known. The estimator m is improved by incorporating the control variate t. Since E (t) = τ, the improved estimator m* is unbiased and has the same expectation as m. In our method, since β1 and A1(t) (τ) are unknown, we use the first stage estimates and to replace their expected values. On the other hand, and are more efficient than and due to larger sample sizes, and therefore can serve as good estimates for β1 and A1(t). Another difference between our method and the control variate method is that we use instead of , since the subjects outside the validation set are independent of the subjects in the validation set.
It is worth noting that our method also works for a working model other than the MS model. For any working model, one can define an estimator of the same form as our proposed estimator and obtain its asymptotic properties following similar steps, provided that the parameter estimator has a similar i.i.d. representation. An interesting question is then how to select the best possible working model, if exists, for the first stage data in order to maximize the benefit of combining information from the two stages. We do not have a definite answer to this question by far. The issue can be very complicated since the answer may depend on many factors, such as the form of the working model, the parameters to be estimated, the surrogate variables involved, and the criterion for optimality. Future research is warranted. However, as implied by the expression of the asymptotic variance of the proposed estimator, for a given second stage model the first stage working model should be chosen in a way such that the first and second stage estimators have high correlation coefficients. In practice, it is convenient to use the same type of models in the two stages, a strategy that would work well when the first stage data are similar (or highly correlated) to the second stage data. Similarly, we suggest assigning the same type of coefficients to a variable in the two stages. For instance, age and gender have constant coefficients in both stages in our data example.
We only consider a simple design where the second stage is a simple random sample from the first stage sample. In many studies, the second-stage subjects are chosen with different selection probabilities depending on the outcomes of the first stage, the covariates, or both. The unequal-selection-probability sampling scheme is of special importance in medical and epidemiological studies, especially for rare diseases. In a sequel, we will extend our methods to incorporate biased-sampling problems.
Acknowledgement
Gang Li's research was supported in part by the U.S. National Institute of Health grants CA016042 and P01AT003960. The authors thank an associate editor and the referees for their helpful comments. Tong Tong Wu's research is partly supported by NSF CCF-0926194.
Appendix
A.1. Regularity assumptions
Let g ∈ {1, 2} be the stage, and let
The following assumptions are made for the theoretical development.
-
(C1)
Finite intervals. Let τ = sup {t : S1(t|x, z)S2(t| x, z)SC(t| x, z) > 0 for all x, z} be a finite constant, where Sg(t|x, z is the survival function of stage g, and SC(t|x, z) is the survival function for censoring time. The covariates xgi and zgi are restricted to a bounded set.
-
(C2)
Limiting bounds. The hazard functions hg(t|x, z), g = 1, 2, are bounded uniformly below and above in t, x, and z by some constants b and B, respectively.
-
(C3)
Asymptotic limits. For the subjects in V (n), , , and converge in L∞ norm (defined as ||M||∞ = maxi |mij| for a matrix M of size m × n uniformly in time t ε [0, τ], in probability, to some deterministic functions Ug(t), Vg(t), and Rg(t), respectively. The functions , , and have limiting functions U1(t), V1(t), and R1(t), respectively. These functions are uniformly continuous and bounded absolutely by the constant matrices KU, KV, and KR, respectively, in the interval [0, τ]. All the matrices are of full rank.
A.2. Proof of Theorem 1
To simplify the notation, we omit the superscript V for the validation set. Let , , , and . Let , where . We have
. It can be shown that and
and I3 = op(1) and I4 = op(1) in L∞ norm in probability; see Wu (2006) for more details. Thus
| (A.1) |
where
It can be shown that E[w1i] = 0 and ||w1i|| ∞ ≤ K for some constant K. By (A.1) and the Multivariate Central Limit Theorem, we have , where .
Define
where
| (A.2) |
The consistency of the variance estimator can be established by the consistency of , , , , and .
Similarly, it can be shown that
| (A.3) |
It follows from (A.3) and the Multivariate Central Limit Theorem that the finite dimensional distributions of converge to those of a multivariate normal distribution with covariance κ(t1, t2) = E {v1i(t1)· v1i(t2)′}. The`tightness' can be checked using Theorem 13.5 in Billingsley (1999). Thus, converges to a zero-mean Gaussian process.
The variance ΣA,11(t) = κ1(t, t) can be estimated by
where is obtained by replacing U1(t), V1(t), h1i(t),, β1, and w1i in (2.2) by their respective sample estimates , , , , , and . The consistency of the variance estimator follows immediately from the consistency of, , , , and .
A.3. Proof of Lemma 1
Similar to (A.1), it can be shown that
This, together with the Multivariate Central Limit Theorem, proves (2.3). The uniform convergence of the variance and covariance estimators can be easily verified using Theorem 1 of Rao (1963).
A.4. Proof of Lemma 2
Similar to the proof of Lemma 1, we can prove that the joint distributions of converge to those of a zero-mean multivariate normal distribution. It can be verified that tightness holds. Therefore we have converges to a zero-mean Gaussian random field with the variance-covariance function
The covariance function can be consistently estimated by
| (A.4) |
The uniform consistency of , can be verified using Theorem 1 of Rao (1963).
A.5. Proof of Theorem 2
As in Lemma 1, we can show that
, where ρ = lim n/N as n, N → ∞. The proposed estimator then can be written as
We have
with variance-covariance function among , , being
We can write
The joint distribution of the proposed estimators and is
where , the variance of can be derived easily by Slutsky's Theorem and the delta-method:
and is a zero-mean Gaussian process with covariance function given by
Covariance of the two-stage estimators and converges to
The variance and covariance matrices can be estimated by replacing each term by their respective sample estimates. The consistency can established by Slutsky's Theorem.
A.6. Proof of Theorem 4
To show that En(t) and have the same limiting distribution , we define an intermediate processes as
It follows that, conditionally on the data, converges weakly in probability to , which is the limiting Gaussian distribution of En(t), according to Theorem 2.9.6 in van der Vaart and Wellner (1996). To complete the proof, we need to show that . We write
Then . We first check that
which converges to zero in probability uniformly in t. It can be similarly shown that the other terms converge to zero in probability uniformly in t.
References
- Aalen OO. Nonparametric inference for a family of counting processes. Ann. Statist. 1978;6:701–726. [Google Scholar]
- Barlow RE, Bartholomew DJ, Bremner JM, Brunk H. Statistical Inference under Order Restrictions. Wiley; New York: 1972. [Google Scholar]
- Billingsley P. Convergence of Probability Measures. Wiley; New York: 1999. [Google Scholar]
- Breslow NE, Holubkov R. Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Statist. Medicine. 1997;16:103–116. doi: 10.1002/(sici)1097-0258(19970115)16:1<103::aid-sim474>3.0.co;2-p. [DOI] [PubMed] [Google Scholar]
- Chen Y-H. Cox regression in cohort studies with validation sampling. J. Roy. Statist.Soc. Ser. B. 2002;64:51–62. [Google Scholar]
- Chen Y-H, Chen H. A unified approach to regression analysis under double sampling design. J. Roy. Statist. Soc. Ser. B. 2000;64:449–460. [Google Scholar]
- Cox DR. Regression models and life tables (with discussion) J. Roy. Statist. Soc. Ser.B. 1972;34:187–220. [Google Scholar]
- Gasser T, Muller H. Kernel Estimation of Regression Functions, Smoothing Techniques for Curve Estimation. Lecture Notes in Mathematics 757. Springer-Verlag; Berlin: 1979. [Google Scholar]
- Jiang J, Zhou H. Additive hazard regression with auxiliary covariates. Biometrika. 2007;94:359–369. [Google Scholar]
- Kulich M, Lin DY. Additive hazards regression with covariate measurement error. J. Amer. Statist. Assoc. 2000;95:238–248. [Google Scholar]
- Li G, Tseng C. Non-parametric estimation of a survival function with two-stage design studies. Scand. J. Statist. 2008;35:193–211. doi: 10.1111/j.1467-9469.2007.00581.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin DY, Ying ZL. Semiparametric analysis of the additive risk model. Biometrika. 1994;81:61–71. [Google Scholar]
- Martinussen T, Scheike TH. Dynamic Regression Models for Survival Data. Springer; New York: 2006. [Google Scholar]
- McKeague I, Sasieni P. A partly parametric additive risk model. Biometrika. 1994;81:501–514. [Google Scholar]
- Rao R. The law of large numbers for d[0,1]-valued random variables. Theory Probab. Appl. 1963;8:70–74. [Google Scholar]
- Scheike TH. The additive nonparametric and semiparametric aalen model as the rate function for a counting process. Lifetime Data Analysis. 2002;8:247–262. doi: 10.1023/a:1015849821021. [DOI] [PubMed] [Google Scholar]
- Schill W, Jockel K-H, Drescher K, Timm J. Logistic analysis in case-control studies under validation sampling. Biometrika. 1993;80:339–352. [Google Scholar]
- Tseng CH. Ph. D. thesis. University of California; Los Angeles: 2004. Survival analysis with two stage design studies. [Google Scholar]
- van der Vaart AW, Wellner JA. Weak Convergence and Empirical Processes. Springer Verlag; New York: 1996. [Google Scholar]
- Wang C-Y, Hsu L, Feng ZD, Prentice RL. Regression calibration in failure time regression. Biometrics. 1997;53:131–145. [PubMed] [Google Scholar]
- White JE. A two stage design for the study of the relationship between a rare exposure and a rare disease. Amer. J. Epidemiology. 1982;115:119–128. doi: 10.1093/oxfordjournals.aje.a113266. [DOI] [PubMed] [Google Scholar]
- Wu TT. Ph. D. thesis. University of California; Los Angeles: 2006. A Partial Linear Semiparametric Additive Risks Model for Two-Stage Design Survival Studies. [Google Scholar]
- Zhou H, Pepe M. Auxiliary covariate data in failure time regression. Biometrika. 1995;82:139–149. [Google Scholar]


