Developing and Evaluating Risk Prediction Models with Panel Current Status Data

Stephanie Chan; Xuan Wang; Ina Jazić; Sarah Peskoe; Yingye Zheng; Tianxi Cai

doi:10.1111/biom.13317

. Author manuscript; available in PMC: 2021 Jun 22.

Published in final edited form as: Biometrics. 2020 Jul 8;77(2):599–609. doi: 10.1111/biom.13317

Developing and Evaluating Risk Prediction Models with Panel Current Status Data

Stephanie Chan ¹, Xuan Wang ², Ina Jazić ¹, Sarah Peskoe ¹, Yingye Zheng ³, Tianxi Cai ^1,^*

PMCID: PMC8168594 NIHMSID: NIHMS1705675 PMID: 32562264

Summary:

Panel current status data arise frequently in biomedical studies when the occurrence of a particular clinical condition is only examined at several prescheduled visit times. Existing methods for analyzing current status data have largely focused on regression modeling based on commonly used survival models such as the proportional hazards model and the accelerated failure time model. However, these procedures have the limitations of being difficult to implement and performing sub-optimally in relatively small sample sizes. The performance of these procedures is also unclear under model mis-specification. In addition, no methods currently exist to evaluate the prediction performance of estimated risk models with panel current status data. In this paper, we propose a simple estimator under a general class of non-parametric transformation (NPT) models by fitting a logistic regression working model and demonstrate that our proposed estimator is consistent for the NPT model parameter up to a scale multiplier. Furthermore, we propose non-parametric estimators for evaluating the prediction performance of the risk score derived from model fitting, which is valid regardless of the adequacy of the fitted model. Extensive simulation results suggest that our proposed estimators perform well in finite samples and the regression parameter estimators outperform existing estimators under various scenarios. We illustrate the proposed procedures using data from the Framingham Offspring Study.

Keywords: Current status data, model mis-specification, risk prediction, robustness, single-index model

1. Introduction

Accurate prediction of risk plays a crucial role in successful management of disease. Developing and evaluating risk models typically relies on prospective cohort studies where risk factors are collected at baseline and patients are followed over time to observe a clinical outcome of interest. Due to logistical or resource limitations, many cohort studies have limited and pre-scheduled visits to assess the clinical outcome of interest. With a single visit, the time to the occurrence of the clinical condition, denoted by T, is only observable with respect to the current status, giving rise to current status data. More generally, with a small number of prescheduled visits, observable information on T gives rise to panel current status data, alternatively recorded as interval censored data.

To derive risk prediction models using current status data and more generally, interval-censored data, one may fit regression models to derive a composite risk score for predicting T. A number of regression procedures have been developed in the literature. For example, Diamond and McDonald (1992) used the Cox proportional hazards (PH) model in a demographic setting, and Huang et al. (1996) proposed a nonparametric maximum likelihood estimator (NPMLE) for the PH model under interval censoring. Under the proportional odds (PO) model, NPMLE and sieve likelihood estimators were proposed (Huang, 1995; Rossini and Tsiatis, 1996; Huang and Rossini, 1997; Wang and Dunson, 2011). Rabinowitz et al. (2000) used conditional logistic regression to fit PO models to interval censored data. Alternative regression methods including the additive hazard model and the accelerated failure time (AFT) model have also been studied (Lin et al., 1998; Shiboski, 1998; Shen, 2000; Martinussen and Scheike, 2002; Tian and Cai, 2006; Lin and Wang, 2010). However, most existing procedures require substantial effort to implement. Computationally and statistically efficient regression methods for current status and interval censored data remain underdeveloped (Mongoué-Tchokoté and Kim, 2008; McMahan et al., 2013). The iterated convex minorant (ICM) algorithm was implemented in the intcox package to obtain the NPMLE under the Cox model (Henschel et al., 2009), but it has been found in previous simulation studies to produce biased coefficient estimates and does not provide standard errors (McMahan et al., 2013). The EM algorithm based R package, ICsurv (McMahan and Wang, 2014), for estimation under the Cox model is also computationally intensive.

In this paper, we propose a simple conditional censoring logistic (CCL) regression approach to deriving a risk score for predicting T with panel current status data via a flexible nonparametric transformation (NPT) model. The NPT model includes many existing survival models, including the PH, PO and AFT models, as special cases. We demonstrate that under elliptical symmetric distributional assumptions on the covariates and the NPT model, the CCL approach can consistently recover the NPT model regression coefficient up to a scale multiplier. Another major goal of this paper is to provide a model-free procedure to quantify the prediction accuracy of the estimated risk score with panel current status data. Specifically, via kernel smoothing, we propose nonparametric estimators for various commonly-used accuracy parameters to evaluate the predictiveness of the estimated risk score. Kernel smoothed estimators have been previously considered in Van Der Laan and Robins (1998) for the estimation of the survival function and Cox regression with current status data. Here, we employ a similar strategy for the estimation of prediction performance measures. Perturbation resampling procedures are also used to derive confidence intervals (CIs) for various parameters of interest. The proposed inference procedures for the accuracy measures are valid regardless of the adequacy of the NPT model or the elliptical symmetric distributional assumption on the covariates.

The remainder of the paper is organized as follows. In section 2, we introduce our estimation procedures for both the model parameters and prediction performance measures. In section 3, we present simulation results demonstrating that the proposed procedures perform well in finite samples and compare the regression parameter estimator to existing estimators. In section 4, we illustrate our methods using data from the Framingham Offspring Study where we develop and evaluate a risk model for diabetes mellitus. Some concluding remarks are given in section 5.

2. Methods

Let Z denote the p-variate baseline covariates and T denote the event time of interest. Under the panel current status data setting, in lieu of T, one only observes (Δ, C), where Δ = (Δ₁, … , Δ_K)^⊤, C = (C₁, … , C_K)^⊤, Δ_k = I(T ⩽ C_k) and C is the vector of K visit or monitoring times. Although the visits are pre-scheduled, C is a random vector as patients may come at random times around the scheduled visits. Additionally, since patients may randomly miss certain visits or drop out of the study, we let ξ_k be the binary variable indicating whether (C_k, Δ_k) is observed. Throughout, we assume that C is independent of both T and Z and P(ξ_k = 1 ∣ Δ, C, Z) = ρ_k > 0. If a patient drops out of the study after visit k, then ξ_k+1 = ⋯ = ξ_K = 0. On the other hand, even when Δ_k = 1, patients may still return to the study for subsequent visits. The data for analysis consist of n independent and identically distributed (i.i.d.) random vectors $D = {D_{i} = (Δ_{ξ_{i}}, C_{ξ_{i}}, Z_{i}^{T})^{T}, i = 1, \dots, n}$ , where ξ_i = (ξ_1i, … , ξ_Ki)^⊤, Δ_ξ_i = Δ_i ⨀ ξ_i, C_ξi = C_i ⨀ ξ_i, ⨀ represents an element-wise product and the subscript i indexes the ith subject.

2.1. Deriving the risk score under the NPT model

We assume that T_i ∣ Z_i follows a flexible NPT model:

h_{0} (T_{i}) = - β_{0}^{T} Z_{i} + ϵ_{i},

(1)

where h₀(·) is an unspecified smooth strictly increasing function, ϵ_i are i.i.d. random errors with unknown distribution function g(·), β₀ is the vector of the unknown regression parameters. Since β₀ is only identifiable up to a scalar, we consider estimation over a compact parameter space

Ω = {β = (β_{1}, \dots, β_{p})^{T} : ‖ β ‖_{2} = 1} and hence ‖ β_{0} ‖_{2} = 1,

where ∥ · ∥₂ is the L₂ norm of β. The NPT model can be equivalently written as

P (T \leq t ∣ Z) = g {h_{0} (t) + β_{0}^{T} Z} .

Since C is independent of T and Z, under (1), we have

P (Δ_{k} = 1 ∣ Z, C_{k}) = g {h_{0} (C_{k}) + β_{0}^{T} Z} and

(2)

P (Δ_{k} = 1 ∣ Z) = G_{k} (β_{0}^{T} Z),

(3)

where $G_{k} (x) = \int g {h_{0} (c) + x} d F_{k} (c)$ and $F_{k} (c) = P (C_{k} \leq c)$ are also unknown. Thus, under (1), Δ_k ∣ Z follows a single-index model. Estimation procedures for β₀ under single-index models are largely complex and/or computationally intensive (Gørgens and Horowitz, 1999; Chen, 2002; Gørgens, 2003; Khan and Tamer, 2007; Wang and Wang, 2018, e.g.). To derive a simple estimator for β₀, we note that under the linearity condition for the distribution of Z:

C o n d i t i o n a l e x p e c t a t i o n E [β^{T} Z ∣ β_{0}^{T} Z] e x i s t s a n d i s l i n e a r i n β_{0}^{T} Z f o r a l l β \in R^{p} .

(C.1)

The direction of β₀ can be recovered by maximizing a simple convex loss, see Li and Duan (1989). Specifically, for a loss L(θ, d) that is convex in θ, β₀ is proportional to b₀ obtained as (a₀, b₀) = argmax_a,bE{L(a + b^⊤Z, Δ)}.

To identify such a convex loss, a simple approach is to consider the logistic loss regressing Δ_ki against Z_i for each k. Specifically, we may obtain

({\hat{α}}_{k}, {\hat{B}}_{k}) = \underset{α, B}{arg \max} \sum_{i = 1}^{n} {Δ_{k i} (α + B^{T} Z_{i}) - \log (1 + e^{α + B^{T} Z_{i}})} ξ_{k i} .

From condition (C.1), the convexity of the logistic likelihood, and Theorem 2.1 of Li and Duan (1989), it is straightforward to show that ${\hat{β}}_{[k]} = {\hat{B}}_{k} ∕ ‖ {\hat{B}}_{k} ‖_{2}$ consistently estimates β₀. One limitation of ${\hat{β}}_{[k]}$ is that it completely ignores the the available censoring times and hence may suffer from loss of efficiency. We instead propose a conditional censoring logistic (CCL) estimator that incorporates the censoring information. To this end, we make another working assumption that h₀(t) = α₀ + α₁H(t), where H(t) is some pre-specified monotone transformation function such as log(·). Using the data from the kth visit, we obtain an estimator of $θ = (α^{T}, B^{T})^{T} = (α_{0}, α_{1}, B^{T})^{T}$ :

{\tilde{θ}}_{k} = ({\tilde{α}}_{k}^{T}, {\tilde{B}}_{k}^{T})^{T} = arg \max_{θ} {\tilde{L}}_{k} (θ),

where

{\tilde{L}}_{k} (θ) = n^{- 1} \sum_{i = 1}^{n} {Δ_{k i} θ^{T} W_{k i} - \log (1 + e^{θ^{T} W_{k i}})} ξ_{k i} and W_{k i} = (1, H (C_{k i}), Z_{i}^{T})^{T} .

Our CCL estimator for β₀ is then ${\tilde{β}}_{[k]} \equiv ({\tilde{β}}_{k 1}, \dots, {\tilde{β}}_{k p})^{T} = {\tilde{B}}_{k} ∕ ‖ {\tilde{B}}_{k} ‖_{2}$ . We show in Appendix A of the Supplementary Materials that ${\tilde{β}}_{[k]}$ is a consistent estimator of β₀ under the NPT model (1) and condition (C.1). In addition, due to the convexity of the likelihood function and a uniform law of large numbers (Pollard, 1990), ${\tilde{β}}_{[k]}$ always converges in probability to a deterministic vector ${\bar{β}}_{[k]} \equiv ({\bar{β}}_{k 1}, \dots, {\bar{β}}_{k p})^{T} = {\bar{B}}_{k} ∕ ‖ {\bar{B}}_{k} ‖_{2}$ , where ${\bar{B}}_{k}$ is the minimizer of the limiting likelihood function. Under correct specification of model (1) and condition (C.1), ${\bar{β}}_{[k]} = β_{0}$ . In Appendix A of the Supplementary Materials, we show that $n^{1 ∕ 2} ({\tilde{β}}_{[k]} - {\bar{β}}_{[k]})$ converges in distribution to a multivariate normal distribution.

Since under model (1) and condition (C.1) we have ${\bar{β}}_{[k]} = β_{0} = (β_{01}, \dots, β_{0 p})^{T}$ , it is desirable to combine information from { ${\tilde{β}}_{[k]}, k = 1, \dots, K$ } to achieve optimal efficiency in estimating β₀ when K ⩾ 2. For example, it is not difficult to show that the distribution of $\sqrt{n} ({\tilde{β}}_{• j} - β_{0 j} 1_{K \times 1})$ can be approximated by a multivariate normal distribution with mean zero and some covariance matrix Σ_j, where ${\tilde{β}}_{• j} = ({\tilde{β}}_{1 j}, \dots, {\tilde{β}}_{K j})^{T}$ , j = 1, … , p. As such, the optimal (w.r.t efficiency) linearly combined estimator for β_0j can be constructed as ${\tilde{β}}_{j} = {\hat{w}}_{j}^{T} {\tilde{β}}_{• j}$ , where ${\hat{w}}_{j} = {\hat{Σ}}_{j}^{- 1} 1 ∕ 1^{T} Σ_{j}^{- 1} 1$ and ${\hat{Σ}}_{j}$ is a consistent estimator of Σ_j. Our final CCL estimator for β₀ is then $\tilde{β} = ({\tilde{β}}_{1}, \dots, {\tilde{β}}_{p})^{T}$ . In practice, we may obtain ${\hat{Σ}}_{j}$ by estimating the joint distribution of { ${\tilde{β}}_{[k]}, k = 1, \dots, K$ }, as detailed in Appendix B of the Supplementary Materials. It is not difficult to show that $\sqrt{n} (\tilde{β} - \bar{β})$ converges in distribution to a multivariate zero-mean normal regardless of model adequacy, where $\bar{β} = ({\bar{β}}_{1}, \dots, {\bar{β}}_{p})$ with ${\bar{β}}_{j} = ({\bar{β}}_{1 j}, \dots, {\bar{β}}_{K j}) w_{j}$ and w_j is the limit of ${\hat{w}}_{j}$ .

2.2. Evaluating prediction performance

After a composite risk score is obtained, it is crucial to evaluate its potential in predicting the event outcome and to develop prognostic classifiers to assist in clinical decision making. To evaluate the prognostic potential of any composite risk score, $S = {\bar{β}}^{T} Z$ , a wide range of predictiveness measures have been proposed in the literature. For example, one may summarize the accuracy of $S$ in classifying the t-year event status D_t = I(T ⩽ t) by:

{TPR}_{t} (s) = P (S ⩾ s ∣ D_{t} = 1), and {FPR}_{t} (s) = P (S ⩾ s ∣ D_{t} = 0) .

One may simultaneously assess TPR_t(·) and FPR_t(·) based on the time-specific Receiver Operating Characteristic (ROC) curve (Uno et al., 2007; Cai and Cheng, 2008), ${ROC}_{t} (u) = {TPR}_{t} {{FPR}_{t}^{- 1} (u)}$ . The curve summarizes the trade-offs between TPR_t and FPR_t, and is helpful in choosing an appropriate threshold s that also incorporates information on costs of false positives and false negatives. The area under the ROC curve (AUC), AUC_t = ∫ ROC_t(u)du, can be used to summarize the overall performance of the risk score $S$ in classifying D_t. Once a threshold is determined to classify subjects into high or low risk groups, it is also of great interest to assess the absolute risk within the risk groups and hence to estimate the time-specific positive predictive value (PPV) and negative predictive value (NPV) parameters:

{PPV}_{t} (s) = P (D_{t} = 1 ∣ S ⩾ s), and {NPV}_{t} (s) = P (D_{t} = 0 ∣ S < s) .

To validate the performance of the risk model, it is desirable to ensure that the validity of the accuracy estimators does not rely on the correct specification of the risk model, since many statistical models are merely an approximation of the true underlying data generating process. Robust estimation of the aforementioned accuracy parameters under possible model mis-specification has been proposed for right-censored data (Cai and Cheng, 2008; Uno et al., 2007). Here, we propose nonparametric estimation of these parameters with current status data via kernel smoothing. Specifically, we estimate TPR_t(s), FPR_t(s), PPV_t and NPV_t respectively as:

{\tilde{TPR}}_{t} (s) = \frac{{\tilde{η}}_{1 t} (s, \tilde{β})}{{\tilde{η}}_{1 t} (- \infty, \tilde{β})}, {\tilde{FPR}}_{t} (s) = \frac{{\tilde{η}}_{0 t} (s, \tilde{β})}{{\tilde{η}}_{0 t} (- \infty, \tilde{β})}, {\tilde{PPV}}_{t} (s) = \frac{{\tilde{η}}_{1 t} (s, \tilde{β})}{{\tilde{η}}_{1 t} (s, \tilde{β}) + {\tilde{η}}_{0 t} (s, \tilde{β})}, {\tilde{NPV}}_{t} (s) = \frac{{\tilde{η}}_{0 t} (- \infty, \tilde{β}) - {\tilde{η}}_{0 t} (s, \tilde{β})}{{\tilde{η}}_{0 t} (- \infty, \tilde{β}) - {\tilde{η}}_{0 t} (s, \tilde{β}) + {\tilde{η}}_{1 t} (- \infty, \tilde{β}) - {\tilde{η}}_{1 t} (s, \tilde{β})},

where $\tilde{β}$ is a regular estimator for $\bar{β}$ which can either be the proposed CCL estimator or other existing estimators such as the NPMLE from the Cox model,

{\tilde{η}}_{δ t} (s, β) = \frac{\sum_{k = 1}^{K} \sum_{i = 1}^{n} I (β^{T} Z_{i} ⩾ s) I (Δ_{k i} = δ) K_{h} (C_{k i} - t) ξ_{k i}}{\sum_{k = 1}^{K} \sum_{i = 1}^{n} K_{h} (C_{k i} - t) ξ_{k i}}, for δ = 0, 1,

K_h(x) = h⁻¹K(x/h), K(·) is a symmetric smooth density function, and h is a bandwidth parameter such that h = O(n^−ν) with ν ∈ (1/5, 1/2). In practice, we find that ${\hat{σ}}_{C} n^{- ν}$ works well, where ${\hat{σ}}_{C}$ is the sample standard deviation of the observed {C_ki}. For our numerical studies, we let ν = 0.3 to adopt under-smoothing to ensure validity of interval estimation via resampling methods (Davison and Hinkley, 1997; Tian et al., 2005). In Appendix C of the Supplementary Materials, we outline the justifications for the consistency and asymptotic normality of the proposed accuracy parameters.

We propose to use perturbation resampling to obtain standard error (SE) estimates and construct CIs for the covariate coefficient parameter and the accuracy parameters similar to those adopted in the literature (Tian et al., 2005; Uno et al., 2007; Cai and Cheng, 2008, e.g.). Details on the implementation of the resampling procedures are given in Appendix B of the Supplementary Materials for the regression coefficients and in Appendix C of the Supplementary Materials for the accuracy parameters. Such a resampling method has been studied extensively in the literature and the validity of such resampling procedures has been previously established (Tian et al., 2005, e.g.) using results such as the multiplier central limit theorem (Kosorok, 2008, Ch.10). Similar techniques can be used to justify its validity under the present setting.

3. Simulation study

We performed extensive simulation studies to evaluate the performance of the proposed CCL estimator and compared it to several existing estimators including: (1) the NPMLE obtained via the intcox package in R for PH model (Cox); (2) the estimator obtained using the ICsurv package in R for PH model (Cox_EM); (3) the logistic regression based estimator of Rabinowitz et al. (2000) for PO model (PO); and (4) the estimator of Tian and Cai (2006) for AFT model (AFT). All regression parameter estimators were standardized to have unit L₂ norm. Throughout, we let n = 1000, p = 3 and used 1000 simulated datasets in each configuration to summarize the results. For each dataset, 500 perturbations were performed for variance and interval estimation.

We considered generating Z = (Z₁, Z₂, Z₃)^⊤ from either: (I) Z₁, Z₂ and Z₃ ~ N(0, 5); or (II) Z₁, Z₂ ~ N(0, 5), Z₃ = Z₁Z₂/3 + e with e ~ N(0, 3). Thus the covariate distributional assumption (C.1) holds in (I) and not in (II). Given Z, we considered generating T from two sets of models:

(M_{1}) : γ_{m} \log (T) = 10 - (Z_{1} + Z_{2} + Z_{3}) ∕ 3 + ϵ_{m},

(M_{2}) : γ_{m} T = \max {0, 40 - (Z_{1} + Z_{2} + Z_{3}) ∕ 3 + ϵ_{m}},

where we considered three choices for the error term ϵ_m: (i) ϵ₁ ~ extreme value distribution with γ₁ = 2, (ii) ϵ₂ ~ logistic distribution with γ₂ = 2, and (iii) ϵ₃ ~ a mixture of N(−2, 1) and N(3, 1) with mixing rate 0.5 and γ₃ = 20. The NPT model is correctly specified under (M₁) with h₀(·) = log(·) but mis-specified under (M₂).

Under covariate setting (I), we expected many of the estimators to converge to β₀ after rescaling and hence we compared their performance using standard metrics by comparing estimation bias and variability. The mean squared error (MSE) can also be used to quantify the relative efficiency of different estimators. In covariate setting (II), mis-specification of the working models including the Cox, PO and logistic regression can lead to estimators that no longer converge to β₀ = (0.577, 0.577, 0.577)^⊤. To enable objective comparisons between different estimators of the direction β₀ under mis-specification, we noted that under both (M₁) and (M₂), the risk score $β_{0}^{T} Z$ attained the optimal AUC as shown in Cai and Cheng (2008). We thus also evaluated the overall performance of different estimators based on an analogue of MSE defined as ${MSE}_{AUC} \equiv AUC (β_{0}) - AUC (\hat{β}) \approx (\hat{β} - β_{0})^{T} A \ddot{U} C (β_{0}) (\hat{β} - β_{0})$ , where AÜC(β₀) is the Hessian matrix of AUC(β₀). For CCL, we also examined the performance of the resampling procedure for variance and interval estimation. Results from (M₂) exhibit patterns similar to those from (M₁), and are shown in Tables S1-S5 in Appendix D of the Supplementary Materials.

3.1. Single Visit Time

We first considered the simple scenario with a single visit and generated the monitoring time C₁ from Uniform[c_l, c_u] such that P(Δ₁ = 1) was about 50%, 50%, and 30% for the three error distributions. In Table 1(a), we report the average of the estimates and the efficiencies of various estimators relative to the Cox estimator (RE) in terms of MSE for the covariate setting (I) under (M₁) . Despite mis-specifications in the link functions, all estimators have negligible biases in estimating the direction β₀ across all three error distributions due to the elliptical symmetric distribution. Under the PH model with extreme value error distribution, the Cox estimator is the most efficient as expected. The proposed estimator is slightly less efficient than the Cox estimator but more efficient than other estimators. On the other hand, when the error follows logistic or mixture distribution, the CCL estimator is most efficient compared to all competing estimators.

Table 1:

Estimates (Est) and relative efficiencies (RE) of the Cox_EM, PO, AFT and CCL estimators compared to the Cox estimator for (a) β₀ and (b) the associated accuracy parameters under covariate settings (I) and (M₁) with the extreme value (EV), logistic (LG) and mixture (MX) error distributions when K = 1.

(a) Estimation of β₀
		True	Est					RE
		True	Cox	Cox_EM	PO	AFT	CCL	Cox_EM	PO	AFT	CCL
EV	β₁	0.58	0.58	0.58	0.58	0.58	0.58	0.44	0.49	0.49	0.92
	β₂	0.58	0.58	0.58	0.58	0.58	0.58	0.44	0.49	0.50	0.94
	β₃	0.58	0.58	0.58	0.58	0.58	0.58	0.47	0.46	0.48	0.90
LG	β₁	0.58	0.58	0.58	0.58	0.57	0.58	0.74	0.79	0.53	1.24
	β₂	0.58	0.58	0.58	0.58	0.58	0.58	0.75	0.78	0.55	1.21
	β₃	0.58	0.58	0.58	0.57	0.58	0.58	0.72	0.78	0.54	1.23
MX	β₁	0.58	0.57	0.57	0.57	0.55	0.57	0.91	0.85	0.44	1.19
	β₂	0.58	0.56	0.56	0.56	0.55	0.56	0.94	0.95	0.55	1.23
	β₃	0.58	0.56	0.56	0.56	0.55	0.56	0.92	0.90	0.53	1.18
(b) Estimation of accuracy parameters
		True	Est					RE
		True	Cox	Cox_EM	PO	AFT	CCL	Cox_EM	PO	AFT	CCL
EV	auc	0.92	0.93	0.93	0.93	0.93	0.93	1.00	0.99	1.00	1.00
	roc_t(.05)	0.70	0.71	0.70	0.70	0.70	0.70	1.00	1.00	0.98	0.99
	roc_t(.10)	0.76	0.79	0.79	0.78	0.78	0.79	1.00	1.01	1.00	1.01
LG	auc	0.88	0.88	0.88	0.88	0.88	0.88	1.00	0.99	1.00	1.00
	roc_t(.05)	0.49	0.51	0.51	0.51	0.51	0.51	1.02	1.00	1.01	1.00
	roc_t(.10)	0.65	0.65	0.65	0.65	0.64	0.65	0.99	1.00	0.96	0.99
MX	auc	0.68	0.68	0.68	0.68	0.68	0.68	1.01	1.00	1.01	1.01
	roc_t(.05)	0.11	0.12	0.12	0.12	0.12	0.12	1.00	0.98	0.99	1.01
	roc_t(.10)	0.20	0.22	0.22	0.22	0.22	0.22	0.99	0.98	0.97	0.99

Open in a new tab

We also evaluated the performance of the point estimation procedures for the prediction accuracy parameters AUC_t, ROC_t(0.05) and ROC_t(0.1), where we let t be the corresponding population median survival time for each of the settings. We estimated the ROC curves using the procedures described above and the AUC using the Stieltjes integral of the estimated ROC curve. As shown in Table 1(b), the estimators for the accuracy parameters have negligible bias. The estimated AUC using estimated β₀ from different methods have comparable efficiency. This is expected since the variability in $\hat{β}$ contributes to the estimated accuracy parameters at a higher order.

In Table 2(a), we compare the overall performance of different methods for estimating β₀ based on MSE_AUC. The CCL and the Cox estimator have comparable MSE_AUC when the Cox model holds, while the CCL outperformed the Cox estimator when the Cox model fails to hold under both covariate settings. The CCL estimator again is substantially more efficient than other competing estimators across all settings, suggesting the robust performance of the CCL procedure under different error distributions and covariate distributions.

Table 2:

The MSE_AUC of the estimated β₀ obtained from Cox, Cox_EM, PO, AFT, and CCL under (M₁) with extreme value (EV), logistic (LG), and mixture (MX) errors when (a) K = 1 and (b) K = 5. Shown also are the overall efficiency of Cox_EM, PO, AFT, and CCL estimators relative to the Cox (RE) with respect to MSE_AUC.

(a) K = 1
Z	Error	MSE_AUC					RE
Z	Error	Cox	Cox_EM	PO	AFT	CCL	Cox_EM	PO	AFT	CCL
(I)	EV	0.04	0.09	0.08	0.08	0.04	0.45	0.48	0.49	0.92
	LG	0.08	0.11	0.10	0.14	0.06	0.74	0.78	0.54	1.23
	MX	0.14	0.15	0.16	0.30	0.12	0.93	0.89	0.49	1.21
(II)	EV	0.05	1.25	0.08	0.10	0.05	0.04	0.56	0.47	0.98
	LG	0.08	0.11	0.10	0.14	0.06	0.74	0.78	0.54	1.23
	MX	0.14	0.14	0.16	0.33	0.12	1.00	0.88	0.44	1.24
(b) K = 5
Z	Error	Est					RE
Z	Error	Cox	Cox_EM	PO	AFT	CCL	Cox_EM	PO	AFT	CCL
(I)	EV	0.02	0.03	0.04	0.06	0.02	0.53	0.50	0.29	0.72
	LG	0.06	0.06	0.05	0.10	0.04	0.97	1.12	0.57	1.43
	MX	0.04	0.04	0.05	0.09	0.04	1.12	0.75	0.48	1.10
(II)	EV	0.02	0.17	0.03	0.17	0.02	0.12	0.62	0.12	0.88
	LG	0.06	0.15	0.05	0.13	0.04	0.40	1.11	0.47	1.40
	MX	0.04	0.04	0.06	0.09	0.04	0.96	0.69	0.48	1.02

Open in a new tab

We also examined the performance of the resampling-based interval estimation procedures for the CCL estimator. As shown in Table 3, the resampling procedures performed well in assessing the variability with CIs achieving desired coverage levels. This also suggests that the CCL procedure is not overly sensitive to the departure of the covariate distribution assumption (C.1).

Table 3:

Estimates (Est), empirical standard errors (ESE), average of estimated standard errors (ASE) and empirical 95% coverage probabilities (CP) of the proposed CCL estimator for making inference about $\bar{β}$ and its associated accuracy parameters under model (M₁) with extreme value (EV), logistic (LG) and mixture (MX) errors when K = 1.

		(I) Z Elliptical symmetric					(II) Z Non-elliptical symmetric
		True	Est	ESE	ASE	CP	True	Est	ESE	ASE	CP
EV	β₁	0.58	0.58	0.02	0.02	0.95	0.58	0.58	0.03	0.03	0.94
	β₂	0.58	0.58	0.02	0.02	0.96	0.58	0.58	0.02	0.03	0.94
	β₃	0.58	0.58	0.02	0.02	0.96	0.58	0.57	0.02	0.02	0.95
	auc	0.92	0.93	0.02	0.02	0.91	0.92	0.92	0.02	0.02	0.92
	roc_t(.05)	0.70	0.70	0.08	0.09	0.92	0.69	0.69	0.08	0.08	0.94
	roc_t(.10)	0.76	0.79	0.07	0.07	0.90	0.76	0.77	0.07	0.07	0.93
LG	β₁	0.58	0.58	0.03	0.03	0.95	0.58	0.58	0.03	0.03	0.94
	β₂	0.58	0.58	0.03	0.03	0.94	0.58	0.58	0.03	0.03	0.94
	β₃	0.58	0.58	0.03	0.03	0.96	0.58	0.58	0.03	0.03	0.94
	auc	0.88	0.88	0.03	0.03	0.93	0.87	0.87	0.03	0.03	0.93
	roc_t(.05)	0.49	0.51	0.11	0.12	0.93	0.52	0.52	0.11	0.11	0.93
	roc_t(.10)	0.65	0.65	0.10	0.10	0.94	0.63	0.64	0.09	0.09	0.94
MX	β₁	0.58	0.57	0.11	0.11	0.95	0.58	0.57	0.11	0.11	0.94
	β₂	0.58	0.56	0.12	0.11	0.93	0.58	0.56	0.11	0.11	0.94
	β₃	0.58	0.56	0.12	0.11	0.93	0.58	0.58	0.11	0.10	0.94
	auc	0.68	0.68	0.05	0.05	0.95	0.67	0.68	0.05	0.05	0.94
	roc_t(.05)	0.11	0.12	0.07	0.08	0.96	0.12	0.13	0.07	0.08	0.95
	roc_t(.10)	0.20	0.22	0.09	0.09	0.96	0.21	0.23	0.09	0.09	0.95

Open in a new tab

3.2. Multiple Visit Times

We also performed simulation studies for the setting with multiple visits. Specifically, we let K = 5 and generated five visit times C_1i, … , C_5i that follow uniform distributions with censoring rates approximately 55%, 45%, 35%, 25%, 15%, respectively. So we have that Δ_ki = I(T_i ⩽ C_ki) for k = 1, … , 5. The observed data include {(Δ_ki, C_ki, Z_i), k = 1, … 5, i = 1, …, n}. Results on the estimation of β₀ and accuracy parameters for (M₁) with covariate setting (I) are shown in Table 4. Similar to the single visit case, the CCL estimator is less efficient than the Cox estimator when the PH model holds as expected, but is generally more efficient than the Cox estimator when the PH model does not hold. The CCL estimator is again substantially more efficient than other competing estimators. From Table 2(b), we see the proposed the CCL and the Cox estimator have comparable MSE_AUC when the Cox model holds, while in the other settings the CCL estimator attains the smallest MSE_AUC among all the estimators considered.

Table 4:

(a) Estimation of β₀
			Est					RE
		True	Cox	Cox_EM	PO	AFT	CCL	Cox_EM	PO	AFT	CCL
EV	β₁	0.58	0.58	0.58	0.58	0.58	0.58	0.56	0.49	0.27	0.70
	β₂	0.58	0.58	0.58	0.58	0.58	0.58	0.50	0.50	0.29	0.71
	β₃	0.58	0.58	0.58	0.58	0.58	0.58	0.53	0.51	0.31	0.75
LG	β₁	0.58	0.58	0.58	0.58	0.58	0.58	0.98	1.12	0.58	1.46
	β₂	0.58	0.58	0.58	0.58	0.58	0.58	0.96	1.13	0.56	1.42
	β₃	0.58	0.58	0.58	0.58	0.58	0.58	0.97	1.12	0.58	1.44
MX	β₁	0.58	0.58	0.58	0.58	0.57	0.58	1.13	0.76	0.54	1.08
	β₂	0.58	0.58	0.58	0.57	0.57	0.58	1.12	0.71	0.51	1.09
	β₃	0.58	0.57	0.57	0.57	0.57	0.57	1.10	0.76	0.52	1.11
(b) Estimation of accuracy parameters
			Est					RE
		True	Cox	Cox_EM	PO	AFT	CCL	Cox_EM	PO	AFT	CCL
EV	auc	0.93	0.92	0.92	0.92	0.92	0.92	0.99	1.00	0.97	1.00
	roc_t(.05)	0.70	0.69	0.69	0.69	0.69	0.69	0.99	0.97	0.97	0.98
	roc_t(.10)	0.78	0.78	0.78	0.78	0.78	0.78	0.99	1.00	1.00	1.00
LG	auc	0.88	0.88	0.88	0.88	0.88	0.88	1.00	1.01	0.96	1.02
	roc_t(.05)	0.51	0.50	0.50	0.50	0.50	0.50	1.00	0.99	0.99	0.99
	roc_t(.10)	0.65	0.64	0.64	0.64	0.64	0.64	0.99	0.99	0.96	0.99
MX	auc	0.58	0.59	0.59	0.59	0.59	0.59	1.00	0.99	1.01	1.01
	roc_t(.05)	0.11	0.12	0.12	0.12	0.11	0.12	1.01	0.99	1.03	1.01
	roc_t(.10)	0.19	0.19	0.19	0.19	0.19	0.19	1.00	0.99	0.99	1.02

Open in a new tab

Results in Table 5 show that the performance of the point and interval estimates for the covariate coefficient parameters as well as the accuracy parameters is similar to what we observe in the single visit setting with the proposed estimator presenting negligible bias and CIs attaining desired coverage levels.

Table 5:

		(I) Z Elliptical symmetric					(II) Z Non-elliptical symmetric
		True	Est	ESE	ASE	CP	True	Est	ESE	ASE	CP
EV	β₁	0.58	0.58	0.02	0.02	0.94	0.58	0.58	0.02	0.02	0.94
	β₂	0.58	0.58	0.02	0.02	0.94	0.58	0.58	0.02	0.02	0.94
	β₃	0.58	0.58	0.02	0.02	0.94	0.58	0.57	0.02	0.02	0.92
	auc	0.93	0.92	0.01	0.01	0.95	0.91	0.91	0.01	0.01	0.93
	roc_t(.05)	0.70	0.69	0.03	0.03	0.93	0.68	0.68	0.03	0.03	0.95
	roc_t(.10)	0.78	0.78	0.03	0.03	0.94	0.76	0.76	0.03	0.03	0.94
LG	β₁	0.58	0.58	0.03	0.02	0.94	0.58	0.58	0.02	0.02	0.93
	β₂	0.58	0.58	0.03	0.02	0.94	0.58	0.58	0.02	0.02	0.93
	β₃	0.58	0.58	0.02	0.02	0.94	0.58	0.58	0.02	0.02	0.95
	auc	0.88	0.88	0.01	0.01	0.94	0.87	0.87	0.01	0.01	0.94
	roc_t(.05)	0.51	0.50	0.04	0.04	0.94	0.53	0.52	0.04	0.04	0.94
	roc_t(.10)	0.65	0.64	0.03	0.04	0.95	0.64	0.63	0.03	0.03	0.95
MX	β₁	0.58	0.58	0.06	0.05	0.90	0.59	0.59	0.06	0.05	0.91
	β₂	0.58	0.58	0.06	0.05	0.90	0.58	0.59	0.06	0.05	0.92
	β₃	0.58	0.57	0.06	0.05	0.92	0.55	0.55	0.06	0.05	0.92
	auc	0.58	0.59	0.04	0.02	0.94	0.58	0.59	0.04	0.02	0.93
	roc_t(.05)	0.11	0.12	0.03	0.03	0.97	0.13	0.13	0.03	0.03	0.95
	roc_t(.10)	0.19	0.19	0.03	0.04	0.95	0.19	0.20	0.03	0.03	0.95

Open in a new tab

4. Diabetes risk prediction with the Framingham Offspring Study

The Framingham Offspring Study was established in 1971 with 5,124 participants who were monitored prospectively on epidemiological and genetic risk factors of cardiovascular disease (CVD) (Kannel et al., 1979). This study has been used extensively in the literature for identifying CVD risk factors and developing CVD risk models. As part of risk factors for CVD, the study also collected information on diabetes status at each exam time. We aim to apply our methods to develop a risk prediction model for diabetes using risk factors including age, smoking status, systolic and diastolic blood pressure (SBP and DBP), high density lipoprotein (HDL), total cholesterol levels, and hypertension medication status, as well as measurements of C-Reactive Protein (CRP), an inflammation marker. Since these risk factors change over time, we used exam 2 as a baseline and aimed to predict the risk of developing diabetes in the next 5 years and 15 years using diabetes outcome status at exam 3, 4, 5, 6 and 7. Subjects who had already developed diabetes at exam 2 or with missing data were excluded from our analysis, leading to a total of 2808, 2889, 2774, 2567 and 2504 subjects for exam 3, 4, 5, 6 and 7, respectively.

In Table 6, we present the estimated regression coefficients as well as the performance of the resulting risk models in predicting the short term (5-year) and long term (15-year) risk of developing diabetes. Consistent with the existing literature, our analysis demonstrates that being on hypertension medication is significantly associated with a higher risk of diabetes (Gress et al., 2000). High HDL is inversely associated with the risk of diabetes. Higher levels of CRP are also associated with an increased risk of diabetes, as observed previously in other studies (Freeman et al., 2002; Pradhan et al., 2001).

Table 6:

Estimated L₂ norm standardized regression coefficients along with the accuracy of the resulting models for predicting short-term (t₀ = 5) and long-term (t₀ = 15) year risk of developing diabetes based on the Cox model and the CCL estimator along with their 95% CIs for the Framingham Offspring data.

Coefficients	CCL		Cox
Sex	0.232_0.194	[−0.149, 0.613]	0.481_0.224	[ 0.041, 0.921]
Age	0.022_0.013	[−0.004, 0.049]	0.048_0.017	[ 0.015, 0.082]
Smoking	−0.026_0.191	[−0.400, 0.349]	0.071_0.243	[−0.406, 0.547]
SBP	0.014_0.007	[ 0.000, 0.027]	0.035_0.010	[ 0.015, 0.056]
DBP	0.032_0.014	[ 0.004, 0.060]	0.035_0.016	[ 0.004, 0.065]
HDL	−0.049_0.014	[−0.078, −0.021]	−0.082_0.018	[−0.117, −0.047]
Total Cholesterol	0.006_0.003	[ 0.000, 0.012]	0.009_0.003	[ 0.002, 0.016]
Hypertension Medication	0.786_0.138	[ 0.515, 1.056]	0.489_0.268	[−0.035, 1.014]
CRP levels	0.569_0.112	[ 0.350, 0.789]	0.716_0.130	[ 0.461, 0.971]
AUC_t₀=5	0.849_0.043	[ 0.765, 0.933]	0.855_0.042	[ 0.772, 0.938]
${TPR}_{t_{0} = 5} {{FPR}_{t_{0}}^{- 1} (0.05)}$	0.228_0.117	[ 0.000, 0.457]	0.392_0.162	[ 0.144, 0.639]
${PPV}_{t_{0} = 5} {{FPR}_{t_{0}}^{- 1} (0.05)}$	0.059_0.029	[ 0.001, 0.116]	0.097_0.033	[ 0.032, 0.162]
${NPV}_{t_{0} = 5} {{FPR}_{t_{0}}^{- 1} (0.05)}$	0.989_0.003	[ 0.983, 0.995]	0.991_0.003	[ 0.986, 0.997]
AUC_t₀=15	0.754_0.034	[ 0.688, 0.820]	0.759_0.033	[ 0.694, 0.824]
${TPR}_{t_{0} = 15} {{FPR}_{t_{0}}^{- 1} (0.05)}$	0.207_0.068	[ 0.073, 0.340]	0.232_0.065	[ 0.103, 0.360]
${PPV}_{t_{0} = 15} {{FPR}_{t_{0}}^{- 1} (0.05)}$	0.276_0.066	[ 0.146, 0.405]	0.299_0.063	[ 0.167, 0.422]
${NPV}_{t_{0} = 15} {{FPR}_{t_{0}}^{- 1} (0.05)}$	0.929_0.010	[ 0.908, 0.949]	0.931_0.010	[ 0.911, 0.951]

Open in a new tab

To evaluate the performance of the risk model, we also report the AUC, TPR, PPV and NPV at FPR of 0.05 for the fitted risk model for both t₀ = 5 and t₀ = 15 in Table 6. The AUC for t₀ = 5 is 0.849 with a 95% CI of [0.765, 0.933], suggesting a reasonably high prediction accuracy. For t₀ = 15, the AUC is 0.754 with a 95% CI of [0.688, 0.820], showing a moderate prediction accuracy. This level of accuracy is similar to some existing risk models but slightly lower than others (Noble et al., 2011; Abbasi et al., 2012), but we note that most existing models are for the prediction of 10-year risk and use different sets of predictors. If we set the threshold for classifying someone as high risk to attain an FPR_t of 0.05, then the corresponding TPR is 0.228 with 95% CI [0.000, 0.457], PPV is 0.059 with 95% CI [0.001, 0.116], and NPV is 0.989 with 95% CI [0.983, 0.995] for t₀ = 5. For t₀ = 15, the corresponding TPR is 0.207 with 95% CI [0.073, 0.340], PPV is 0.276 with 95% CI [0.146, 0.405], and NPV is 0.929 with 95% CI [0.908, 0.949].

For comparison, we also obtained normalized regression coefficient estimates from the Cox model as well as standard error estimates via bootstrap, and subsequently assessed the prediction accuracy of the resulting model described in section 2.2. The point estimators for $\bar{β}$ from the CCL and the Cox are slightly different from each other, although the directions and relative magnitudes for different risk factors are similar. It is interesting to note that since a few of the covariates are binary, condition (C.1) is unlikely to hold and hence it is not surprising to see a difference between the Cox and the CCL estimators. Nevertheless, the simple CCL approach yielded a risk model with prediction performance similar to that of the Cox model, further demonstrating the robustness and practical utility of the CCL approach.

5. Discussion

We propose in this paper a computationally simple yet robust CCL estimator for the NPT model with panel current status data. The CCL estimator only requires fitting multiple logistic regression working models to the sequence of binary current status outcomes. The consistency of the CCL estimator under the NPT model relies on the elliptical symmetry assumption (C.1) on the covariates. Existing methods such as those discussed in Zhu and Neuhaus (2003) and Huffer and Park (2007) can be used to verify whether the assumption (C.1) holds for a given dataset. However, even when (C.1) and/or the NPT model fails to hold, the linear score derived from CCL, ${\bar{β}}^{T} Z$ , remains a sensible risk score for predicting T and our proposed accuracy estimators for ${\bar{β}}^{T} Z$ remains valid. Our numerical results also demonstrate that the CCL estimator is not sensitive to departures from condition (C.1) and is more efficient in estimating the direction of β than the existing Cox regression method when the proportional hazards model fails to hold. Although logistic regression has been previously used to make inference under the PO model for current status data, the CCL approach possesses additional robustness properties in that it is valid under the broader class of NPT model.

The CCL approach can effectively combine information from multiple visits. The optimal linear combination strategy used in CCL is more robust than the standard generalized estimating equation approach since it can weigh information from multiple visits differently to maximally improve efficiency. For simplicity, we presented results on optimally combining { ${\tilde{β}}_{1 j}, \dots, {\tilde{β}}_{K j}$ } for the estimation of β_0j, which is equivalent to combining ${\tilde{β}}_{1 j}$ with ( ${\tilde{β}}_{2 j} - {\tilde{β}}_{1 j}, \dots, {\tilde{β}}_{K j} - {\tilde{β}}_{1 j}$ ). It is possible to further gain efficiency by considering more extensive combinations and identifying an optimal W to minimize the variance of ${\tilde{β}}_{1 j} + ({\tilde{β}}_{[2]}^{T} - {\tilde{β}}_{[1]}^{T}, \dots, {\tilde{β}}_{[K]}^{T} - {\tilde{β}}_{[1]}^{T}) W$ . However, when p or K is not small, the dimensionality of W is high and shrinkage estimation may be needed to obtain stable estimates for the optimal W as in Tian et al. (2012) and Zheng and Cai (2017).

Although not necessary for consistency, properly accounting for the censoring time in the working regression model can improve the efficiency in estimating β₀. As shown in Appendix F of the Supplementary Materials, the simple estimator introduced in section 2.1 ignoring censoring time is substantially less efficient than the CCL estimator that includes H(C_ik) as a covariate. We chose H(t) = log(t) for simplicity in our numerical analyses but our methods can easily accommodate more flexible parametrization of h(t) such as $α_{0} + \sum_{l = 1}^{L} H_{l} (t)$ with H_l(·) being spline basis functions to approximate h(t).

A major limitation of the proposed framework is the requirement that the monitoring time be independent of both T and Z. Unfortunately the proposed CCL framework cannot be extended to the setting with covariate dependent censoring easily due to its nature of requiring a simple single-index model for Δ ∣ Z. With covariate dependent censoring, the non-parametric estimator for the accuracy parameters would also need to be modified to include estimators for the conditional density of C ∣ Z which is not feasible when p is not small, due to the curse of dimensionality as for right censored data (Robins and Ritov, 1997). Additional model assumptions on C ∣ Z would be required to derive valid estimators for the NPT model as well as for accuracy estimation, which warrants further research.

Our current procedure requires Z to be a baseline covariate with the goal of making prediction at baseline for the risk of developing a future event. In the presence of time varying covariates, it is plausible to extend the NPT model as $P {T \leq t ∣ Z (s)} = g {h_{0} (t) + β_{0}^{T} Z (t)}$ which is equivalent to $h_{0} (T) = - β_{0}^{T} Z (T) + ϵ$ , where $Z (t) = {Z (s), s \leq t}$ . However, the proposed CCL estimation procedure for β₀ is no longer valid under this setting and it is highly challenging if not impossible to modify the CCL type estimator to incorporate time varying covariates due to the fact that Z(T) is never directly observable with current status data. Developing estimation procedures for the NPT model with time varying covariates warrants further research.

To the best of our knowledge, our paper is the first to provide a robust and consistent estimator of prediction performance measures for an estimated risk model with panel current status data. The nonparametric estimator is simple to implement and performs well in practice. This accuracy estimator is also applicable to risk scores derived under any regression approach including the NPMLE under the Cox model. Throughout our numerical studies, we employ a simple plug-in estimator for the bandwidth h and find it to perform well in practice. More data-adaptive methods for selecting an optimal bandwidth warrant further research.

Availability of data

The data that support the findings in this paper are available from the corresponding author upon reasonable request.

Supplementary Material

NIHMS1705675-supplement-Supplementary_Material.pdf^{(298.3KB, pdf)}

Acknowledgement

The work is partially supported by grants R21-CA242940, U01-CA86368, R01-CA236558, and R01-HL089778 from the National Institutes of Health. The Framingham Heart Study and the Framingham SHARe project are conducted and supported by the National Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston University. The Framingham SHARe data used for the analyses described in this manuscript were obtained through dbGaP (access number: phs000007.v3.p2). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or the NHLBI. Stephanie Chan and Xuan Wang contributed equally to this research.

Footnotes

Supporting Information

An R package for implementing the proposed procedures, named PanelCurrentStatus, has been made available at “https://celehs.github.io/PanelCurrentStatus/”. Web Appendices and Tables referenced in Section 2, 3, 4 and 5 are available with this paper at the Biometrics website on Wiley Online Library.

Publisher's Disclaimer: This article has been accepted for publication and undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1111/biom.13317

References

Abbasi A, Peelen LM, Corpeleijn E, van der Schouw YT, Stolk RP, Spijkerman AM, Moons KG, Navis G, Bakker SJ, Beulens JW, et al. (2012). Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cai T and Cheng S (2008). Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 9, 216–233. [DOI] [PubMed] [Google Scholar]
Chen S (2002). Rank estimation of transformation models. Econometrica 70, 1683–1697. [Google Scholar]
Davison AC and Hinkley DV (1997). Bootstrap methods and their application, volume 1. Cambridge university press. [Google Scholar]
Diamond I and McDonald J (1992). Demographic Applications of Event History Analysis, chapter Analysis of current-status data, pages 231–252. Oxford University Press. [Google Scholar]
Freeman DJ, Norrie J, Caslake MJ, Gaw A, Ford I, Lowe GD, O’Reilly DSJ, Packard CJ, and Sattar N (2002). C-reactive protein is an independent predictor of risk for the development of diabetes in the west of scotland coronary prevention study. Diabetes 51, 1596–1600. [DOI] [PubMed] [Google Scholar]
Gørgens T (2003). Semiparametric estimation of censored transformation models. Journal of Nonparametric Statistics 15, 377–393. [Google Scholar]
Gørgens T and Horowitz JL (1999). Semiparametric estimation of a censored regression model with an unknown transformation of the dependent variable. Journal of Econometrics 90, 155–191. [Google Scholar]
Gress TW, Nieto FJ, Shahar E, Wofford MR, and Brancati FL (2000). Hypertension and antihypertensive therapy as risk factors for type 2 diabetes mellitus. New England Journal of Medicine 342, 905–912. [DOI] [PubMed] [Google Scholar]
Henschel V, Heiss C, and Mansmann U (2009). intcox: Iterated convex minorant algorithm for interval censored event data. R package version 0.9 2,. [Google Scholar]
Huang J (1995). Maximum likelihood estimation for proportional odds regression model with current status data. Lecture Notes-Monograph Series pages 129–145. [Google Scholar]
Huang J et al. (1996). Efficient estimation for the proportional hazards model with interval censoring. The Annals of Statistics 24, 540–568. [Google Scholar]
Huang J and Rossini A (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. Journal of the American Statistical Association 92, 960–967. [Google Scholar]
Huffer FW and Park C (2007). A test for elliptical symmetry. Journal of Multivariate Analysis 98, 256–281. [Google Scholar]
Kannel WB, Feinleib M, McNAMARA PM, Garrison RJ, and Castelli WP (1979). An investigation of coronary heart disease in families the framingham offspring study. American journal of epidemiology 110, 281–290. [DOI] [PubMed] [Google Scholar]
Khan S and Tamer E (2007). Partial rank estimation of duration models with general forms of censoring. Journal of Econometrics 136, 251–280. [Google Scholar]
Kosorok MR (2008). Introduction to empirical processes and semiparametric inference. Springer. [Google Scholar]
Li K-C and Duan N (1989). Regression analysis under link violation. The Annals of Statistics pages 1009–1052. [Google Scholar]
Lin D, Oakes D, and Ying Z (1998). Additive hazards regression with current status data. Biometrika 85, 289–298. [Google Scholar]
Lin X and Wang L (2010). A semiparametric probit model for case 2 interval-censored failure time data. Statistics in medicine 29, 972–981. [DOI] [PubMed] [Google Scholar]
Martinussen T and Scheike TH (2002). Efficient estimation in additive hazards regression with current status data. Biometrika 89, 649–658. [Google Scholar]
McMahan C and Wang L (2014). Icsurv: A package for semiparametric regression analysis of interval-censored data. R package version 1,. [Google Scholar]
McMahan CS, Wang L, and Tebbs JM (2013). Regression analysis for current status data using the em algorithm. Statistics in medicine 32, 4452–4466. [DOI] [PubMed] [Google Scholar]
Mongoué-Tchokoté S and Kim J-S (2008). New statistical software for the proportional hazards model with current status data. Computational Statistics & Data Analysis 52, 4272–4286. [Google Scholar]
Noble D, Mathur R, Dent T, Meads C, and Greenhalgh T (2011). Risk models and scores for type 2 diabetes: systematic review. Bmj 343, d7163. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pollard D (1990). Empirical processes: theory and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–86. JSTOR. [Google Scholar]
Pradhan AD, Manson JE, Rifai N, Buring JE, and Ridker PM (2001). C-reactive protein, interleukin 6, and risk of developing type 2 diabetes mellitus. Jama 286, 327–334. [DOI] [PubMed] [Google Scholar]
Rabinowitz D, Betensky RA, and Tsiatis AA (2000). Using conditional logistic regression to fit proportional odds models to interval censored data. Biometrics 56, 511–518. [DOI] [PubMed] [Google Scholar]
Robins JM and Ritov Y (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in medicine 16, 285–319. [DOI] [PubMed] [Google Scholar]
Rossini A and Tsiatis A (1996). A semiparametric proportional odds regression model for the analysis of current status data. Journal of the American Statistical Association 91, 713–721. [Google Scholar]
Shen X (2000). Linear regression with current status data. Journal of the American Statistical Association 95, 842–852. [Google Scholar]
Shiboski SC (1998). Generalized additive models for current status data. Lifetime data analysis 4, 29–50. [DOI] [PubMed] [Google Scholar]
Tian L and Cai T (2006). On the accelerated failure time model for current status and interval censored data. Biometrika 93, 329–342. [Google Scholar]
Tian L, Cai T, Zhao L, and Wei L-J (2012). On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial. Biostatistics 13, 256–273. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tian L, Zucker D, and Wei L (2005). On the cox model with time-varying regression coefficients. Journal of the American statistical Association 100, 172–183. [Google Scholar]
Uno H, Cai T, Tian L, and Wei L (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association pages 527–537. [Google Scholar]
Van Der Laan MJ and Robins JM (1998). Locally efficient estimation with current status data and time-dependent covariates. Journal of the American Statistical Association 93, 693–701. [Google Scholar]
Wang L and Dunson DB (2011). Semiparametric bayes’ proportional odds models for current status data with underreporting. Biometrics 67, 1111–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang Q and Wang X (2018). Analysis of censored data under heteroscedastic transformation regression models with unknown transformation function. Canadian Journal of Statistics 46, 233–245. [Google Scholar]
Zheng Y and Cai T (2017). Augmented estimation for t-year survival with censored regression models. Biometrics 73, 1169–1178. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhu L-X and Neuhaus G (2003). Conditional tests for elliptical symmetry. Journal of Multivariate Analysis 84, 284–298. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material

NIHMS1705675-supplement-Supplementary_Material.pdf^{(298.3KB, pdf)}

Data Availability Statement

The data that support the findings in this paper are available from the corresponding author upon reasonable request.

[R1] Abbasi A, Peelen LM, Corpeleijn E, van der Schouw YT, Stolk RP, Spijkerman AM, Moons KG, Navis G, Bakker SJ, Beulens JW, et al. (2012). Prediction models for risk of developing type 2 diabetes: systematic literature search and independent external validation study. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Cai T and Cheng S (2008). Robust combination of multiple diagnostic tests for classifying censored event times. Biostatistics 9, 216–233. [DOI] [PubMed] [Google Scholar]

[R3] Chen S (2002). Rank estimation of transformation models. Econometrica 70, 1683–1697. [Google Scholar]

[R4] Davison AC and Hinkley DV (1997). Bootstrap methods and their application, volume 1. Cambridge university press. [Google Scholar]

[R5] Diamond I and McDonald J (1992). Demographic Applications of Event History Analysis, chapter Analysis of current-status data, pages 231–252. Oxford University Press. [Google Scholar]

[R6] Freeman DJ, Norrie J, Caslake MJ, Gaw A, Ford I, Lowe GD, O’Reilly DSJ, Packard CJ, and Sattar N (2002). C-reactive protein is an independent predictor of risk for the development of diabetes in the west of scotland coronary prevention study. Diabetes 51, 1596–1600. [DOI] [PubMed] [Google Scholar]

[R7] Gørgens T (2003). Semiparametric estimation of censored transformation models. Journal of Nonparametric Statistics 15, 377–393. [Google Scholar]

[R8] Gørgens T and Horowitz JL (1999). Semiparametric estimation of a censored regression model with an unknown transformation of the dependent variable. Journal of Econometrics 90, 155–191. [Google Scholar]

[R9] Gress TW, Nieto FJ, Shahar E, Wofford MR, and Brancati FL (2000). Hypertension and antihypertensive therapy as risk factors for type 2 diabetes mellitus. New England Journal of Medicine 342, 905–912. [DOI] [PubMed] [Google Scholar]

[R10] Henschel V, Heiss C, and Mansmann U (2009). intcox: Iterated convex minorant algorithm for interval censored event data. R package version 0.9 2,. [Google Scholar]

[R11] Huang J (1995). Maximum likelihood estimation for proportional odds regression model with current status data. Lecture Notes-Monograph Series pages 129–145. [Google Scholar]

[R12] Huang J et al. (1996). Efficient estimation for the proportional hazards model with interval censoring. The Annals of Statistics 24, 540–568. [Google Scholar]

[R13] Huang J and Rossini A (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. Journal of the American Statistical Association 92, 960–967. [Google Scholar]

[R14] Huffer FW and Park C (2007). A test for elliptical symmetry. Journal of Multivariate Analysis 98, 256–281. [Google Scholar]

[R15] Kannel WB, Feinleib M, McNAMARA PM, Garrison RJ, and Castelli WP (1979). An investigation of coronary heart disease in families the framingham offspring study. American journal of epidemiology 110, 281–290. [DOI] [PubMed] [Google Scholar]

[R16] Khan S and Tamer E (2007). Partial rank estimation of duration models with general forms of censoring. Journal of Econometrics 136, 251–280. [Google Scholar]

[R17] Kosorok MR (2008). Introduction to empirical processes and semiparametric inference. Springer. [Google Scholar]

[R18] Li K-C and Duan N (1989). Regression analysis under link violation. The Annals of Statistics pages 1009–1052. [Google Scholar]

[R19] Lin D, Oakes D, and Ying Z (1998). Additive hazards regression with current status data. Biometrika 85, 289–298. [Google Scholar]

[R20] Lin X and Wang L (2010). A semiparametric probit model for case 2 interval-censored failure time data. Statistics in medicine 29, 972–981. [DOI] [PubMed] [Google Scholar]

[R21] Martinussen T and Scheike TH (2002). Efficient estimation in additive hazards regression with current status data. Biometrika 89, 649–658. [Google Scholar]

[R22] McMahan C and Wang L (2014). Icsurv: A package for semiparametric regression analysis of interval-censored data. R package version 1,. [Google Scholar]

[R23] McMahan CS, Wang L, and Tebbs JM (2013). Regression analysis for current status data using the em algorithm. Statistics in medicine 32, 4452–4466. [DOI] [PubMed] [Google Scholar]

[R24] Mongoué-Tchokoté S and Kim J-S (2008). New statistical software for the proportional hazards model with current status data. Computational Statistics & Data Analysis 52, 4272–4286. [Google Scholar]

[R25] Noble D, Mathur R, Dent T, Meads C, and Greenhalgh T (2011). Risk models and scores for type 2 diabetes: systematic review. Bmj 343, d7163. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Pollard D (1990). Empirical processes: theory and applications. In NSF-CBMS regional conference series in probability and statistics, pages i–86. JSTOR. [Google Scholar]

[R27] Pradhan AD, Manson JE, Rifai N, Buring JE, and Ridker PM (2001). C-reactive protein, interleukin 6, and risk of developing type 2 diabetes mellitus. Jama 286, 327–334. [DOI] [PubMed] [Google Scholar]

[R28] Rabinowitz D, Betensky RA, and Tsiatis AA (2000). Using conditional logistic regression to fit proportional odds models to interval censored data. Biometrics 56, 511–518. [DOI] [PubMed] [Google Scholar]

[R29] Robins JM and Ritov Y (1997). Toward a curse of dimensionality appropriate (coda) asymptotic theory for semi-parametric models. Statistics in medicine 16, 285–319. [DOI] [PubMed] [Google Scholar]

[R30] Rossini A and Tsiatis A (1996). A semiparametric proportional odds regression model for the analysis of current status data. Journal of the American Statistical Association 91, 713–721. [Google Scholar]

[R31] Shen X (2000). Linear regression with current status data. Journal of the American Statistical Association 95, 842–852. [Google Scholar]

[R32] Shiboski SC (1998). Generalized additive models for current status data. Lifetime data analysis 4, 29–50. [DOI] [PubMed] [Google Scholar]

[R33] Tian L and Cai T (2006). On the accelerated failure time model for current status and interval censored data. Biometrika 93, 329–342. [Google Scholar]

[R34] Tian L, Cai T, Zhao L, and Wei L-J (2012). On the covariate-adjusted estimation for an overall treatment difference with data from a randomized comparative clinical trial. Biostatistics 13, 256–273. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Tian L, Zucker D, and Wei L (2005). On the cox model with time-varying regression coefficients. Journal of the American statistical Association 100, 172–183. [Google Scholar]

[R36] Uno H, Cai T, Tian L, and Wei L (2007). Evaluating prediction rules for t-year survivors with censored regression models. Journal of the American Statistical Association pages 527–537. [Google Scholar]

[R37] Van Der Laan MJ and Robins JM (1998). Locally efficient estimation with current status data and time-dependent covariates. Journal of the American Statistical Association 93, 693–701. [Google Scholar]

[R38] Wang L and Dunson DB (2011). Semiparametric bayes’ proportional odds models for current status data with underreporting. Biometrics 67, 1111–1118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Wang Q and Wang X (2018). Analysis of censored data under heteroscedastic transformation regression models with unknown transformation function. Canadian Journal of Statistics 46, 233–245. [Google Scholar]

[R40] Zheng Y and Cai T (2017). Augmented estimation for t-year survival with censored regression models. Biometrics 73, 1169–1178. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Zhu L-X and Neuhaus G (2003). Conditional tests for elliptical symmetry. Journal of Multivariate Analysis 84, 284–298. [Google Scholar]

PERMALINK

Developing and Evaluating Risk Prediction Models with Panel Current Status Data

Stephanie Chan

Xuan Wang

Ina Jazić

Sarah Peskoe

Yingye Zheng

Tianxi Cai

Summary:

1. Introduction

2. Methods

2.1. Deriving the risk score under the NPT model

2.2. Evaluating prediction performance

3. Simulation study

3.1. Single Visit Time

Table 1:

Table 2:

Table 3:

3.2. Multiple Visit Times

Table 4:

Table 5:

4. Diabetes risk prediction with the Framingham Offspring Study

Table 6:

5. Discussion

Availability of data

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Developing and Evaluating Risk Prediction Models with Panel Current Status Data

Stephanie Chan

Xuan Wang

Ina Jazić

Sarah Peskoe

Yingye Zheng

Tianxi Cai

Summary:

1. Introduction

2. Methods

2.1. Deriving the risk score under the NPT model

2.2. Evaluating prediction performance

3. Simulation study

3.1. Single Visit Time

Table 1:

Table 2:

Table 3:

3.2. Multiple Visit Times

Table 4:

Table 5:

4. Diabetes risk prediction with the Framingham Offspring Study

Table 6:

5. Discussion

Availability of data

Supplementary Material

Acknowledgement

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases