Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jun 20.
Published in final edited form as: Ann Appl Stat. 2017 Jul 20;11(2):1063–1084. doi: 10.1214/17-AOAS1036

FLEXIBLE RISK PREDICTION MODELS FOR LEFT OR INTERVAL-CENSORED DATA FROM ELECTRONIC HEALTH RECORDS

Noorie Hyun *, Li C Cheung *, Qing Pan , Mark Schiffman *, Hormuzd A Katki *
PMCID: PMC6586434  NIHMSID: NIHMS995818  PMID: 31223347

Abstract

Electronic health records are a large and cost-effective data source for developing risk-prediction models. However, for screen-detected diseases, standard risk models (such as Kaplan–Meier or Cox models) do not account for key issues encountered with electronic health record data: left-censoring of pre-existing (prevalent) disease, interval-censoring of incident disease, and ambiguity of whether disease is prevalent or incident when definitive disease ascertainment is not conducted at baseline. Furthermore, researchers might conduct novel screening tests only on a complex two-phase subsample. We propose a family of weighted mixture models that account for left/interval-censoring and complex sampling via inverse-probability weighting in order to estimate current and future absolute risk: we propose a weakly-parametric model for general use and a semiparametric model for checking goodness of fit of the weakly-parametric model. We demonstrate asymptotic properties analytically and by simulation. We used electronic health records to assemble a cohort of 33,295 human papillomavirus (HPV) positive women undergoing cervical cancer screening at Kaiser Permanente Northern California (KPNC) that underlie current screening guidelines. The next guidelines would focus on HPV typing tests, but reporting 14 HPV types is too complex for clinical use. National Cancer Institute along with KPNC conducted a HPV typing test on a complex subsample of 9258 women in the cohort. We used our model to estimate the risk due to each type and grouped the 14 types (the 3-year risk ranges 21.9–1.5) into 4 risk-bands to simplify reporting to clinicians and guidelines. These risk-bands could be adopted by future HPV typing tests and future screening guidelines.

Keywords: Mixture model, interval censoring, two-phase sampling, B-splines, weighted likelihood, HIV

1. Introduction.

Many large-scale epidemiologic cohort studies are being organized within health-care providers who have large populations of patients to recruit, preexisting infrastructure for longitudinal visits, and electronic health records to facilitate data collection. For example, we collaborated with Kaiser Permanente Northern California (KPNC) to assemble a cohort of women in cervical cancer screening by linking electronic records of patient information, test results and disease outcomes [Castle et al. (2009)]. Nearly all women underwent testing for human papillomavirus (HPV), the cause of nearly all cervical cancer. We previously used this cohort to develop the cancer risk calculations underlying current HPV-based screening guidelines [Katki et al. (2013), Massad et al. (2013)], which are available in the official guidelines App (http://www.asccp.org/store-detail2/asccp-mobile-app).

In light of our experience, we have developed new risk modeling methodology for electronic health record data for screen-detected diseases. We address three key issues that make it inappropriate to calculate risk using standard methods, such as Kaplan–Meier [Kaplan and Meier (1958)] or Cox models [Cox (1972)].

First, prevalent disease could exist at enrollment, and separating out risk of prevalent disease is important because clinicians are primarily concerned with the risk that disease is present. Furthermore, doctors have little interest in when a cancer currently detected might have arisen in the past. Thus, it suffices to consider prevalent disease as a left-censored point-mass at time zero, taken as the earliest time at which there exist health records for the outcomes and covariates. The idea of modeling prevalent disease as a point mass at time zero is the obverse of the cure model for two heterogeneous sub-populations, where there is a point mass at time infinity [Li, Taylor and Sy (2001), Ma (2010), Shao et al. (2014)].

However, prevalent disease is not always diagnosed at baseline. People with missing or negative screening test results generally do not undergo definitive disease ascertainment, such as biopsies. Consequently, disease diagnosed at future visits is a mixture of truly incident disease and undiagnosed prevalent disease. A mixture of prevalent and incident disease is a key feature of health record data; it is also commonly found but ignored in epidemiologic cohorts, for example, in case-cohort studies to estimate the incidence rate for an asymptomatic disease, cases diagnosed after baseline are considered to occur after baseline by assuming diagnosis dates are equal to disease onset dates.

The second key issue is that incident disease events are often interval-censored between irregular visits. Researchers working with data from health providers typically cannot influence the timing of visits, and patients return at intervals that are quite irregular. Ignoring interval-censoring leads to invalid inferences [Dorey, Little and Schenker (1993), Odell, Anderson and D’Agostino (1992), Rücker and Messerer (1988)], especially when intervals are irregular. Furthermore, standard interval-censoring methods [cf. Huang and Rossini (1997), Huang and Wellner (1997), Ma (2010), Tian and Cai (2006), Wang et al. (2016), Zhang, Hua and Huang (2010)] do not account for diagnosed or undiagnosed prevalent disease.

The final issue we address is estimating absolute risk from two-phase stratified samples nested within the cohort. Electronic health record information is available on everyone (phase 1), and the new screening tests are available only on a sample of the cohort (phase 2). Conducting biomarker measurements only on a judicious sample can be cost-efficient in using cohort resources [Woodward (1999)]. Estimating absolute risk for the full cohort requires accounting for the sample design, for example, the sampling fractions. We focus on the Horvitz–Thompson (design-based) estimation [Horvitz and Thompson (1952)] in this manuscript.

We propose a family of mixture models, called “prevalence-incidence” models, for estimating cumulative risk and assessing covariate effects. All details are presented for the useful special case where prevalent disease is modeled with logistic regression and incident disease is modeled with a Cox model (“logistic-Cox”). The semiparametric model is computationally intensive, and estimates asymptotically converge at slow rates though, recent high-level computational resources along with big data can solve the problems. However, low event rates despite a large sample and a set of multiple data analyses can be barriers to using the semipara-metric model with a bootstrap-based inference procedure. We propose a weakly-parametric model using a monotone spline for the baseline cumulative hazard. As a practical solution for diagnostic purposes, we propose using the semiparametric estimates to graphically assess the fit of the weakly-parametric model, and an iterative algorithm is used to estimate parameters in a semiparametric framework. We extend our models to account for two-phase stratified sampling via inverse-probability weighting (IPW) by sample inclusion probabilities [cf. Breslow and Wellner (2007), Cai and Zheng (2013), Kovalchik and Pfeiffer (2014), Saegusa (2015)].

We used electronic health records to link data at KPNC for 33,295 HPV-positive women from 2007–2011 to assemble the HPV Persistence and Progression (PaP) Cohort (see Figure 1 for details). We plan to use this data and our prevalence-incidence models to inform the next screening guidelines that will incorporate new screening tests, in particular, HPV typing [Castle et al. (2011)]. Each of the 14 carcinogenic HPV types has different precancer/cancer risk [Schiffman et al. (2011)], but providing information on each of 14 types is too complex for clinicians or developing guidelines. We conducted HPV typing tests in PaP using a residual exfoliated cervical specimen that was stored for study use [Schiffman et al. (2015)]. However, typing tests are too expensive to be used to test all specimens. Instead, we conducted typing tests on a stratified random sample of 9258 women that over-samples women diagnosed with precancer/cancer or are otherwise at high risk (the design will be elaborated in Section 4). Using the IPW logistic-Cox model to calculate risk in PaP, we grouped the 14 types by risk into 4 risk-bands to report to clinicians and for basing guidelines. These risk-bands may be adopted by future screening guidelines, which would inform the design of future commercial HPV typing tests.

Fig. 1.

Fig. 1.

Human papillomavirus (HPV) Persistence and Progression (PaP) cohort.

2. Proposed methods.

We first propose the prevalence-incidence family of models for full cohorts and then extend it to two-phase samples. Throughout, we assume that all outcomes and covariates have negligible measurement error.

2.1. Complete data of a full cohort.

For full cohort data (no subsampling), denote subjects i = 1,…, N, the failure time, Ti has cumulative density function F, and its survival function is S(t) = 1 – F(t). The time scale is time-on-study, and we suppose the baseline time is 0 at enrollment into the study. If subject i has disease at baseline (i.e., Ti ≤ 0), then Yi = 1; otherwise Yi = 0. The prevalence indicator variable, Yi is not necessarily observed for all subjects. As the missing indicator, Mi has value 1 if yi is observed and 0 if yi is missing. Failure times are interval-censored between Li and Ri, the latest and earliest visit times at which the subject i is observed as disease-free and diseased, respectively. Intervals are defined as follows: for 0 < Li < Ri, right-censoring is (Li, Ri = ∞), interval-censoring where disease is definitively known to be not present at base-line is (0, Ri) or (Li, Ri), and disease that is diagnosed in the follow-up but might be unobserved at baseline (i.e., Mi = 0) is [0, Ri) for Ri < ∞. We assume that given covariates, the censoring time and observation time are independent of the failure time because visit time is predetermined by guidelines and precancers and early-phase cancers are most likely to be asymptomatic. If case status, that is, diseased versus disease-free at baseline or during the enrollment period (prevalence at baseline or incidence observed during the enrollment period), are used to determine strata, the auxiliary variable of Vi (not the risk factors of interest) includes (Yi, Li, Ri) in addition to other characteristics, for example, strata factors and demographics.

We assume that the prevalent disease probability, Pd(xi, β) at baseline depends on β for a given covariate xi, which does not overlap γ for incident probability given an incidence-related covariate, zi. The covariate vectors, Xi and Zi are partially overlapped or the same, and for example, can be potential risk factors for cancers at baseline. The likelihood for complete-data, Dc={(yi,Li,Ri,viTxiT,ziT);  i=1,,N} is

LN(β,γ,S;Dc)=i=1NPd(xi,β)yi[{1Pd(xi,β)}×{S(Li;zi,γ)S(Ri;zi,γ)}]1yi. (2.1)

The above likelihood defines a general class of “prevalence-incidence” mixture models. In particular, we focus on the logistic-Cox prevalence-incidence model, which models prevalent disease with a logistic regression and incident disease with a Cox model [Cox (1972)], that is, Pd(x, β) = exp()/{1 + exp() and S(t; z, γ)} =exp{−Ʌ(t) exp()}, where Ʌ(t) is an unknown baseline cumulative hazard function, which is nondecreasing over time and Ʌ(0) = 0. Cumulative risk from the logistic-Cox model given x and z is

CR(t|x,z,β,γ,Λ)=exp(xβ)1+exp(xβ)+11+exp(xβ)[1exp{Λ(t)exp(zγ)}]. (2.2)

2.2. Two-phase stratified sample.

For two-phase stratified sample design, we follow the general inverse-probability weighting (IPW) approach [Breslow and Wellner (2007)]. The first phase is the full cohort of N subjects, which is a simple random sample from an infinite population (called superpopulation). For subjects i = 1,…, N, at phase 1, we observe only a vector of auxiliary variables V i, which correlates with the time-to-precancer/cancer, Ti and determines stratification. In the HPV-PaP cohort, the auxiliary information includes the currently used cotesting for cervical cancer screening (cytology and HC2) and demographics. We suppose the cohort is divided into J mutually exclusive and exhaustive strata. Let Nj denote the number of subjects in the jth stratum for j = 1,…, J, so N=j=1JNj. At phase 2, simple random samples without replacement of size nj are drawn from each of the J finite phase 1 strata and n=j=1Jnj. We denote ξj,i as the indicator variable equal to one if the ith subject in stratum jis sampled at phase 2 and zero otherwise. Under this two-phase stratified sample design, (ξj,1,,ξj,Nj) are exchangeable with Pr(ξj,i = 1) = nj/Nj, and the J random vectors (ξj,1,,ξj,Nj) are independent. With two-phase sampling, X and Z are not observed for all N subjects but fully observed for subjects sampled at phase 2, for example, expensive bioassay tests are only conducted on the subjects sampled in phase 1. For the general setting, let πj,i = Pr(ξj,i = 1) be the probability that the ith subject from stratum j is sampled at phase 2. Then served data at phase one is D(1)={Dj,i1,i=1,,Nj,j=1,,J} where Dj,i(1)={yj,i=1,vj,i,ξj,i} or {yj,i = 0, Lj,i, Rj,i, vj,i, ξj,i} when Mj,i = 1, whereas Dj,i(1)={Lj,i=0,Rj,i,vj,i,ξj,i} when Mj,i = 0. At phase two, the observed data is D(2)={Dj,i(2),i=1,,nj,j=1,,J} where Dj,i(2)={ξj,ixj,i,ξj,izj,i} Hence, the observed data from phase-two stratified sampling are D = D(1)D(2). We assume the missing mechanism for Y at phase one and sample selection at phase two are missing at random (MAR). We also assume that all J strata are sampled with positive probability.

Then the weighted likelihood for the observed data D and the missing indicator M is

Lnπ(βT,γT,S,Ψ;D,M)=j=1J{i;mj,i=1}NjNjnjξj,iPd(β)yj,i[{1Pd(β)}{S(Lj,i;γ)S(Rj,i;γ)}]1yj,iP(Mj,i=mj,i|Dj,i,Ψ)×j=1J{i;mj,i=0}NjNjnjξj,i[Pd(β)+{1Pd(β)}{1S(Rj,i;γ)}]×P(Mj,i=mj,i|Dj,i,Ψ), (2.3)

where Ψ denotes parameters in the missing data mechanism. The likelihood in (2.3) is orthogonal in (β, γ) and Ψ. Thus, the missing data mechanism P(Mj,i | Dj,i, Ψ) is ignorable in maximum likelihood estimation. The MAR assumption for πj,i is crucial to construct an unbiased estimating equation.

Then the weighted log-likelihood for the observed data, D is

lnπ(βT,γT,Λ;D)=j=1JNj/nji=1Njξj,il(β,γ,Λ;Dj,i), (2.4)

where

l(βT,γT,Λ;Dj,i)=I(mj,i=1)(yj,ixj,iβlog{1+exp(xj,iβ)}+(1yj,i)log[exp{Λ(Lj,i)exp(zj,iγ)}exp{Λ(Rj,i)exp(zj,iγ)}])+I(mj,i=0)log[1{1+exp(xj,iβ)}1exp{Λ(Rj,i)exp(zj,iγ)}]. (2.5)

Estimates, (β^T,γ^T,Λ^) denote the corresponding arguments maximizing the objective function in (2.4). Because the indicator variables, {ξj,i}iNj within j stratum are interchangeable but not independent when replacement is not allowed, we need techniques dealing with nonindependent data to prove consistency and weak convergence of the estimates.

2.3. Weakly parametric mixture models.

It is well known that the asymptotic distribution of cumulative hazard functions from interval-censored data is non-Gaussian converging at rates slower than root-N [Groeneboom and Wellner (1992), Sen and Banerjee (2007)]. Generally, semiparametric estimation procedures for interval censored data are computationally-intensive especially when the number of unique visit times increases, so a bootstrap method for inference is often impractical. To sidestep such challenges, we propose a weakly-parametric model by approximating the baseline cumulative hazard with an integrated B-spline. For smoothing, cubic splines are commonly used in practice [Wang et al. (2016)]. Knots can be placed at the quantiles of the finite visit time points. We also present a semiparametric estimator in Section 2.5 as a benchmark to assess how well the approximation of the baseline hazard function fits. To ensure a convergence rate of square root of sample size, we assume the number of knots for integrated B-spline are fixed. In our experience, the assumption is plausible for data analyses with rare events because the number of finite intervals in which events occur is controlled by screening guidelines, and thus is not increasing proportional to the sample size [Zhang, Hua and Huang (2010)].

We approximate the baseline cumulative hazard as Λ(t)=k=1Kexp(bk)Bk(), where Bk(·)’s are integrated B-spline basis functions, which are nondecreasing from 0 to 1 and the bk’s are unknown parameters for the basis functions [using exp(bk) ensures nonnegative Ʌ(t)]. We omit the subscripts j, i for simplicity. The weighted log-likelihood in the model is

lnπ{θ=(βT,γT,b1,,bK);D}=j=1JNj/nji=1Njξj,il(θ;Dj,i), (2.6)

where

l(θ;D)=I(m=1)((1y)log[exp{exp(zγ)k=1KebkBk(L)}exp{exp(zγ)k=1KebkBk(R)}]+yxβlog{1+exp(xβ)})+I(m=0)log[1{1exp(xβ)}1×exp{exp(zγ)k=1KebkBk(R)}]

where K is the number of knots.

The root of the score function, j=1Ji=1Njξj,il˙(θ;Dj,i), where l˙(θ;Dj,i)=l(θ;Dj,i)/θ (presented in Section 2 of the supplementary materials [Hyun et al. (2017)]) can be found by the Newton–Raphson iterative algorithm. Model identifiability and asymptotic consistency of the estimators obtained from the weakly-parametric procedure are proved in Lemma 3.1 and Theorem 3.1 in Section 3 of the supplementary materials [Hyun et al. (2017)], respectively. The Fisher information matrix, I0=E{l˙(θ0)l˙(θ0)T} is invertible under the condition A2 in Section 1 of the supplementary materials [Hyun et al. (2017)], and it is shown in Lemma 3.2 in Section 3 of the supplementary materials [Hyun et al. (2017)].

2.4. Asymptotic variance for the weakly-parametric models.

Standard parametric maximum-likelihood theory is inapplicable because the sampling is without replacement, so the sampling indicator variables ξj,i are correlated within a stratum. We follow the weighted likelihood approach from Breslow and Wellner (2007) to demonstrate weak convergence of the estimates for finite population stratified sample. We assume the number and placement of knots are known a priori and independent of sample size.

By using Taylor expansion of lnπ(θ;D) in (2.6), we linearize the estimated parameters:

N(θ^θ0)=j=1JNj/nji=1Njξj,iI01l˙(θ0;Dj,i)+op(1)=i=1NI01l˙(θ0;Di)+I01j=1Ji=1Nj(Njξj,i/nj1)l˙(θ0;Dj,i)+ op(1) (2.7)
dN(0,I01+I01Σ0I01), (2.8)

where Σ0=j=1Jvj[(1pj)/pj}[E{l˙(θ0)2|Vj}E{l˙(θ0)|Vj}2], Vj is j stratum, and x⊗2 = xxT for a vector x; for each stratum j = 1,…, J. As N, → ∞, sampling fraction converges with pj (= lim nj/Nj); each stratum size increases at the same rate as N increases, that is, vj = lim Nj/N and 0 < vj < ∞. The asymptotic normal limit distribution of the estimators is derived in Theorem 3.2 of the supplementary materials [Hyun et al. (2017)]. The asymptotic variance estimator for θ^ consists of two components from phase 1 and 2 finite sample design. By letting l¨(θ)=l˙(θ)/θ for l˙(θ), the variance estimators are

va^rph1(θ^)=N1I^(θ0)1={Nj=1JNj/nji=1Njξj,il¨(θ^;Dj,i)}1, (2.9)
va^rph2(θ^)=1N2I^(θ0)1[j=1JNj(1pjpj) va^r0|j{l˙(θ0)}]I^(θ0)1, (2.10)

where va^r0|j=(nj)1i=1Nj{ξj,il˙(θ^;Dj,i)}2{(nj)1i=1Njξj,il˙(θ^;Dj,i)}2. As a result, the variance estimator of θ^ is the sum of the variances in (2.9) and (2.10). Given x and z, the asymptotic variance estimate for CR(t | x, z, θ^) is derived by va^r(θ^) and the delta method. The explicit variance form is presented in Section 4 in the supplementary materials [Hyun et al. (2017)].

The sampling weights can be estimated to improve efficiency by using a parametric model π(α; v) = Pr(ξj,i = 1 | vj,i) when the auxiliary variables are closely correlated with the target variables [Breslow et al. (2009)]. When we use estimated weights, the asymptotic distribution of the estimates is different from distribution (2.7), particularly from the variance due to sampling, Σ in (2.8). The asymptotic distribution can be derived by the result of Breslow et al. (2009), and it is presented in Section 3 in the supplementary materials [Hyun et al. (2017)].

2.5. Semiparametric estimation procedure.

A semiparametric risk estimate is useful for checking the fit of parametric models. We propose a semiparametric estimator that maximizes the objective function in (2.4) by iterating between estimating the finite dimensional regression parameters and the infinite dimensional cumulative-hazard Ʌ(t), estimating each with standard fitting algorithms:

  1. Initialize β^(0)=β* and γ^(0)=γ*.

  2. With the current estimate (β^(l),γ^(l)), compute Λ^(l) by maximizing lnπ(β^(l),γ^(l),Λ;D) as a function of Ʌ. This optimization can be carried out by the Iterative Convex Minorant (ICM) algorithm [Robertson, Wright and Dykstra (1988)] (the detail follows below).

  3. With the updated Λ^(l) we maximize lnπ(β,γ,Λ^(l);D) with respect to (βT ,γT) using the classic iteratively reweighted least squares algorithm for generalized linear models [Nelder and Wedderburn (1972)].

  4. Repeat steps 2 and 3 until convergence.

For steps 2 and 3, we define the following IPW processes:

Aj,i(t)=I(mj,i=1 and 0<Rj,it)(1yj,i)g(Rj,i)exp(zj,iγ){g(Lj,i)g(Rj,i)}I(mj,i=1 and 0<Lj,it)(1yj,i)g(Lj,i)×exp(zj,iγ)/{g(Lj,i)g(Rj,i)}+I(mj,i=0 and 0<Rj,it)g(Rj,i)×exp(zj,iγ)/{1+exp(xj,iβ)g(R)},

where g(t) = exp{−Ʌ(t) exp()} for t > 0, g(0) = 1 and limtg(t) = 0. This process Aj,i(t) is the time derivative of the log-likelihood in (2.5) and can only have a jump at tk, which is at either Lj,i or Rj,i:

AΛ,n(t)=j=1JNj/nji=1Njξj,iAj,i(t),GΛ,n(t)=j=1JNj/nji=1Njξj,iAj,i2(t),QΛ,n(t)=AΛ,n(t)+0tΛ(s)dGΛ,n(s), (2.11)

where GɅ,n(t) in (2.11) is based on a second order expansion of the log-likelihood in (2.5). To ensure identifiability of Ʌ(t), we assume that Λ^ is right continuous and piecewise constant, and at most only discontinuous at {t(k); k = 1,…, K}, which are ordered unique values of observed times, {Li, Ri | Li ≠ 0 and Ri < ∞, i =1, …, n}.

For fixed (β, γ), let Λ^ be the left derivative of the greatest convex mino-rant of the self-induced cumulative sum diagram formed by the points, (0, 0) and {GΛ^,n(t(k)),QΛ^,n(t(k))}. Then Λ^ maximizes j=1JNj/nji=1Njξj,il(β,γ,Λ;Dj,i) [Groeneboom and Wellner (1992)]. The consistency of the estimators obtained from the semiparametric procedure is proved in Theorem 3.1 in Section 3 of the supplementary materials [Hyun et al. (2017)].

3. Simulation studies.

We conducted a series of simulations to assess the numerical performance for the weakly-parametric IPW logistic-Cox model and to compare estimates from it to the semiparametric IPW logistic-Cox model. We simulate two scenarios SC1 and SC2, where SC1 reflects an ideal situation with a high event rate and narrow visit-intervals, whereas SC2 reflects a realistic scenario with a moderate event rate and wide visit-intervals. Two covariates in the models (3.1) and (3.2), X1 and X2 are independently generated as a binomial with probability 0.5 and as a standard normal distribution with variance 1, respectively, and the covariate vectors for incidence and prevalence are identical:

Logistic model: logit{Pd(X1,X2,β)}=β0+β1X1+β2X2, (3.1)
Cox model: Λ(t;X1,X2,γ)=γ0tτexp(γ1X1+γ2X2). (3.2)

The Cox submodel baseline hazard parameters are (γ0, τ) (0.135, 1) for SC1 and (0.05, 0.5) for SC2; the covariates-related parameters are (β1, β2, γ1, γ2) = (1, 1, 0.3, 0.3) for SC1 and 2. Visit times are independent and generated as a normal distribution with mean 3 and variance 0.5. The number of visits varies across subjects as we set a fixed end time (t = 20 for SC1 and t = 10 for SC2) for follow-up. Follow-up occurs if there is no prevalent disease at baseline. Whether a subject takes a diagnostic test at each screening visit follows a binomial distribution with the probability of 0.5 and 0.07 for SC1 and 2, respectively. This means the incidental interval in SC1 is more likely to be narrower than the one in SC2.

Time interval (Li, Ri) in which disease occurs is determined by the closest disease ascertainment date prior to and post to the true event time. We set the cohort size to be 10,000, and consider two-phase stratified sample. For stratification, we use two factors, cases-controls in certain enrollment period and a binary variable, V depending on X1 + X2. Among the high risk group, that is, X1 + X2Q (92.8%), where Q (92.8%) corresponds to the 92.8% quantile of the distribution of X1 + X2, namely, 2.135, we set P(V = 1 | X1 + X2 ≥ 2.135) = 0.9, and of the low risk group, we set P(V = 0 | X1 + X2 < 2.135) = 0.9. This implies that the stratum variable V is strongly associated with survival time T. In SC1, cases are defined by diagnosis time up to t = 2, that is, prevalent case or Ti < 2; whereas in SC2, cases are defined by prevalent cases only. We take all cases and randomly select samples from (V = 1, controls) and (V = 0, controls), 80% and 11% for SC1 and 80% and 20% for SC2, respectively. The sampling weights for cases and controls are one and the inverse of the sampling fraction, (1.25 and 9.09) for SC1 and (1.25 and 5.0) for SC2, respectively. SC2 is meant to simulate the data of our application, while SC1 increases the number of incidental intervals.

In SC1, the average sample size is 2611. The baseline diagnosis test rate and left-/right-/interval-censoring rates are 95.5%, 30.6%, 3.9%, and 41.1%, respectively. In SC2, the average sample size is 3354. The baseline diagnosis test rate and left-/right-/interval-censoring rates are 95.5%, 12.0%, 59.0%, and 0.9%, respectively. We carried out 1000 replications for each scenario.

We first applied a naive approach, a survey-weighted Cox model for right-censored data to the simulation data by using function “svycoxph” in the R-package “Survey” [Lumley (2016)]. We imputed the minimum of {Lj,i;i = 1,…,nj, j = 1,…,J} − ε to the event time for prevalent cases, where ε is an arbitrary positive constant so that the event time is positive; we impute (Rj,iLj,i)/2 to the event times for [Lj,i = 0, Rj,i) or (Lj,i, Rj,i) for Rj,i < ∞; the censoring times for (0 < Lj,i, Rj,i = ∞) are to imputed Lj,i. In SC1, the cumulative risk estimates are substantially biased at the early times, and the bias is decreasing to 0 over time, whereas the cumulative risk estimates in SC2 are substantially biased across times because of the wide finite visit-intervals and the low event rate (Table 1).

Table 1.

Simulation results of a naive approach: the cumulative risk estimates are for the subgroup with (x1 = 1, x2 = 0.5)

Cumulative risk (CR) Scenario 1 (n = 2611) Scenario 2 (n = 3354)
True value Bias True value Bias
CR(t = 0.1) 0.138 0.135 0.141 0.185
CR(t = 1) 0.288 0.058 0.185 0.146
CR(t = 3) 0.534 −0.016 0.231 0.115
CR(t = 5) 0.695 −0.013 0.260 0.223
CR(t = 7) 0.801 −0.004 0.284 0.208

Table 2 presents simulation results. In both scenarios, regression parameter and cumulative risk estimates have negligible bias. For the regression parameter estimates, the efficiency of both models are comparable, whereas for cumulative risk estimates, the empirical standard errors of the weakly-parametric model are smaller (relative efficiency is 1.297–1.664) than those of the semiparametric model. The resulting asymptotic variance estimates are close to the empirical standard errors except for the intercept coefficient parameter in SC2. Most coverage probabilities from the weakly-parametric model are near the nominal level 95%. In SC2, the relatively low coverage probability for the cumulative risk is owing to the lack of events, and consequently the few simulation estimates with relatively large bias. Results for the cumulative risk curve estimated in both scenarios are shown in Figure 2, and the bias of cumulative risk curve estimated in SC1 is much smaller than the curves estimated in SC2. The black-solid lines are the average of the 1000 estimates, and the average estimates agree well with the true curve of grey dashed-dot line. The dashed step-curve and dashed smooth-curve in Figure 2 are a single representative estimate from the 1000 estimates, and those are also close to the true curve.

Table 2.

Simulation results: aparameter; btrue value; cempirical standard error; d relative efficiency= SE1/SE2; easymptotic standard error; f 95% coverage probability; gcumulative risk; the cumulative risk estimates are for the subgroup with (x1 = 1, x2 = 0.5)

Semiparametric model Weakly-parametric model
Para.a True.b Bias SE1c REd Bias SE2c ASEe CPf
Scenario 1, n = 2611
β0 −3.5 −0.007 0.100 1.004 −0.004 0.099 0.105 0.952
β1 1.0 0.003 0.119 1.002 0.001 0.119 0.120 0.950
β2 1.0 0.004 0.058 1.005 0.003 0.058 0.057 0.947
γ1 0.3 −0.011 0.074 1.085 −0.004 0.068 0.065 0.938
γ2 0.3 0.002 0.037 1.121 0.000 0.033 0.033 0.952
Prevalence 0.119 0.000 0.007 1.002 0.000 0.007 0.007 0.960
CR(t = 1)g 0.288 −0.002 0.034 1.652 0.000 0.021 0.023 0.954
CR(t = 3) 0.534 −0.011 0.044 1.664 −0.001 0.026 0.026 0.939
CR(t = 5) 0.695 0.004 0.032 1.390 −0.001 0.023 0.022 0.931
CR(t = 7) 0.801 0.000 0.023 1.287 −0.001 0.018 0.018 0.947
CR(t = 10) 0.895 −0.001 0.017 1.284 −0.001 0.014 0.013 0.946
Scenario 2, n = 3354
β0 −3.5 −0.002 0.088 1.001 −0.002 0.088 0.092 0.957
β1 1.0 0.001 0.102 1.000 0.000 0.102 0.103 0.949
β2 1.0 0.002 0.051 0.999 0.001 0.051 0.051 0.943
γ1 0.3 −0.020 0.286 2.063 −0.001 0.139 0.138 0.950
γ2 0.3 0.000 0.067 1.014 0.002 0.066 0.067 0.950
Prevalence 0.119 0.000 0.006 1.000 0.000 0.006 0.006 0.964
CR(t = 1) 0.185 −0.009 0.029 1.297 0.000 0.022 0.022 0.919
CR(t = 3) 0.231 −0.007 0.029 1.284 −0.002 0.022 0.023 0.925
CR(t = 5) 0.260 0.003 0.027 1.266 −0.001 0.022 0.021 0.931
CR(t = 7) 0.284 0.004 0.026 1.301 0.001 0.020 0.019 0.931

Fig. 2.

Fig. 2.

Results of a simulation study with two scenarios, SC1 and SC2: the black solid lines are the average of the 1000 estimates; the black dashed lines are a single representative estimate from the 1000 estimates; the grey dashed-dot lines are the true cumulative risk.

We numerically evaluated the robustness of the cumulative risk estimates from the weakly-parametric and semiparametric logistic-Cox model when the true prevalence and incidence models are a probit and an additive hazard model. The cumulative risk estimates from the semiparametric and weakly-parametric logistic-Cox regression models are robust to model misspecification (Table 1 and Figure 1 in the supplementary materials [Hyun et al. (2017)]). We also evaluated the robustness of the cumulative risk estimates from the weakly-parametric logistic-Cox model when the assumptions about the cubic B-spline approximation are violated. As a violation, we considered a cumulative hazard function including abrupt change points. In the scenario with a high event rate, the cumulative risk estimate from the semiparametric model is less biased than the weakly-parametric model. However, in the scenario with a moderate event rate, the cumulative risk estimate from the weakly-parametric model is less biased than the semiparametric model (Table 2 and Figure 2 in the supplementary materials [Hyun et al. (2017)]).

4. Application: Developing risk-bands based on HPV typing tests.

It is expected that the next cervical cancer screening guidelines will include recommendations for the use of HPV typing tests. There are thirteen oncogenic HPV types and one possibly oncogenic type commonly included in tests (HPV66), and each type has a different risk of precancer/cancer [Schiffman et al. (2011)]. However, little is known about the performance of HPV typing in clinical practice, and the best grouping of the 14 types for different triage would be useful to increase the screening benefit. Our typing assay currently groups the 14 types into 9 categories: HPV16, HPV18, HPV31, HPV45, HPV51, HPV52, HPV33/58, HPV39/68/35, and HPV59/56/66.

For the subgroup with positive on HC2 (5%) within the cohort of women undergoing screening at KPNC, we have assembled a two-phase stratified sample of 9258 (in Figure 1) with HC2-positive. From the sample, we have residual discarded HPV test specimens usable for HPV-type testing since 2007. The stratified sample was based on baseline cytology severity (normal/low/high grade), FocalPoint computer-assisted quantitative cytology (0, 1–9, 10–100%), and baseline histology result (grade 1/2/3 or cancer). Table 3 shows the sample design. The analysis dataset includes 8333 subjects with complete HPV types. Median and maximum follow-up time are 1.69 and 7.18 years, respectively. The outcome of interest is precancer (histology grade 3) or cancer. There are 744 (8.9%) prevalent cases at baseline, and baseline biopsy rate, left-/right-/interval-censored cases are 7331 (88.0%), 361 (4.7%), 7132 (94.0%), and 96 (1.1%), respectively. The 1888 (24.9%) who never got a biopsy are mostly women who have less than 1 year of follow-up or their HPV cleared at their second visit, obviating a biopsy.

Table 3.

Sample design in women with HPV positive: for FocalPoint, “0” means result abscent, “1–9” means not within most abnormal decile, “≥10” means within most abnormal decile

Severity of cytology Histology FocalPoint category (%) Stratum number Sample number Sampling fraction Sampling weight
Normal or low grade <Grade 2 0 13,615 1651 0.1213 8.2
1–9 13,826 2441 0.1766 5.7
≥10 2845 1412 0.4963 2.0
Grade 2 0 918 321 0.3497 2.9
1–9 808 286 0.354 2.8
≥10 360 184 0.5111 2.0
Grade 3 0 541 249 0.4603 2.2
1–9 427 185 0.4333 2.3
≥10 184 100 0.5435 1.8
Cancer/AIS 0 82 54 0.6585 1.5
1–9 71 57 0.8028 1.2
≥10 17 12 0.7059 1.4
High grade < Grade 2 0 497 332 0.668 1.5
1–9 268 215 0.8022 1.2
≥10 251 170 0.6773 1.5
Grade 2 0 214 169 0.7897 1.3
1–9 107 69 0.6449 1.6
≥10 175 116 0.6629 1.5
Grade 3 0 251 222 0.8845 1.1
1–9 131 80 0.6107 1.6
≥10 299 189 0.6321 1.6
Cancer/AIS 0 88 62 0.7045 1.4
1–9 61 42 0.6885 1.5
≥10 69 46 0.6667 1.5

We used the IPW logistic-Cox model to calculate 3-year risk of precancer or cancer for each HPV type, with the very lowest risk types grouped (Table 4). Because multiple HPV types can co-infect the cervix, the analysis is hierarchically conducted in the following manner. We calculate the marginal risk for each type, then at the next level, we excluded everyone who had all higher-risk HPV type, and recalculate marginal risks for the remaining types [Schiffman et al. (2015)], and so on. This determines the best order of introducing additional type categories for risk stratification. This strategy is sensible, in that precancer/cancer risk is dominated by the riskiest type, that is, multiple types do not “interact” [Chaturvedi et al. (2011)]. For example, a woman with both HPV16 and HPV56 will have her outcomes attributed to the higher risk type (i.e., HPV16). When estimating risk for subsets of data, a standard weighted analysis using only the subset of interest can underestimate standard errors if there is no sampled observations from the domain in some strata [Graubard and Korn (1996)], but in the hierarchical subgroup by HPV types, each domain is sampled from nearly all strata. We did not employ a multiple comparisons correction because the hierarchical analyses were done for exploratory purposes.

Table 4.

Hierarchical analysis for CIN3 plus risk by the nine HPV categories: anumber of observations; b3 years-cumulative risk; clower limit; d upper limit

HPV No. obsa 3yr-CR (%)b 95%LLC 95%ULd
HPV16 positive 1564 21.9 20.1 23.7
Else HPV18 positive 494 11.5 9.2 13.8
Else HPV33 or 58 positive 631 8.6 6.9 10.3
Else HPV31 positive 766 8.1 6.6 9.5
Else HPV45 positive 324 5.4 3.8 7.0
Else HPV52 positive 823 5.6 4.4 6.7
Else HPV51 positive 536 2.9 1.9 3.9
Else HPV39, 68 or 35 positive 1201 2.0 1.5 2.5
Else HPV59, 56 or 66 positive 1047 1.5 1.0 1.9

The estimates are obtained by applying the weakly-parametric logistic-Cox model with a covariate for HPV type in each submodel for prevalence and incidence. We chose the cubic B-spline with 7 knots placed at quantiles of visit times by examining the semiparametric risk estimate. For each of the nine categories, the cumulative risk curves from the weakly-parametric approach is a good fit to the semiparametric estimates (Figure 3).

Fig. 3.

Fig. 3.

CIN3 plus cumulative risk estimates by the HPV types. the step curves are semiparametric estimates; the smooth curves are weakly-parametric estimates.

The types can be grouped into 4 bands. As expected, HPV16 had by far the greatest risk (21.9%), nearly 15 times the 1.5% risk associated with the lowest-risk types (HPV59/56/66). HPV18 has the second highest risk at 11.5%. Although HPV45 has half the risk of HPV18, they both cause a particularly worrisome sub-type of cervical cancer (adenocarcinoma) so we group 18/45 together. Because types 33/58/31/52 have moderate risks between 5.6% and 8.6%, we group them together. The types 51/39/68/35/59/56/66 are grouped together because all have risk below 2.9%.

To form cervical cancer risk strata combining HPV with cytology, we calculated 3-year risk for grade 3 or cancer/AIS by histology (called CIN3+) across cytology subgroups within each band in Table 5. By comparison with established risk benchmarks and management recommendations from current U.S. guidelines [Katki et al. (2011)], we are able to propose the risk management of each stratum. Risk varies from 60.6% for HPV16 and high risk cytology down to 1.2% for the 4th HPV band and normal cytology, which represents considerable risk stratification. These risk bands could be used to base future guidelines, for example, the highest risks might indicate immediate treatment, medium-high risk might indicate a biopsy, medium-low risk might indicate a 1-year return, and low-risk might indicate a 2-year return.

Table 5.

CIN3/cancer risk strata combining the four HPV bands with cytology: anumber of observations; b3 years-cumulative risk; c95% lower limit; d 95% upper limit

HPV positive Cytology Obs.a CR (%)b LLc ULd
HPV16 Overall 1564 21.9 20.1 23.7
High 506 60.6 59.5 61.6
Medium/low 574 17.9 16.2 19.5
Normal 484 13.8 11.9 15.6
Else HPV 18/45 Overall 843 9.0 7.5 10.5
High 235 40.9 39.7 42.1
Medium/low 250 7.1 5.6 8.6
Normal 358 4.4 3.2 5.6
Else HPV31/52/33/58 Overall 2195 7.3 6.5 8.1
High 467 35.0 34.2 35.7
Medium/low 850 5.7 5.0 6.4
Normal 878 4.0 3.2 4.7
Else HPV 51/39/68/35/59/56/66 Overall 2784 2.0 1.7 2.3
High 321 13.6 11.8 15.3
Medium/low 1123 2.0 1.7 2.4
Normal 1340 1.2 0.0 8.7

Cumulative risk was used to inform the screening guidelines process because it was simpler to use than separate risks of prevalent and incident disease[Katki et al. (2013)]. However, risks of prevalent versus incidence disease are separated by the model and could be used separately if so desired.

5. Discussion.

Although potentially cost-effective and efficient, cohorts assembled from electronic health records at health providers pose analytic challenges. We addressed three challenges: prevalent left-censored outcomes and incident irregularly interval-censored outcomes, where incident disease is a mixture of truly incident disease and missed-prevalent disease when disease ascertainment is not always conducted at the baseline visit. The third challenge is complex sampling within the cohort, such as two-phase stratified case-control sampling, to ensure efficient use of biospecimen resources.

The estimates from an weighted Cox hazard model, but with ad hoc schemes to impute event onsets within intervals, are biased (Section 3). We proposed a general family of mixture models called prevalence-incidence models and focused on the logistic-Cox model in order to estimate cumulative risk. We proposed a weighted likelihood approach, using IPW to account for different complex two-phase sampling rates. We presented a weakly-parametric model using monotone splines, whose goodness-of-fit can be checked against a semiparametric risk curve estimated by an iterative algorithm that includes a weighted-iterative convex minorant algorithm. Our approach is the obverse of the cure model for two heterogeneous subpopulations; cure models have a point mass at infinity, but prevalence-incidence models have a point mass at the origin. Cure models have identifiability problems because cure can never be observed. In contrast, prevalent disease is observable for some patients, which should mitigates identifiability issues with prevalence-incidence models. We applied the IPW logistic-Cox model to estimate risk to group the 14 HPV types into 4 risk-bands. These risk-bands may be adopted by commercial entities proposing new HPV typing tests for regulatory approval and for adoption into future cervical cancer screening guidelines.

In our example, we focused on total cervical precancer/cancer risk for HPV-positive women, which combines risks of both prevalent and incident disease. However, for other aims, one may focus on only prevalent disease risk or incident disease risk. For example, only incident disease risk is relevant for women who undergo definitive disease ascertainment and are known disease-free. In contrast, ideally only prevalent disease risk is relevant for making decisions about whether to undergo definitive disease ascertainment, such as biopsies. Our models yield proper estimates of incidence disease risk using all the data, which improves power and reduces selection bias.

Although the weakly-parametric model is flexible, it still requires assumptions. From simulation studies, we found the assumptions for the weakly-parametric model are plausible in practice, and the weakly-parametric model can sometimes have less finite-sample bias than the semiparametric model for low/moderate event rates. However, bias can dominate in a large data with many events, and weakly parametric models are more likely to have larger bias and smaller variance than semiparametric models when the assumptions are violated. To identify such situations, it is important to check whether the confidence interval from the weakly-parametric model includes the point estimate of the semiparametric model.

We linked electronic records to assemble a high-risk 5% sub-cohort of women undergoing cervical cancer screening at KPNC, and conducted HPV typing tests on a stratified two-phase sample of 8644 women. The risk curves from the weakly-parametric IPW logistic-Cox model fit well to the semiparametric curves. Because having separate guidelines for each of 14 types is too complex for clinicians, we grouped the types into 4 bands by risk: HPV16 had a uniquely high risk of precancer/cancer; HPV18/45 and HPV31/52/33/58 have intermediate risk, and HPV51/39/68/35/59/56/66 has low risk. The most common abnormality in screening is HC2-positive and a normal cytology, for which guidelines currently recommend that patients return after 1-year. For HPV-positive women with normal cytology, if she has HPV16, her risk might be high enough to justify immediate biopsies, but if her HPV type is in the 4th (low risk) band, her risk might be low enough to justify a 2- or 3-year return. Our findings suggest that HPV typing, in conjunction with cytology test, might more precisely define management based on risk. These risk bands could be used to base future guidelines: high, medium, and low risk might indicate a biopsy, 1-year return, and 2-year return, respectively.

Our prevalence-incidence models are an incremental step on the way to developing more sophisticated models. Our models presume only progressive disease, but it is believed that some cervical precancers can spontaneously regress to normalcy without intervention. Regressive outcomes present serious identifiability problems for interval-censoring methods. Also, our model presumes a perfect outcome ascertainment, but biopsies are considered insensitive for finding cervical precancers [Schiffman et al. (2011)]. The combination of outcome measurement error and regressive outcomes present serious identifiability problems for any stochastic model, but must be addressed to develop more realistic and useful models. Cervical precancer is not deadly, so survival bias in sampling is negligible; however, if the interest was to study the natural history of cervical precancer and cancer (rather than to simply develop risk estimates valid for clinical use), we would need to account for left-truncation. Finally, we calculated risks valid only for baseline time-independent covariates, such as a baseline HPV test result. Extending the models to account for internal time-dependent covariates, such as HPV status changing over time, is an area of future work.

The semiparametric IPW logistic-Cox model is computationally intensive. Reducing the computational burden will be critical for epidemiologists who generally use only their desktop computers and are used to seeing results in a short period of time. An R package, (PIMixture) is under development to fit the IPW logistic-Cox model.

Supplementary Material

AOAS1036_Supplementary

Acknowledgments.

The first author would like to thank Dr. Barry I. Graubard for his constructive suggestions during the writing of this paper. The authors are grateful to two referees, an associate editor, and the editor for helpful comments.

Footnotes

SUPPLEMENTARY MATERIAL

Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records” (DOI: 10.1214/17-AOAS1036SUPP; .pdf). Supplementary materials available in the attached file includes the proofs for model identifiability and to establish useful asymptotic results of the estimates such as consistency and weak convergence to normal distribution under certain regularity conditions. The simulation studies and results are summarized in the supplementary materials.

REFERENCES

  1. Breslow NE and Wellner JA (2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scand. J. Stat 34 86–102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Breslow NE, Lumley T, Ballantyne CM, Chambless LE and Kulich M (2009). Improved Horvitz-Thompson estimation of model parameter from two-phase stratified samples: Applications in epidemiology. Stat. Biosci 1 32–49. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Cai T and Zheng Y (2013). Resampling procedures for making inference under nested case-control studies. J. Amer. Statist. Assoc 108 1532–1544. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Castle PE, Fetterman B, SCT (ASCP), Poitras N, Lorey T, Shaber R and Kinney W (2009). Five-year experience of human papillomavirus DNA and Papanicolaou test cotesting. Obstetrics & Gynecology 113 595–600. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Castle PE, Stoler MH, Wright TC Jr., Sharma A, Wright TL and Behrens CM (2011). Performance of carcinogenic human papillomavirus (HPV) testing and HPV16 or HPV18 genotyping for cervical cancer screening of women aged 25 years and older: A subanalysis of the ATHENA study. Lancet Oncol. 12 880–890. [DOI] [PubMed] [Google Scholar]
  6. Chaturvedi AK, Katki HA, Hildesheim A, Rodríguez AC, Quint W, Schiffman M, Van Doorn LJ, Porras C, Wacholder S, Gonzalez P and Sherman ME (2011). Human papillomavirus infection with multiple types: Pattern of coinfection and risk of cervical disease. J. Infect. Dis 203 910–920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cox DR (1972). Regression models and life-tables. J. R. Stat. Soc. Ser. B. Stat. Methodol 34 187–220. [Google Scholar]
  8. Dorey FJ, Little RJA and Schenker N (1993). Multiple imputation for threshold-crossing data with interval censoring. Stat. Med 12 1589–1603. [DOI] [PubMed] [Google Scholar]
  9. Graubard BI and Korn EL (1996). Survey inference for subpopulations. Am. J. Epidemiol 144 102–106. [DOI] [PubMed] [Google Scholar]
  10. Groeneboom P and Wellner JA (1992). Information Bounds and Nonparametric Maximum Likelihood Estimation. DMV Seminar 19 Birkhäuser, Basel. [Google Scholar]
  11. Horvitz DG and Thompson DJ (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc 47 663–685. [Google Scholar]
  12. Huang J and Rossini AJ (1997). Sieve estimation for the proportional-odds failure-time regression model with interval censoring. J. Amer. Statist. Assoc 92 960–967. [Google Scholar]
  13. Huang J and Wellner JA (1997). Interval censored survival data: A review of recent progress In Proceedings of the First Seattle Symposium in Biostatistics (Lin DY and Fleming TR, eds.) 123–169. Springer, New York. [Google Scholar]
  14. Hyun N, Cheung LC, Pan Q, Schiffman M and Katki HA (2017). Supplement to “Flexible risk prediction models for left or interval-censored data from electronic health records”. DOI: 10.1214/17-AOAS1036SUPP. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Kaplan EL and Meier P (1958). Nonparametric estimation from incomplete observations. J. Amer. Statist. Assoc 53 457–481. [Google Scholar]
  16. Katki HA, Kinney WK, Fetterman B, Lorey T, Poitras NE, Cheung L, Demuth F, Schiffman M, Wacholder S and Castle PE (2011). Cervical cancer risk for 330,000 women undergoing concurrent HPV testing and cervical cytology in routine clinical practice at a large managed care organization. Lancet Oncol. 12 663–672. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Katki HA, Schiffman M, Castle PE, Fetterman B, Poitras NE, Lorey T, Cheung LC, Raine-Bennett TR, Gage JC and Kinney WK (2013). Bench marking CIN3+ risk as the basis for incorporating HPV and Pap cotesting into cervical screening and management guidelines. J. Low. Genit. Tract Dis 17 S28–S35. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kovalchik SA and Pfeiffer RM (2014). Population-based absolute risk estimation with survey data. Lifetime Data Anal. 20 252–275. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Li C-S, Taylor JMG and Sy JP (2001). Identifiability of cure models. Statist. Probab. Lett 54 389–395. [Google Scholar]
  20. Lumley T (2016). Analyses of complex survey samples. Available at https://cran.r-project.org/web/packages/survey/survey.pdf.
  21. Ma S (2010). Mixed case interval censored data with a cured subgroup. Statist. Sinica 20 1165–1181. [Google Scholar]
  22. Massad LS, Einstein MH, Huh WK, Katki HA, Kinney WK, Schiffman M, Solomon D, Wentzensen N and Lawson HW (2013). 2012 updated consensus guide lines for the management of abnormal cervical cancer screening tests and cancer precursors. J. Low. Genit. Tract Dis 17 S1–S27. [DOI] [PubMed] [Google Scholar]
  23. Nelder JA and Wedderburn RWM (1972). Generalized linear models. J. Roy. Statist. Soc. Ser. A 135 370–384. [Google Scholar]
  24. Odell PM, Anderson KM and D’Agostino RB (1992). Maximum likelihood estimation for interval-censored data using a Weibull-based accelerated failure time model. Biometrics 951–959. [PubMed] [Google Scholar]
  25. Robertson T, Wright FT and Dykstra RL (1988). Order Restricted Statistical Inference. Wiley, Chichester. [Google Scholar]
  26. Rücker R and Messerer D (1988). Remission duration: An example of interval-censored observations. Stat. Med 7 1139–1145. [DOI] [PubMed] [Google Scholar]
  27. Saegusa T (2015). Variance estimation under two-phase sampling. Scand. J. Stat 42 1078–1091. [Google Scholar]
  28. Schiffman M, Wentzensen N, Wacholder S, Walter K, Gage JC and Castle PE (2011). Human papillomavirus testing in the prevention of cervical cancer. J. Natl. Cancer Inst 103 368–383. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Schiffman M, Vaughan LM, Raine-Bennett TR, Castle PE, Katki HA, Gage JC, Fetterman B, Befano B and Wentzensen N (2015). A study of HPV typing for the management of HPV-positive ASC-US cervical cytologic results. Gynecol. Oncol 138 573–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Sen B and Banerjee M (2007). A pseudolikelihood method for analyzing interval censored data. Biometrika 94 71–86. [Google Scholar]
  31. Shao F, Li J, Ma S and Lee M-LT (2014). Semiparametric varying-coefficient model for interval censored data with a cured proportion. Stat. Med 33 1700–1712. [DOI] [PubMed] [Google Scholar]
  32. Tian L and Cai T (2006). On the accelerated failure time model for current status and interval censored data. Biometrika 93 329–342. [Google Scholar]
  33. Wang L, McMahan CS, Hudgens MG and Qureshi ZP (2016). A flexible, computationally efficient method for fitting the proportional hazards model to interval-censored data. Biometrics 72 222–231. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Woodward M (1999). Epidemiology: Study Design and Data Analysis. Chapman & Hall/CRC, Boca Raton, FL. [Google Scholar]
  35. Zhang Y, Hua L and Huang J (2010). A spline-based semiparametric maximum likelihood estimation method for the Cox model with interval-censored data. Scand. J. Stat 37 338–354. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

AOAS1036_Supplementary

RESOURCES