Summary
Disease incidence data in a national-based cohort study would ideally be obtained through a national disease registry. Unfortunately, no such registry currently exists in the United States. Instead, the results from individual state registries need to be combined to ascertain certain disease diagnoses in the United States. The National Cancer Institute has initiated a program to assemble all state registries to provide a complete assessment of all cancers in the United States. Unfortunately, not all registries have agreed to participate. In this article, we develop an imputation-based approach that uses self-reported cancer diagnosis from longitudinally collected questionnaires to impute cancer incidence not covered by the combined registry. We propose a two-step procedure, where in the first step a mover–stayer model is used to impute a participant’s registry coverage status when it is only reported at the time of the questionnaires given at 10-year intervals and the time of the last-alive vital status and death. In the second step, we propose a semiparametric working model, fit using an imputed coverage area sample identified from the mover–stayer model, to impute registry-based survival outcomes for participants in areas not covered by the registry. The simulation studies show the approach performs well as compared with alternative ad hoc approaches for dealing with this problem. We illustrate the methodology with an analysis that links the United States Radiologic Technologists study cohort with the combined registry that includes 32 of the 50 states.
Keywords: Disease registry, Missing outcome, Mover–stayer model, Self-reported failure time, Working imputation model
1. Introduction
In longitudinal cohort studies of disease incidence such as cancer, HIV, and kidney failure, an ideal scenario is that the registries cover all residential areas of the study population and disease incidence captured by routine registry linkage is considered the gold standard. Outcomes of disease incidence so identified are regressed against a set of covariates via Cox proportional hazards model specified as
![]() |
(1.1) |
where
is a vector of log hazard ratios associated with
and
is an unspecified baseline hazard function. The interest is in making inference on
. In reality, however, registries may not provide perfect coverage on every study participant over the course of follow-up and as a result, disease cases may be underreported. While self-reported disease incidence can be collected as a supplement, it is subject to reporting errors.
Our motivating example comes from the US Radiologic Technologists (USRT) study, a large US-based cohort study that examined the effects of low dose radiation on cancer incidence in medical technologists. Cancer incidence was originally assessed only through repeated questionnaires where participants were asked about the occurrence and timing of cancer incidence. It was recognized that self-reported outcome ascertainment is prone to reporting errors. However, this was the best that could be done since state-specific cancer registry programs that obtain high quality medical cancer diagnoses had not yet been established at the time the study was initiated in the mid-1980’s. Since that time, state-specific cancer registries have collected high-quality cancer diagnosis data. Recently, there has been an attempt to coordinate the state registries to obtain such a national resource that would facilitate better outcome assessment for US-based cancer cohorts. Unfortunately, not all states have agreed to particulate in this initiative, resulting in cohort participants potentially being in or moving into states that are not covered by this resource. For USRT, the entire cohort finder file of 146 022 subjects was sent to each participating state registry for linkage using first and last name, middle name or initial, date of birth and birth sex, and Social Security Number as the linking variables (Liu and others, 2022). The linkage process identified matches between individuals in the USRT cohort and individuals in the registry that had been diagnosed with cancer. The observed data consist of USRT self-reported cancer diagnosis, variables collected in the questionnaires, and registry-identified cancer diagnosis for study participants who had a matched record in the participating cancer registries. In addition, at the start of the study, all participants were members of the American Association of Radiologic Technologists (AART) with known addresses. In follow-up, questionnaires were sent to the last updated address for participants who were active members of AART, or the address where the previous questionnaire was sent for those no longer members of AART. A fraction of these questionnaires were forwarded to a new address. If the new address was reported as unknown (dead letter), then a company was hired to track down the address if at all possible.
Figure 1 illustrates examples of the eight types of coverage status/event patterns that are possible in the registry follow-up in USRT. Subjects 1 and 2 reside in the coverage areas during the entire follow-up with subject 1 having no registry-based cancer identified (censored at the end of follow-up) and subject 2 being diagnosed with cancer in follow-up. Due to registry noncoverage subject 3 does not have registry-identified cancer, whereas subject 4 resides outside of coverage area but the cancer diagnosis is captured in the registry. One possible explanation for the few cases like subject 4 in the USRT cohort is that these subjects may get cancer diagnosis or treatment in one state and get reported to another state’s registry. Subject 5 does not have a cancer diagnosis while in the covered area, but his/her cancer diagnosis is unknown after moving out of the covered area. Subject 6 spends a period of time residing outside the coverage area and is diagnosed with cancer after this subject returns to a covered state. Subject 7 might be either cancer-free at the end of follow-up or have cancer in years 1999–2001 when this subject resides outside the coverage area. Subject 8, similar to subject 6, has cancer diagnosed after moving back to a state in the coverage area. New statistical methodology is needed to account for these complex observation schemes for incomplete registry-based cancer diagnosis in the analysis.
Fig. 1.
Examples of coverage status and event patterns in the registry follow-up in USRT.
Throughout the article, we treat the registry-based outcome data as the gold standard and self-reported assessment as the diagnosis measured with reporting error. The conceptual idea of our proposed methodology is to use self-reported data that is potentially available on the entire cohort, to impute registry-based cancer diagnoses for individuals who live in or move into states that are not participating in the national registry resource. Various ad hoc approaches might be attempted to replace missing registry linkage outcomes due to noncoverage such as censoring individuals if registry-identified disease is not found such as subjects 3, 5, and 7 in Figure 1, or using self-reported outcomes for either the entire study population or subjects without registry-identified outcomes. Using registry-identified outcomes by treating missing outcomes as cancer-free at the end of registry follow-up might shift the distribution of registry failure time to the right and potentially lead to bias in parameter estimation. Self-reported outcomes on the other hand are prone to reporting errors.
It has been shown that measurement errors in outcomes may cause substantial bias in parameter estimation. For example, Edwards and others (2013) showed parameter estimators in the binary regression model of misclassified outcomes were biased. Oh and others (2018) showed the naive log hazard ratio estimator in the Cox proportional hazard model of error-prone failure times introduces significant bias, and some of the bias remains after correction with the SIMEX method.
One common strategy used to alleviate the bias caused by measurement errors is to select a validation sample in which both the gold standard and self-reported outcomes are available; parameters for an imputation model can be estimated consistently using the gold standard outcomes from the validation sample. Additionally, the relationship between the true and self-reported outcomes may be exploited to recover information contained in the remaining self-reported outcomes to increase estimation efficiency. Edwards and others (2013) and Shepherd and others (2012) considered multiple imputation by fitting models using the validated sample and imputing missing true binary and continuous responses for observations not in the validated sample, respectively. More recently Oh and others (2021) used regression calibration and generalized raking to correct the bias due to measurement errors in both survival outcome and covariates. Giganti and others (2020) proposed fitting discrete failure time models to the validated subsample and then imputing values outside of the validation sample multiple times for time-to-event outcomes and covariates. However, in the context of registry-based disease diagnosis as the gold standard, a priori validation set does not exist, and it requires complete knowledge of coverage status throughout follow-up to determine whose gold standard survival outcomes are known and thus can be treated as a validation sample. In reality, coverage statuses of participants in most cohorts are only reported at wide intervals (every 10 years and at the time of the last-alive vital status and death in USRT). In the examples in Figure 1 where the coverage status is completely known, subjects 1, 2, 5, and 6 are qualified to be included in the validation sample because they reside in the coverage area at the beginning of follow-up. If the coverage status were observed only intermittently, missing coverage status would need be imputed to be included in the validation sample.
In this article, we propose a two-step approach to deal with the problems of missing registry-based disease diagnosis and presence of measurement errors of self-reported time-to-event outcomes. In the first step, we use a mover–stayer model to impute each subject’s yearly coverage status. We then identify a set of individuals, termed an imputed coverage area sample, in which each subject’s imputed coverage status is inside the registry coverage area in the beginning of follow-up until they move out of the registry coverage areas or their disease incidences are captured by the registry, whichever occurs first. This sample can be treated as a subcohort with prospective follow-up so that their possibly right-censored registry-identified failure times can be attained. In the second step, we use a semiparametric working imputation model estimated from the imputed coverage area sample to impute unobserved registry-identified outcomes based on self-reported survival outcomes and covariates. Regression parameters in the Cox model are then estimated via the usual partial likelihood using observed together with imputed registry-based survival outcomes.
Our application focuses on the analysis of cancer registry data, but the proposed methodology applies more generally to registries for other chronic diseases with time-to-event outcome. The outline of the rest of the article is as follows. Section 2 presents the mover–stayer model, working model for imputation, proposed estimation method, and its extension to incorporating competing risks. Section 3 presents the simulation study, and Section 4 the analysis of USRT study data. The article concludes with discussion in Section 5.
2. Methods
2.1. A mover–stayer model
We begin by presenting a mover–stayer model (Goodman, 1961) for intermittently observed residence status with respect to being inside or outside of the registry coverage area. We will use the estimated mover–stayer model to impute each subject’s time-varying residence status and then identify subjects who reside in the registry coverage area in the beginning of follow-up so that their registry-based survival outcome can be obtained in a prospective manner.
In the mover–stayer model formulation, a population consists of three groups of individuals: (i) stayers residing inside of the registry coverage area for the entire study period; (ii) stayers residing outside of the registry coverage area for the entire study period; and (iii) movers moving between inside and outside of the coverage area during the study. Let
indicates a stayer outside of the coverage area with probability
,
indicates a stayer inside of the coverage area with probability
and
indicates a mover with probability
. Let
denote the vector of complete residence status for the
th subject at calendar year
, where
if subject
resides in the registry coverage areas in calendar year
and 0 otherwise. For simplicity we assume the residence status remains constant within the year, that is, for
with
. Then, the mover–stayer model for
given
is given by
![]() |
(2.2) |
where
is the probability of a mover residing in the coverage area in the initial time period, and
is the probability of moving in vs remaining in the coverage area in year
for
, respectively.
The observed residence status is
, where
denotes the calendar year in which we observed the x
residence status, and
is the total number of occasions that we collected the residence information for individual
. We assume that the vector of indicators for missing data status corresponding to the elements of
is missing completely at random, for example, the missingness does not depend on resident status, covariates in the disease outcome model, and disease outcomes. This assumption is reasonable for the USRT study where residence state was observed approximately every 10 years at the time of each questionnaire.
The EM algorithm is used to find the maximum likelihood estimator of
in which the missing data are
and missing
. As the number of missing
is potentially large, we use a modification of the forward–backward algorithm, originally developed in the context of fitting hidden Markov models to simplify the E-step calculation (Baum and others, 1970; Albert, 1999, 2000). After the EM algorithm reaches convergence, for the ease of computation, we impute each missing
by its posterior mode such that an imputed coverage area sample can be identified. Let
denote the vector of imputed residence status for subject
. Suppose follow-up for subject
starts at calendar year
. We choose to create an imputed coverage area sample
for individuals residing in the coverage area at time
, that is,
, because their at-risk processes for the error-free event of interest are known at the beginning of follow-up. The imputed coverage area sample obtained in this way can be used to estimate a predictive distribution for imputing registry-based failure time data for individuals without registry-identified cancers.
2.2. A working model for the predictive distribution of registry-identified event time data
The correct specification of the predictive distribution of registry-based failure time given self-reported failure time in the presence of censoring is complicated. It requires assumptions on the measurement error model and distribution of censoring time, and it is well known that the mis-specification of the measurement error model may result in severe bias in the estimation of
(Pepe, 1992). Computationally, estimation of the predictive distribution is expected to be unstable, because it is unlikely that predictive distribution has a closed-form expression and the baseline hazard in (1.1) for error-free failure time is unspecified. To alleviate this complexity, we consider a flexible working imputation model specified separately for each self-reported failure type with covariates and self-reported failure time as the predictors. We first consider a working model with cancer of any type as the failure and then extend the working model to the competing risk setting to incorporate cause-specific cancer incidence.
Let
and
represent registry-identified time to cancer and censoring time due to death or end of registry follow-up since
, respectively, and
associated covariate vector.
and
are assumed to be independent conditional on
. For some subjects,
may not be identified due to noncoverage of registry. Assume
follows the Cox model with the hazard function specified in (1.1). For each individual
in the imputed coverage area sample
, let
denote the time from the beginning of follow-up till the first time moving out of the registry coverage areas. Write
for
,
and
. If an individual in
and the registry-based cancer is captured after moving back to a registry coverage area, this subject’s registry-identified failure time used for estimating the working model will be censored at
, because otherwise this person’s at-risk status during the period residing outside the coverage area would depend on whether cancer is captured by the registry or not later. Subject 6 in Figure 1 is an example where
years,
years and
years. Correspondingly,
is used to estimate the imputation model, but this subject’s final outcome
and
is used to estimate the parameters of interest
in (1). In the USRT data, less than 1
of registry identified cancers were captured after the subjects moved back to the coverage areas. Let
represent error-prone self-reported failure time and
censoring time due to death or end of survey follow-up. Write
and
.
The working model for the predictive distribution of
conditional on
is specified for each failure type
separately given by
![]() |
(2.3) |
with the associated survival function denoted by
, where
is for
and
for
. To further allow for flexibility, we use the natural cubic splines
to model the effects of
on
. Although model (2.3) is not directly compatible with (1.1), it is a flexible working model that facilitates tractable analysis while providing robust inferential properties for
, as evidenced in the simulation studies in Section 3. Define the counting process
, at-risk process
and let
be the corresponding covariates and
denote the end of registry follow-up time. Let
and
denote the set of individuals in
with
and size of
,
, respectively. The estimators
of
are the solutions to
with
![]() |
(2.4) |
where
![]() |
(2.5) |
Registry-identified failure time is imputed from
conditional on that if it occurs, it is in the window that the individual is outside of the coverage area, where
and
is the Nelson–Aalen estimator of the cumulative baseline hazard function. Write
for the imputed value and
,
.
2.3. Proposed estimation method
The survival outcome
is observed if subject
resides in the registry coverage areas throughout registry follow-up (i.e.
) or if the cancer incidence is captured by the registry linkage. Conversely,
need be imputed if this subject resides outside the registry coverage areas sometime during follow-up (i.e.,
for some
) and no cancer incidence is captured by the registry linkage. The imputed residence status is incorporated when imputing failure time from the working model. Specifically, uncensored imputed failure times are constrained in time periods where
, because otherwise cancer incidence would have been captured by the registry linkage. In addition, imputed failure times for subjects in
are generated conditional on being cancer-free at the time the subjects move out of the coverage areas. After imputed survival outcomes are obtained, the parameter of interest
is estimated via the usual partial likelihood using observed together with imputed registry-identified survival outcomes. Let
if cancer incidence is captured by the registry linkage and 0 otherwise and the imputation indicator
The proposed estimator
of
is the solution to the following estimating equation
![]() |
(2.6) |
where
![]() |
,
, and
for
. The imputation for failure time is repeated
times, and the final estimate is the average of
estimated values of
. The imputation approach considered here is improper, as a single estimate of the parameters is used throughout the imputation step. It is not feasible to implement proper imputation, because the parameters involve the infinite-dimensional baseline hazard function, and it is for this reason that improper imputation is common in the survival setting with missing data. The standard errors of the averaged
are obtained by the bootstrap resampling method. In each bootstrap sample, both the mover–stayer and working model are estimated and multiple-imputations are carried out. The bootstrap accounts for the facts that the coverage status is calculated from the posterior mode of the mover–stayer model, and the working model for imputation may be misspecified.
2.4. Incorporating competing risks
The approaches presented in 2.2–2.3 are readily extended to incorporate competing risks where failure type 1 denotes the cancer type of interest and failure type 2 the competing cancer type. In the USRT study, we are interested in thyroid cancer and nonthyroid cancer is the competing event. In the imputations, the competing cancer type should not be treated as censored, because imputation of registry failure time may only be applied to subjects with no registry-identified cancer diagnosis. Assume the cause-specific hazard function of failure type 1 follows the Cox model given by
![]() |
(2.7) |
and the interest is in estimating
. For simplicity of notation, covariates
are used in both the all-cause hazard function (1.1) and cause-specific hazard function here. A subset of covariates can be incorporated in each model by assigning 0 to the associated regression coefficients. Let
if failure type
is observed and 0 otherwise. Similar to all-cause cancer, the working model for the hazard function of registry-identified failure time of failure type
is specified separately for each failure type of self-report failure time (
, cancer type of interest, 2, competing cancer type, and 0, censored due to noncancer death or end of follow-up) given by
![]() |
(2.8) |
Parameters
are estimated by the Cox partial likelihood with the estimating function similar to (2.4) by treating the other failure type as censored. The survival function
of the working models is estimated by
, where
are the cause-specific cumulative hazard function estimator. Imputed value of all-cause failure time
is sampled from
and failure type
is assigned according to the Bernoulli distribution with probability
. Once imputed values of failure times and failure types are obtained for subjects with
, these imputed values together with the rest of the observed error-free survival outcomes are used to obtain the estimator of
by solving the estimation equation similar to (2.6) by treating the competing cancer type as censored.
The algorithm of estimating the cause-specific log hazard ratios
is summarized in the following steps:
(1) Estimate parameters
in the mover–stayer model (2.2) via the EM algorithm. Estimate
by the posterior mode.(2) (a) Create an imputed coverage area sample
for subjects with
. Determine observed cancer ascertainment status
and imputation indicator
. (b) Estimate working models (2.8) for the cause-specific hazard of registry-based failure time using
from
by solving the estimation equation similar to (2.4) by treating the other failure type as censored. For
repeat steps 3–4:(3) Randomly impute
and failure type if
from the estimated working models satisfying
and
.(4) Use observed and imputed registry-identified failure times to find
of
maximizing the usual Cox partial likelihood treating the competing failure type as censored.(5) Calculate the final estimate
of
,
.
The entire algorithm was run
times, first with the original data to obtain the point estimate for
and second with
bootstrap samples on the USRT data linked with cancer registries to calculate the bootstrap standard error of
. Let
denote the estimate in
imputation of
bootstrap sample and
. The standard error of
is estimated by the standard deviation of
.
3. Simulation study
We conducted a simulation study to evaluate and compare the performance of the parameter estimators of
in the cause-specific Cox proportional hazards model for registry-identified failure time data obtained from the proposed method and a number of alternative approaches listed in 3.2. In addition, we evaluate the properties of the estimators of the parameters in the mover–stayer and the effect of mis-specification of the mover–stayer model on the estimation of
.
3.1. Setup
Yearly residence status (inside vs outside registry coverage area) was simulated from a mover–stayer model with four scenarios. In scenarios I and II, the mover–stayer model follows (2.2) with parameter values
and
respectively with 25 years of follow-up. The transition probabilities
in the second scenario were higher than those in the first scenario resulting in more cases moving in and out of the coverage areas. Parameter values in scenario III were kept the same as in scenarios I and IV as in scenario II except that the transition probabilities in scenarios III and IV depend on a random intercept such that logit(
,
,
for mover–stayer models III and IV, respectively, and
is normally distributed with mean 0 and standard deviation 0.5. Scenarios III and IV are used to assess the effect of mis-specification of transition probability in the mover–stayer model (2.2) due to omitting a covariate on the estimation of
. Missing values were introduced to the status of residence by assuming (i) there was an equal (20
) probability of observing residence status in each of the first 5 years and (ii) there was 90
chance of observing residence status every 5 years. Study entry
for all subjects.
The impact of identifying the imputed coverage area sample correctly is evaluated by two indices. The first one, denoted by
, is the estimated proportion of agreement between estimated and true residence status at the study entry, that is,
. If
, the validation cohort identified from the estimated mover–stayer model agrees with the true underlying coverage area cohort completely. The second one, denoted by
, measures the degree of agreement between true and estimated residence status from the study entry till the end of follow-up defined as the estimated proportion of subjects with no more than two discrepancies between
and
for
.
Registry identified failure time was calculated as time from study entry to the onset of event where failure type 1 is the event of interest and failure type 2 the competing event. Three sets of cause-specific baseline constant hazards for registry-identified failure times were considered and associated censoring time was generated from a 1:9 mixture of normal distribution and a point mass at 18 years, yielding to 90
(2
failure type 1 and 8
failure type 2), 69
(15
failure type 1 and 16
failure type 2), and 55
censoring (26
failure type 1 and 19
failure type 2). The true values of
were
corresponding to covariate
generated from Bernoulli with equal probability and
from the standard normal. To mimic the USRT data, if a failure of any type occurs outside the coverage areas (i.e., status of residence at the year of occurrence = 0), that failure was masked and only censoring time was observed.
Error-prone failure time was generated from the regression model
, where
is zero-mean normal with standard deviation
with
. Censoring time associated with error-prone failure time was generated from the Weibull distribution and truncated at the last time of observed residence status. As few discordances between self-reported and registry-identified cancer types (thyroid vs nonthyroid) in the USRT data were observed, mis-classification rates of failure types (
and
) were set at 1
and for
was set to 0 for
. The number of imputations
for imputing missing registry-identified failure time was set at 20. The sample size was 10 000 and simulations were repeated 1000 times. In each simulation replicate, 500 bootstrap samples were generated for variance estimation.
3.2. Estimation approaches for
Parameters of interest
in the cause-specific Cox proportional hazards model of failure type 1 were estimated by seven approaches briefly summarized below. These approaches differ in the data used in Cox regression to estimate
.
(1) Gold standard (GS): Using error-free registry-based failure time data. This approach can not be implemented in real examples, but serves as a benchmark in the simulation studies;
(2) Imperfect registry (IR): Using imperfect registry failure time data in which all observations without registry-identified cancer are treated as cancer-free at the end of registry follow-up;
(3) Self-reported (SR): Using error-prone self-reported failure time data;
(4) Hybrid: Using imperfect registry failure time data with failure times of self-reported cancer cases substituted for observations without registry-identified cancer;
(5) Step 1: Using the sample of observations
derived from the first step of the proposed approach;(6) Proposed: Using registry-identified cancer cases combined with imputed failure times for imperfect registry-censored observations obtained from the proposed approach;
(7) Nonparametric imputation (NP): Using registry-identified cancer cases combined with randomly imputed failure times for imperfect registry-censored observations obtained from a nonparametric kernel smoothed product limit estimator conditional on self-reported failure time and failure type (Beran, 1981).
3.3. Results
3.3.1. Mover–stayer model
The simulation results listed in Table 1 show
is high in both scenarios I and II, and there is little bias in the estimates of both sets of parameters. Because fewer transitions occur for observations generated from the scenario I than scenario II, there is less uncertainty in the estimate of residence status and as a result Monte Carlo means of both
and
are higher in scenario 1. On the other hand, as the transition probabilities in scenario I are small compared to scenario II, its harder to distinguish stayers from movers, which is reflected in the larger Monte Carlo standard deviations of the estimators of
and
in scenario I. When the transition probabilities depend on a subject-level covariate as in scenarios 3 and 4 which is omitted in (2.2),
in these two scenarios have larger bias than scenarios I and II. Interestingly, even though these parameter estimates are biased,
and
in these two scenarios are similar to those in scenarios I and II, respectively, suggesting the selection of the validation sample is robust against the mis-specification of the mover–stayer model.
Table 1.
Monte Carlo means and standard deviation of the parameter estimates in the mover–stayer (MS) model used for Step 1, proposed, and NP approaches
| MS |
|
Monte Carlo mean | ||||||
|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
||
| I |
|
0.199 | 0.397 | 0.610 | 0.030 | 0.020 | 0.986 | 0.974 |
| II |
|
0.200 | 0.400 | 0.608 | 0.100 | 0.100 | 0.948 | 0.861 |
| III |
|
0.221 | 0.434 | 0.607 | 0.039 | 0.026 | 0.985 | 0.969 |
| IV |
|
0.207 | 0.411 | 0.607 | 0.106 | 0.106 | 0.946 | 0.855 |
| Monte Carlo SD | ||||||||
| I |
|
0.011 | 0.023 | 0.028 | 0.002 | 0.002 | 0.001 | 0.002 |
| II |
|
0.005 | 0.006 | 0.013 | 0.003 | 0.003 | 0.002 | 0.005 |
| III |
|
0.009 | 0.016 | 0.024 | 0.003 | 0.002 | 0.001 | 0.002 |
| IV |
|
0.005 | 0.006 | 0.014 | 0.004 | 0.004 | 0.002 | 0.005 |
3.3.2. Estimators of
The benchmark GS approach is essentially unbiased in all the parameter settings considered (Table 2). Approach IR has little bias under 90
and 69
censoring, and the bias becomes apparent when the percentage of censoring decreases to 55
. Both SR and hybrid approaches exhibit sizable bias. The bias increases with the magnitude of measurement errors and decrease with the level of censoring, because measurement error is introduced from uncensored cases only. The Step 1 and the proposed approach show little bias. The bias of the NP approach demonstrates that it is necessary to include in the working model at least the covariates which are specified in the model for error-free registry-identified failure time. The performance under mis-specifications of the mover–stayer model under scenarios III and IV is investigated in Supplementary material available at Biostatistics online.
Table 2.
Monte Carlo biases of the estimators of
Censoring ( ) |
MS |
|
|
GS | IR | SR | Hybrid | Step 1 | Proposed | NP |
|---|---|---|---|---|---|---|---|---|---|---|
| 90 | I |
|
1 | 0.004 | 0.000 |
0.007 |
0.005 |
0.000 | 0.001 |
0.029 |
| 1.5 | 0.004 | 0.000 |
0.014 |
0.014 |
0.000 | 0.001 |
0.038 |
|||
| 2 | 0.004 | 0.000 |
0.018 |
0.024 |
0.000 | 0.001 |
0.044 |
|||
|
1 |
0.002 |
0.002 |
0.012 | 0.006 |
0.003 |
0.001 |
0.028 | ||
| 1.5 |
0.002 |
0.003 |
0.015 | 0.014 |
0.003 |
0.002 |
0.036 | |||
| 2 |
0.002 |
0.002 |
0.020 | 0.023 |
0.003 |
0.003 |
0.043 | |||
| II |
|
1 | 0.004 | 0.004 |
0.007 |
0.004 |
0.004 | 0.003 |
0.030 |
|
| 1.5 | 0.004 | 0.004 |
0.013 |
0.012 |
0.004 | 0.004 |
0.040 |
|||
| 2 | 0.004 | 0.004 |
0.018 |
0.022 |
0.004 | 0.005 |
0.047 |
|||
|
1 |
0.002 |
0.001 |
0.012 | 0.007 | 0.000 |
0.002 |
0.032 | ||
| 1.5 |
0.002 |
0.001 |
0.015 | 0.014 |
0.001 |
0.002 |
0.041 | |||
| 2 |
0.002 |
0.001 |
0.020 | 0.023 | 0.000 |
0.002 |
0.048 | |||
| 69 | I |
|
1 | 0.002 |
0.006 |
0.015 |
0.016 |
0.001 | 0.000 |
0.031 |
| 1.5 | 0.002 |
0.006 |
0.027 |
0.038 |
0.001 | 0.000 |
0.042 |
|||
| 2 | 0.002 |
0.006 |
0.037 |
0.055 |
0.001 | 0.000 |
0.049 |
|||
|
1 | 0.000 | 0.007 | 0.016 | 0.017 |
0.001 |
0.001 | 0.032 | ||
| 1.5 | 0.000 | 0.007 | 0.027 | 0.038 |
0.001 |
0.001 | 0.042 | |||
| 2 | 0.000 | 0.007 | 0.037 | 0.056 |
0.001 |
0.000 | 0.049 | |||
| II |
|
1 | 0.002 |
0.005 |
0.015 |
0.015 |
0.001 | 0.000 |
0.035 |
|
| 1.5 | 0.002 |
0.006 |
0.027 |
0.037 |
0.001 | 0.000 |
0.046 |
|||
| 2 | 0.002 |
0.006 |
0.037 |
0.054 |
0.001 | 0.001 |
0.054 |
|||
|
1 | 0.000 | 0.008 | 0.016 | 0.017 |
0.001 |
0.001 | 0.035 | ||
| 1.5 | 0.000 | 0.008 | 0.027 | 0.038 | 0.000 | 0.001 | 0.047 | |||
| 2 | 0.000 | 0.008 | 0.037 | 0.055 | 0.000 | 0.001 | 0.054 | |||
| 55 | I |
|
1 | 0.001 |
0.014 |
0.022 |
0.024 |
0.000 |
0.001 |
0.033 |
| 1.5 | 0.001 |
0.014 |
0.036 |
0.048 |
0.000 |
0.001 |
0.044 |
|||
| 2 | 0.001 |
0.014 |
0.048 |
0.068 |
0.000 | 0.000 |
0.051 |
|||
|
1 | 0.000 | 0.015 | 0.023 | 0.025 |
0.001 |
0.001 | 0.034 | ||
| 1.5 | 0.000 | 0.015 | 0.038 | 0.050 |
0.001 |
0.001 | 0.045 | |||
| 2 | 0.000 | 0.015 | 0.048 | 0.069 |
0.001 |
0.000 | 0.051 | |||
| II |
|
1 | 0.001 |
0.014 |
0.022 |
0.023 |
0.000 |
0.001 |
0.038 |
|
| 1.5 | 0.001 |
0.014 |
0.036 |
0.047 |
0.000 |
0.001 |
0.050 |
|||
| 2 | 0.001 |
0.014 |
0.048 |
0.066 |
0.000 |
0.001 |
0.057 |
|||
|
1 | 0.000 | 0.016 | 0.023 | 0.025 | 0.000 | 0.002 | 0.039 | ||
| 1.5 | 0.000 | 0.016 | 0.038 | 0.049 | 0.000 | 0.002 | 0.050 | |||
| 2 | 0.000 | 0.016 | 0.048 | 0.068 | 0.000 | 0.001 | 0.058 |
The coverage probabilities of the GS, Step 1, and the proposed approach are close to the nominal level whereas there is substantial under-coverage for the SR, hybrid, and NP approaches (Table S2 of the Supplementary material available at Biostatistics online). Approach IR shows coverage close to the nominal level under 90
and 69
censoring and undercoverage under 55
censoring.
Due to the large bias produced by approaches SR, hybrid, and NP, they are excluded from the comparison with respect to mean squared errors (MSE). When the percentage of censoring is high, the IR and Step 1 approaches have comparable MSEs under mover–stayer model I and the approach IR has a smaller MSE under mover–stayer model II. The proposed approach has smaller MSEs than approaches IR and Step 1 throughout, but its improvement over approach IR decreases when the measurement error gets large (Figure S1 of the Supplementary material available at Biostatistics online). Efficiencies of the Step 1 and proposed approach relative to GS are presented in Table 3. The efficiency gain of the proposed approach over the Step 1 approach ranges from 6
to 20
with higher gain achieved under smaller magnitude of measurement errors. A reviewer suggests to compare the efficiency again of Step 1 and the proposed approach relative to complete case analysis where Cox regression is fit to observations whose intermittently observed residence status was inside the coverage area during the entire follow-up. The simulation results show the efficiencies of Step 1 and the proposed approach relative to the complete case analysis range from 1.02 to 1.28 and 1.16 to 1.76, respectively (Tables S6 and S7 of the Supplementary material available at Biostatistics online).
Table 3.
Efficiency comparison between Step 1 and proposed approach
|
|
|||||
|---|---|---|---|---|---|---|
Censoring ( ) |
MS |
|
Step 1 | Proposed | Step 1 | Proposed |
| 90 | I | 1 | 0.583 | 0.762 | 0.579 | 0.743 |
| 1.5 | 0.582 | 0.683 | 0.582 | 0.699 | ||
| 2 | 0.583 | 0.641 | 0.579 | 0.661 | ||
| II | 1 | 0.486 | 0.684 | 0.505 | 0.707 | |
| 1.5 | 0.486 | 0.631 | 0.506 | 0.646 | ||
| 2 | 0.485 | 0.580 | 0.506 | 0.610 | ||
| 69 | I | 1 | 0.660 | 0.823 | 0.623 | 0.775 |
| 1.5 | 0.660 | 0.768 | 0.623 | 0.729 | ||
| 2 | 0.660 | 0.736 | 0.623 | 0.692 | ||
| II | 1 | 0.538 | 0.738 | 0.525 | 0.721 | |
| 1.5 | 0.537 | 0.673 | 0.525 | 0.668 | ||
| 2 | 0.537 | 0.632 | 0.525 | 0.635 | ||
| 55 | I | 1 | 0.631 | 0.776 | 0.660 | 0.793 |
| 1.5 | 0.631 | 0.737 | 0.660 | 0.763 | ||
| 2 | 0.631 | 0.706 | 0.660 | 0.741 | ||
| II | 1 | 0.549 | 0.720 | 0.587 | 0.759 | |
| 1.5 | 0.549 | 0.671 | 0.587 | 0.723 | ||
| 2 | 0.549 | 0.632 | 0.587 | 0.689 | ||
Note: Relative efficiency defined as the mean squared error of the gold standard approach divided by the mean squared error of the used estimation approach.
4. Analysis of USRT cohort data
4.1. USRT cohort and registry linkage
The USRT cohort contained 146 022 radiologic technologists who were certified for at least 2 years between 1926 and 1982 through the American Registry of Radiologic Technology. Four survey questionnaires were each administered about every 10 years: 1983–1989 (Q1), 1994–1998 (Q2), 2003–2005 (Q3), and 2012–2014 (Q4). The participants responded to questions on work history, demographic and lifestyle factors, and past diagnoses of cancer. Vital status of the cohort members was obtained from linkages to the Social Security Death Master File, and for those determined to be deceased, the cause of death was identified from the National Death Index. The USRT cohort was linked to a total of 43 state/regional cancer registries in the United States. For illustrating the application of our proposed method, we used the data freeze in August 2018, in which data from the registries of 32 states accounting for approximately 70
of the US population were received and cleaned. More details of the methods used in the USRT study are described in Boice Jr and others (1992), Sigurdson and others (2003) and the methods of cancer registry linkage in Liu and others (2022).
We study the first primary thyroid cancer incidence between January 1, 1999 and December 31, 2012 as the outcome of interest with failure time quantified in days, while treating first primary cancers of all other sites combined as a competing event. Thyroid cancer risk in the USRT cohort has been studied in an epidemiological study using the self-reported data only (Meinhold and others, 2010). We considered a similar cause-specific hazards model with covariates, that is, body mass index (BMI) (kg/m
), age in 1999, and gender (male/female). All the covariates were reported at the baseline, in Q1 or Q2. We chose the starting point to be January 1, 1999, because many US cancer registries were established in the 1980s to early 1990s, and their cancer incidence data may not be complete in the early years. Our analysis sample excluded (i) 35 648 cohort members who did not respond to Q1 or Q2 and hence did not have any individual variables collected in USRT study, (ii) 13 984 subjects who died before 1999 or died with a missing death date, and (iii) 12 523 subjects who had a registry- or USRT-identified cancer (other than nonmelanoma skin cancer) before 1999. These three exclusion criteria were also used previously for a descriptive comparison of registry-identified and USRT-identified cancers (Liu and others, 2022). After further excluding 681 subjects with missing values in the covariates used in the analysis, 98 263 subjects were included in the final analysis sample. The concordance between registry-identified vs self-reported cancer types and failure times in Q3 or Q4 are listed in Section S2.1 of the Supplementary material available at Biostatistics online.
Registry-identified cancers were censored at either death or the end of 2012 and self-reported cancers were censored at the latest response to Q3 or Q4, death or the end of 2012. Self-reported outcomes were restricted to subjects who responded to either Q3 or Q4. For the proposed and NP approach, both using imperfect registry censoring time and exclusion from the analysis were considered for 6692 subjects in the imputed outside coverage area sample who were nonresponders to Q3 and Q4 and did not have registry-identified cancers. The state of residence was intermittently observed at the time of each questionnaire, last-alive vital status, and death so each individual had up to 6 time points of observed residence.
4.2. Analysis results
The estimated mover–stayer model parameters are shown in the top panel of Table 4. Briefly, 22.1
(
) of the population were estimated to be stayer outside of the registry coverage area, and 51.8
(
) were stayer inside the coverage areas. Hence, the percentage of movers was
. Among the movers, 59.2
(
) lived in the coverage areas in the year 1983; the yearly transition probabilities of moving into and outside of the coverage areas were 3.2
and 2.4
(
and
), respectively.
Table 4.
USRT data analysis
| Mover–stayer model | ||||||
|---|---|---|---|---|---|---|
|
|
|
|
|
||
| Estimate | 0.2207 | 0.5175 | 0.5914 | 0.0316 | 0.0235 | |
| (SE) | (0.0025) | (0.0041) | (0.0081) | (0.0008) | (0.0007) | |
| Approach | IR | SR | Hybrid | Step 1 | Proposed
|
NP
|
| Sample size | 98 263 | 75 618 | 98 263 | 68 116 | 98 263 | 98 263 |
| No. of thyroid cancers | 262 | 309 | 382 | 253 | 365 | 359 |
| event rate (per 10 000 | 2.03 | 3.62 | 3.40 | 3.00 | 2.87 | 2.81 |
| person years) | ||||||
Cox model ( (SE)) | ||||||
| Gender | ||||||
| Female | 0.5394 | 0.5152 | 0.6047 | 0.5173 | 0.6343 | 0.5914 |
| (0.1745) | (0.1673) | (0.149) | (0.1829) | (0.1618) | (0.1488) | |
| BMI | 0.039 | 0.0227 | 0.0282 | 0.0396 | 0.0317 | 0.0281 |
| (0.0091) | (0.0101) | (0.0086) | (0.0075) | (0.0077) | (0.0076) | |
| Age | –0.0328 | –0.0342 | –0.0329 | –0.0355 | –0.0348 | –0.0318 |
| (0.0087) | (0.0083) | (0.0073) | (0.0085) | (0.008) | (0.0073) | |
Number of thyroid cancer cases and event rate were based on the median of 20 imputations.
Sample size, number of thyroid cancers and event rate for each estimation approach are listed in the middle panel of Table 4. The estimated log hazard ratios obtained are listed in the bottom panel of Table 4. Approach SR results in lower estimated values than the other approaches. The IR and Step 1 approaches result in similar estimates with slightly lower standard errors in the Step 1 approach. Using the proposed approach, the estimated gender and age effects are stronger than those from other approaches and have lower standard errors than the Step 1 approach. The slight difference in the point estimates between Step 1 and proposed approach is likely due to an imbalance in BMI and gender of self-reported thyroid cancer between imputed coverage area vs outside coverage area sample (Section S2.2 of the Supplementary material available at Biostatistics online). Nevertheless, consistent with the findings in Meinhold and others (2010), higher BMI is associated with increased risk of thyroid cancer, that age is inversely associated with thyroid cancer risk, and that being female is at higher risk of thyroid cancer. Compared to the proposed approach, the estimated log hazard ratios of approach SR are attenuated. When the 6692 nonresponders in the imputed outside coverage area sample were excluded from the analysis for the proposed and NP approaches, the results were similar to those presented in Table 4. There were 63 348 subjects whose observed residence was inside the coverage area during 1999–2012. Using this subset, the point estimates of log hazard ratios were similar to those obtained from the Step 1 approach.
5. Discussion
We proposed a two-step approach to estimate Cox regression model when the outcome ascertainment comes from two sources: the gold standard is incomplete due to noncoverage, while the self-report is subject to measurement error. This work was motivated from cancer registry linkage, and has important implications for future cohort studies of other disease incidence that will rely on region-specific registries with incomplete coverage. When the participating registries do not provide complete coverage of the study cohort, we demonstrated that the mover–stayer model can effectively identify an imputed coverage area sample, and that the self-report data can then provide useful information to impute the missing cancers due to noncoverage. The proposed method performed well in simulation studies and the application to the USRT data.
In this article, we used the mover–stayer model to impute each subject’s missing coverage status by the posterior mode. We also considered multiple imputation for both the missing coverage statuses and the missing failure times, and used Rubin’s rule to calculate the standard errors. Due to unaccounted for variability of the parameter estimates in the working imputation model, including the nonparametric estimate of the hazard function, the proposed approach shows under-coverage for the 95
confidence interval (down to 91.7
from simulation results). One possible remedy for the under-coverage is to use a flexible parametric imputation model such as a spline baseline hazard approximation and implement proper imputation.
In the mover–stayer model, we assume that the transition probabilities do not depend on the disease outcome. Though this assumption is difficult to verify, it can be relaxed by including observable subject-level covariates in the mover–stayer model, at the cost of increased computation burden. Nevertheless, the simulation results show that the proposed estimator performs well even when a covariate is omitted in the mover–stayer model. The low transition probabilities relative to the measurement interval may contribute to the robust performance of the proposed estimator. In the application of the USRT study, participants in the USRT Study infrequently moved, resulting in small transition probabilities relative to the questionnaire interval. Therefore, the mover–stayer model approach is likely able to identify a coverage area sample with high accuracy. In applications where more frequent moving is expected, a more frequent assessment of residence would be needed to facilitate the imputation, otherwise the imputed values for residence status may be erroneous which in turn may cause large variability in the log hazard ratio estimator obtained from the second step. Notwithstanding, our simulation results for the second scenario of the mover–stayer model suggest that even in this situation, there is improvement over other approaches in the reduction of mean squared errors.
An implicit assumption needed to fit the working model is that the selection of an imputed coverage area sample is a representative subsample of the study cohort and does not depend on subjects’ self-reported disease outcomes. Under this assumption, the proposed estimator is expected to be robust against model mis-specification. The intuition for this observation comes from Chen (2000) who formally proved this robustness property under a generalized linear model with a surrogate outcome. While our proposed model is in the context of a time-to-event outcome with a much more complicated setting, our simulation studies suggest that this robustness property holds in our setting. This assumption is violated when the mover–stayer model is mis-specified, leading to errors in identifying the imputed inside coverage sample and subsequently biased estimation of the survival model of interest. This source of bias can be reduced by introducing covariates that better characterize the movement of individuals in and out of covered states. It is most important to include such covariates that are included in the survival model. In our example, this assumption is reasonable. However, even when the assumption does not hold, under a correctly specified working model, multiple imputation yields valid inferences, as reviewed by Mclsaac and Cook (2017). While the working imputation model is inherently mis-specified relative to a full model-based imputation, we propose a flexible working model for the functional relationships between cause-specific self-reported and gold standard failure times that should mitigate this bias in practice; specifying and fitting a full model-based imputation would be very difficult analytically and impractical computationally. Results of simulation studies presented in Section 3.3 show that our proposed approach yields little bias in the parameter estimates and confidence intervals achieve close to the nominal levels under realistic scenarios mimicking that in our cancer registry analysis.
Software
Software in the form of R code is available at https://github.com/JHSHIH/analysis_missing-outcomes_registry_non-coverage.git.
Supplementary material
Supplementary material is available online at http://biostatistics.oxfordjournals.org.
Supplementary Material
Acknowledgments
The authors would like to thank Jordan Aaron and Bill Wheeler for computational assistance and acknowledge the contribution to this study from central cancer registries supported through the Centers for Disease Control and Prevention’s National Program of Cancer Registries (NPCR) and/or the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program. Central registries may also be supported by state agencies, universities, and cancer centers. Participating central cancer registries include the following: AK, AR, AZ, CA, CO, CT, DE, FL, GA, HI, IA, ID, IN, KY, LA, MA, MD, MI, MN, MO, MS, MT, NC, NE, NH, NJ, NM, NV, NY, OH, OK, OR, PA, RI, SC, TN, TX, UT, VA, VT, WA, WI, and WY.
Conflict of Interest: None declared.
Contributor Information
Joanna H Shih, Biometric Research Program, Division of Cancer Treatment and Diagnosis, National Cancer Institute, 9609 Medical Center Drive, Bethesda, MD 20892, USA.
Paul S Albert, Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Bethesda, MD 20892, USA.
Jason Fine, Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Bethesda, MD 20892, USA.
Danping Liu, Biostatistics Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, 9609 Medical Center Drive, Bethesda, MD 20892, USA.
Funding
The Intramural Research Program of National Cancer Institute, National Institutes of Health to P.S.A. and D.L. This work utilized the computational resources of the NIH HPC Biowulf cluster (http://hpc.nih.gov).
References
- Albert, P. S. (1999). A mover-stayer model for longitudinal marker data. Biometrics 55, 1252–1257. [DOI] [PubMed] [Google Scholar]
- Albert, P. S. (2000). A transitional model for longitudinal binary data subject to non-ignorable missing data. Biometrics 56, 602–608. [DOI] [PubMed] [Google Scholar]
- Baum, L. E., Petrice, T., Soules, G. and Weiss, N. A. (1970). A maximization technique occurring in the statistical analysis of probabilistic functions of Markov chains. Annals of Mathematical Statistics 41, 164–171. [Google Scholar]
- Beran, R. (1981). Nonparametric regression with randomly censored survival data. Technical Report. University of California, Berkeley. [Google Scholar]
- Boice, J. D. Jr, Mandel, J. S., Doody, M. M., Yoder, R. C. and McGowan, R. (1992). A health survey of radiologic technologists. Cancer 69, 586–598. [DOI] [PubMed] [Google Scholar]
- Chen, Y. (2000). A robust imputation method for surrogate outcome data. Biometrika 87, 711–716. [Google Scholar]
- Edwards, J. K., Cole, S. R., Troester, M. A. and Richardson, D. B. (2013). Accounting for misclassified outcomes in binary regression models using multiple imputation with internal validation data. American Journal of Epidemiology 177, 904–912. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Giganti, M. I., Shaw, P. A., Bebawy, S. S., Turner, M. M., Sterling, T. R. and Shepherd, B. E. (2020). Accounting for dependent errors in predictors and time-to-event outcomes using electronic health records, validation samples, and multiple imputation. Annals of Applied Statistics 14, 1045–1061. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodman, E. (1961). Statistical methods for the mover-stayer model. Journal of the American Statistical Association 56, 841–868. [Google Scholar]
- Liu, D., Linet, M. S., Albert, P. S., Landgren, A. M., Kitahara, C. M., Iwan, A., Clerkin, C., Kohler, B., Alexander, B. H. and Penberthy, L. (2022). Cancer incidence ascertainment by U.S. population-based cancer registries versus self-report and death certificates in the nationwide U.S. radiologic technologists cohort. American Journal of Epidemiology 191, 2075–2083. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mclsaac, M. and Cook, R. J. (2017). Statistical methods for incomplete data: some results on model misspecification. Statistical Methods in Medical Research 26, 248–267. [DOI] [PubMed] [Google Scholar]
- Meinhold, C. L., Ron, E., Schonfeld, S. J., Alexander, B. H., Freedman, D. M., Linet, M. S. and de Gonzalez, A. B. (2010). Nonradiation risk factors for thyroid cancer in the us radiologic technologists study. American Journal of Epidemiology 171, 242–252. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oh, E. J., Shepherd, B. E., Lumley, T. and Shaw, P. A. (2021). Raking and regression calibration: methods to address bias from correlated covariate and time-to-event error. Statistics in Medicine 40, 631–649. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Oh, E. J., Shepherd, B. E., Lumley, T. and Shaw, P. A. (2018). Considerations for analysis of time-to-event outcomes measured with error: bias and correction with SIMEX. Statistics in Medicine 37, 1276–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pepe, M. S. (1992). Inference using surrogate outcome data and validation sample. Biometrika 79, 355–365. [Google Scholar]
- Shepherd, B., Shaw, P. A. and Dodd, L. E. (2012). Using audit information to adjust parameter estimates for data errors in clinical trials. Clinical Trials 9, 721–729. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sigurdson, A. J., Doody, M. M., Rao, R. S. and others. (2003). Cancer incidence in the U.S. radiologic technologists health study, 1983-1998. Cancer 97, 3080–3089. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.















