Abstract
In a prospective cohort study, examining all participants for incidence of the condition of interest may be prohibitively expensive. For example, the “gold standard” for diagnosing temporomandibular disorder (TMD) is a physical examination by a trained clinician. In large studies, examining all participants in this manner is infeasible. Instead, it is common to use questionnaires to screen for incidence of TMD and perform the “gold standard” examination only on participants who screen positively. Unfortunately, some participants may leave the study before receiving the “gold standard” examination. Within the framework of survival analysis, this results in missing failure indicators. Motivated by the Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA) study, a large cohort study of TMD, we propose a method for parameter estimation in survival models with missing failure indicators. We estimate the probability of being an incident case for those lacking a “gold standard” examination using logistic regression. These estimated probabilities are used to generate multiple imputations of case status for each missing examination that are combined with observed data in appropriate regression models. The variance introduced by the procedure is estimated using multiple imputation. The method can be used to estimate both regression coefficients in Cox proportional hazard models as well as incidence rates using Poisson regression. We simulate data with missing failure indicators and show that our method performs as well as or better than competing methods. Finally, we apply the proposed method to data from the OPPERA study.
Keywords: Cox regression, missing data, multiple imputation, Poisson regression, survival analysis
1. Introduction
Time-to-event analyses are frequently conducted in medicine, actuarial science, and numerous other fields of applied science. There is a well-developed set of survival analysis methods implemented in standard software. Semi-parametric methods, such as the Cox proportional hazards model, allow robust estimation of the effects of covariates on the hazard function. However, these methods require the analyst to know the failure status of each participant, which may not always be available.
In some cases the outcome of interest may be difficult to ascertain. For example, in oncology studies, researchers may want to differentiate between deaths due to cancer and deaths due to car accidents or other unrelated causes. Investigators may easily record the mortality of all subjects, but it may be extremely difficult or costly to find out exactly why each subject died. One possible solution to this problem is delayed event adjudication [1]. This means that possible cases are not identified immediately but screened using simple methods that may have poor sensitivity or specificity. Later, the screened candidate cases are re-examined using a more precise, but also more costly and time-consuming, method to determine the true event status.
The study that motivates our work is Orofacial Pain: Prospective Evaluation and Risk Assessment (OPPERA), a prospective cohort study to identify risk factors for the onset of temporomandibular disorders (TMD). Each (initially TMD-free) OPPERA study participant was followed for a median of 2.8 years to identify cases of first-onset TMD. However, it was impractical to perform a physical examination on every participant. It would also have been inefficient given that most study participants did not develop the condition. Instead, this “gold standard” examination was performed only on participants with positive screens on a quarterly screening questionnaire that was designed to assess recent orofacial pain [2]. However, some participants with positive screens were lost to follow-up before receiving the “gold standard” examination. Thus a time-to-event analysis would have some participants with missing failure indicators.
Previous research indicates that when a subset of the failure indicators are missing, one can obtain more accurate estimates of the parameters of interest by using appropriate tools to estimate these missing values [1, 3, 4]. Cook and Kosorok [1] estimate parameters in Cox proportional hazard models with missing failure indicators by weighting observations according to their probability of being a true case. They show that the estimators are consistent and asymptotically normally distributed. However, the standard error of their proposed estimate cannot be easily obtained using existing software without bootstrapping. For the OPPERA data, a separate Cox model was calculated for each putative risk factor of interest, including approximately three thousand genetic markers. Consequently, applying this method to the OPPERA genetic data would be computationally intractable.
In the OPPERA study, the likelihood that a participant with a positive screen was examined was weakly associated with demographic variables such as gender, race, or socioeconomic status [2]. This indicated that the failure indicators in the OPPERA study were not missing completely at random (MCAR). Application of models that assume MCAR failure indicators may result in biased estimates of hazard ratios for covariates of interest. More importantly, a participant’s responses to their screening questions are predictive of whether or not they are an incident case of TMD. This setting presents statistical challenges, which require care in order to avoid bias and maintain efficiency. Additionally, incidence rate estimates are desired, and none of the methods currently available allow for estimation of the incidence rate. There is a clear need for new methodology to effectively answer the research questions of the OPPERA study.
In this paper, we propose a method for parameter and variance estimation in Cox regression models with missing failure indicators. The motivating data set is introduced in Section 2. We describe our method in Section 3. In Section 4, we report the results of simulations. Finally, in Section 5 we apply our method to the OPPERA study. We conclude with a discussion in Section 6.
2. Motivating Data Set: The OPPERA Study
OPPERA is a prospective cohort study designed to identify risk factors for first-onset TMD. A total of 3,263 initially TMD-free subjects were recruited at four study sites between 2006 and 2008. TMD status was confirmed by physical examination of the jaw joints and muscles using the Research Diagnostic Criteria for TMD [5], which is the gold standard for diagnosing TMD.
Upon enrollment in the study, each OPPERA participant was evaluated for a wide variety of possible risk factors for TMD, including psychological distress, previous history of painful conditions, and sensitivity to experimental pain. For a brief overview of the risk factors of interest in the OPPERA study, see Section S1 in the Supporting Information. See Ohrbach et al. [6], Fillingim et al. [7], Greenspan et al. [8], Maixner et al. [9], and Smith et al. [10] for a complete description of the baseline measures that were collected in OPPERA.
After enrollment, each participant was asked to complete questionnaires to evaluate recent orofacial pain once every three months. These questionnaires (hereafter referred to as “screeners”) evaluated the frequency and severity of pain in the orofacial region during the previous three months. The purpose of the screener was to identify participants who were likely to have recently developed TMD. For a complete description of the screener, see Slade et al. [11]. Participants with a positive screen were asked to undergo a follow-up physical examination by a clinical expert to diagnose presence or absence of TMD.
Of the 3,263 subjects, 2,737 filled out at least 1 screener, and the remaining 521 did not fill out any screeners. The total number of screeners was 26,666. There were 717 positive screeners, 486 (about 68%) of which were followed by a clinical examination. As reported in Bair et al. [2], case classifications made by one examiner (hereafter, “Examiner #4”) were deemed unreliable because the examiner diagnosed a much higher percentage of individuals with TMD compared to other examiners. We therefore set all of Examiner #4’s physical examination findings to be missing and imputed them using the methods in this paper. This left 404 positive screeners (56%) resulting in valid clinical exams.
3. Model
3.1. Notation and Assumptions
Assume there are n independent participants. For each participant i (i = 1, …, n), let Ci and Ti denote the potential times until censoring and failure, respectively, and let Vi = min(Ti, Ci), Δi = I(Ti ≤ Ci). Let Zi a p × 1 vector of covariates measured at baseline and let Xi be a q × 1 vector of covariates measured at the time of the putative event. We assume the hazard for participant i follows a Cox proportional hazards model
(1) |
where λ0(t) is an unspecified baseline hazard function. Let ξi denote the indicator that Δi is observed. We observe (Vi, ξi) for i = 1, …, n and Δi when ξi = 1.
In the OPPERA study, Vi is the length of time for participant i between enrollment in the study and either of two events
a screener which resulted in a diagnosis of incident TMD
the last-completed screener before loss-to-follow-up.
Note that participants with a positive screen do not fill out additional screeners until they are examined, so Vi will be the time until the positive screen for a participant who has a positive screen but is never examined. If participant i had a positive screen and subsequently was diagnosed with TMD, then Δi = 1. If participant i either had a negative screen on the last quarterly screener before loss-to-follow up or a positive screen and was diagnosed to be free of TMD, then Δi = 0. If participant i had a positive screen on the last screener but was not examined, then Δi is missing and ξi = 0. The putative risk factors for TMD that were assessed at enrollment are denoted by the vector Zi. Responses to the screener for participant i at time Vi are denoted by the vector Xi. For OPPERA, we also define Qi = 1 if participant i has a positive screen on their final screener and Qi = 0 otherwise.
We assume the failure indicators are missing at random (MAR) as follows:
(2) |
In other words, the probability of having a missing failure indicator may depend on measured factors, but it does not depend on whether or not an event occurred. We will describe how to estimate the probability in (2) in Section 3.2 and then show how to use this estimate to impute the missing event indicators in Section 3.3.
3.2. Estimating Event Probabilities
We model the probability that participant i with a missing failure indicator is a case by a logistic regression model based on Xi and Vi:
(3) |
That is, we estimate the probability of examiner-diagnosed TMD in a participant who was not examined as intended. (Here I(x) denotes an indicator function.) The probability was estimated using the time between enrollment and their last positive screener as well as their answers on that screener. Then, for those individuals who had a positive screen on the last screener (i.e. those with Qi = 1) and were not examined, the estimated probability of being a case is estimated by (3) with the parameters replaced by their respective estimates based on individuals who were examined.
Note that this also assumes that there is one observation per subject, which may not be the case in practice. For example, if some participants had a positive screen on more than one screener and are examined at least once, then we have multiple observations per participant. In that case, fitting a generalized linear mixed effects logistic regression model rather than a standard logistic regression model could account for correlations between the responses of the same participant. However, only a small number of participants in the OPPERA study were examined multiple times after positive screeners, so we simply discarded all but the most recent screener when analyzing the OPPERA data (thereby avoiding this problem of repeated observations).
3.3. Multiple Imputation
One popular method for handling missing data is multiple imputation. For a comprehensive review on multiple imputation, see Rubin [12]. Our imputation procedure is as follows:
Estimate the coefficients α, γ, and η in (3). We used a Bayesian model where α, γ, and η had a prior distribution that was Cauchy with center 0 and scale 2.5.
For each observation with a missing failure indicator, sample from the posterior distribution of α, γ, and η to obtain an estimate of the probability that an event occurred for each such observation.
Generate a Bernoulli random variable with success probability equal to the predicted probability found in step (2).
Combine the raw data and imputed data from step (3) to form a completed data set.
Fit the Cox proportional hazards model to the completed data set.
Record each parameter estimate β̂j and covariance matrix Ûj.
Repeat steps (3)–(6) for a total of m times, where m is the desired number of imputations.
Next, we combine all of the estimates. The average parameter estimate is
(4) |
the within-imputation variance estimate is
(5) |
and the between-imputation variance
(6) |
Finally, the estimated covariance matrix is
(7) |
It can be shown that is approximately t distributed with degrees of freedom
(8) |
(7) and (8) can be used to compute confidence intervals for the multiply imputed parameter estimate β̄.
3.4. Estimation of Incidence
Previous sections of this paper described how to estimate hazard ratios in the presence of missing failure indicators. It may also be of interest to estimate incidence rates for the same event using Poisson regression instead of Cox regression. For example, one of the aims of the OPPERA study is to estimate the incidence rate of first-onset TMD.
In order to estimate incidence rates, we estimate the case probabilities as described previously based on participants who had a positive screen and were examined. Then we impute case status as described in Section 3.3 for those who had a positive screen but were not examined. However, in this case we fit Poisson regression models, rather than Cox models, to the completed data sets. Finally, we calculate the incidence rate based on the estimates of the regression coefficients in the Poisson model. Specifically, we use the data from imputation j to fit the model
(9) |
where Δij denotes the jth imputation for observation i, j = 1, …, m. We combine the m imputations using equation (4) and
(10) |
The estimated incidence rate for an individual with covariates X* and Z* is given by exp(μ̄ + τ̄X* + λ̄Z*). The variability of μ̄, τ̄, and λ̄ may be estimated using (7), and confidence intervals may be computed based on the t distribution using (8), as described previously.
4. Simulations
Data with missing failure indicators were simulated, and several possible methods were compared with respect to bias, coverage, and confidence interval width. Survival times for 1,000 individuals were generated with exponentially distributed failure times under a proportional hazards model with covariates as proposed by Bender et al. [13]. That is, the survival time for each individual was distributed according to (1) where λ0(t) = 1 is the baseline hazard. For our simulations, Zi was a single baseline covariate following a normal distribution with mean 2 and unit variance. In other words, conditional on Zi, the failure times Ti followed an exponential distribution with hazard exp(β′Zi) where β ∈ {−0.5, −1.5, −3}. The censoring times Ci followed an exponential distribution with mean 5 (corresponding to a hazard of exp(−log(5)) ≈ exp(−1.61)). This yielded about 35%, 75% and 90% censoring for β = −0.5, β = −1.5, and β = −3, respectively. We also defined Δi = I(Ti ≤ Ci). If Δi = 0, the implication is that the follow up period ended before the participant developed TMD, meaning that the observation was censored at time Ci.
Covariates are represented by Zi, a risk factor for TMD measured at enrollment, and Xi, a measurement collected on the last screener. For each observation, a normally distributed covariate Xi1 was generated with mean Δi and standard deviation 0.3. In OPPERA, Xi represents a question on the screener evaluating some symptom of first-onset TMD, such as the frequency of jaw pain. This was used to generate Qi = I(Xi > 0.5), an indicator of whether participant i screened positive on their last screener. Note that Xi depends on Δi, since participants who developed first-onset TMD are more likely to report symptoms on their screener, and Qi depends on Xi, since the screener is positive if enough symptoms are reported. Also, ξi = I(Δi is observed) corresponds to the indicator of whether participant i came in for their clinical exam if Qi = 1. In all simulations, δi was used as the failure indicator rather than Δi, where δi is defined as
In other words, we set the failure indicator δi = 0 if the final screener was negative. This decision was made to reflect the fact that OPPERA participants who had a negative screen were not examined. Hence it is possible that some participants developed first-onset TMD but were never examined due to their final screener being negative. Thus, the simulations (incorrectly) treat these observations as censored.
We created missing failure indicators under the following classical missing data mechanisms of Rubin [14]:
The probability of having a missing failure indicator is independent of the data. This is known as missing completely at random (MCAR).
The probability of having a missing failure indicator depends on an observed covariate. This is known as missing at random (MAR).
The probability of having a missing failure indicator depends on the (potentially unobserved) failure indicator. This is known as missing not at random (MNAR).
Our method assumes that the data are MAR, which includes MCAR as a special case. Our simulations under MAR and MNAR parallel the study protocol in that failure indicators can only be missing for those with positive screeners. In other words, observations were potentially missing if and only if Qi = 1. (Individuals with negative screeners have Qi = 0 and are assumed to be censored. Those with positive screeners have Qi = 1 and may have missing clinical examinations.) Details and results for MCAR and MNAR data are shown in Sections S2.2 and S2.4 in the Supporting Information. We also considered several simulation scenarios where the logistic regression model for predicting the failure indicator was misspecified; see Section S2.3 in the Supporting Information. For MAR data, we set failure indicators to be missing with probability
(11) |
This resulted in approximately 50% of failure indicators being set to missing, which is consistent with the rate of missing failure indicators in the OPPERA study.
In each simulated data set, all observations with observed failure indicators who had a positive screen were used to fit a logistic regression model for case status with covariates Zi, Xi and Vi. That is, using the complete data (i.e. observations with Qi = 1 and ξi = 1), we fit the logistic regression model for the event probability conditional on Zi, Xi, and Vi, namely
(12) |
The estimated probabilities were calculated for individuals with Qi = 1 (where α̂, γ̂, and η̂ are drawn from their posterior distribution).
To evaluate the performance of our method, multiple imputation was employed to calculate 10 imputed estimates of β for each simulation as described in Section 3.3. For each observation i with Qi = 1 and ξi = 0, we estimated failure indicators Δ̂ij independently for each imputation j.
A Cox proportional hazards model was fit for each imputed data set, and the imputed estimates of the regression coefficient and their variances were recorded. These were aggregated using equations (4) and (7) to create confidence intervals for the multiple imputation estimates.
The performance of our method was compared with that of the method of Cook and Kosorok [1]. To obtain the estimates of Cook and Kosorok [1], for each simulated data set, we estimated the probabilities p̂i that the (potentially unobserved) event for participant i is a true event, as described previously. We then fit a weighted Cox proportional hazards model to the data set with weights calculated as follows: Each observation with a missing failure indicator was deleted and replaced with two new observations. Each such pair of observations had the same failure time and covariates, but different failure indicators and weights. The first observation had weight p̂i and Δ̂i = 1, and the second observation had weight 1 − p̂i and Δ̂i = 0. Participants with fully observed data retained a single observation in the data set with unit weight. The estimated regression coefficient, β̂ was recorded.
The variance of this estimate was estimated by generating 1,000 bootstrap replicates of each simulated data set and refitting the model for each bootstrap replicate. A set of 1,000 subjects was selected at each bootstrap iteration by sampling from the data with replacement. For each bootstrap replicate, the estimated probability that participant i is a true failure was calculated. These estimated ’s were used to calculate a bootstrap estimate β̂* of β using a weighted Cox model as described in the previous paragraph. The average parameter estimate, β̂̄ and percentile confidence intervals (β0.025, β0.975) were all recorded, where βθ is the θth quantile among the 1,000 bootstrap replicates.
We also compared our method to the ideal situation in which the true values of Δi were observed for all observations (note that Δi was used instead of δi in this case), complete case analysis (meaning that we exclude from the data set all observations with missing failure indicators), and two ad hoc methods in which we treat the missing indicators either all as censored or all as failures. Results under the assumption of MAR are shown in Table 1. We estimated the bias of each method by calculating the mean difference between the estimated Cox regression coefficient and the true coefficient over the 1,000 simulations. We also calculated the mean width of the confidence intervals produced by each method over the 1,000 simulations. Similarly, we calculated the empirical coverage probability for the confidence intervals produced by each method by dividing the number of times that the confidence intervals contained the true value of the parameter by 1,000. We also report the Monte Carlo error for the coverage rate, which is the error in the empirical coverage probability due to conducting only a finite number of simulations (which would be for n simulations). Finally, the rate of missing information and the average running time of each method was computed.
Table 1.
Simulation Results for MAR
β* | Method | Bias | SE (Bias) | Width | SE (Width) | Coverage† | Running Time (s.) |
---|---|---|---|---|---|---|---|
−0.5 | Full Data | −0.0008 | 0.0005 | 0.1666 | 0.0004 | 0.962 | 0.008 |
Complete Case | 0.0033 | 0.0007 | 0.2152 | 0.0004 | 0.955 | 0.007 | |
Treat all as Censored | 0.1058 | 0.0007 | 0.2127 | 0.0004 | 0.514 | 0.007 | |
Treat all as Failures | 0.0018 | 0.0005 | 0.1699 | 0.0004 | 0.964 | 0.008 | |
Cook & Kosorok | −0.0009 | 0.0005 | 0.1728 | 0.0004 | 0.959 | 22.0 | |
Multiple Imputation | −0.0003 | 0.0005 | 0.1721 | 0.0004 | 0.961 | 0.49 | |
−1.5 | Full Data | 0.0047 | 0.0011 | 0.3176 | 0.0002 | 0.938 | 0.008 |
Complete Case | −0.0558 | 0.0015 | 0.4317 | 0.0003 | 0.927 | 0.007 | |
Treat all as Censored | 0.1241 | 0.0014 | 0.421 | 0.0003 | 0.767 | 0.007 | |
Treat all as Failures | 0.0716 | 0.0011 | 0.3154 | 0.0002 | 0.841 | 0.007 | |
Cook & Kosorok | 0.0052 | 0.0011 | 0.3399 | 0.0003 | 0.942 | 17.50 | |
Multiple Imputation | 0.0082 | 0.0011 | 0.3353 | 0.0002 | 0.942 | 0.40 | |
−3 | Full Data | −0.0294 | 0.0025 | 0.7606 | 0.0009 | 0.945 | 0.007 |
Complete Case | −0.2044 | 0.0036 | 1.0855 | 0.0017 | 0.918 | 0.008 | |
Treat all as Censored | 0.0988 | 0.0034 | 1.0413 | 0.0015 | 0.92 | 0.008 | |
Treat all as Failures | 0.5914 | 0.0025 | 0.6293 | 0.0006 | 0.085 | 0.008 | |
Cook & Kosorok | −0.0302 | 0.0029 | 0.9078 | 0.0017 | 0.94 | 17.33 | |
Multiple Imputation | −0.0042 | 0.0028 | 0.8556 | 0.0014 | 0.947 | 0.43 |
The rate of missing information is 0.017 when β = −0.5, 0.061 when β = −1.5, and 0.178 when β = −3.
The Monte Carlo error is 0.007.
All calculations were performed using R versions 3.0.2 running on a single core of a Dell C6100 server with a 2.93 GHz Intel processor. The function “mi.binary” in the “mi” R package was used to generate the imputed values of the missing failure indicators. The functions “boot” and “boot.ci” in the “boot” R package were used to calculate the bootstrap estimates of the standard error of the Cook and Kosorok [1] method. The Cox proportional hazard models were fit using the “coxph” function in the “survival” R package. The code used to perform the simulations (and analyze the OPPERA data) is available in the Supporting Information.
The empirical coverage probability of the confidence intervals produced by multiple imputation is close to the nominal level (0.95) in all simulations. Our multiple imputation method and the method of Cook and Kosorok [1] produced approximately unbiased estimates and valid confidence intervals in all the scenarios we considered. The estimates produced by the other methods showed a larger amount of bias and did not always achieve the desired coverage level. Our multiple imputation method also yielded the narrowest confidence intervals in each scenario. Although the method of Cook and Kosorok [1] produced confidence intervals that were only slightly wider, this indicates that our proposed method may have slightly greater power to detect true associations, particularly when the absolute value of β is large. Our proposed method also tended to have lower bias than the method of Cook and Kosorok [1] when the absolute value of β is large. The running time of our proposed method was also significantly less than the running time of the Cook and Kosorok [1] method. Moreover, for most parameter values, the coverage probabilities for the complete case and ad hoc methods were significantly different (p < 0.01) from the nominal rate.
In addition, we examined the performance of our proposed methods when we changed the logistic regression model for Δi. We investigate two additional types of models: one in which the model contained a variable unrelated to case status and another in which the model does not include one variable related to case status. As in the previous simulations, the failure times were generated by (1), censoring was exponential with mean 5, failure indicators were set to be missing completely at random or missing at random with probability given in equation (11), Zi ~ N(2, 1), Xi1 ~ N(Δi, 0.3) and Qi = I(Yi2 > 0.5) for i = 1, …, n. We also generated Xi2 ~ N(0, 1) where Zi, Xi1, Xi2 were mutually independent and Xi2 was independent of Δi and Qi.
In the previous simulations, we fit the data to (12) with covariates Zi and Xi = Xi1. The additional simulations instead used the covariates and parameters as follows:
X̃i = {1, Xi1, Xi2}
X̃i = 0.
That is, rather than fitting model (12) to the data, we modeled the case probability with
(13) |
The results, which are shown in Section S2.3 in the Supporting Information, remained similar under both alternative models. This indicates that the proposed methods are robust to misspecification of the logistic regression model in some situations. Most notably, leaving out one covariate that was weakly related to case status did not markedly decrease the performance of the method.
We also performed some simulations where a random subset of the observations with Qi = 0 were set to have missing failure indicators. The model to predict Δi was fitted using only the observations for which Qi = 1, but the model was applied to all observations with missing failure indicators (including observations where Qi = 0). The results are shown in Section S2.3 in the Supporting Information. In this case our method (as well as the Cook and Kosorok [1] method) produced reasonable results when the logistic regression model was specified correctly or when an extra covariate was included in the model. However, both methods performed poorly when an important covariate was missing from the logistic regression model.
Finally, we conducted simulations to evaluate the method’s ability to estimate incidence rates. A similar multiple imputation strategy was applied to Poisson regression. Our method produced estimates much closer to the true incidence rates than the complete case estimate. In fact, the complete case method underestimated incidence rates by as much as a factor of 3. See Section S2.5 in the Supporting Information for details.
5. Analysis of the OPPERA Study
In this section, we apply our method to estimate hazard ratios and incidence rates in the OPPERA study using m = 10 imputations.
5.1. Hazard Ratios
We applied our method to the OPPERA cohort to adjust for the effect of participants with missing clinical examinations. (Note that examinations for participants evaluated by Examiner #4 were also treated as missing.) First, we estimated the probability that a participant would be diagnosed as an incident case of TMD given a positive screener. Due to the rich body of information collected in each screener, we carefully selected a small number of predictor variables. Specifically, we fit a logistic regression model to predict the result of the clinical exam based on each item in the screener. As described previously, the regression coefficients were assumed to have a prior distribution that was Cauchy with center 0 and scale 2.5. All models were adjusted for study site.
The majority of the variables measured on the screener were not associated with the result of the clinical examination. The strongest predictor of being diagnosed with TMD was a count of non-specific orofacial symptoms (e.g stiffness, fatigue) in the previous three months. The time elapsed since enrollment and OPPERA study site were also important covariates, as shown in Bair et al. [2]. Several other possible predictors of being diagnosed with TMD were identified, but including these additional predictors in the model did not improve the predictive accuracy of the model and hence they were not included. (In general failure to include a relevant predictor variable when performing multiple imputation will produce greater error than including an irrelevant variable as evidenced by our simulations, so generally it is better to err on the side of including too many predictors rather than too few. However, in this case, our testing indicated that including additional variables did not improve the predictive accuracy of the model and in fact might actually decrease the accuracy. Hence, in this case we favored the more parsimonious model.)
Thus, we estimated the probability of being diagnosed with TMD based on the count of non-specific orofacial symptoms, time since enrollment, and OPPERA study site. This model was used to perform multiple imputation for those with no clinical examination. These imputed data sets were used to fit a series of Cox proportional hazards models to estimate the hazard ratio (and associated confidence interval and p-value) for each predictor using the methods described in Section 3.3. Examples of predictors include perceived stress, history of comorbid chronic pain conditions, and smoking status.
In addition, Bair et al. [2] examined univariate relationships between examination attendance and numerous possible predictor variables. Differences between examined and non-examined participants were small and most were not statistically significant. However, a few of the differences were statistically significant, indicating that the data were not MCAR, since MCAR requires that the probability of a missing observation does not depend on the data.
Table 2 shows the results of applying our method to a subset of the putative risk factors of TMD measured in OPPERA. Due to the large number of putative risk factors measured in OPPERA, we only report the results for a selected subset of the variables. All continuous variables were normalized to have mean 0 and standard deviation 1 prior to fitting the Cox models. (Thus, the hazard ratios for the continuous variables represent the hazard ratios corresponding to a one-standard deviation increase in the predictor variable.) In Table 2, all the quantitative sensory testing and psychosocial variables were continuous, while all of the clinical variables were dichotomous (and hence were not normalized). The small number of missing values in these predictor variables were (singly) imputed using the EM algorithm; see Greenspan et al. [8] or Fillingim et al. [7] for details. For a more detailed description of the OPPERA domains, see Section S1 in the Supporting Information, Maixner et al. [15], and Slade et al. [16].
Table 2.
Results from the OPPERA Study
Treat All MCIs as Censored | Multiple Imputation | |||||||
---|---|---|---|---|---|---|---|---|
HR | LCL | UCL | P | HR | LCL | UCL | P | |
Clinical Variable | ||||||||
In the last month could not open mouth wide | 3.26 | 1.83 | 5.84 | <0.0001 | 2.35 | 1.39 | 3.96 | 0.0015 |
Has two or more comorbid chronic pain disorders | 3.08 | 2.26 | 4.21 | <0.0001 | 2.36 | 1.79 | 3.11 | <0.0001 |
History of 5 respiratory conditions | 1.38 | 1.01 | 1.87 | 0.0408 | 1.44 | 1.13 | 1.85 | 0.0040 |
Smoking: current | 1.26 | 0.86 | 1.84 | 0.2403 | 1.48 | 1.07 | 2.04 | 0.0166 |
Smoking: former | 1.87 | 1.22 | 2.87 | 0.0041 | 1.70 | 1.18 | 2.46 | 0.0045 |
One or more palpation tender points: right temporalis | 1.83 | 1.32 | 2.52 | 0.0002 | 1.54 | 1.18 | 2.02 | 0.0018 |
One or more palpation tender points: left temporalis | 1.60 | 1.14 | 2.25 | 0.0064 | 1.50 | 1.13 | 1.98 | 0.0045 |
One or more palpation tender points: right masseter | 1.85 | 1.35 | 2.53 | 0.0001 | 1.69 | 1.31 | 2.17 | <0.0001 |
One or more palpation tender points: left masseter | 1.70 | 1.23 | 2.35 | 0.0013 | 1.50 | 1.15 | 1.97 | 0.0031 |
Quantitative Sensory Testing Variable | ||||||||
Pressure pain threshold: temporalis | 1.26 | 1.07 | 1.49 | 0.0065 | 1.14 | 1.00 | 1.31 | 0.0466 |
Pressure pain threshold: masseter | 1.23 | 1.04 | 1.45 | 0.0170 | 1.14 | 0.99 | 1.31 | 0.0674 |
Pressure pain threshold: TM joint | 1.25 | 1.05 | 1.48 | 0.0106 | 1.15 | 1.01 | 1.32 | 0.0416 |
Mechanical pain aftersensation: 512mN probe, 15 s | 1.23 | 1.09 | 1.38 | 0.0006 | 1.15 | 1.04 | 1.28 | 0.0071 |
Mechanical pain aftersensation: 512mN probe, 30 s | 1.20 | 1.07 | 1.34 | 0.0020 | 1.12 | 1.02 | 1.24 | 0.0241 |
Psychosocial Variable | ||||||||
PILL Global Score | 1.52 | 1.35 | 1.71 | <0.0001 | 1.42 | 1.29 | 1.58 | <0.0001 |
EPQ-R Neuroticism | 1.39 | 1.21 | 1.60 | <0.0001 | 1.25 | 1.11 | 1.42 | 0.0003 |
Trait Anxiety Inventory | 1.43 | 1.25 | 1.64 | <0.0001 | 1.34 | 1.19 | 1.52 | <0.0001 |
Perceived Stress Scale | 1.35 | 1.17 | 1.55 | <0.0001 | 1.29 | 1.15 | 1.44 | <0.0001 |
SCL 90R Somatization | 1.44 | 1.31 | 1.58 | <0.0001 | 1.40 | 1.29 | 1.51 | <0.0001 |
HR, hazard ratio; LCL, lower confidence limit; UCL, upper confidence limit; TM, temporomandibular; PILL Pennebaker Inventory of Limbic Languidness; EPQ, Eysenck Personality Questionnaire; SCLR-90R, Symptom Checklist-90, Revised.
The rate of missing information varied slightly for each putative risk factor. The average rate of missing information was approximately 0.097. Compared to the unimputed results, which treated missing failure indicators as censored observations, imputation slightly reduced the hazard ratios for most of the psychosocial variables that were measured in OPPERA. For instance, Table 2 shows the (standardized) hazard ratios for the Pennebaker Inventory of Limbic Languidness (PILL) score, the neuroticism subscale of the Eysenck Personality Questionnaire (EPQ), the Spielberger Trait Anxiety Inventory score, the Perceived Stress Scale, and the somatization subscale of the Symptom Checklist-90, Revised (SCL-90R). In each case, the hazard ratios were reduced after imputation.
A similar pattern was observed after applying our imputation method to the measures of experimental pain sensitivity. The mechanical pain aftersensation ratings were strongly associated with first-onset TMD before imputation, but they were only weakly associated with first-onset TMD after imputation. The pressure pain algometer ratings were also more weakly associated with TMD after imputation (and one of three ratings in Table 2 was no longer significantly associated with first-onset TMD at the p < 0.05 level).
Interestingly, the hazard ratios for the presence of one or more palpation tender points at the temporalis and masseter muscles were also attenuated after imputation. These tender points were evaluated as part of the clinical examination using a different protocol than the quantitative sensory testing algometer pain ratings. However, both pain measures (algometer and palpation) were measured at the same facial locations. While the palpation ratings were more strongly associated with first-onset TMD than the algometer ratings both before and after imputation, it is interesting that different pain sensitivity measures using different protocols at the same anatomical location were both attenuated by imputation.
The effects of other clinical variables were also attenuated after imputation. For example, the hazard ratios associated with being unable to open one’s mouth wide in the past month and having two or more comorbid pain conditions were both noticeably attenuated after imputation. However, other clinical variables were more strongly associated with first-onset TMD after imputation. For example, having a history of respiratory illness was only weakly associated with first-onset TMD before imputation (HR=1.38, p=0.04), but the association was much stronger after imputation (HR=1.43, p=0.004). Also, being a current smoker was not significantly associated with first-onset TMD before imputation (HR=1.26, p=0.24) but was associated after imputation (HR=1.49, p=0.02).
5.2. Incidence Rates
In Table 3, the incidence rate of first-onset TMD was estimated using two different approaches. First, all missing failure indicators were treated as censored. Second, the multiple imputation method in this paper was used to estimate the incidence rate. The estimated TMD incidence rate using multiple imputation was 70% greater than the unimputed estimate. The estimated incidence rate increased by 70% for females and 87% for males. Estimated incidence rates for whites and Hispanics were 118% and 202% higher, respectively, with imputation. Thus, the incidence rate is likely to be underestimated without imputation.
Table 3.
Estimated TMD Incidence Rates With and Without Imputation
No MI | MI | Percent Change | |
---|---|---|---|
Overall | 2.23 | 3.78 | 70% |
Males | 1.87 | 3.49 | 87% |
Females | 2.46 | 4.19 | 70% |
White | 1.70 | 3.70 | 118% |
Black | 4.20 | 5.70 | 36% |
Hispanic | 1.17 | 3.53 | 202% |
Other | 1.10 | 1.86 | 69% |
Incidence rates are given in cases per 100 person-years.
6. Discussion
We have developed a computationally efficient method to adjust for missing failure indicators in time-to-event data using logistic regression and multiple imputation. Logistic regression is used to estimate the failure probability for participants with missing failure indicators. The missing values are imputed, and the standard errors are estimated using our multiple imputation method. This framework is important in studies where failure status may be measured in stages, which may lead to missing failure status indicators. This is a common occurrence in studies of diseases that are difficult or expensive to diagnose, such as TMD.
The present method is similar to the method of Magder and Hughes [17], who use an iterative procedure for parameter estimation based on the EM algorithm. Our assumption of MAR data renders their iterative method unnecessary. Other methods [18, 19, 20] depend on the MCAR assumption, which does not hold for the OPPERA study. Chen et al. [21] estimate Cox regression parameters using the EM algorithm and establish their consistency under basic regularity conditions, including missing at random (MAR) failure indicators. However, their approach depends on the assumptions of piecewise constant proportional hazard functions for the censoring time as well as for the failure time.
In each simulation scenario, our multiple imputation method produced the narrowest valid confidence intervals and no significant bias. In particular, the method of Cook and Kosorok [1] produced slightly wider confidence intervals in all but one of the simulations we considered. The differences were small, so the performance of the two methods appear to be comparable for most practical purposes. However, we believe that our method has several possible advantages over the method of Cook and Kosorok [1]. First, bootstrapping is much more intensive computationally than our multiple imputation approach. Calculating bootstrap confidence intervals generally requires at least 1000 bootstrap replicates [22], whereas as few as 10 imputed data sets may be sufficient for multiple imputation [23]. Although the difference in the computing time of the two methods is small for a single fitted model, many such models will be required in the course of the OPPERA study. OPPERA has already collected data on approximately three thousand genetic markers and has plans to collect data on approximately a million genetic markers in a genome-wide association study. Thus, at least three thousand (and potentially as many as a million) Cox models will need to be fit, and our proposed method may allow for a significant decrease in computing time. Moreover, our method can also be easily implemented in popular statistical software packages (such as SAS) without additional programming.
Additionally, our methodology may easily be extended to other models, such as Poisson regression. We conducted simulations (Table S9 in the Supporting Information) that showed that our proposed method can be used to estimate incidence rates using Poisson regression, which is one of the research aims of the OPPERA study. In particular, estimates of the failure rates were biased when missing failure indicators were treated as censored or when the complete case method was used, but they were unbiased when we employed the methodology in this paper.
Our method may yield increased bias and decreased coverage if the logistic regression model for predicting case status is inaccurate, as observed in the simulations in Section S2.3 in the Supporting Information. However, this would also be true for competing methods, including the method of Cook and Kosorok [1].
Our proposed also requires that the missing data be MAR. Although it is impossible to test this assumption directly, Bair et al. [2] showed that there were no significant differences between those who did and not attend their clinical examination with respect to a wide range of demographic variables and putative risk factors for TMD. Thus, the MAR assumption is reasonable for OPPERA. Furthermore, the results of the simulations described in Section S2.4 in the Supporting Information, show that our proposed method can produce valid results in some situations even if the MAR assumption is violated.
Also, our proposed method is only useful for imputing missing event failure indicators among participants who have positive screeners. If a participant develops first-onset TMD but still has a negative screener, such a participant will be treated as censored, and our method is unable to correct for this misclassification. The OPPERA screener was designed to have high sensitivity and modest specificity, so the number of false negative screens is expected to be low. (Indeed, OPPERA performed clinical examinations on a subset of the participants with negative screeners. Although analysis of this data is ongoing, preliminary results suggests that the false negative rate is less than 5%.) Thus, we expect that the small number of false negative screens will not meaningfully affect the results of our analysis. Also, note that under our simulation scenarios, we assumed that some failures were not observed due to a negative screener. Since our proposed method gave satisfactory results in these simulation scenarios, it appears that failing to observe some events due to negative screeners should not significantly bias the results.
In the OPPERA study, the hazard ratios associated with some variables were noticeably different after imputation. Although other results remained qualitatively unchanged, we note that even small changes in hazard ratios are important. In addition, estimated incidence rates were significantly increased after imputation. Since the results of OPPERA may become normative in the orofacial pain literature, precise calculation of the incidence rate of TMD and the hazard ratios associated with putative risk factors is important. Thus, imputation is recommended.
Supplementary Material
Acknowledgements
The authors would like to acknowledge and thank the principal investigators of the OPPERA study, namely William Maixner, Luda Diatchenko, Bruce Weir, Richard Ohrbach, Roger Fillingim, Joel Greenspan, and Ronald Dubner. The OPPERA study was supported by NIH/NIDCR grant U01DE017018. Naomi Brownstein was supported by NIH/NIEHS T32ES007018 and NSF Graduate Research Fellowship Program grant 0646083. Jianwen Cai was supported by NIH/NCI grant P01CA142538 and NIH/NIEHS grant R01ES021900. Eric Bair was supported by NIH/NIDCR grant R03DE023592, NIH/NCATS grant UL1TR001111, and NIH/NIEHS grant P03ES010126.
Contract/grant sponsor: The OPPERA study was supported by NIH/NIDCR grant U01DE017018. Naomi Brownstein was supported by NIH/NIEHS grant T32ES007018 and NSF Graduate Research Fellowship Program grant 0646083. Jianwen Cai was supported by NIH/NCI grant P01CA142538 and NIH/NIEHS grant R01ES021900. Eric Bair was supported by NIH/NIDCR grant R03DE023592, NIH/NIEHS grant P30ES010126, and NIH/NCATS grant UL1TR001111.
References
- 1.Cook TD, Kosorok MR. Analysis of time-to-event data with incomplete event adjudication. Journal of the American Statistical Association. 2004;99(468):1140–1152. URL http://www.jstor.org/stable/27590492. [Google Scholar]
- 2.Bair E, Brownstein NC, Ohrbach R, Greenspan JD, Dubner R, Fillingim RB, Maixner W, Smith SB, Diatchenko L, Gonzalez Y, et al. Study protocol, sample characteristics, and loss to follow-up: The OPPERA prospective cohort study. The Journal of Pain. 2013;14(12):T2–T19. doi: 10.1016/j.jpain.2013.06.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Magaret AS. Incorporating validation subsets into discrete proportional hazards models for mismeasured outcomes. Statistics in Medicine. 2008;27(26):5456–5470. doi: 10.1002/sim.3365. URL http://dx.doi.org/10.1002/sim.3365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Dodd LE, Korn EL, Freidlin B, Gray R, Bhattacharya S. An audit strategy for progression-free survival. Biometrics. 2011;67(3):1092–1099. doi: 10.1111/j.1541-0420.2010.01539.x. URL http://dx.doi.org/10.1111/j.1541-0420.2010.01539.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Dworkin S, LeResche L. Research diagnostic criteria for temporomandibular disorders: review, criteria, examinations and specifications, critique. Journal of Craniomandibular Disorders. 1992;6(4):301–355. [PubMed] [Google Scholar]
- 6.Ohrbach R, Fillingim RB, Mulkey F, Gonzalez Y, Gordon S, Gremillion H, Lim PF, Ribeiro-Dasilva M, Greenspan JD, Knott C, et al. Clinical findings and pain symptoms as potential risk factors for chronic TMD: Descriptive data and empirically identified domains from the OPPERA case-control study. The Journal of Pain. 2011;12(11) Supplement:T27–T45. doi: 10.1016/j.jpain.2011.09.001. URL http://www.sciencedirect.com/science/article/pii/S1526590011007437. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Fillingim RB, Ohrbach R, Greenspan JD, Knott C, Dubner R, Bair E, Baraian C, Slade GD, Maixner W. Potential psychosocial risk factors for chronic TMD: Descriptive data and empirically identified domains from the OPPERA case-control study. The Journal of Pain. 2011;12(11) Supplement:T46–T60. doi: 10.1016/j.jpain.2011.08.007. URL http://www.sciencedirect.com/science/article/pii/S1526590011007401. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Greenspan JD, Slade GD, Bair E, Dubner R, Fillingim RB, Ohrbach R, Knott C, Mulkey F, Rothwell R, Maixner W. Pain sensitivity risk factors for chronic TMD: Descriptive data and empirically identified domains from the OPPERA case control study. The Journal of Pain. 2011;12(11) Supplement:T61–T74. doi: 10.1016/j.jpain.2011.08.006. URL http://www.sciencedirect.com/science/article/pii/S1526590011007395. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Maixner W, Greenspan JD, Dubner R, Bair E, Mulkey F, Miller V, Knott C, Slade GD, Ohrbach R, Diatchenko L, et al. Potential autonomic risk factors for chronic TMD: Descriptive data and empirically identified domains from the OPPERA case-control study. The Journal of Pain. 2011;12(11) Supplement:T75–T91. doi: 10.1016/j.jpain.2011.09.002. URL http://www.sciencedirect.com/science/article/pii/S1526590011007449. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Smith SB, Maixner DW, Greenspan JD, Dubner R, Fillingim RB, Ohrbach R, Knott C, Slade GD, Bair E, Gibson DG, et al. Potential genetic risk factors for chronic TMD: Genetic associations from the OPPERA case control study. The Journal of Pain. 2011;12(11) Supplement:T92–T101. doi: 10.1016/j.jpain.2011.08.005. URL http://www.sciencedirect.com/science/article/pii/S1526590011007383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Slade GD, Sanders A, Bair E, Brownstein NC, Fillingim RB, Maixner W, Greenspan JD, Ohrbach R. Pre-clinical episodes of orofacial pain symptoms and their association with healthcare behaviors in the OPPERA prospective cohort study. Pain. 2013 May;154:750–760. doi: 10.1016/j.pain.2013.01.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rubin DB. Multiple imputation after 18+ years. Journal of the American Statistical Association. 1996;91(434):473–489. URL http://www.jstor.org/stable/2291635. [Google Scholar]
- 13.Bender R, Augustin T, Blettner M. Generating survival times to simulate cox proportional hazards models. Statistics in Medicine. 2005 Feb;24:1713–1723. doi: 10.1002/sim.2059. [DOI] [PubMed] [Google Scholar]
- 14.Rubin DB. Inference and missing data. Biometrika. 1976;63(3):581–592. URL http://www.jstor.org/stable/2335739. [Google Scholar]
- 15.Maixner W, Diatchenko L, Dubner R, Fillingim RB, Greenspan JD, Knott C, Ohrbach R, Weir B, Slade GD. Orofacial pain prospective evaluation and risk assessment study - the OPPERA study. The Journal of Pain. 2011 Nov;12(11):T4–T11. doi: 10.1016/j.jpain.2011.08.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Slade GD, Bair E, By K, Mulkey F, Baraian C, Rothwell R, Reynolds M, Miller V, Gonzalez Y, Gordon S, et al. Study methods, recruitment, sociodemographic findings, and demographic representativeness in the OPPERA study. The Journal of Pain. 2011 Nov;12(11):T12–T26. doi: 10.1016/j.jpain.2011.08.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty. American Journal of Epidemiology. 1997;146(2):195–203. doi: 10.1093/oxfordjournals.aje.a009251. URL http://www.ncbi.nlm.nih.gov/pubmed/9230782. [DOI] [PubMed] [Google Scholar]
- 18.McKeague IW, Subramanian S. Product-limit estimators and Cox regression with missing censoring information. Scandinavian Journal of Statistics. 1998;25(4):589–601. URL http://www.jstor.org/stable/4616526. [Google Scholar]
- 19.Gijbels I, Lin D, Ying Z. Non- and semi- parametric analysis of failure-time data with missing failure indicators. Lecture Notes-Monograph Series. 2007 Mar;54(1):203–223. [Google Scholar]
- 20.Subramanian S. Efficient estimation of regression coefficients and baseline hazard under proportionality of conditional hazards. Journal of Statistical Planning and Inference. 2000;84(1–2):81–94. URL http://www.sciencedirect.com/science/article/pii/S0378375899001536. [Google Scholar]
- 21.Chen P, He R, Shen Js, Sun Jg. Regression analysis of right-censored failure time data with missing censoring indicators. Acta Mathematicae Applicatae Sinica (English Series) 2009;25:415–426. URL http://dx.doi.org/10.1007/s10255-008-8807-1, 10.1007/s10255-008-8807-1. [Google Scholar]
- 22.Efron B, Tibshirani RJ. An introduction to the bootstrap. Boca Raton, FL: Chapman and Hall/CRC; 1993. [Google Scholar]
- 23.Little RJA, Rubin DB. Statistical Analysis with Missing Data. Wiley Series in Probability and Mathematical Statistics. 2002 [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.