Abstract
The problem of misclassification in covariates is ubiquitous in survival data and often leads to biased estimates. The misclassification simulation extrapolation method (MC-SIMEX) is a popular method to correct this bias. However, its impact on Weibull accelerated failure time (AFT) models has not been studied. In this paper, we study the bias caused by misclassification in one or more binary covariates in Weibull AFT models and explore the use of the MC-SIMEX in correcting for this bias, along with its asymptotic properties. Simulation studies are carried out to investigate the numerical properties of the resulting estimator for finite samples. The proposed method is then applied to colon cancer data obtained from the cancer registry at Memorial Sloan Kettering Cancer Center (MSKCC).
Keywords: Accelerated failure time model, Measurement error, Misclassification, Survival analysis, Weibull distribution
1: Introduction
Misclassification of binary covariates occurs frequently for survival analysis in epidemiological data. For example, dietary intake is misclassified based on food frequency questionnaires1,2; diagnosis of dental caries may be misclassified due to geographical differences in clinical practice3 ; socioeconomic status may be misclassified based on recall data as opposed to the use of historical records 4; prenatal smoking status may be misclassified in gestational studies5. It is well known that the analysis based on the misclassified covariates will lead to biased estimation, especially when the misclassification rate is high.
Several methods have been investigated for handling misclassification for survival data analysis, such as probabilistic bias analysis 6, weighted least squares method, corrected score function method, pseudo-partial likelihood method 7-9, pooled estimation method 10,11 the regression calibration approach11,12 and misclassification simulation extrapolation (MC-SIMEX) method. Additionally, purely numerical studies have also been conducted. 13 Among these methods, the MC-SIMEX method gained popularity recently due to its flexibility, easy implementation, robustness and minimal assumption requirement14. It is an extension of the simulation extrapolation (SIMEX) method 15 for measurement error in continuous covariates and first proposed by Kuchenhoff et al. 3 for regression models for complete data. Then it is studied for different models in survival data analysis. Bang et al., 11 studied the performance of the MC-SIMEX method in a Cox model. The performance of MC-SIMEX in log normal AFT models was studied by Slate and Bandyopadhyay 14 and in log-logistic AFT models by Sevilimedu et al 16. However, it has not been explored for Weibull AFT model although the Weibull distribution is a very popular distribution to model monotonic hazard functions in survival data analysis. In addition, none of the existing works 11,14,16 provided details about the asymptotic properties of the MC-SIMEX estimator in the corresponding survival models. Furthermore, they did not investigate the form of the extrapolation function used in the MC-SIMEX method.
Our work is motivated by colon cancer data from the Memorial Sloan Kettering Cancer Center (MSKCC) registry, where the survival times of patients may follow a Weibull distribution. We are interested in the effect of misclassification caused by faulty categorization of pathological nodal status, which plays a pivotal role in determining survival in colon cancer patients. In this paper, we explore the performance of the MC-SIMEX (Misclassification Simulation Extrapolation)3 method for misclassification correction in Weibull survival data, while also investigating its asymptotic properties and the form of the extrapolation function. These asymptotic properties of the MC-SIMEX estimator and the form of the extrapolation function are applicable to general AFT models, in which the data can follow any specified distributions. In the simulation studies, we also investigate the robustness of the proposed method to different scenarios such as misspecification of survival time distribution, dependent censoring, presence of multiple categories in covariates and the presence of multiple covariates that may be subject to misclassification.
2: Methodology
2.1. Weibull AFT model and misclassification matrix:
Let and denote the failure and censoring times respectively. The observations are , in which is the observed survival time and is the censoring indicator. We assume that , which is is linearly associated with covariates and . The Weibull AFT model can then be written as
| (1) |
where is the scale parameter and represents random error which follows standard extreme value distribution. We denote the parameter vector in model (1) by . The binary covariate is the variable that is subject to non-differential misclassification and is the vector of correctly measured confounding variables. Here we only include one misclassified covariate for clarity of the notations and convenience. More misclassified covariates can be added to model (1) and their inference procedure are same as the following inference procedure for . The observed version of covariate , which is denoted by , is related to through misclassification matrix , which is defined as:
where and are the specificity and sensitivity of classification respectively.
When the true covariate is observed, the likelihood function of the observation can be written as: 17
where
and
are the density function and distribution function for respectively. Then the log likelihood function for the data is
The maximum likelihood estimator of is then derived by solving
The use of the observed covariate instead of in the Weibull AFT model can result in a biased estimate for , which is most likely attenuated18. That is, the estimated (attenuated) effect size - , is more likely to be closer to 1.
2.2. MC-SIMEX:
Here, we provide an overview of the MC-SIMEX procedure for the estimation of . The general idea of the MC-SIMEX algorithm is to increase the amount of noise in a misclassified covariate through the use of an error inflation factor , estimate the coefficient at each level of added noise, and then extrapolate to a point where there is presumably no misclassification in the observed covariate . The parametric form of this extrapolation function can be written as:
where is the extrapolation function and is the corresponding vector of parameters. 3 We now briefly describe the individual components of MC-SIMEX namely, simulation and extrapolation.
Simulation component:
For a fixed set of values of , namely, , pseudo datasets are generated at each level of by using the misclassification equation,
where , and . In other words, for a fixed value , are generated from where represents the mean of , is the determinant of the misclassification matrix and is the error inflation factor. The misclassification equation represents the simulation of a given variable with a misclassification matrix , which represents the misclassification matrix raised to the power of where is the error inflation factor which is greater than 0. Let denote the maximum likelihood estimator of for the pseudo dataset and the corresponding variance estimate be denoted by . Then the naïve estimator at a level of misclassification is obtained as
Extrapolation component:
Let denote the component in for , denote the corresponding variance component in and denote the component in for the naïve estimator of . A parametric regression model is fitted by regressing the dependent variable on the independent variable . The least squares method is used to obtain the estimator of the coefficient vector . is then obtained by extrapolating the naïve estimates to a point where misclassification in the binary covariates does not exist, i.e., where . That is,
To calculate the variance of , we first calculate the sample variance of each of the naïve estimates as
where is the sample mean of , with . Next, the variance of each naïve estimator is obtained as
where represents the variance of the naïve estimator in the replication and is obtained using the information matrix. is the average of the variances for the naïve estimators across replications. We then calculate the variance of by utilizing the method proposed by Stefanski and Cook, 19 which extrapolates the difference between and to , 20
3: Properties of the MC-SIMEX estimator
In this Section, we will provide the properties of the MC-SIMEX estimator. They are applicable to general AFT models, in which the survival time can follow any specified distributions. The Weibull AFT model, in which the survival time is specified to follow Weibull distribution, is one of the general AFT models.
3.1. Asymptotic properties of
Under regularity conditions for the maximum likelihood estimator, the MC-SIMEX estimator, , can be shown to be consistent and asymptotically normally distributed. It is evident that is a consistent estimator of based on the theory of maximum likelihood estimation and Slutsky’s theorem. In addition, we know that is a consistent estimator of based on the least squares estimation theory. Suppose that is the true extrapolation function, then is consistent for based on the continuous mapping theory. Asymptotic normality can be proven along the lines of proof provided by He et al. 21 The authors provide proof for asymptotic normality of the SIMEX estimator which corrects for measurement error in continuous variables in an AFT model. However, the same theory can also be applied to MC-SIMEX which handles categorical variables with misclassification, because both SIMEX and MC-SIMEX use maximum likelihood theory for estimation in the simulation step and least squares method in the extrapolation step. Note that in the simulation step, the proof only depends on the maximum likelihood estimator of the simulated data but does not depend on how the simulated data is generated. Therefore, it can be stated that
where ; , in which , is a submatrix of formed by first m row and first m column and . is the diagonal matrix of , , where and is the covariance matrix of the vector , .
3.2. The form of extrapolation function in MC-SIMEX method
To investigate the form of the extrapolation function in the MC-SIMEX method, we prove the relationship between the naïve parameter and true parameter for general AFT models, which can be shown to be (proof provided in appendix):
| (2) |
where is the determinant of the misclassification matrix , given by . The form of the extrapolation function can be visually evaluated by fitting a regression line for the over various values of with the aid of spectral decomposition. Spectral decomposition allows us to estimate the misclassification matrix at a particular level of error inflation as follows:
The components of can then be inserted into (2) to obtain the relationship between and at the specified value of . Figure 1 shows the function , which is calculated based on equation (2), for different values of . It shows that the quadratic extrapolation function is appropriate for MC-SIMEX method under general AFT models.
Figure 1:
Figure demonstrating the behavior of with varying levels of . The and . Top row shows the behavior of when the values of are 1,2 and 3 (from left to right) and when the value of is 0.7. Bottom row shows the behavior of when the values of are 1,2 and 3 (from left to right) and when the value of is 0.9. The solid line represents the fitted quadratic model while the dotted line represents a plot of the original estimates of .
4: Simulation studies
4.1. Settings:
In this Section, we compare the new proposed estimator, referred to as the Weibull-Q MC-SIMEX estimator (Q denoting quadratic form of extrapolation), with the Cox MC-SIMEX estimator proposed by Bang et al.,11 Weibull-L MC-SIMEX estimator (L denoting linear form of extrapolation), Weibull-C MC-SIMEX estimator (C denoting cubic form of extrapolation), true estimator which uses the correct value of the covariate and the naïve estimator which does not correct for misclassification for finite samples. We consider a wide range of scenarios. The objective of the first simulation study is to assess the performance of the Weibull-Q MC-SIMEX estimator with different values of scale parameter () and varying sample sizes. In the subsequent simulation studies, we investigate the robustness of the proposed estimator to misspecification of distribution of survival times, presence of dependent censoring (the distribution of censoring time is dependent upon a covariate in the model), existence of multiple categories in the misclassified covariate, and presence of two categorical covariates measured with error.
For the proposed Weibull-Q MC-SIMEX estimator, Cox MC-SIMEX estimator, Weibull-L MC-SIMEX estimator and Weibull-C MC-SIMEX estimator, a total of 50 pseudo-datasets are generated (B) at each level of error inflation with . The covariate Z is generated from a standard normal distribution. For each setting, a total of 1000 simulated datasets are analyzed. The rate of censoring is adjusted to 30% and two sample sizes, n =500 or 1000 are considered. We present the results for n=500 in the main article and for n=1000 in the supplementary materials. We report bias, empirical variance, estimated variance and coverage probability of the 95% confidence interval for each method. The R code for the Weibull MC-SIMEX method is available from the first author upon request.
4.2. One misclassified covariate
In the first set of simulation studies, we investigate the performance of the Weibull-Q MC-SIMEX estimator in the following AFT model:
| (3) |
where is the scale parameter and represents the error term which follows standard extreme value distribution and survival time follows Weibull distribution. The true values of the binary covariate are generated from a random binomial distribution with a mean of 0.5. It is subject to non-differential misclassification with the misclassification matrix and is the correctly measured confounding variable. The values of and are both set to , while the value of is set to 0. is the log of survival time. The scale parameters are set as 0.5, 1.5 or 2. The censoring time is generated from uniform (0, upper), where the upper bound is adjusted to yield 30% censoring. In this section, it is assumed that censoring is independent of covariates. The results for n=500 are shown in table 1.
Table 1:
Table showing the performance of the true , naïve , Cox MC-SIMEX , Weibull-Q MC-SIMEX, Weibull-L MC-SIMEX and Weibull-C MC-SIMEX estimators with misclassified covariate . Sample size is set to 500. Scale parameter is denoted by .
| Estimated variance | 0.048 | 0.052 | 0.053 | 0.064 | 0.064 | 0.064 |
| Empirical variance | 0.050 | 0.114 | 0.068 | 0.068 | 0.037 | 2.793 |
| Bias | 0.002 | −0.248 | −0.075 | −0.071 | −0.183 | 0.026 |
| Coverage | 0.949 | 0.808 | 0.878 | 0.880 | 0.967 | 0.211 |
| Estimated variance | 0.005 | 0.007 | 0.010 | 0.008 | 0.008 | 0.008 |
| Empirical variance | 0.005 | 0.083 | 0.026 | 0.018 | 0.041 | 0.380 |
| Bias | −0.002 | −0.275 | −0.135 | −0.095 | −0.200 | −0.042 |
| Coverage | 0.945 | 0.088 | 0.662 | 0.720 | 0.249 | 0.204 |
| Estimated variance | 0.003 | 0.004 | 0.011 | 0.005 | 0.005 | 0.005 |
| Empirical variance | 0.003 | 0.084 | 0.036 | 0.018 | 0.046 | 0.264 |
| Bias | 0.002 | −0.283 | −0.177 | −0.109 | −0.214 | −0.094 |
| Coverage | 0.951 | 0.008 | 0.581 | 0.597 | 0.088 | 0.208 |
From the perspective of bias and empirical variance, the true estimator generally has the smallest value as it uses the true covariate values, while the naïve estimator has the largest value as it uses the wrong covariate values. Both the Weibull- Q MC-SIMEX estimator and the Cox MC-SIMEX estimator have smaller bias and empirical variance than the naïve estimator, which shows that the MC-SIMEX method demonstrates a much improved performance. The Weibull-Q MC-SIMEX estimator has smaller bias and empirical variance than the Cox MC-SIMEX estimator, which indicates Weibull-Q MC-SIMEX method has a better correction for the misclassification. The estimated variance is usually smaller than the empirical variance for the Weibull-Q MC-SIMEX method and the Cox MC-SIMEX method, especially when the scale parameter is large. This may be due to the unknown extrapolation function for the variance. In terms of the coverage probability, the true estimator has coverage close to the nominal level as expected. The Weibull-Q MC-SIMEX method always has better coverage probability than the Cox MC-SIMEX method.
Overall, the Weibull-Q MC-SIMEX method performs better than the Cox MC-SIMEX method in terms of bias, empirical variance and coverage probabilities. Comparing the Weibull MC-SIMEX estimator with different extrapolation functions, the Weibull-C MC-SIMEX estimator always has the smallest bias and the Weibull-L MC-SIMEX estimator always has the largest bias. The Weibull-C MC-SIMEX estimator always has much larger empirical variance than the other two estimators. The Weibull-Q MC-SIMEX estimator is better than the Weibull-L MC-SIMEX estimator except for the case with a small scale parameter. As for the coverage probability, the Weibull-Q MC-SIMEX estimator performs robustly well than the other two estimators, except that it has a slightly smaller coverage than the Weibull-L MC-SIMEX estimator for the case with a small scale parameter. This indicates that the quadratic extrapolation function provides a good compromise from the perspective of bias-variance trade-off, which is supported by the findings in Section 3.2.
4.3: Misspecified distribution of survival times
In this setting, we explore the performance of the proposed estimator when the distribution of survival time is misspecified as a Weibull distribution while its original distribution is exponential, log-normal or log-logistic. The survival times for the mentioned distributions are generated using model (3) with scale parameter . We simulate the log normally distributed survival data by generating the error term from the normal distribution (0,1) or normal distribution (0,2). The log-logistic survival data is simulated by generating the error terms from the logistic distribution (0,1) or logistic distribution (0,2). The censoring times are generated in a similar manner to that described in section 4.2. The true values of the binary covariate are generated from a random binomial distribution with a mean of 0.5. The results are given in table 2.
Table2:
Table showing the performance of the true , naïve , Cox MC-SIMEX , Weibull-Q MC-SIMEX, Weibull-L MC-SIMEX and Weibull-C MC-SIMEX estimators with misclassified covariate , when the true underlying distribution is misspecified as a Weibull distribution. Sample size is set to 500. is used to denote log normal distribution and is used to denote log logistic distribution.
| Exponential | ||||||
| Estimated variance | 0.012 | 0.013 | 0.015 | 0.016 | 0.016 | 0.016 |
| Empirical variance | 0.013 | 0.082 | 0.025 | 0.023 | 0.037 | 0.788 |
| Bias | 0.004 | −0.261 | −0.091 | −0.073 | −0.190 | −0.069 |
| Coverage | 0.943 | 0.389 | 0.798 | 0.829 | 0.763 | 0.231 |
| Estimated variance | 0.011 | 0.013 | 0.019 | 0.016 | 0.016 | 0.016 |
| Empirical variance | 0.015 | 0.100 | 0.023 | 0.032 | 0.051 | 0.755 |
| Bias | −0.050 | −0.295 | −0.035 | −0.125 | −0.223 | −0.129 |
| Coverage | 0.916 | 0.270 | 0.880 | 0.744 | 0.490 | 0.223 |
| Estimated variance | 0.036 | 0.039 | 0.019 | 0.047 | 0.047 | 0.047 |
| Empirical variance | 0.044 | 0.136 | 0.136 | 0.073 | 0.063 | 2.134 |
| Bias | −0.080 | −0.311 | −0.339 | −0.152 | −0.245 | −0.131 |
| Coverage | 0.923 | 0.637 | 0.306 | 0.816 | 0.901 | 0.233 |
| Estimated variance | 0.028 | 0.029 | 0.019 | 0.036 | 0.036 | 0.036 |
| Empirical variance | 0.031 | 0.119 | 0.091 | 0.051 | 0.055 | 1.636 |
| Bias | −0.068 | −0.304 | −0.267 | −0.123 | −0.229 | −0.077 |
| Coverage | 0.935 | 0.566 | 0.467 | 0.829 | 0.882 | 0.214 |
| Estimated variance | 0.096 | 0.101 | 0.019 | 0.120 | 0.120 | 0.120 |
| Empirical variance | 0.106 | 0.186 | 0.228 | 0.148 | 0.065 | 5.250 |
| Bias | −0.077 | −0.297 | −0.454 | −0.101 | −0.238 | −0.040 |
| Coverage | 0.922 | 0.851 | 0.135 | 0.863 | 0.971 | 0.211 |
For exponential survival data, the Weibull-Q MC-SIMEX method generally outperforms the Cox MC-SIMEX method, the Weibull-L MC-SIMEX method and the Weibull-C MC-SIMEX method in terms of bias, empirical variance, and coverage probability. However, it has a slightly larger bias than the Weibull-C MC-SIMEX method. The estimated variance is usually smaller than the empirical variance for the MC-SIMEX related methods.
For log-normal survival data, the performance of the Cox MC-SIMEX estimator is not stable. When the data variance is small, i.e., normal (0,1), it performs very well, that is, it has smallest bias (even smaller than the true estimator), smaller empirical variance and better coverage probability than the Weibull-Q MC-SIMEX estimator. However, when the data variance is large, i.e., normal (0,2), it performs very poorly, that is, it has largest bias (even larger than the naïve estimator), much larger empirical variance and worse coverage probability than the Weibull-Q MC-SIMEX estimator. However, the performance of the Weibull-Q MC-SIMEX method is reasonably well and robust. Similar to the results in section 4.2, the estimated variance is usually smaller than the empirical variance for the Cox MC-SIMEX estimator and the Weibull-Q MC-SIMEX estimator due to the unknown extrapolation function of the variance. The Weibull-Q MC-SIMEX method always has smaller bias than the Weibull-L MC-SIMEX method and comparable bias to the Weibull-C MC-SIMEX method. It has smaller empirical variance and better coverage probability than the Weibull-L MC-SIMEX method for large samples (n=1000) and small sample (n=500) with small variance and the Weibull-C MC-SIMEX method. This again supports the finding in Section 3.2.
For log-logistic survival data, the Cox MC-SIMEX estimator performs worse than for log normal survival data in terms of bias, empirical variance and coverage probability, while the Weibull-Q MC-SIMEX method still performs reasonably well. The Weibull-L MC-SIMEX method has larger bias than the Weibull-Q MC-SIMEX method, although their performances are similar in terms of empirical variance and coverage probability. Although the Weibull-C MC-SIMEX method has smaller bias than the Weibull-Q MC-SIMEX method, it has much worse empirical variance and coverage probability.
Overall, for the distributions considered in this section, the Weibull-Q MC-SIMEX method performs reasonably well and in a robust manner. Other methods, such as the Cox MC-SIMEX method, the Weibull-L MC-SIMEX method and the Weibull-C MC-SIMEX method have poor performance in some settings and perform generally worse than the Weibull-Q MC-SIMEX method.
4.4: Dependent censoring
In this simulation, we explored the performance of the proposed estimator when the censoring mechanism is dependent upon a covariate. The true values of the binary covariate are generated from a random binomial distribution with a mean of 0.5. If the true binary covariate took on the value of 1, then the censoring times were generated from a uniform (0,4) distribution. On the other hand, if the true covariate took on the value of 0, then censoring times were generated from uniform (0, 9) distribution, which results in a censoring rate of approximately 30%. We still consider model (3) in section 4.2. The results are shown in table 3.
Table 3:
Table showing the performance of the true , naïve , Cox MC-SIMEX , Weibull-Q MC-SIMEX, Weibull-L MC-SIMEX and Weibull-C MC-SIMEX estimators with misclassified covariate , when the censoring mechanism is dependent upon . Sample size is set to 500. Scale parameter is denoted by .
| Estimated variance | 0.049 | 0.051 | 0.052 | 0.059 | 0.059 | 0.059 |
| Empirical variance | 0.049 | 0.118 | 0.071 | 0.068 | 0.032 | 2.638 |
| Bias | 0.005 | −0.253 | −0.073 | −0.067 | −0.169 | −0.094 |
| Coverage | 0.947 | 0.778 | 0.856 | 0.863 | 0.952 | 0.214 |
| Estimated variance | 0.006 | 0.007 | 0.010 | 0.008 | 0.008 | 0.008 |
| Empirical variance | 0.006 | 0.072 | 0.018 | 0.013 | 0.034 | 0.376 |
| Bias | 0.005 | −0.257 | −0.100 | −0.067 | −0.183 | −0.049 |
| Coverage | 0.957 | 0.108 | 0.763 | 0.800 | 0.298 | 0.214 |
| Estimated variance | 0.003 | 0.004 | 0.011 | 0.006 | 0.006 | 0.006 |
| Empirical variance | 0.003 | 0.074 | 0.025 | 0.012 | 0.037 | 0.243 |
| Bias | 0.001 | −0.264 | −0.141 | −0.080 | −0.191 | −0.059 |
| Coverage | 0.947 | 0.021 | 0.710 | 0.744 | 0.158 | 0.217 |
The patterns of the results are similar to those in section 4.2. It shows that dependent censoring mechanism does not affect the performance of these methods.
4.5: One misclassified covariate with 3 categories
In real data analysis, it is often the case that categorical covariates have more than 2 levels, in contrast to what has been described so far. In order to mimic this scenario and evaluate the proposed estimator in such a scenario, we assumed that the covariate came from a multinomial distribution with three categories, which are 0,1 and 2, each with a probability of 0.33. The misclassification matrix was set to .
The resulting AFT model is therefore:
where is the indicator function with the corresponding coefficients and – both of which are set to log(2). Censoring times are generated in a manner similar to what was described in section 4.2. The results of this simulation analysis evaluating the numerical properties of the proposed estimator are shown in table 4.
Table 4:
Table showing the performance of the true , naïve , Cox MC-SIMEX , Weibull-Q MC-SIMEX, Weibull-L MC-SIMEX and Weibull-C MC-SIMEX estimators when the misclassified covariate has two categories. Sample size is set to 500. Scale parameter is denoted by . Estimated variance 1 is used to denote the estimated variance of the first category of and estimated variance 2 is used to denote the estimated variance of the second category of . Notation is defined similarly for other parameters.
| Estimated variance1 | 0.068 | 0.071 | 0.069 | 0.082 | 0.082 | 0.082 |
| Estimated variance2 | 0.068 | 0.066 | 0.061 | 0.075 | 0.075 | 0.075 |
| Empirical variance1 | 0.070 | 0.138 | 0.098 | 0.098 | 0.043 | 3.827 |
| Empirical variance2 | 0.069 | 0.136 | 0.067 | 0.067 | 0.040 | 2.687 |
| Bias1 | 0.001 | −0.263 | −0.069 | −0.067 | −0.194 | 0.043 |
| Bias2 | −0.005 | −0.263 | −0.070 | −0.068 | −0.186 | −0.061 |
| Coverage1 | 0.946 | 0.827 | 0.840 | 0.863 | 0.958 | 0.212 |
| Coverage2 | 0.940 | 0.825 | 0.896 | 0.920 | 0.976 | 0.231 |
| Estimated variance1 | 0.007 | 0.009 | 0.014 | 0.011 | 0.011 | 0.011 |
| Estimated variance2 | 0.007 | 0.009 | 0.013 | 0.010 | 0.010 | 0.010 |
| Empirical variance1 | 0.007 | 0.096 | 0.028 | 0.022 | 0.050 | 0.532 |
| Empirical variance2 | 0.007 | 0.088 | 0.026 | 0.019 | 0.048 | 0.405 |
| Bias1 | −0.006 | −0.294 | −0.130 | −0.098 | −0.222 | −0.035 |
| Bias2 | −0.001 | −0.281 | −0.131 | −0.099 | −0.217 | −0.066 |
| Coverage1 | 0.944 | 0.140 | 0.731 | 0.753 | 0.268 | 0.197 |
| Coverage2 | 0.956 | 0.141 | 0.745 | 0.752 | 0.242 | 0.218 |
| Estimated variance1 | 0.004 | 0.006 | 0.016 | 0.007 | 0.007 | 0.007 |
| Estimated variance2 | 0.004 | 0.006 | 0.016 | 0.006 | 0.006 | 0.006 |
| Empirical variance1 | 0.004 | 0.098 | 0.035 | 0.021 | 0.058 | 0.325 |
| Empirical variance2 | 0.004 | 0.093 | 0.035 | 0.020 | 0.056 | 0.263 |
| Bias1 | −0.001 | −0.302 | −0.170 | −0.116 | −0.240 | −0.037 |
| Bias2 | 0.000 | −0.295 | −0.175 | −0.121 | −0.236 | −0.053 |
| Coverage1 | 0.957 | 0.017 | 0.740 | 0.635 | 0.093 | 0.200 |
| Coverage2 | 0.950 | 0.028 | 0.739 | 0.590 | 0.069 | 0.237 |
The comparisons of the Weibull-Q MC-SIMEX estimator with other estimators are similar to those in Section 4.2 in terms of bias and empirical variance. From the perspective of coverage probability, when sample size is small, i.e., n=500, the Weibull-Q MC-SIMEX estimator is better than the Cox MC-SIMEX estimator for ; they perform similarly for and the Cox MC-SIMEX is better for . When n=1000 (table s4 in supplementary material), the Weibull-Q MC-SIMEX estimator has similar coverage probability to Cox MC-SIMEX estimator when and better coverage probability for . Although the Cox MC-SIMEX estimator is still better for , the difference of the coverage of these two methods is smaller than that when n=500. This indicates that when the misspecified covariate has more categories, a larger sample size is required for the Weibull-Q MC-SIMEX estimator to have better coverage probability than the Cox MC-SIMEX estimator.
4.6: Two mismeasured covariates
In this section, we evaluate a setting wherein two binary categorical covariates are subject to misclassification. Therefore, the resulting model is:
where both and follow a binomial distribution with a mean of 0.5. The values of both and are set to log(2). The misclassification matrices for both and are set to . The two misclassification processes are independent. The censoring rate is set to 30% by adjusting the upper bound of a uniformly distributed censoring time, which is independent of the covariates mentioned in the model immediately above. . The results of this analysis are shown in table 5. Once again, the patterns of the comparisons among these estimators are similar to those in Section 4.2.
Table 5:
Table showing the performance of the true , naïve , Cox MC-SIMEX , Weibull-Q MC-SIMEX, Weibull-L MC-SIMEX and Weibull-C MC-SIMEX estimators when there are two misclassified covariates and . Sample size is set to 500. Scale parameter is denoted by . Estimated variance 1 is used to denote the estimated variance of and estimated variance 2 is used to denote the estimated variance of . Notation is defined similarly for other parameters.
| Estimated variance1 | 0.046 | 0.050 | 0.053 | 0.065 | 0.065 | 0.065 |
| Estimated variance2 | 0.046 | 0.050 | 0.052 | 0.062 | 0.062 | 0.062 |
| Empirical variance1 | 0.044 | 0.116 | 0.076 | 0.077 | 0.038 | 2.663 |
| Empirical variance2 | 0.047 | 0.116 | 0.065 | 0.066 | 0.037 | 2.776 |
| Bias1 | −0.001 | −0.261 | −0.080 | −0.073 | −0.184 | −0.096 |
| Bias2 | 0.003 | −0.257 | −0.073 | −0.066 | −0.183 | −0.001 |
| Coverage1 | 0.959 | 0.791 | 0.852 | 0.876 | 0.972 | 0.231 |
| Coverage2 | 0.947 | 0.793 | 0.874 | 0.874 | 0.970 | 0.221 |
| Estimated variance1 | 0.005 | 0.008 | 0.010 | 0.010 | 0.010 | 0.010 |
| Estimated variance2 | 0.005 | 0.008 | 0.010 | 0.010 | 0.010 | 0.010 |
| Empirical variance1 | 0.005 | 0.083 | 0.037 | 0.020 | 0.042 | 0.488 |
| Empirical variance2 | 0.005 | 0.083 | 0.036 | 0.020 | 0.041 | 0.475 |
| Bias1 | −0.001 | −0.273 | −0.171 | −0.098 | −0.203 | −0.080 |
| Bias2 | 0.003 | −0.275 | −0.166 | −0.092 | −0.202 | −0.049 |
| Coverage1 | 0.951 | 0.138 | 0.531 | 0.746 | 0.298 | 0.212 |
| Coverage2 | 0.944 | 0.130 | 0.568 | 0.781 | 0.336 | 0.211 |
| Estimated variance1 | 0.003 | 0.005 | 0.011 | 0.008 | 0.008 | 0.008 |
| Estimated variance2 | 0.003 | 0.006 | 0.011 | 0.008 | 0.008 | 0.008 |
| Empirical variance1 | 0.003 | 0.087 | 0.054 | 0.019 | 0.045 | 0.339 |
| Empirical variance2 | 0.003 | 0.084 | 0.054 | 0.019 | 0.045 | 0.351 |
| Bias1 | −0.001 | −0.285 | −0.221 | −0.103 | −0.211 | −0.079 |
| Bias2 | −0.002 | −0.279 | −0.222 | −0.106 | −0.212 | −0.086 |
| Coverage1 | 0.946 | 0.032 | 0.412 | 0.686 | 0.185 | 0.219 |
| Coverage2 | 0.939 | 0.040 | 0.392 | 0.721 | 0.162 | 0.215 |
Overall, the Weibull-Q MC-SIMEX estimator generally performs better than the Cox MC-SIMEX estimator for Weibull or other survival data with one or more misclassified covariates. It is robust to all scenarios considered in the simulation studies. Moreover, the Weibull-Q MC-SIMEX estimator performs better than the Weibull-L MC-SIMEX estimator and the Weibull-C MC-SIMEX estimator in terms of the bias-variance trade-off, which supports the theoretical findings in Section 3.2.
5: Real data analysis
5.1: Real data analysis with one misclassified covariate
We illustrate an application of the Weibull-Q MC-SIMEX method to a colon cancer dataset containing survival information of 1327 patients who suffered from Stage I to Stage III colon cancer, for which they were surgically treated at Memorial Sloan Kettering Cancer Center (MSKCC) from January 1990 to December 2000. The dataset was obtained from the MSKCC registry. The dataset provides information on variables such as age and prognostic indicators such as T-stage, N-stage, lymphovascular invasion (LVI) and peri-neural invasion (PNI) 22. In colon cancer patients it has been established that the pathological nodal (pathn) status of a patient is often misclassified and that the true pathological nodal status (a binary covariate) is a function of the number of nodes examined. Therefore, we estimated the true pathological nodal status of a patient using a statistical algorithm proposed by Gonen et al. 23 We then used this true pathological nodal status to estimate the misclassification matrix as , where the row denotes pathological nodal status assigned and column denotes the true pathological nodal status. Therefore, it can be seen that the pathological nodal status assigned (naïve covariate) is 100% specific and 70% sensitive.
Of the total of 1327 patients who underwent surgery, 12 were eliminated from the dataset due to ambiguous values on the total nodes examined. The median follow-up time in the cohort was 55 months with date of surgery being treated as the starting time point. 163 of the 1315 individuals died due to colon cancer and the rest were lost to follow up or died of other causes. Disease specific survival from colon cancer was the primary outcome of interest. The distribution of survival times was checked using the fitdistrplus 24 package in R 4.2 25. The Weibull distribution appeared to fit the data well based on preliminary visual examination of the QQ plots, probability plots, cumulative density plots and probability density plots of survival times (Figure 2). Median survival time was not reached (Figure 3).
Figure 2:

Figure showing probability density function, q-q plot, cumulative distribution function (CDF) and P-P plot assuming that the distribution of survival times follows Weibull distribution.
Figure 3:

KM plot of survival times in the colon cancer dataset with the axis representing time in months and the y axis representing survival probability.
We regressed the survival time of individuals on the following covariates: age, T stage, LVI, PNI and assigned pathological nodal status using an accelerated failure time (AFT) model. The variable pathological nodal status is considered as the covariate which is subject to misclassification. The AFT model takes the form:
where is the log of the survival time, the represent the corresponding coefficients for the covariates in the model, is the scale parameter and is the error term. We use to denote the naïve coefficient for the naively measured pathological nodal status. We evaluated interaction terms in the model and found that none of these were significant at a type I error rate of 0.05. Therefore, we eliminated interaction terms from the model.
Due to lack of proper validation data (as the true pathological nodal status could only be estimated), we do not provide results for the true estimator. For the Weibull-Q MC-SIMEX, Weibull-L MC-SIMEX, Weibull-C MC-SIMEX and Cox MC-SIMEX procedures, we ran a total of 50 replications of pseudo-dataset generation for each level of (i.e., ). The grid of values used in the MC-SIMEX procedure was (0.5,1,1.5,2). The performance of the above mentioned methods is evaluated and shown in table 6.
Table 6:
Table showing the estimates of the coefficient associated with pathological nodal status obtained by fitting the AFT model to colon cancer data.
| Type of estimator | Estimate of the coefficient associated with pathological nodal status |
95% CI |
|---|---|---|
| Naïve estimator | −0.542 | (−0.839, −0.244) |
| Cox MC-SIMEX | 0.851 | (0.433, 1.269) |
| Weibull-Q MC-SIMEX | −0.705 | (−1.035, −0.375) |
| Weibull-L MC-SIMEX | −0.573 | (−0.903, −0.243) |
| Weibull-C MC-SIMEX | −0.847 | (−1.212,−0.482) |
All the estimators shown in table 6 indicate the detrimental effect of a positive pathological nodal status on survival. It can be deciphered from the Weibull-Q MC-SIMEX estimator that patients with a negative pathological nodal status have a survival time which is times more than patients with positive pathological nodal status. It can also be deciphered from the Cox MC-SIMEX estimator that the hazard ratio of those with positive pathological nodal status is 2.34 .
Even though data fits Weibull distribution well as shown in Figure 2, it can be misspecified and the original distribution can be other distributions. Then the commonly used statistical criteria such as AIC, BIC and loglikelihood are implemented to check all distributions considered in section 4.3, The results are shown in table 7.
Table 7:
Table showing the fit statistics of the AFT regression model under different distributional assumptions using AIC, BIC and log-likelihood.
| Statistic | |
|---|---|
| Weibull AIC (smaller is better) | 2215.389 |
| Weibull BIC (smaller is better) | 2262.017 |
| Weibull logLikelihood | −1098.695 |
| Exponential AIC (smaller is better) | 2221.309 |
| Exponential BIC (smaller is better) | 2262.756 |
| Exponential logLikelihood | −1102.655 |
| Log Normal AIC (smaller is better) | 2205.09 |
| Log Normal BIC (smaller is better) | 2251.717 |
| Log Normal logLikelihood | −1093.545 |
| Log logistic AIC (smaller is better) | 2206.59 |
| Log logistic BIC (smaller is better) | 2253.218 |
| Log logistic logLikelihood | −1094.295 |
As there are only minor differences in AIC, BIC and log likelihood for the distributions considered in section 4.3, any distribution for the data is justifiable. Under such conditions, we have demonstrated via simulations in section 4.3 that a Weibull-Q MC-SIMEX estimator provides the most reliable parameter estimates in comparison to Cox MC-SIMEX estimator, except for log-normally distributed data with lower variance, for which the Cox MC-SIMEX estimator has less bias. We estimated the variance of the log of survival times in the current dataset to be 0.80. Given that this variance is close to that of a standard log-normal distribution, it will be fair to say that if this data follows log-normal distribution, Cox MC-SIMEX estimator may have smaller bias than the Weibull-Q MC-SIMEX estimator as shown in the simulation study in section 4.3. However, since the Weibull-Q MC-SIMEX estimator performs generally well for all distributions, we recommend using the results from Weibull-Q MC-SIMEX method to make conclusions for this data. Because the log normal distribution may best fit the data based on the statistics in Table 7, the log normal MC-SIMEX method may provide more accurate results. However, simulations are needed to compare the performance of log normal MC-SIMEX method with other methods. Future research could investigate the performance of the MC-SIMEX method with different distributions, and therefore yield more guidelines for real data analysis.
5.2: Real data analysis with two misclassified covariates:
In this section, we would like to illustrate the application of Weibull-Q MC-SIMEX method for the colon cancer data considered in Section 5.1 with two misclassified covariates. In order to examine the performance of the proposed estimator in this situation, we simulated a misclassified version of the variable using the misclassification matrix and used it as a covariate in the following model.
| (4) |
Where is the coefficient associated with misclassified , is the misclassified version of the variable with associated coefficient . All other notations are the same as in section 5.1. Table 8 below shows the coefficient estimates obtained from model (4).
Table 8:
Table showing the estimates of the coefficient associated with pathological nodal status and mLVI.
| Type of estimator | Covariate | Coefficient estimate | 95% CI |
|---|---|---|---|
| Naïve estimator | Pathological nodal status | −0.601 | (−0.891, −0.31) |
| mLVI | −0.133 | (−0.441, 0.176) | |
| Cox MC-SIMEX | Pathological nodal status | 0.907 | (0.498, 1.316) |
| mLVI | 0.132 | (−0.294, 0.557) | |
| Weibull-Q MC-SIMEX | Pathological nodal status | −0.752 | (−1.108, −0.396) |
| mLVI | −0.104 | (−0.459,0.25) | |
| Weibull-L MC-SIMEX | Pathological nodal status | −0.631 | (−0.987, −0.275) |
| mLVI | −0.136 | (−0.49,0.219) | |
| Weibull-C MC-SIMEX | Pathological nodal status | −0.752 | (−1.107,−0.396) |
| mLVI | −0.123 | (−0.477,0.232) |
The result shows the coefficient estimate for pathological nodal status are very similar to what was seen in section 5.1. With regard to the coefficient estimates for , we now already know that the Weibull-Q MC-SIMEX estimator is the most reliable method based on the results of the simulation studies. It concludes that LVI is not a significant covariate associated with survival time of colon cancer patients. Although other methods have different corresponding estimates when compared to that of the Weibull-Q MC-SIMEX estimator, they still reach the same conclusion as that of the Weibull-Q MC-SIMEX estimator.
6: Discussion
In this article, we investigate the performance of the Weibull-Q MC-SIMEX estimator in an AFT model in which the survival times followed a Weibull distribution and are subject to right censoring. We demonstrate that under varying scenarios of censoring and misclassification, the Weibull-Q MC-SIMEX estimator is robust and performs better than the naïve estimator and the existing Cox MC-SIMEX estimator.
Additionally, we also provide proof for the relationship between the parameter estimates obtained at each level of error inflation in the presence of a correctly measured confounder and demonstrate graphically and using simulation studies that the quadratic extrapolation function will offer the best fit in a Weibull AFT model. This finding highlights the importance of the choice of the extrapolation function when using the MC-SIMEX method. Colon cancer is one of the leading causes of death in the US. Therefore, it is important that while estimating hazards and time to survival among patients with risk factors, potential misclassification must be corrected in the covariates. Through this study, we show that the Weibull-Q MC-SIMEX estimator could be used to obtain accurate predictions of time to survival, by correcting for misclassification.
Potential future directions include investigating the performance of nonparametric extrapolation techniques such as splines in the MC-SIMEX method, application of the procedure to situations where the correctly measured confounder and the misclassified covariate interact with each other, application of the procedure to a competing risks model and extension of the method to other fields such as mediation analysis.
Supplementary Material
Acknowledgements
We are grateful to Dr. Mithat Gonen, Attending Biostatistician and Chief of Biostatistics Service, MSKCC, for providing us with the colon cancer data.
Appendix:
Derivation of equation 2:
Derivation of :
We define:
Then,
When ,
Derivation of :
Along similar lines, as previously stated in the derivation section for ,
Substituting for the value of , we have
References:
- 1.Dalen I, Buonaccorsi JP, Laake P, Hjartåker A, Thoresen M. Regression analysis with categorized regression calibrated exposure: some interesting findings. Emerging Themes in Epidemiology. 2006/July/04 2006;3(1):6. doi: 10.1186/1742-7622-3-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Yi GY, Ma Y, Spiegelman D, Carroll RJ. Functional and Structural Methods with Mixed Measurement Error and Misclassification in Covariates. J Am Stat Assoc. Jun 1 2015;110(510):681–696. doi: 10.1080/01621459.2014.922777 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Küchenhoff H, Mwalili SM, Lesaffre E. A general method for dealing with misclassification in regression: The misclassification SIMEX. Biometrics. 2006;62(1):85–96. [DOI] [PubMed] [Google Scholar]
- 4.Kauhanen L, Lakka HM, Lynch JW, Kauhanen J. Social disadvantages in childhood and risk of all-cause death and cardiovascular disease in later life: a comparison of historical and retrospective childhood information. Int J Epidemiol. Aug 2006;35(4):962–8. doi: 10.1093/ije/dyl046 [DOI] [PubMed] [Google Scholar]
- 5.Buonaccorsi JP, Laake P, Veierød MB. On the effect of misclassification on bias of perfectly measured covariates in regression. Biometrics. Sep 2005;61(3):831–6. doi: 10.1111/j.1541-0420.2005.00336.x [DOI] [PubMed] [Google Scholar]
- 6.Ahrens K, Lash TL, Louik C, Mitchell AA, Werler MM. Correcting for exposure misclassification using survival analysis with a time-varying exposure. Ann Epidemiol. Nov 2012;22(11):799–806. doi: 10.1016/j.annepidem.2012.09.003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Zucker DM, Spiegelman D. Inference for the proportional hazards model with misclassified discrete-valued covariates. Biometrics. Jun 2004;60(2):324–34. doi: 10.1111/j.0006-341X.2004.00176.x [DOI] [PubMed] [Google Scholar]
- 8.Zucker DM. A Pseudo–Partial Likelihood Method for Semiparametric Survival Regression With Covariate Errors. Journal of the American Statistical Association. 2005/December/01 2005;100(472):1264–1277. doi: 10.1198/016214505000000538 [DOI] [Google Scholar]
- 9.Zucker DM, Spiegelman D. Corrected score estimation in the proportional hazards model with misclassified discrete covariates. Statistics in medicine. 2008;27(11):1911–1933. doi: 10.1002/sim.3159 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Spiegelman D, Carroll RJ, Kipnis V. Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument. Stat Med. Jan 15 2001;20(1):139–160. doi: 10.1002/1097-0258(20010115)20:1<139::aid-sim644>3.0.co;2-k [DOI] [PubMed] [Google Scholar]
- 11.Bang H, Chiu Y-L, Kaufman JS, Patel MD, Heiss G, Rose KM. Bias Correction Methods for Misclassified Covariates in the Cox Model: comparison of five correction methods by simulation and data analysis. Journal of statistical theory and practice. 2013;7(2):381–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Cole SR, Chu H, Greenland S. Multiple-imputation for measurement-error correction. International Journal of Epidemiology. 2006;35(4):1074–1081. doi: 10.1093/ije/dyl097 [DOI] [PubMed] [Google Scholar]
- 13.Zhou H, Pepe MS. Auxiliary Covariate Data in Failure Time Regression. Biometrika. 1995;82(1):139–149. doi: 10.2307/2337634 [DOI] [Google Scholar]
- 14.Slate EH, Bandyopadhyay D. An investigation of the MC-SIMEX method with application to measurement error in periodontal outcomes. Stat Med. Dec 10 2009;28(28):3523–38. doi: 10.1002/sim.3656 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Cook JR, Stefanski LA. Simulation-extrapolation estimation in parametric measurement error models. Journal of the American Statistical Association. 1994;89(428):1314–1328. [Google Scholar]
- 16.Sevilimedu V, Yu L, Samawi H, Rochani H. Application of the Misclassification Simulation Extrapolation Procedure to Log-Logistic Accelerated Failure Time Models in Survival Analysis. Journal of Statistical Theory and Practice. 2018/November/30 2018;13(1):24. doi: 10.1007/s42519-018-0024-5 [DOI] [Google Scholar]
- 17.Liu E, Lim K. Using the Weibull accelerated failure time regression model to predict time to health events. bioRxiv. 2018:362186. doi: 10.1101/362186 [DOI] [Google Scholar]
- 18.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition. CRC Press; 2006. [Google Scholar]
- 19.Stefanski Cook JR. Simulation-extrapolation: the measurement error jackknife. Journal of the American Statistical Association. 1995;90(432):1247–1256. [Google Scholar]
- 20.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models: a modern perspective. CRC press; 2006. [Google Scholar]
- 21.He W, Yi GY, Xiong J. Accelerated failure time models with covariates subject to measurement error. Statistics in Medicine. 2007;26(26):4817–4832. [DOI] [PubMed] [Google Scholar]
- 22.Weiser MR, Landmann RG, Kattan MW, et al. Individualized prediction of colon cancer recurrence using a nomogram. J Clin Oncol. Jan 20 2008;26(3):380–5. doi: 10.1200/jco.2007.14.1291 [DOI] [PubMed] [Google Scholar]
- 23.Gönen M, Schrag D, Weiser MR. Nodal Staging Score: A Tool to Assess Adequate Staging of Node-Negative Colon Cancer. Journal of Clinical Oncology. 2009/December/20 2009;27(36):6166–6171. doi: 10.1200/JCO.2009.23.7958 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Delignette-Muller ML, Dutang C. fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software. March/20 2015;64(4):1 – 34. doi: 10.18637/jss.v064.i04 [DOI] [Google Scholar]
- 25.R Foundation for Statistical Computing; 2022.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.

