Abstract
We consider the situation of estimating Cox regression in which some covariates are subject to missing, and there exists additional information (including observed event time, censoring indicator and fully observed covariates) which may be predictive of the missing covariates. We propose to use two working regression models: one for predicting the missing covariates and the other for predicting the missing probabilities. For each missing covariate observation, these two working models are used to define a nearest neighbor imputing set. This set is then used to non-parametrically impute covariate values for the missing observation. Upon the completion of imputation, Cox regression is performed on the multiply imputed datasets to estimate the regression coefficients. In a simulation study, we compare the nonparametric multiple imputation approach with the augmented inverse probability weighted (AIPW) method, which directly incorporates the two working models into estimation of Cox regression, and the predictive mean matching imputation (PMM) method. We show that all approaches can reduce bias due to non-ignorable missing mechanism. The proposed nonparametric imputation method is robust to mis-specification of either one of the two working models and robust to mis-specification of the link function of the two working models. In contrast, the PMM method is sensitive to misspecification of the covariates included in imputation. The AIPW method is sensitive to the selection probability. We apply the approaches to a breast cancer dataset from Surveillance, Epidemiology and End Results (SEER) Program.
Keywords: Augmented inverse probability weighted method, Cox regression, missing covariates, multiple imputation, predictive mean matching
1. Introduction
For survival time data with covariates, Cox regression is often used to specify the relationship between survival time and covariates.1 For time-independent covariates, Cox regression has the proportional hazards property. It estimates the regression coefficients of the model using the partial likelihood function without specifying the baseline hazard function.2 The estimators of regression coefficients have been shown to be consistent, normally distributed and semi-parametrically efficient.3 However, in many situations, some of the covariates are not fully observed. Missing covariates could compromise the asymptotic properties of the estimators if missing data are not accounted for in estimation. Specifically, it has been shown that the estimators of the regression coefficients derived from the subjects with all of the covariates observed (i.e. complete-case analysis) not only lose efficiency, but may also generate biased regression coefficient estimates when missingness depends on the survival outcome (i.e. survival time and censoring indicator).4 When missingness depends on the survival outcome (i.e. survival time and censoring indicator) and some fully observed covariates, missing mechanism is considered as missing at random (MAR).5 For the survival outcome data, MAR can be even further classified into two scenarios: failure- ignorable MAR (i.e. missingness does not depend on failure time) and censoring-ignorable MAR (i.e. missingness does not depend on censoring time but may depend on failure time).6 When missingness is failure-ignorable MAR, complete-case analysis can still produce valid regression coefficient estimates. However, when missingness is censoring-ignorable MAR, complete-case analysis may produce biased regression coefficient estimates.
Several approaches have been proposed to deal with missing covariates in Cox regression. Of the existing approaches, the augmented inverse probability weighted (AIPW) method,7,8 where the weight is derived from a fully specified model for the missing status conditional on the observed data and an augmentation term derived from a fully specified model for the missing covariate conditional on the observed data is added to estimation to correct the potential bias, has been shown to have a double robustness property. Specifically, the AIPW method uses two fully specified parametric models (one for the missing covariate and the other for the missing probability) to account for missing covariates while estimating the regression coefficients of Cox regression model. This indicates that at least one of the two models has to be correctly specified, including the distribution and link function for the missing covariate and the missing status, respectively. Of the two models, the model for the missing covariate is more important since in a sense it is directly associated with estimation in Cox regression. However, it is more challenging to correctly specify the model for the missing covariate than the model for the missing probability based on the observed data. Because of the double robustness property, the AIPW method is a popular method for researchers who do not want to solely rely on the model for estimating the conditional distribution of the missing covariate on the observed data in Cox regression with missing covariates. To weaken the reliance on parametric assumptions behind the two models, non-parametric regression has been used to estimate the two models without fully specifying the relationship between the missing covariates and the observed data.9 As the dimensionality of the observed data increases, it becomes extremely difficult to use non- parametric regression to estimate the two models. In addition, the AIPW method is also sensitive to misspecification of the missing probability model, because even mild lack of fit in outlying regions of the covariate space where the missing probability is extreme (i.e. very close to 1) translates into large errors in the weights.5,10,11
We previously developed a nonparametric multiple imputation (MI) approach to deal with missing data in a situation without censored data.11 The approach indirectly uses two working models to recover information for missing data observations. Specifically, we use two working regression models, one for predicting the missing covariate values and one for predicting the missing probabilities. The parameter estimates from these two working models are then used to give two predictive scores for each subject, defined as the linear combination of the covariates in the corresponding model. The method then selects an imputing set of observations for each missing data observation, which consists of subjects who have their data fully observed and have similar predictive scores as the subject with missing data. Then the missing data value is randomly drawn from this imputing set. The idea is similar to predictive mean matching12 and propensity score matching13 in the missing data literature. In a situation with missing outcome data, we have shown that this nonparametric multiple imputation approach can generate a consistent mean estimator. In this paper, we generalize the nonparametric multiple imputation approach,11 in which no statistical model is directly used to perform multiple imputation, to handle estimation of Cox regression with missing covariates to weaken the reliance on the two models and produce stable regression coefficient estimates even if the missing probability is extreme. Specifically, we propose to use two working regression models, one for predicting the missing covariates and one for predicting the missing probabilities, to derive two predictive scores to select an imputing set for each missing covariate observation. It has been shown that the survival outcome data (specifically cumulative baseline hazard and censoring indicator) need to be included in predicting the missing covariates.14 In addition, the survival outcome data can be also included in the regression model for missing probabilities as the covariates to account for potentially censoring-ignorable MAR. The two working regression models are only used to derive two predictive scores to select an imputing set. Hence, the approach can easily handle the multi-dimensional structure of the observed data and is expected to be less affected by the mis-specification of the two working models (especially the mis-specification of the missing probability model) than the AIPW method. Due to the simplicity in estimation and the availability in statistical software, the MI method simply based on the predictive model of the missing covariates is widely used. Qi15 compared the AIPW method with the MI method using predictive mean matching (PMM) based on multiple imputation by chain equations (MICE) in the estimation of Cox regression with missing covariates and concluded the PMM method is sensitive to misspecification of the predictive model of the missing covariates. In this paper, not only will we study the performance of the proposed multiple imputation approach but will also compare its performance with the AIPW and PMM methods.
This paper is organized as follows. In Section 2, we review the complete-case analysis and the AIPW method. In Section 3, we describe the proposed multiple imputation method and the associated properties. In Section 4, we apply the techniques to data from a breast cancer study. In Section 5, we give results from a simulation study. A discussion follows in Section 6.
2. Review of methods
In this section, we begin with describing the setting of the situation: estimation of Cox regression with time- independent covariates and one of the covariates subject to missing. Let T denote the failure time, C denote the censoring time, Y = min(T,C) denote the observed time, denote the censoring indicator and denote the counting process. Assume T has a hazard function of where λ0(t) is an unspecified baseline hazard function, X is subject to missing and Z is fully observed. Let dx denote the missing indicator for X (i.e. δx =1 if X is observed; otherwise, 0) and π Pr(δx = 1) denote the selection probability. We assume that T and C are independent conditional on X and Z and X is missing at random (i.e. ) and there is a random sample of n subjects.
2.1. Complete-case analysis
The complete-case (CC) analysis of β = (βx,βz) is based on the partial likelihood estimator using observations that have X observed. Let and The CC analysis involves solving the following estimating equations
where for m = 0, 1. It is easy to implement the CC analysis and it is consistent when the missingness depends only on Z. However, it loses efficiency due to discarding data from incomplete observations, especially when the missing rate is greater than 25%,16 and is inconsistent when missingness depends on T or δt.
2.2. AIPW method
The AIPW method was first proposed by Robins et al.7 to modify the CC analysis to produce consistent estimators of β and furthermore improve efficiency of the CC analysis. The AIPW method has been studied and further developed by a few groups for various scenarios. For Cox regression with a missing covariate, it involves solving the estimating equations8,9
Where for m = 0,1 and
Based on the above expression, it can be seen that the conditional expectation in Ai(β, πi) depends on the baseline cumulative hazard and the conditional distribution of The EM algorithm can be used to derive the AIPW estimates.8 To perform the EM algorithm, the conditional distribution of and the selection probability π need to be estimated. It has been shown that if one of them is estimated correctly, the AIPW estimator is consistent (so called double robustness property). Often two parametric working models are used to estimate the conditional distribution of estimate the conditional distribution and the selection probability π, respectively, and then directly incorporate them into estimation of AIPW estimator. To relax the reliance on the distributional assumptions, nonparametric techniques have been proposed to estimate the conditional distribution and the selection probability. However, as the number of fully observed covariates (i.e. Z) increases, it gets difficult to estimate the conditional distribution and the selection probability nonparametrically. In this paper, we will mainly focus on the performance of the AIPW estimator where two parametric working models are used to estimate the conditional distribution and the selection probability, respectively, and one of the two models is mis-specified. The estimate of standard error for AIPW is derived from 500 bootstrap samples.
3. Nonparametric multiple imputation
Instead of directly incorporating the working models into estimation, we propose to use two working regression models, one for predicting the missing covariates and one for predicting the missing probabilities, to derive two predictive scores to select an imputing set for each missing covariate observation. The two working regression models are only used to derive two predictive scores to select an imputing set. Hence, the approach is expected to be less affected by the mis-specification of the two working models. To conduct nonparametric multiple imputation, for each missing covariate observation we seek an imputing set consisting of subjects who have similar predictive scores as the subject with missing covariate observation. We describe the imputation procedures in detail below.
3.1. Imputation procedures for missing covariate X
3.1.1. Step 1: Estimate the two predictive scores on a Bootstrap sample
To define each imputing set, we first reduce the observed survival data and Z to two scalar indices (predictive scores), which provide an indicator of an individual’s value of X and chance of having missing X. White14 showed that in Cox regression with missing covariates only under certain cases the conditional distribution of can be exactly specified using cumulative baseline hazard H0(t), δt and Z. Specifically, when both X and Z are binary variables, the conditional distribution of exactly follows a binomial distribution, where . When there is no Z, the conditional distribution reduces to . In other cases, only approximate conditional distribution of can be obtained. The approximate conditional distribution depends on cumulative baseline hazard H0(t), censoring indicator δt and the fully observed covariate Z.14 Hence, all of them will be included in the working regression model for predicting X. To account for potential censoring-ignorable MAR and misspecification of the conditional distribution of , we will include the survival outcome data (i.e. Y and δt), as well as Z, in the working regression model for predicting the missing probabilities. This strategy summarizes the multi-dimensional structure of the observed survival data and Z into a two-dimensional summary. The hope is that this two-dimensional summary contains most, if not all, the information about the value of missing X and missingness.
Specifically, a linear/generalized linear model with H0(t), δt and Z as the covariates can be fitted to the complete cases to derive a predictive score for X. This score summarizes the relationship between X and H0(t), δt and Z. A logistic regression model with the observed Y, δt and Z as the covariates will be fitted to the missing indicator data (i.e. 6x) to derive a predictive score for missingness. This score summarizes the relationship between missingness and Y, δt and Z. The two models will be fitted on a nonparametric bootstrap sample17 of the original dataset to incorporate the uncertainty of parameter estimates from the working models. This step results in proper multiple imputation (Nielsen18 and references therein). More specifically, let denote the bootstrap sample. Two working models are conducted on the bootstrap sample to calculate two predictive scores, Sx(B) and Sδx(B), for each individual in the bootstrap sample. We further standardize these scores by subtracting their sample mean and dividing by their standard deviation, and denote the standardized scores by Sxc(B) and Sδxc(B), respectively. Combinations of these two predictive scores will be studied to see to what extent a double robustness property19 for model mis-specification can be established and whether a robustness property for link function mis-specification can be established for the non-parametric multiple imputation method.
3.1.2. Step 2: Define the imputing set
For subject j with missing X in the original dataset, two predictive scores are derived using the regression coefficient estimates obtained from the bootstrap sample (i.e. Sx(j) and Sδx(j)) and then standardized by subtracting the sample mean of the corresponding bootstrap sample predictive scores and dividing by the standard deviation of the corresponding bootstrap sample predictive scores, respectively (denoted as Sxc(j) and Sδxc(j)). The distance between subject j in the original dataset and subject k in the bootstrap sample is then defined as , where w1 and w2 are non-negative weights that sum to one. Non-zero weights for w2 may be useful in reducing the bias resulting from model mis-specification. Specifically, a small weight w2 (e.g. 0.2) will result in incorporating the predictive scores from the missing probability model into defining a set of nearest neighbors for subjects with missing X. There are alternative ways to calculate the distance between subjects such as Mahalanobis distance, which accounts for the correlation between the two predictive scores. Once the distance is derived, for subject j, the distance is then employed to define a set of nearest neighbors. This neighborhood consists of NN subjects who have their X observed and have a small distance from subject j in terms of two predictive scores.
3.1.3. Step 3: Impute a value from the imputing set
After the imputing set is defined, a value of X is randomly drawn from the imputing set. Thus, the procedure imputes X only from the subjects with X observed. The non-parametric multiple imputation method based on a nearest neighborhood is denoted as NNMI(NN, w1, w2).
3.1.4. Step 4: Repeat Steps 1 to 3 independently M times
Each of the M imputed datasets is based on a different Bootstrap sample. Once the M multiply imputed datasets are obtained, we carry out the MI analysis procedure established in Rubin.5 Specifically for our purposes, Cox regression analysis with X and Z as the covariates is performed on the M imputed datasets to estimate βx and βz. For both βx and βz, the final estimate is the average of the M corresponding regression coefficient estimates (i.e. ) and the final variance (denoted ) is the sum of the sample variances (denoted as Bβ) of the M regression coefficient estimates and the average (denoted as Uβ) of the M variance estimates of . As shown in Rubin,5 for both βx and βz, the quantity approximately follows a t distribution with a degree of freedom, We use a value of 10 or higher for M.
4. Illustration of the method on a breast cancer dataset
We demonstrate the nonparametric multiple imputation approach on a dataset which consists of 7050 women diagnosed with stage IV breast cancer between 2005 and 2011 in California. This dataset was extracted from the breast cancer registries under Surveillance, Epidemiology and End Results (SEER) Program. Of the 7050 patients, besides survival data (i.e. survival status and survival time) after diagnosed with breast cancer, for each patient there are several variables collected at diagnosis, as well as Age, Race (Black, White, Other), HER2, Radiation and Surgery. Those variables are summarized in Table 1. HER2 is a member of the human epidermal growth factor receptor family and has been shown to be strongly associated with increased disease recurrence and a poor prognosis for breast cancer patients.20 According to Table 1, of the 7050 patients, 1293 (18.34%) had missing HER2 value. Table 2 identifies variables predictive of HER2 value and missing probability. Specifically, based on univariate logistic regression analysis for HER2 positive indicator using patients with their HER2 value available (i.e. complete case analysis), Age, Race, Surgery and baseline cumulative hazard, respectively, are predictive of HER2 value and used for performing the PMM method. The results indicate younger patients who did not have surgery and had a higher hazard rate are more likely to have a positive HER2 value. Based on univariate logistic regression for missing indicator, Age, Surgery, Radiation, survival status (Dead indicator) and baseline cumulative hazard, respectively, are predictive of missing probability. The results indicate older patients who did not have surgery and radiation and had a lower hazard rate are more likely to have a missing HER2 value. Those predictive covariates are then used to derive the conditional distribution of HER2 given the observed data and the selection probability for performing the AIPW estimation and derive two predictive scores for conducting the proposed multiple imputation method. Specifically, a working logistic regression model for HER2 positive indicator with Age, Race and Surgery, as well as survival status and baseline cumulative hazard, as covariates is fitted to derive the conditional distribution of HER2 given the observed data and a HER2 predictive score for each patient. A working logistic regression model for HER2 missing indicator with Age, Radiation and Surgery, as well as survival status and baseline cumulative hazard, as covariates is fitted to derive the selection probability (i.e. π = 1-missing probability) and a predictive score of HER2 missing probability for each patient. To perform the AIPW estimation, the derived conditional distribution of HER2 is then used to derive the conditional expectations and the selection probability is incorporated into the estimation as the weight. To conduct the proposed multiple imputation approach (i.e. NNMI), the two predictive scores are then used to calculate the distance between patients and then select an imputing set for each patient with missing HER2. The number of imputes M is set at 50. Upon the completion of multiple imputation, Cox regression analysis with Age, Black and Others (White as the reference group), HER2, Radiation and Surgery as the covariates is performed on each of the imputed datasets and Rubin’s rule5 is applied to derive the final estimate for each regression coefficient.
Table 1.
Variable | Mean/Frequency | Standard Deviation/ Percentage |
---|---|---|
Age | 60.91 | 14.41 |
Race | ||
White | 5585 | 79.22 |
Black | 721 | 10.23 |
Others | 744 | 10.55 |
HER2 | ||
Negative | 4180 | 59.29 |
Positive | 1577 | 23.37 |
Missing | 1293 | 18.34 |
Surgery | ||
No | 3916 | 55.55 |
Yes | 3134 | 44.45 |
Radiation | ||
No | 4484 | 63.60 |
Yes | 2566 | 36.40 |
Table 2.
Variable | Missing HER2 Value |
Missing HER2 Probability |
||||
---|---|---|---|---|---|---|
ORa | 95% CIb | pc | OR | 95% CI | p | |
Age | 0.987 | (0.983, 0.991) | <0.0001 | 1.031 | (1.026, 1.035) | <0.0001 |
Black | 1.187 | (0.983, 1.433) | 0.08 | 1.007 | (0.825, 1.229) | 0.94 |
Others | 1.360 | (1.135, 1.629) | <0.001 | 0.908 | (0.742, 1.112) | 0.35 |
No Radiation | 0.913 | (0.811, 1.028) | 0.13 | 1.580 | (1.385, 1.804) | <0.0001 |
No Surgery | 0.884 | (0.787, 0.993) | 0.04 | 2.146 | (1.885, 2.443) | <0.0001 |
Dead | 0.997 | (0.888, 1.120) | 0.96 | 2.205 | (1.937, 2.510) | <0.0001 |
H0(t)d | 1.416 | (1.235, 1.624) | <0.0001 | 0.641 | (0.549, 0.747) | <0.0001 |
Odds ratio (HER2+ vs. HER−).
95% Confidence interval.
p-Value.
Baseline cumulative hazard.
The results of the Cox regression estimation for the CC, PMM, AIPW and NNMI methods are provided in Table 3. Table 3 displays the hazard ratio estimate of each covariate along with the associated 95% confidence interval (CI) and p-value. The CC and AIPW methods produce similar results. The results indicate that Age, Black and Surgery are significantly associated with survival after diagnosis with stage IV breast cancer. Specifically, older patients tend to have a higher hazard rate than younger patients, Black patients tend to have a higher hazard rate than White patients, and patients without surgery tend to have a higher hazard rate than patients with surgery. Others patients have a slightly lower hazard rate than white patients but not significant at a significance level of 5%. Radiation and HER2 are not significantly associated with survival after diagnosis with stage IV breast cancer. The PMM and NNMI method produces similar results as the CC and AIPW methods, except for Others. The results of PMM and NNMI methods indicate that Others patients have a significantly lower hazard rate than White patients. In addition, the PMM and NNMI methods produce a tighter 95% CI than the CC and AIPW method except for HER2.
Table 3.
Variable | CC |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
HRa | 95% CIb | pc | |||||||||
Age | 1.015 | (1.012, 1.017) | <0.01 | ||||||||
Black | 1.437 | (1.286, 1.695) | <0.01 | ||||||||
Others | 0.887 | (0.781, 1.007) | 0.06 | ||||||||
No Radiation | 1.056 | (0.978, 1.140) | 0.16 | ||||||||
No Surgery | 1.893 | (1.755, 2.042) | <0.01 | ||||||||
Her2 | 0.940 | (0.867, 1.020) | 0.14 | ||||||||
PMM |
AIPW |
NNMI(5, 0.8, 0.2) |
|||||||||
Variable |
HR | 95% CI | p | HR | 95% CI | p | HR | 95% CI | p | ||
Age | 1.018 | (1.015, 1.020) | <0.01 | 1.015 | (1.012, 1.017) | <0.01 | 1.018 | (1.015, 1.020) | <0.01 | ||
Black | 1.443 | (1.308, 1.591) | <0.01 | 1.436 | (1.276, 1.615) | <0.01 | 1.442 | (1.307, 1.591) | <0.01 | ||
Others | 0.879 | (0.786, 0.984) | 0.03 | 0.886 | (0.776, 1.011) | 0.07 | 0.879 | (0.785, 0.983) | 0.02 | ||
No Radiation | 1.044 | (0.976, 1.118) | 0.21 | 1.056 | (0.981, 1.137) | 0.15 | 1.044 | (0.976, 1.118) | 0.21 | ||
No Surgery | 1.885 | (1.762, 2.016) | <0.01 | 1.894 | (1.756, 2.042) | <0.01 | 1.896 | (1.773, 2.028) | <0.01 | ||
Her2 | 0.932 | (0.860, 1.010) | 0.09 | 0.958 | (0.874, 1.049) | 0.35 | 0.939 | (0.864, 1.021) | 0.14 |
Hazard ratio.
95% Confidence interval.
p-Value.
5. Simulation study
We perform several simulation studies to investigate the properties of the AIPW, NNMI and PMM methods when Cox regression has a covariate subject to missing and an additional fully observed covariate that is predictive of the missing covariate, and the quantities of interest are the regression coefficients of the Cox regression model. We investigate the effects of sample size, mis-specification of one of the two working models and mis-specification of the two link functions under a situation with dependent censoring. The simulation program is written in R and is available upon request.
For each of 1000 independent simulated datasets, the predictive covariate Z is generated from a U(0,1)distribution. The covariate X subject to missing is generated from either a Bernoulli[p(Z)] distribution, where
p(Z) is either based on a logit link (i.e. or a complementary log–log link (i.e. ), or a normal distribution with mean a0+a1Z and variance s . The failure time T is generated from either an exponential distribution with a hazard rate of or a Weibull distribution with a hazard rate of at time t. The censoring time C is also generated from either an exponential distribution with a hazard rate of or a Weibull distribution with a hazard rate of at time t. Let Y = min(T,C) and . The missing indicator if X is observed) is generated from a Bernoulli[p(Z,Y)] distribution, where p(Z,Y) (i.e. selection probability) is based on a logit link (i.e. ) or a complementary log–log link (i.e. ) The regression coefficients and hazard rates are selected to give a desired censoring rate and missing rate.
For the ‘‘Fully-Observed’’ (FO) analysis, treated as the gold standard, we derive Cox regression coefficient estimates for each simulated dataset before any missingness is applied. For the ‘‘Complete-Case’’ (CC) analysis, we derive Cox regression coefficient estimates from the data with X observed. For the AIPW and NNMI methods, a working logistic regression model (denoted by M1) is fitted to the data with X observed to derive the conditional distribution of X given the observed data and the predictive score of X. A working logistic regression model (denoted by M2) is fitted to the missing indicator to derive the missing probability and the predictive score of missingness. When both working models include all of the correct covariates in the models (i.e. ), they are denoted by AIPW11 and NNMI11, respectively. When the working model for predicting X includes all of the correct covariates but the working model for predicting the missing probability does not (i.e. ), they are denoted by AIPW12 and NNMI12, respectively. When the working model for predicting X does not include all of the correct covariates but the working model for predicting the missing probability does (i.e. ), they are denoted by AIPW21 and NNMI21, respectively. When X and δx are generated from a complementary log–log model, both AIPW and NNMI methods are considered as mis-specified even if both working models include all of the correct covariates in the models (i.e. AIPW11 and NNMI11) since the true models are not logit models. The PMM method includes all of the correct covariates for predicting X (i.e. is denoted by PMM1. Based on our prior experience on dealing with missing data (for both missing outcome and missing covariate values) using multiple imputation,11,21,22 for the NNMI method we set M = 10, NN =5 and (w1, w2) = (0.8, 0.2) or (0.2, 0.8).
The results are provided in Tables 4 to 7. The FO analysis, which is the gold standard method, in all situations targets the true values, has the lowest root mean square error (RMSE) and produces coverage rates comparable to the nominal level, 95%. The CC analysis as expected produces biased regression coefficient estimates, especially for the estimate of regression coefficient for Z (i.e. βz), which results in a much larger RMSE than AIPW and NNMI and a slightly lower coverage rate than the nominal level in some situations due to the bias. When X is binary (Table 4), the PMM1 method (i.e. all of the correct covariates are included into imputation method) produces reasonable regression coefficient estimates and coverage rates and tends to have a smaller RMSE than AIPW and NNMI in which the missing probability working model is correctly specified. However, when X is continuous (Table 5), the bias of PMM1 is much larger than AIPW and NNMI in which the missing probability working model is correctly specified.
Table 4.
βx = In (2) = 0.693 |
βz=−ln(2) = −0.693 |
|||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Method | Esta | SDb | SEc | RMSEd | CRe | Est | SD | SE | RMSE | CR | Divf | |||||||||||
N = 200 | ||||||||||||||||||||||
FO | 0.691 | 0.191 | 0.194 | 0.191 | 95.3 | −0.696 | 0.333 | 0.316 | 0.333 | 93.7 | ||||||||||||
CC | 0.652 | 0.358 | 0.334 | 0.360 | 94.1 | −0.953 | 0.563 | 0.543 | 0.620 | 92.4 | ||||||||||||
PMMI | 0.719 | 0.337 | 0.322 | 0.338 | 95.4 | −0.685 | 0.342 | 0.330 | 0.342 | 93.9 | ||||||||||||
AIPWII | 0.715 | 0.362 | 0.319 | 0.363 | 91.3 | −0.698 | 0.355 | 0.408 | 0.355 | 97.7 | 0 | |||||||||||
NNMIII(0.8,0.2) | 0.726 | 0.360 | 0.343 | 0.361 | 94.1 | −0.689 | 0.347 | 0.333 | 0.347 | 94.9 | ||||||||||||
NNMIII(0.2,0.8) | 0.705 | 0.368 | 0.352 | 0.368 | 93.7 | −0.674 | 0.343 | 0.330 | 0.344 | 94.5 | ||||||||||||
AIPWI2 | 0.735 | 0.335 | 0.199 | 0.338 | 74.6 | −0.765 | 0.440 | 0.480 | 0.446 | 95.7 | 187 | |||||||||||
NNMII2(0.8,0.2) | 0.721 | 0.317 | 0.313 | 0.318 | 96.1 | −0.695 | 0.347 | 0.332 | 0.347 | 94.2 | ||||||||||||
NNMI12(0.2,0.8) | 0.716 | 0.298 | 0.301 | 0.299 | 96.4 | −0.697 | 0.345 | 0.332 | 0.345 | 93.9 | ||||||||||||
AIPW2I | 0.714 | 0.363 | 0.319 | 0.364 | 91.5 | −0.698 | 0.355 | 0.409 | 0.355 | 97.7 | 0 | |||||||||||
NNMI2I(0.8,0.2) | 0.736 | 0.348 | 0.339 | 0.351 | 94.4 | −0.690 | 0.346 | 0.332 | 0.346 | 94.9 | ||||||||||||
NNMI2I(0.2,0.8) | 0.713 | 0.364 | 0.351 | 0.365 | 93.7 | −0.677 | 0.342 | 0.330 | 0.342 | 94.5 | ||||||||||||
N = 400 | ||||||||||||||||||||||
FO | 0.691 | 0.135 | 0.136 | 0.135 | 95.6 | −0.696 | 0.226 | 0.220 | 0.226 | 94.6 | ||||||||||||
CC | 0.661 | 0.235 | 0.23 | 0.237 | 94.1 | −0.947 | 0.385 | 0.370 | 0.461 | 89.1 | ||||||||||||
PMMI | 0.710 | 0.219 | 0.225 | 0.220 | 95.9 | −0.691 | 0.228 | 0.230 | 0.228 | 95.0 | ||||||||||||
AIPWII | 0.701 | 0.234 | 0.222 | 0.234 | 94.3 | −0.698 | 0.230 | 0.263 | 0.230 | 97.4 | 0 | |||||||||||
NNMIII(0.8,0.2) | 0.703 | 0.238 | 0.233 | 0.238 | 94.9 | −0.692 | 0.227 | 0.230 | 0.227 | 95.2 | ||||||||||||
NNMIII(0.2,0.8) | 0.698 | 0.241 | 0.237 | 0.241 | 94.4 | −0.682 | 0.225 | 0.229 | 0.225 | 94.9 | ||||||||||||
AIPWI2 | 0.723 | 0.224 | 0.138 | 0.226 | 80.1 | −0.761 | 0.281 | 0.305 | 0.289 | 96.1 | 115 | |||||||||||
NNMII2(0.8,0.2) | 0.709 | 0.214 | 0.216 | 0.215 | 95.4 | −0.699 | 0.226 | 0.230 | 0.226 | 95.2 | ||||||||||||
NNMII2(0.2,0.8) | 0.719 | 0.201 | 0.204 | 0.203 | 95.3 | −0.704 | 0.228 | 0.229 | 0.228 | 95.2 | ||||||||||||
AIPW2I, | 0.701 | 0.236 | 0.222 | 0.236 | 93.4 | −0.698 | 0.230 | 0.263 | 0.230 | 97.3 | 0 | |||||||||||
NNMI2I(0.8,0.2) | 0.706 | 0.235 | 0.233 | 0.235 | 95.3 | −0.694 | 0.227 | 0.230 | 0.227 | 94.9 | ||||||||||||
NNMI2I(0.2,0.8) | 0.699 | 0.241 | 0.238 | 0.241 | 94.7 | −0.684 | 0.226 | 0.229 | 0.226 | 94.9 |
Note: Censoring rate: 0.35; Missing rate: 0.63.
Average of 1000 point estimates.
Empirical standard deviation.
Average estimated standard error.
Root mean square error: square root of bias2 + SD2.
Coverage rate of 1000 95% confidence intervals.
Number of disconvergences for AIPW.
Table 7.
βx = ln(2) = 0.693 | βz = ln(2) = 0.693 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Method | Esta | SDb | SEc | RMSEd | CRe | Est | SD | SE | RMSE | CR | Divf |
N = 200 | |||||||||||
FO | 0.688 | 0.201 | 0.195 | 0.201 | 94.3 | −0.714 | 0.308 | 0.316 | 0.309 | 95.5 | |
CC | 0.671 | 0.329 | 0.320 | 0.330 | 94.2 | −0.940 | 0.531 | 0.518 | 0.586 | 92.6 | |
PMM1 | 0.721 | 0.310 | 0.303 | 0.311 | 96.1 | −0.703 | 0.319 | 0.330 | 0.319 | 95.2 | |
AIPW11 | 0.705 | 0.331 | 0.296 | 0.331 | 91.8 | −0.716 | 0.327 | 0.373 | 0.328 | 96.1 | 0 |
NNMI11(0.8,0.2) | 0.717 | 0.338 | 0.315 | 0.339 | 93.0 | −0.705 | 0.322 | 0.332 | 0.322 | 94.3 | |
NNMI11(0.2,0.8) | 0.692 | 0.339 | 0.320 | 0.339 | 94.1 | −0.690 | 0.317 | 0.330 | 0.317 | 94.3 | |
AIPW12 | 0.739 | 0.321 | 0.208 | 0.324 | 81.5 | −0.779 | 0.378 | 0.450 | 0.388 | 96.1 | 191 |
NNMI12(0.8,0.2) | 0.721 | 0.304 | 0.296 | 0.305 | 95.5 | −0.713 | 0.325 | 0.331 | 0.326 | 94.1 | |
NNMI12(0.2,0.8) | 0.720 | 0.286 | 0.286 | 0.287 | 96.6 | −0.714 | 0.322 | 0.331 | 0.323 | 94.7 | |
AIPW21 | 0.706 | 0.331 | 0.296 | 0.331 | 92.1 | −0.715 | 0.327 | 0.373 | 0.328 | 96.2 | 0 |
NNMI21(0.8,0.2) | 0.724 | 0.331 | 0.313 | 0.332 | 94.0 | −0.707 | 0.322 | 0.332 | 0.322 | 94.1 | |
NNMI21(0.2,0.8) | 0.696 | 0.335 | 0.319 | 0.335 | 94.4 | −0.690 | 0.317 | 0.329 | 0.317 | 94.5 | |
N = 400 | |||||||||||
FO | 0.683 | 0.143 | 0.136 | 0.143 | 93.6 | −0.701 | 0.228 | 0.221 | 0.228 | 95.0 | |
CC | 0.660 | 0.235 | 0.220 | 0.237 | 92.2 | −0.927 | 0.379 | 0.358 | 0.445 | 88.8 | |
PMM1 | 0.699 | 0.219 | 0.217 | 0.219 | 95.1 | −0.698 | 0.238 | 0.231 | 0.238 | 94.9 | |
AIPW11 | 0.691 | 0.230 | 0.206 | 0.230 | 92.2 | −0.706 | 0.241 | 0.247 | 0.241 | 96.4 | 0 |
NNMI11(0.8,0.2) | 0.697 | 0.233 | 0.217 | 0.233 | 94.0 | −0.700 | 0.240 | 0.231 | 0.240 | 94.7 | |
NNMI11(0.2,0.8) | 0.688 | 0.234 | 0.224 | 0.234 | 94.2 | −0.687 | 0.236 | 0.230 | 0.236 | 95.2 | |
AIPW12 | 0.713 | 0.217 | 0.144 | 0.218 | 81.0 | −0.760 | 0.279 | 0.270 | 0.287 | 94.9 | 150 |
NNMI12(0.8,0.2) | 0.700 | 0.213 | 0.205 | 0.213 | 94.5 | −0.705 | 0.239 | 0.230 | 0.239 | 95.3 | |
NNMI12(0.2,0.8) | 0.708 | 0.199 | 0.196 | 0.200 | 95.8 | −0.708 | 0.238 | 0.230 | 0.238 | 94.8 | |
AIPW21 | 0.690 | 0.230 | 0.206 | 0.230 | 92.3 | −0.706 | 0.240 | 0.247 | 0.240 | 96.3 | 0 |
NNMI21(0.8,0.2) | 0.700 | 0.232 | 0.216 | 0.232 | 94.1 | −0.701 | 0.240 | 0.231 | 0.240 | 94.6 | |
NNMI21(0.2,0.8) | 0.688 | 0.234 | 0.221 | 0.234 | 94.4 | −0.687 | 0.236 | 0.229 | 0.236 | 95.1 |
Note: Censoring rate: 0.35; Missing rate: 0.60.
Average of 1000 point estimates.
Empirical standard deviation.
Average estimated standard error.
Root mean square error: square root of bias2 + SD2.
Coverage rate of 1000 95% confidence intervals.
Number of disconvergences for AIPW.
Table 5.
βx = In (2) = 0.693 |
βZ = −ln(2) = −0.693 |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Method | Esta | SDb | SEc | RMSEd | CRe | Est | SD | SE | RMSE | CR | Divf |
N = 200 | |||||||||||
FO | 0.668 | 0.1 13 | 0.1 14 | 0.1 16 | 94.4 | -0.690 | 0.318 | 0.324 | 0.318 | 95.8 | |
CC | 0.700 | 0.168 | 0.170 | 0.168 | 96.1 | -0.869 | 0.446 | 0.447 | 0.479 | 94.4 | |
PMM1 | 0.592 | 0.123 | 0.143 | 0.159 | 92.5 | -0.621 | 0.329 | 0.346 | 0.337 | 95.2 | |
AIPW11 | 0.666 | 0.160 | 0.151 | 0.162 | 93.4 | -0.687 | 0.346 | 0.373 | 0.346 | 96.5 | 1 |
NNMI11 ,(0.8,0.2) | 0.656 | 0.150 | 0.150 | 0.155 | 94.3 | -0.670 | 0.337 | 0.350 | 0.338 | 95.6 | |
NNMI11(0.2,0.8) | 0.655 | 0.154 | 0.153 | 0.159 | 94.5 | -0.643 | 0.334 | 0.349 | 0.338 | 95.4 | |
AIPW12 | 0.902 | 0.322 | 0.097 | 0.384 | 60.3 | -0.814 | 0.575 | 0.517 | 0.588 | 89.8 | 1765 |
NNMI12(0.8,0.2) | 0.616 | 0.127 | 0.142 | 0.149 | 92.9 | -0.645 | 0.329 | 0.347 | 0.333 | 95.9 | |
NNMI12(0.2,0.8) | 0.594 | 0.1 16 | 0.138 | 0.153 | 91.8 | -0.628 | 0.324 | 0.344 | 0.330 | 95.4 | |
aipw21 | 0.667 | 0.162 | 0.151 | 0.164 | 93.4 | -0.688 | 0.347 | 0.373 | 0.347 | 96.5 | 1 |
NNMI21(0.8,0.2) | 0.656 | 0.148 | 0.149 | 0.153 | 94.2 | -0.667 | 0.337 | 0.349 | 0.338 | 95.7 | |
NNMI21(0.2,0.8) | 0.655 | 0.154 | 0.154 | 0.159 | 94.1 | -0.641 | 0.334 | 0.348 | 0.338 | 95.4 | |
N = 400 | |||||||||||
FO | 0.677 | 0.080 | 0.080 | 0.082 | 94.1 | -0.695 | 0.224 | 0.227 | 0.224 | 94.9 | |
CC | 0.716 | 0.120 | 0.1 18 | 0.122 | 95.6 | -0.870 | 0.307 | 0.309 | 0.354 | 91.6 | |
PMM1 | 0.602 | 0.088 | 0.099 | 0.127 | 85.5 | -0.639 | 0.234 | 0.243 | 0.240 | 95.4 | |
AIPW11 | 0.676 | 0.106 | 0.106 | 0.107 | 94.6 | -0.695 | 0.241 | 0.254 | 0.241 | 95.6 | 0 |
NNMI11 (0.8,0.2) | 0.673 | 0.105 | 0.106 | 0.107 | 94.7 | -0.686 | 0.240 | 0.244 | 0.240 | 95.4 | |
NNMI11(0.2,0.8) | 0.673 | 0.108 | 0.107 | 0.110 | 95.1 | -0.667 | 0.240 | 0.243 | 0.241 | 95.6 | |
AIPW1 2 | 0.926 | 0.224 | 0.069 | 0.323 | 44.0 | -0.829 | 0.371 | 0.424 | 0.395 | 88.7 | 1596 |
NNMI12(0.8,0.2) | 0.631 | 0.093 | 0.101 | 0.112 | 90.8 | -0.663 | 0.233 | 0.243 | 0.235 | 96.3 | |
NNMI12(0.2,0.8) | 0.609 | 0.084 | 0.098 | 0.119 | 87.2 | -0.643 | 0.228 | 0.242 | 0.233 | 95.6 | |
aipw21 | 0.676 | 0.107 | 0.106 | 0.108 | 94.6 | -0.695 | 0.241 | 0.254 | 0.241 | 95.6 | 0 |
NNMI21(0.8,0.2) | 0.672 | 0.105 | 0.105 | 0.107 | 94.9 | -0.685 | 0.240 | 0.244 | 0.240 | 95.7 | |
NNMI21(0.2,0.8) | 0.672 | 0.107 | 0.107 | 0.109 | 94.9 | -0.667 | 0.239 | 0.243 | 0.240 | 95.3 |
Note: Censoring rate: 0.33; Missing rate: 0.47.
Average of 1000 point estimates.
Empirical standard deviation.
Average estimated standard error.
Root mean square error: square root of bias2+SD2
Coverage rate of 1000 95% confidence intervals.
Number of disconvergences for AIPW
When both working models include all of the correct covariates (i.e. AIPW11 and NNMI11), in all situations, both AIPW11 and NNMI11 methods produce reasonable regression coefficient estimates and coverage rates. The bias of the NNMI11 method is comparable to that of the AIPW11 method. For both AIPW11 and NNMI11 methods, the bias for βx decreases with sample size and the decrease is larger than PMM1. For NNMI11, when X is binary (Table 4) a larger weight on the missing probability predictive score, i.e. (w1, w2) = (0.2, 0.8), can reduce the bias of the estimate of bx but not that of the estimate of βz. However, when X is continuous (Table 5), a larger weight on the missing probability predictive score increases the bias of both regression coefficient estimates.
When the working logistic regression model for missing indicator dx (i.e. M2) is mis-specified (i.e. AIPW12 and NNMI12), for both binary (Table 4) and continuous (Table 5) X, NNMI12 has a smaller bias than AIPW12, especially when X is continuous. For NNMI12 with a larger weight on the predictive score for X,i.e. (w1, w2) = (0.8, 0.2), the bias decreases with sample size in all situations. When N ¼ 400 and X is binary (Table 4), the bias of NNMI12 is comparable to PMM1. When X is continuous (Table 5), the bias of NNMI12 is smaller than PMM1 in all situations. The bias of AIPW12 does not reduce as much as NNMI12 with sample size, especially when X is continuous (Table 5). This is because the performance of the AIPW method highly depends on whether a correct model is used to derive the selection probability. Also, in all situations, AIPW12’s standard errors tend to underestimate the variability of the regression coefficient estimates, and the underestimate is substantial when X is continuous (Table 5). As a result, AIPW12’s coverage rates are lower than the nominal level. NNMI12 has a smaller RMSE than NNMI11. This is because the mis-specification of missing probability working model induces a much smaller SD for NNMI12 even if the bias is larger than NNMI11. For NNMI12, when X is binary (Table 4) a larger weight on the missing probability predictive score, i.e. (w1, w2) = (0.2, 0.8), has a smaller bias than NNMI11 when N = 200. However, when sample size is larger or X is continuous (Table 5), a larger weight on the missing probability predictive score increases the bias of both regression coefficient estimates. When the working logistic regression model for X (i.e. M1) is mis-specified (i.e. AIPW21 and NNMI21), for a binary X (Table 4), NNMI21 method has a larger bias than NNMI11 and NNMI12, especially when the sample size is equal to 200. The bias decreases with sample size in all situations. However, for a continuous X (Table 5), in some situations NNMI21 method has a smaller bias than NNMI11 and NNMI12. This is because the working model for predicting a missing continuous covariate is simply based on some approximation. Similar to NNMI11, when X is binary (Table 4) for NNMI21 a larger weight on the missing probability predictive score, i.e. (w1, w2) = (0.2, 0.8), can reduce the bias of the estimate of βx but not that of the estimate of βz. However, when X is continuous (Table 5), a larger weight on the missing probability predictive score increases the bias of both regression coefficient estimates. When the link function for both X and δx is mis-specified (Table 6), the NNMI methods can still produce reasonable estimates of regression coefficients. This indicates the NNMI method is also robust to mis-specification of the link functions of the two working models. When both T and C are generated from a Weibull distribution (Table 7), the PMM, NNMI and AIPW methods all produce reasonable estimates. This is because they do not need to specify the underlying distributions of failure and censoring times while performing estimation.
Table 6.
βx = ln(2) = 0.693 |
βz = −ln(2) = −0.693 |
||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Method | Esta | SDb | SEc | RMSEd | CRe | Est | SD | SE | RMSE | CR | Divf |
N = 200 | |||||||||||
FO | 0.682 | 0.194 | 0.198 | 0.194 | 95.7 | −0.710 | 0.341 | 0.337 | 0.341 | 94.6 | |
CC | 0.662 | 0.262 | 0.267 | 0.264 | 95.2 | −0.846 | 0.455 | 0.456 | 0.480 | 94.0 | |
PMM1 | 0.686 | 0.251 | 0.258 | 0.251 | 95.7 | −0.721 | 0.352 | 0.348 | 0.353 | 93.8 | |
AIPW11 | 0.685 | 0.245 | 0.246 | 0.245 | 95.0 | −0.713 | 0.354 | 0.366 | 0.355 | 95.6 | 0 |
NNMI11(0.8,0.2) | 0.695 | 0.250 | 0.255 | 0.250 | 95.7 | −0.718 | 0.352 | 0.348 | 0.353 | 94.2 | |
NNMI11(0.2,0.8) | 0.692 | 0.255 | 0.256 | 0.255 | 94.5 | −0.730 | 0.349 | 0.346 | 0.351 | 94.2 | |
AIPW12 | 0.700 | 0.246 | 0.209 | 0.246 | 90.3 | −0.726 | 0.364 | 0.373 | 0.365 | 95.4 | 74 |
NNMI12(0.8,0.2) | 0.694 | 0.241 | 0.25 | 0.241 | 96.2 | −0.712 | 0.353 | 0.348 | 0.354 | 94.5 | |
NNMI12(0.2,0.8) | 0.696 | 0.232 | 0.247 | 0.232 | 96.8 | −0.709 | 0.351 | 0.347 | 0.351 | 94.0 | |
AIPW21 | 0.686 | 0.243 | 0.246 | 0.243 | 94.9 | −0.713 | 0.354 | 0.366 | 0.355 | 95.6 | 0 |
NNMI21(0.8,0.2) | 0.702 | 0.247 | 0.252 | 0.247 | 95.6 | −0.715 | 0.352 | 0.347 | 0.353 | 94.0 | |
NNMI21(0.2,0.8) | 0.696 | 0.252 | 0.255 | 0.252 | 94.4 | −0.730 | 0.349 | 0.346 | 0.351 | 94.4 | |
N = 400 | |||||||||||
FO | 0.688 | 0.138 | 0.139 | 0.138 | 95.2 | −0.701 | 0.247 | 0.236 | 0.247 | 93.6 | |
CC | 0.670 | 0.181 | 0.187 | 0.182 | 95.2 | −0.832 | 0.319 | 0.318 | 0.348 | 93.9 | |
PMM1 | 0.688 | 0.171 | 0.178 | 0.171 | 95.8 | −0.707 | 0.250 | 0.243 | 0.250 | 94.4 | |
AIPW11 | 0.690 | 0.171 | 0.171 | 0.171 | 94.9 | −0.702 | 0.251 | 0.250 | 0.251 | 95.2 | 0 |
NNMI11(0.8,0.2) | 0.695 | 0.178 | 0.177 | 0.178 | 94.4 | −0.706 | 0.252 | 0.243 | 0.252 | 94.2 | |
NNMI11(0.2,0.8) | 0.695 | 0.178 | 0.178 | 0.178 | 94.2 | −0.714 | 0.250 | 0.242 | 0.251 | 94.8 | |
AIPW12 | 0.701 | 0.169 | 0.145 | 0.169 | 89.7 | −0.717 | 0.254 | 0.254 | 0.255 | 95.2 | 65 |
NNMI12(0.8,0.2) | 0.696 | 0.170 | 0.174 | 0.170 | 95.4 | −0.699 | 0.251 | 0.243 | 0.251 | 94.7 | |
NNMI12(0.2,0.8) | 0.703 | 0.164 | 0.171 | 0.164 | 95.8 | −0.695 | 0.250 | 0.242 | 0.250 | 94.8 | |
AIPW21 | 0.692 | 0.170 | 0.171 | 0.170 | 95.0 | −0.701 | 0.251 | 0.250 | 0.251 | 95.2 | 0 |
NNMI21(0.8,0.2) | 0.697 | 0.177 | 0.176 | 0.177 | 94.6 | −0.704 | 0.252 | 0.243 | 0.252 | 94.5 | |
NNMI21(0.2,0.8) | 0.697 | 0.178 | 0.177 | 0.178 | 94.4 | −0.688 | 0.249 | 0.242 | 0.249 | 94.7 |
Note: Censoring rate: 0.37; Missing rate: 0.44.
Average of 1000 point estimates.
Empirical standard deviation.
Average estimated standard error.
Root mean square error: square root of bias2 + SD2.
Coverage rate of 1000 95% confidence intervals.
Number of disconvergences for AIPW.
In summary, all methods reduce the bias of the standard CC analysis, but the amount of the remaining bias, the efficiency and the validity of the estimated standard errors vary between methods. The performance of the AIPW method depends on whether a correct model is used to derive the selection probability. In contrast, the NNMI method in which two predictive scores are derived from two working regression models can provide reasonable regression coefficient estimates for both X independent and dependent of Z and is robust to mis-specification (the covariates include and the link function) of either one of the two working regression models.
6. Discussion
In this paper we propose a nonparametric multiple imputation approach to handle a missing covariate in Cox regression analysis and compare it with an existing popular AIPW approach. Based on the simulation results, the performance of the AIPW method depends on whether the selection/missing probability model is correctly specified. This indicates while performing the AIPW method, one has to be sure the corresponding model is correct, and specifically requires all aspects of the models including the link functions and choice of covariates to be correct. In contrast, for the nonparametric multiple imputation approach the two working regression modelsare only used to derive two predictive scores to select imputing sets for missing covariate observations. Once the imputing sets are selected, nonparametric multiple imputation procedures are conducted on the sets. Therefore, this approach is expected to have weak reliance on the two working regression models compared to the AIPW method.
The performances of the proposed nonparametric multiple imputation method will depend on the missing rate. Specifically, the missing rate will affect the number of similar ‘‘donors’’ for each missing covariate observation. In a situation with a high missing rate, say, 0.90, a much larger sample size is required for the proposed method to perform well, than a situation with a low missing rate.
As pointed out in the literature,23–26 when the imputation model is incompatible to the analysis model, multiple imputation may impute covariates that are incompatible with the analysis model and then lead to biased estimates of parameters and the associated variances. To avoid the incompatibility, one can specify a joint model for outcome and covariates for which the conditional distribution of outcome given covariates matches the analysis model and then using the imputation model implied by this joint model.26 However, it can be challenging to specify the joint model in a situation with missing covariates. The proposed nonparametric multiple imputation does not directly use a statistical model to perform multiple imputation and is, therefore, expected to be less tangible to the incompatibility as long as the right covariates are included in one of the two working models. Based on the numerical results, we do not observe any under-estimation in variation of the parameter estimates and the bias is mainly due to finite sample even if the link function is mis-specified.
In this paper, we assume missingness only depends on the observed data (i.e. MAR mechanism). This assumption is untestable. It is possible that missingness also depends on some unobserved data (i.e. missing not at random mechanism). This indicates non-ignorable missing mechanism may still remain even conditioning on all of the observed data. Sensitivity analysis27 would be a possible way to evaluate the impact of unobserved data on the proposed multiple imputation approaches. The proposed nonparametric multiple imputation might be less affected by the violation of the MAR assumption since it does not directly use statistical models for performing imputation.
Acknowledgments
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Dr. Chiu-Hsieh Hsu’s research was partially supported by the National Cancer Institute grant P30 CA 023074.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
References
- 1.Cox DR. Regression models and life-tables. J Royal Stat Soc Ser B (Methodological) 1972; 34: 187–220. [Google Scholar]
- 2.Cox DR. Partial likelihood. Biometrika 1975; 62: 269–276. [Google Scholar]
- 3.Andersen PK and Gill RD. Cox’s regression model for counting processes: a large sample study. Ann Stat 1982; 10: 1100–1120. [Google Scholar]
- 4.Little RJA and Rubin DB. Statistical analysis with missing data, 2nd ed. New York, NY: Wiley, 2002. [Google Scholar]
- 5.Rubin DB. Multiple imputation for nonresponse in surveys New York, NY: Wiley, 1987. [Google Scholar]
- 6.Rathouz PJ. Identifiability assumptions for missing covariate data in failure time regression models. Biostatistics 2007; 8: 345–356. [DOI] [PubMed] [Google Scholar]
- 7.Robins JM, Rotnitzky A and Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994; 89: 846–866. [Google Scholar]
- 8.Wang CY and Chen HY. Augmented inverse probability weighted estimator for Cox missing covariate regression. Biometrics 2001; 57: 414–419. [DOI] [PubMed] [Google Scholar]
- 9.Qi L, Wang CY and Prentice RL. Weighted estimators for proportional hazards regression with missing covariates. J Am Stat Assoc 2005; 100: 1250–1263. [Google Scholar]
- 10.Kang JDY and Schafer JL. Demystifying double robustness: a comparison of alternative strategies for estimating a population mean from incomplete data. Stat Sci 2007; 22: 523–539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Long Q, Hsu C- H and Li Y. Doubly robust nonparametric multiple imputation for ignorable missing data. Stat Sinica 2012; 22: 149–172. [PMC free article] [PubMed] [Google Scholar]
- 12.Rubin DB. Statistical matching using file concatenation with adjusted weights and multiple imputation. J Business Econ Stat 1986; 4: 87–94. [Google Scholar]
- 13.Rosenbaum PR and Rubin DB. Constructing a control group using multivariate matched sampling methods that incorporate the propensity score. Am Stat 1985; 39: 33–38. [Google Scholar]
- 14.White IR and Royston P. Imputing missing covariate values for the Cox model. Stat Med 2009; 28: 1982–1998. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Qi L, Wang YF and He Y. A comparison of multiple imputation and fully augmented weighted estimators for Cox regression with missing covariates. Stat Med 2010; 29: 2592–2604. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Marshall A, Altman DG, Royston P, et al. Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 2010; 10: 7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Efron B Bootstrap methods: another look at the jackknife. Ann Stat 1979; 7: 1–26. [Google Scholar]
- 18.Nielsen SF. Proper and improper multiple imputation. Int Stat Rev 2003; 71: 593607. [Google Scholar]
- 19.Robins JM, Rotnitzky A and van der Laan M. Comment on profile likelihood. J Am Stat Assoc 2000; 95: 477–482. [Google Scholar]
- 20.Tan M and Yu D. Molecular mechanisms of erbB2-mediated breast cancer chemo-resistance. Adv Experiment Med Biol 2007; 608: 119–129. [DOI] [PubMed] [Google Scholar]
- 21.Hsu C- H, Long Q, Li Y, et al. A nonparametric multiple imputation approach for data with missing covariate values with application to colorectal adenoma data. J Biopharmaceut Stat 2014; 24: 634–648. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Hsu C-H, He Y, Li Y, et al. Doubly robust multiple imputation using kernel-based techniques. Biometric J 2016; 58: 588–606. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Fay R Proceedings of the section on survey research methods Washington, DC: American Statistical Association, 1992, pp.227–232. [Google Scholar]
- 24.Meng XL. Multiple imputation inferences with uncongenial sources of input. Stat Sci 1994; 9: 538–573. (with Discussion). [Google Scholar]
- 25.Rubin DB. Multiple imputation after 18 years. J Am Stat Assoc 1996; 91: 473–490. [Google Scholar]
- 26.Bartlett JW, Seaman SR, White IR, et al. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Meth Med Res 2015; 24: 462–487. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Carpenter JR, Kenward MG and White IR. Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Stat Meth Med Res 2007; 16: 259–275. [DOI] [PubMed] [Google Scholar]