Abstract
Modeling symptom progression to identify ideal subjects for a Huntington’s disease clinical trial is problematic since time to diagnosis, a key covariate, can be heavily censored. Imputation is an appealing strategy that replaces the censored covariate with its conditional mean, but existing methods saw over 200% bias under heavy censoring. Calculating conditional means well requires estimating and then integrating over the survival function of the censored covariate from the censored value to infinity. To estimate the survival function flexibly, existing methods use the semiparametric Cox model with Breslow’s estimator, leaving the integrand for the conditional means (the survival function) undefined beyond the observed data. The integral is then estimated up to the largest observed covariate value, and this approximation can cut off the tail of the survival function and lead to severe bias. We combine the semiparametric survival estimator with a parametric extension to approximate the integral up to infinity. In simulations, our proposed extrapolation-before-imputation approach substantially reduces the bias seen with existing imputation methods, sometimes even when the parametric extension was misspecified. We further demonstrate how imputing with corrected conditional means can prioritize subjects for clinical trials. The R code to reproduce results is available in the Supplementary Material.
Keywords: Adaptive quadrature, Breslow’s estimator, Conditional mean imputation, Huntington’s disease, Time to diagnosis, Trapezoidal rule
1. Introduction
1.1. Modeling the Progression of Huntington’s Disease
Prospective studies are common for genetically-inherited diseases because, with genetic testing, researchers can identify at-risk subjects and follow their symptom development over time. Such studies are especially powerful for Huntington’s disease, a neurodegenerative disease caused by unstable cytosine-adenine-guanine (CAG) repeats in the HTT gene (The Huntington’s Disease Collaborative Research Group, 1993). Huntington’s disease is fully penetrant, so anyone with ≥ 36 CAG repeats is guaranteed to develop the disease.
Modeling the progression of Huntington’s disease using data from prospective studies like the Neurobiological Predictors of Huntington’s Disease (PREDICT-HD) (Paulsen et al., 2008) is appealing, for example, as scientists are now investigating experimental treatments designed to slow or delay symptoms. Models of how symptoms progress can help identify subjects to recruit into clinical trials. These symptoms are most detectable in the few years immediately before and after diagnosis, so subjects in this window of time would be ideal to test a new therapy in a clinical trial.
However, Huntington’s disease progresses slowly, with functional, motor, and cognitive decline spanning decades. As a result, prospective studies often end before all at-risk subjects have met the diagnosis criteria, defined as when motor abnormalities are unequivocal signs of Huntington’s disease (Huntington Study Group, 1996). Therefore, the slow-moving nature of the disease leaves the key variable “time to diagnosis” right-censored among subjects who have yet to be diagnosed (i.e., their motor abnormalities will merit a diagnosis sometime after their last study visit, but exactly when is unknown). We thus face the challenge of how to model the association between a fully observed outcome (symptom progression) and a randomly right-censored covariate (time to diagnosis).
1.2. Imputing a Censored Covariate
Inspired by missing data techniques, one appealing strategy is conditional mean imputation (CMI), where we replace all right-censored times to diagnosis with their conditional means (Atem et al., 2019, 2017, 2019). The conditional mean for a right-censored value is the expected time to diagnosis given that it must happen after the censored value (the last study visit) and adjusting for additional covariates (e.g., CAG repeat length).
In theory, this expected time to diagnosis can be anywhere from the last study visit to infinity, so computing it involves an integral over this range. Typically, the survival function relies on a step function (in this case, Breslow’s estimator), which is well-defined up to the largest uncensored covariate value but not beyond. If there are covariate values beyond the largest uncensored one, Breslow’s estimator will carry forward the last estimated survival, but this is unrealistic in for a fully penetrant disease like Huntington’s. Importantly, this step function leaves the integrand not well defined beyond the observed covariate values, so more accurate integration alone will not improve the estimation of the conditional means.
Existing CMI approaches use the trapezoidal rule to compute the integral over Breslow’s estimator from the censored value to the largest uncensored value in the data (Atem et al., 2019, 2017, 2019). For this approximation to hold, the largest observed covariate value in the data must represent the variable’s true maximum (which, in theory, could be infinity) such that the estimated survival function at that value is approximately zero; otherwise, data beyond that value will be cut off. Since the survival function is nonnegative and decreases monotonically, this cut-off can lead the existing approach, which we call “non-extrapolated CMI,” to underestimate the integral.
For example, if the last time to diagnosis was 10 years from study entry, non-extrapolated CMI assumes that all unobserved times to diagnosis should be observed within 10 years of study entry. Yet, in reality, diagnosis could occur at any time between the last study visit and death, both of which are unique to each subject. With non-extrapolated CMI, we are thus likely to incorrectly impute the censored covariate, rendering the downstream analysis (e.g., a model fit to the imputed data) invalid.
1.3. Need for Extrapolation with Imputation
Many methods may come to mind that handle integrals with infinite bounds, such as Gauss-Hermite and adaptive quadrature, both of which are widely implemented. However, we can only integrate over values of the covariate where the integrand is defined, and we need the integrand to be defined up to the infinite bound in the conditional mean formula.
Specifically, we need to extrapolate from Breslow’s estimator beyond the largest uncensored value so that we can integrate over it. While extrapolation methods are well established, our needs are unique: We are not just interested in extending the survival curve, which could be done using standard methods like those in Klein and Moeschberger (2003), but in further integrating over it.
In search of the best extrapolation method to use with integration, we explored various methods to extend the survival estimator and identified the best one for our proposed “extrapolated CMI” approach. To our knowledge, only one paper has investigated this same problem (Datta, 2005). They considered fewer methods and, in fact, we found that their recommended method could lead to bias even when integrating up to infinity. We note that inference about the tail of the survival function has been studied extensively (Reid and Cox, 1984), but we are interested in inference and prediction from the regression model after imputing based on the survival function instead. These two problems are fundamentally different. Inference about the tail requires extrapolating the survival function beyond the largest uncensored value, whereas the inference we are interested in requires extrapolating the survival function and integrating over that extrapolated function.
Importantly, extending the survival curve for integration is not a challenge unique to CMI. Any nonparametric or semiparametric full-likelihood approach with a censored covariate would also need to integrate up to infinity over an integrand that is not defined over that range.
1.4. Overview
We propose a hybrid approach to CMI that extrapolates from the semiparametric survival estimator before integration, making it possible to completely approximate the integral up to infinity and offering reduced bias (Section 2). We explore various extrapolation methods and identify the “Weibull extension” as the best one. We quantify the sizable bias introduced by calculating conditional means from the non-extrapolated survival function and show in extensive simulation studies that extrapolating with the Weibull extension before imputation reduces bias when imputing censored covariates, sometimes even when the data were not truly Weibull (Section 3). We show how imputing with biased conditional means can impact clinical trial recruitment (Section 4) and discuss our findings (Section 5).
2. Methods
2.1. Model and Data
Consider an outcome and covariates , which are assumed to be related through a regression model parameterized by . For example, if given follows a linear regression model, , where . Estimating is our primary interest and, unfortunately, difficult to do because the covariate is right-censored. Rather than observe , we observe and , where is a random censoring value. (Having random means that it changes for every subject and is unknown at study start. For Huntington’s disease studies, is the subject-specific length of follow-up from first to last study visit.) Thus, an observation for subject is captured as .
2.2. Conditional Mean Imputation
In missing data settings, imputation is a popular approach to obtain valid statistical inference without sacrificing the power of the full sample. Imputation is also a promising method to handle a censored covariate, with one simple change. When is right-censored, rather than impute any value for it, we impute a value that is larger than because, by the definition of right censoring, must be larger than . This partial information that is captured through a CMI approach (Little, 1992; Richardson and Ciampi, 2003).
In CMI, we replace right-censored covariates with their conditional means:
| (1) |
where is the conditional survival function for given . To our knowledge, this form was first introduced for randomly right-censored covariates by Atem et al. (2017), and, previously, a parallel formula was given in Little and Rubin (2002) for covariates censored by a lower limit of detection. Note that we use the subscript for and because these are observed values of random variables and , respectively, whereas is still random. Importantly, deriving (1) relies on the assumption of conditionally noninformative censoring, such that and are conditionally independent given .
Now, CMI proceeds in two stages. First, we calculate the conditional means for the censored covariate, which requires estimating (Section 2.3) and approximating the integral over it (Sections 2.4 and 2.5). Then, we replace the censored covariate with its conditional mean and fit the outcome model for given imputed and using the “usual” methods (e.g., ordinary least squares) to obtain the estimators . Under proper specification (e.g., a well-estimated survival function and a well-approximated integral), Bernhardt et al. (2015) prove that CMI leads to consistent estimators in linear regression.
2.3. Estimating the Survival Function
To robustly estimate in (1) without assuming a distribution for given , existing approaches use the semiparametric Cox proportional hazards model (Atem et al., 2019, 2017, 2019). With that model, the survival function is calculated as where is the log hazard ratios and is the baseline survival function of (i.e., ), both of which must be estimated. The log hazard ratios are easily estimated from existing software, like the coxph function in the survival package (Therneau and Grambsch, 2000), and a common way to estimate is with Breslow’s estimator (Breslow, 1972). After estimating both, is constructed and used to compute from (1).
2.4. The Problem with Non-Extrapolated Conditional Means
Computing requires a method to approximate the integral over from to infinity. Existing approaches use the trapezoidal rule (Atem et al., 2019, 2017, 2019; Lotspeich et al., 2022). That is, they estimate in (1) with
| (2) |
where denote the distinct, ordered values of from a sample of observations. Going forward, let denote the conditional mean using (2).
Some will be censored, so computing requires evaluating between and beyond the uncensored data on which it is defined. Between uncensored values, is carried forward (interpolated). Beyond the largest uncensored value, is also defined to carry forward, but this definition is unrealistic for a study of a fully penetrant disease like Huntington’s and leads to a divergent integral.
Remark 2.1. Existing approaches (e.g., Atem et al. (2019)) interpolate with the mean of from the uncensored values immediately below and above a censored . We adopt carry forward interpolation because it is computationally simple and follows from the original formula in Breslow (1972), but either method worked well (Supplemental Figure S1).
Critically, we recognize that (2) estimates the wrong integral: rather than . The validity of this estimate, and with it the quality of the conditional means, hinges on how well the maximum of the observed covariate represents the true maximum of the covariate ; this sentiment is shared in Atem et al. (2017). If is far below the true upper bound of , then approximating with (2) will underestimate the integral by cutting off the tail of the survival function. We conclude that imputing with these “non-extrapolated” conditional means is only appropriate when , when the survival function is entirely captured by .
2.5. Defining Extrapolated Conditional Means
We sought an improved calculation to capture the entirety of the improper integral in the conditional means by extending beyond to better approximate the infinite upper bound. Conveniently, the integrate function in implements “adaptive quadrature of functions… over a finite or infinite interval” (R Core Team, 2019). This function is included in the basic R functions and does not require installing any additional packages, making it an accessible and sustainable choice. Using integrate with an infinite upper bound is simple enough. In fact, as a user, it is no different than with a finite one. Still, adopting software that can integrate up to infinity does not help if the integrand, i.e., the survival function, is not defined as such; this is a problem for all quadrature software.
Remark 2.2. For the two parametric extrapolation methods that follow, the exponential and Weibull extensions, it is also possible to break the integral in (1) into two parts (one proper and one improper integral). Numeric integration is still needed for the former and an analytic solution can be derived for the latter.
If , i.e., the largest observed value is uncensored, the estimated survival function at is approximately zero and extrapolation is not needed. Otherwise, we have to “extend” (i.e., extrapolate from) Breslow’s estimator beyond . This way, the integrate function will have something to integrate over on its way up to infinity and help us better calculate the conditional means. Extrapolating from step functions is a common challenge with censored outcomes, since popular estimators, like Kaplan-Meier, are not well defined for values of , either. We discuss four potential methods (illustrated in Supplemental Figure S2) to extend Breslow’s estimator .
Carry forward: Assume that for all , which is equivalent to assuming that all censored covariates would have had .
Immediate drop-off: Assume that that at all , which is equivalent to assuming that all censored covariates would have had just beyond their .
Exponential extension: “Tie in” an exponential survival function where Breslow’s estimator leaves off and assume that for .
Weibull extension: For added flexibility, tie in a Weibull survival function and assume that for , where and are found using constrained maximum likelihood estimation (Moeschberger and Klein, 1985).
While these methods are well established for censored outcomes, to our knowledge we are the first to consider them for censored covariates. Also, our needs are unique, since we are extrapolating from the survival curve to then integrate over it.
Carry forward or immediate drop-off could be valid if we were just modeling the survival function, since they can converge to the true survival functions in large samples (Ying, 1989; Klein and Moeschberger, 2003). However, neither is a good choice when we are subsequently integrating over the survival function. Carry forward makes the integral up to infinity diverge, since the never drops down to zero as goes to infinity. Immediate drop-off forces the integral to cut off at ; therefore, we expect it to offer little improvement, even with adaptive quadrature. (This method is recommended by Datta (2005) for integration under the Kaplan-Meier estimator.) Plus, assuming that patients never develop Huntington’s disease (carry forward) or develop it immediately after the study ends (immediate drop-off) is unrealistic. Theoretical justification exists for both parametric extensions, but the Weibull extension is preferable for its flexibility. Derivations for the parametric extensions can be found in Web Appendix A.
Remark 2.3. Non-extrapolated CMI can still involve evaluating the survival function for , but we found in simulations that the choice of extrapolation method did not have much impact (Supplemental Figure S3). Without additional covariates , the existing approaches (e.g., Atem et al. (2019)) treat the largest value as uncensored regardless of , a recommendation from Datta (2005), which is equivalent to immediate drop-off. To our knowledge, the existing approaches do not define an extrapolation method when covariates are available.
3. Simulation Studies
In the simulations that follow, we investigate how well extrapolation before imputation can reduce bias when imputing a censored covariate under correct (Section 3.2) and incorrect (Section 3.3) specification of the Weibull extension extrapolation method. The R scripts to reproduce all simulations, tables, and figures are available in the Supplementary Material. The imputeCensRd R package implementing the imputation methods is available on GitHub at www.github.com/sarahlotspeich/imputeCensRd.
3.1. Data Generation and Metrics for Comparison
Data for samples of , or 2000 subjects were simulated in the following way. First, a covariate was generated from a Bernoulli distribution with . Next, was generated from a Weibull distribution with shape = 0.75 and , leading to proportional hazards in given . Then, an outcome was generated as , where was a standard normal random variable. We explored light (∼ 17%), heavy (∼ 49%), and extra heavy (∼ 78%) censoring in by generating from an exponential distribution with rates = 0.23, 2, and 10, respectively (Supplemental Figure S4). Notice that was generated independently of all other variables, more than satisfying the assumption of conditionally noninformative censoring. Finally, and were constructed.
Given a continuous outcome , the analysis model was a linear regression, and both extrapolated CMI with the Weibull extension (hereafter, “extrapolated CMI”) and non-extrapolated CMI were considered to estimate the model parameters . Both CMI approaches were applied following the conditional single imputation with bootstrapping approach proposed by Atem et al. (2019). This framework (i) resamples observations with replacement times from the imputed data, (ii) fits the analysis model to each resampled dataset, and (iii) pools the fitted models using Rubin’s Rules (Rubin, 2004). This approach is distinct from multiple imputation with bootstrapping, as we impute then resample rather than resample then impute, and its goal is to improve the standard error estimator for single CMI.
To assess validity, we report the empirical bias and standard errors for , in addition to the average standard error estimators and empirical coverage probabilities for 95% confidence intervals. To gauge statistical precision, we report the relative efficiency, calculated as the empirical variance of the full cohort analysis (i.e., where all observations had uncensored ) divided by the empirical variance of the CMI approaches. Relative efficiency closer to one indicates that the more efficiency was recovered through imputation. Unless otherwise stated, all summary metrics are based on 1000 replications.
3.2. Quantifying the Improvement with Extrapolation
Extrapolated CMI using the Weibull extension (correctly specified based on the data generating mechanism in Section 3.1) offered reduced bias compared to non-extrapolated CMI. Under light, heavy, and extra heavy censoring, non-extrapolated CMI led to as much as 5%, 40%, and 218% bias, respectively, in (Table 1). Meanwhile, extrapolated CMI offered no more than 6%, 14%, and 40% bias, respectively, under these same settings. The residual bias for extrapolated CMI could be explained by the natural breakdown of the Cox model and/or the constrained MLE for the Weibull extension under strenuous levels of censoring.
Table 1:
Simulation results for Weibull dependent on from the full cohort analysis (i.e., where all observations had uncensored ) and conditional mean imputation (CMI) approaches.
| Full Cohort | Extrapolated CMI | Non-Extrapolated CMI | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Distribution | Bias | (%) | ESE | Bias | (%) | ESE | ASE | CP | RE | Bias | (%) | ESE | ASE | CP | RE | |
| : Intercept | ||||||||||||||||
| Light | 100 | 0.005 | (0.50) | 0.154 | 0.008 | (0.78) | 0.160 | 0.219 | 0.989 | 0.929 | −0.007 | (−0.65) | 0.161 | 0.222 | 0.990 | 0.911 |
| 500 | 0.000 | (−0.03) | 0.068 | 0.005 | (0.48) | 0.071 | 0.097 | 0.992 | 0.923 | −0.006 | (−0.64) | 0.071 | 0.098 | 0.992 | 0.931 | |
| 2000 | −0.002 | (−0.16) | 0.035 | 0.003 | (0.29) | 0.036 | 0.048 | 0.992 | 0.932 | −0.005 | (−0.46) | 0.036 | 0.049 | 0.989 | 0.941 | |
| Heavy | 100 | 0.005 | (0.50) | 0.154 | −0.011 | (−1.05) | 0.177 | 0.243 | 0.990 | 0.753 | −0.032 | (−3.20) | 0.184 | 0.254 | 0.988 | 0.699 |
| 500 | 0.000 | (−0.03) | 0.068 | −0.013 | (−1.28) | 0.078 | 0.107 | 0.992 | 0.761 | −0.030 | (−3.00) | 0.079 | 0.110 | 0.988 | 0.740 | |
| 2000 | −0.002 | (−0.16) | 0.035 | −0.006 | (−0.63) | 0.041 | 0.053 | 0.989 | 0.734 | −0.022 | (−2.19) | 0.040 | 0.054 | 0.984 | 0.753 | |
| Extra Heavy | 100 | 0.005 | (0.50) | 0.154 | 0.037 | (3.68) | 0.231 | 0.289 | 0.977 | 0.442 | −0.057 | (−5.72) | 0.287 | 0.400 | 0.988 | 0.287 |
| 500 | 0.000 | (−0.03) | 0.068 | 0.010 | (0.95) | 0.103 | 0.130 | 0.986 | 0.440 | −0.064 | (−6.38) | 0.121 | 0.168 | 0.988 | 0.320 | |
| 2000 | −0.002 | (−0.16) | 0.035 | −0.005 | (−0.54) | 0.052 | 0.068 | 0.992 | 0.453 | −0.055 | (−5.48) | 0.057 | 0.082 | 0.969 | 0.368 | |
| : Coefficient on Censored | ||||||||||||||||
| Light | 100 | −0.016 | (−3.21) | 0.186 | −0.031 | (−6.20) | 0.237 | 0.288 | 0.975 | 0.616 | 0.026 | (5.29) | 0.254 | 0.318 | 0.989 | 0.539 |
| 500 | −0.002 | (−0.34) | 0.074 | −0.023 | (−4.50) | 0.098 | 0.117 | 0.965 | 0.571 | 0.018 | (3.59) | 0.097 | 0.126 | 0.989 | 0.586 | |
| 2000 | 0.001 | (0.13) | 0.035 | −0.016 | (−3.13) | 0.052 | 0.059 | 0.952 | 0.470 | 0.011 | (2.12) | 0.047 | 0.061 | 0.984 | 0.559 | |
| Heavy | 100 | −0.016 | (−3.21) | 0.186 | 0.070 | (13.97) | 0.380 | 0.476 | 0.983 | 0.240 | 0.197 | (39.50) | 0.442 | 0.591 | 0.986 | 0.177 |
| 500 | −0.002 | (−0.34) | 0.074 | 0.053 | (10.53) | 0.168 | 0.192 | 0.970 | 0.195 | 0.133 | (26.67) | 0.176 | 0.221 | 0.971 | 0.177 | |
| 2000 | 0.001 | (0.13) | 0.035 | 0.021 | (4.20) | 0.093 | 0.090 | 0.931 | 0.144 | 0.088 | (17.64) | 0.086 | 0.101 | 0.912 | 0.167 | |
| Extra Heavy | 100 | −0.016 | (−3.21) | 0.186 | −0.200 | (−40.03) | 0.616 | 0.456 | 0.520 | 0.091 | 1.091 | (218.29) | 1.983 | 2.661 | 0.981 | 0.009 |
| 500 | −0.002 | (−0.34) | 0.074 | −0.022 | (−4.50) | 0.405 | 0.310 | 0.754 | 0.034 | 0.853 | (170.53) | 0.665 | 0.904 | 0.924 | 0.012 | |
| 2000 | 0.001 | (0.13) | 0.035 | 0.110 | (21.99) | 0.258 | 0.199 | 0.880 | 0.019 | 0.667 | (133.45) | 0.307 | 0.391 | 0.659 | 0.013 | |
| : Coefficient on Censored | ||||||||||||||||
| Light | 100 | 0.002 | (0.94) | 0.210 | 0.005 | (2.13) | 0.214 | 0.298 | 0.994 | 0.960 | 0.008 | (3.24) | 0.215 | 0.298 | 0.994 | 0.954 |
| 500 | 0.002 | (0.87) | 0.095 | 0.004 | (1.46) | 0.097 | 0.133 | 0.997 | 0.970 | 0.004 | (1.71) | 0.097 | 0.133 | 0.997 | 0.972 | |
| 2000 | 0.001 | (0.42) | 0.047 | 0.002 | (0.62) | 0.048 | 0.066 | 0.993 | 0.963 | 0.001 | (0.48) | 0.047 | 0.066 | 0.994 | 0.967 | |
| Heavy | 100 | 0.002 | (0.94) | 0.210 | 0.017 | (6.60) | 0.229 | 0.313 | 0.995 | 0.836 | 0.036 | (14.39) | 0.224 | 0.308 | 0.993 | 0.878 |
| 500 | 0.002 | (0.87) | 0.095 | 0.014 | (5.77) | 0.100 | 0.138 | 0.994 | 0.905 | 0.030 | (11.96) | 0.099 | 0.137 | 0.993 | 0.925 | |
| 2000 | 0.001 | (0.42) | 0.047 | 0.006 | (2.37) | 0.050 | 0.069 | 0.991 | 0.859 | 0.020 | (8.07) | 0.050 | 0.069 | 0.985 | 0.889 | |
| Extra Heavy | 100 | 0.002 | (0.94) | 0.210 | −0.017 | (−6.65) | 0.387 | 0.502 | 0.987 | 0.293 | 0.102 | (40.66) | 0.231 | 0.312 | 0.984 | 0.826 |
| 500 | 0.002 | (0.87) | 0.095 | −0.006 | (−2.42) | 0.129 | 0.172 | 0.993 | 0.549 | 0.097 | (38.71) | 0.098 | 0.138 | 0.961 | 0.954 | |
| 2000 | 0.001 | (0.42) | 0.047 | 0.010 | (4.17) | 0.063 | 0.081 | 0.986 | 0.554 | 0.089 | (35.65) | 0.050 | 0.069 | 0.822 | 0.867 | |
Note: Bias (%): empirical bias (empirical percent bias); ESE: empirical standard error; ASE: average standard error estimator (with bootstrapping); CP: empirical coverage probability of 95% confidence intervals; RE: empirical relative efficiency to the full cohort analysis. True parameter values were . Extrapolated and non-extrapolated CMI were successful in all but 44 and 16 replications out of 9000; these few replications encountered errors with numerical integration (extrapolated only) or non-convergence with the Cox model (both).
Non-extrapolated CMI was also biased by as much as 41% for , while both CMI approaches were reasonably unbiased for the intercept . With minor exceptions (e.g., in the largest samples), extrapolated CMI could also offer some efficiency gains over non-extrapolated CMI. However, comparing efficiency is difficult when one or more methods is biased. In additional simulations where was generated independently of , non-extrapolated CMI led to unbiased estimates for and but continued to see bias (up to 154% versus 38% for extrapolated CMI) in estimating (Supplemental Table S1).
The bootstrap standard error estimator captured the empirical variability with extrapolated CMI fairly well, especially in large samples and under lower censoring rates (Table 1). The corresponding confidence interval coverage probabilities were near the nominal 95% level for settings with light to heavy censoring and larger sample sizes. With , the confidence intervals were conservative. However, under extra heavy censoring the average standard error estimators were smaller than the empirical for extrapolated CMI, leading to low coverage even when the bias was small (e.g., with ). If inference is the primary objective, multiple imputation may be advisable in these settings. With non-extrapolated CMI, the bootstrap standard error estimator was generally inflated, leading to coverage probabilities of 95% or higher in some cases where the estimates were extremely biased.
These simulations were run in parallel on 64 gigabyte nodes on a high-performance computing cluster. One replication of extrapolated CMI took 24.6 seconds, on average, compared to 0.12 seconds for non-extrapolated CMI. Extrapolated CMI seems to incur the most computational strain from using adaptive quadrature for numerical integration, not estimating the parameters for the Weibull extension.
3.3. Misspecifiying the Weibull Extension
To investigate performance when the Weibull extension is misspecified, we also generated from (i) a log-normal distribution with and standard deviation = 0.5 (on the log scale) and (ii) a gamma distribution with shape = 1.5 and . We focused these simulations on the heavy censoring settings (∼ 50%), generating from an exponential distribution with rates = 0.65 and 0.27 for the log-normal and gamma setups, respectively. See Supplemental Figure S5 for how these three distributions for differ. All other variables were generated following Section 3.1, and samples of , or 2000 subjects were considered. The same linear regression outcome model was used.
As expected, the advantages of extrapolated over non-extrapolated CMI were most evident when given truly followed a Weibull distribution. Still, extrapolated CMI could perform reasonably well in estimating when the Weibull extension was misspecified (Table 2). For instance, both CMI approaches led to low bias (< 5%) when was generated from a log-normal distribution. However, extrapolated CMI was up to 18% biased when was generated from a gamma distribution, while non-extrapolated CMI was < 10% biased in this case. The higher bias when was non-Weibull was not surprising for extrapolated CMI, but seeing lower bias for non-extrapolated CMI was.
Table 2:
Simulation results for under various distributions of dependent on from the full cohort analysis (i.e., where all observations had uncensored ) and conditional mean imputation (CMI) approaches.
| Full Cohort | Extrapolated CMI | Non-Extrapolated CMI | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Censoring | Bias | (%) | ESE | Bias | (%) | ESE | ASE | CP | RE | Bias | (%) | ESE | ASE | CP | RE | |
| Gamma | 100 | 0.001 | (0.29) | 0.038 | −0.092 | (−18.45) | 0.114 | 0.085 | 0.687 | 0.113 | 0.048 | (9.51) | 0.089 | 0.109 | 0.980 | 0.185 |
| 500 | 0.000 | (−0.08) | 0.016 | −0.062 | (−12.44) | 0.066 | 0.040 | 0.627 | 0.060 | 0.016 | (3.24) | 0.042 | 0.045 | 0.951 | 0.150 | |
| 2000 | 0.000 | (−0.08) | 0.008 | −0.033 | (−6.53) | 0.036 | 0.021 | 0.606 | 0.051 | 0.006 | (1.28) | 0.021 | 0.022 | 0.941 | 0.143 | |
| Log-Normal | 100 | −0.006 | (−1.19) | 0.173 | −0.024 | (−4.73) | 0.240 | 0.332 | 0.990 | 0.517 | 0.017 | (3.35) | 0.255 | 0.358 | 0.992 | 0.456 |
| 500 | −0.001 | (−0.23) | 0.074 | −0.011 | (−2.23) | 0.101 | 0.143 | 0.994 | 0.546 | 0.009 | (1.78) | 0.103 | 0.148 | 0.996 | 0.519 | |
| 2000 | 0.000 | (−0.05) | 0.035 | −0.007 | (−1.46) | 0.052 | 0.071 | 0.989 | 0.454 | 0.002 | (0.45) | 0.052 | 0.073 | 0.995 | 0.458 | |
| Weibull | 100 | −0.016 | (−3.21) | 0.186 | 0.070 | (13.97) | 0.380 | 0.476 | 0.983 | 0.240 | 0.197 | (39.50) | 0.442 | 0.591 | 0.986 | 0.177 |
| 500 | −0.002 | (−0.34) | 0.074 | 0.053 | (10.53) | 0.168 | 0.192 | 0.970 | 0.195 | 0.133 | (26.67) | 0.176 | 0.221 | 0.971 | 0.177 | |
| 2000 | 0.001 | (0.13) | 0.035 | 0.021 | (4.20) | 0.093 | 0.090 | 0.931 | 0.144 | 0.088 | (17.64) | 0.086 | 0.101 | 0.912 | 0.167 | |
Note: Bias (%): empirical bias (empirical percent bias); ESE: empirical standard error; ASE: average standard error estimator (with bootstrapping); CP: empirical coverage probability of 95% confidence intervals; RE: empirical relative efficiency to the full cohort analysis. True parameter value was . Extrapolated and non-extrapolated CMI were successful in all but 3 replications out of 9000; these few replications encountered non-convergence with the Cox model.
We discovered that this improvement for non-extrapolated CMI was due to the data generating mechanisms for the censoring variable . Higher censoring rates, driven by larger rate parameters for , led to smaller values of with the Weibull (Supplemental Figure S6). Smaller then led to worse performance (i.e., higher bias) for non-extrapolated CMI, which calculated the conditional means with the trapezoidal rule up to this value. Meanwhile, the log-normal and gamma distributions led to larger values of , even under heavy or extra heavy censoring, which could explain the improvements for non-extrapolated CMI, since less of the survival function’s tail is cut off.
We encountered minor difficulties with the default options for tolerance and the number of subdivisions to use with integrate. There were a small number of these errors with Weibull , but the issues became more pronounced in the log-normal and gamma samples of size . Most errors were due to “roundoff,” which could be resolved by manually increasing the tolerance (within reason). Others were “potential divergence,” due to a slowly decreasing survival function, which could be resolved by allowing more subdivisions.
4. Application to Huntington’s Disease Data
4.1. Designing Clinical Trials to Test Experimental Treatments
Huntington’s disease causes irreversible damage, so experimental treatments aim to slow symptom progression. Clinical trials are critical to the success of potential treatments but also expensive, leading to constraints in their design and implementation, like the number of subjects recruited and length of follow-up. Thus, clinical trials seek to recruit subjects for whom the treatment could have the greatest potential impact (Paulsen et al., 2019).
Recruiting from an existing Huntington’s disease study can be a powerful first step, since more information is available on subjects’ disease history. For example, we could measure symptom change leading up to potential recruitment. Information about symptom change is important, since the impact of the treatment in slowing symptom progression would be more measurable for subjects with steeply progressing symptoms. Still, an existing study only tells us how a subject’s symptoms have been changing thus far, while what we really want to know is how their symptoms would change during the trial. We can model between-visit symptom change using data from PREDICT-HD, and then use this model to estimate subjects’ post-recruitment symptom progression and identify high-priority subjects for a new clinical trial.
Time to diagnosis has been shown to be highly predictive of symptom severity, with the steepest change in symptoms seen in the years immediately before and after diagnosis (e.g., Long et al. (2014)). Thus, time to diagnosis is an important covariate in our symptom progression model, but in a prospective study like PREDICT-HD, where not everyone has been diagnosed, it is a randomly right-censored covariate.
4.2. Modeling the Progression of Huntington’s Disease Symptoms
One way to gauge symptom severity is the composite Unified Huntington Disease Rating Scale (cUHDRS), which collectively measures functional, motor, and cognitive impairments. As Huntington’s disease progresses toward diagnosis, impairment worsens and the cUHDRS decreases. Following Schobel et al. (2017), cUHDRS = (TFC-10.4)/1.9-(TMS-29.7)/14.9+ (SDMT-28.4)/11.3+(SWR-66.1)/20.1+10, where TFC is total functional capacity, TMS is total motor score, SDMT is the Symbol Digit Modality Test, and SWR is the Stroop Word Reading Test. These components measure symptom severity in capacity for “everyday tasks” (TFC), motor impairment (TMS), and cognitive impairment (SDMT and SWR).
We modeled the adjusted association between a subject’s cUHDRS at two time points (denoted by cUHDRS_start and cUHDRS_end), controlling for other known covariates. Subjects’ cUHDRS scores at their first and last PREDICT-HD study visits were taken as cUHDRS_start and cUHDRS_end, respectively, to fit the model. The additional covariates were (i) proximity to diagnosis, defined as TIME_end from the end time point to diagnosis, and (ii) baseline information about age, CAG repeat length, and their interaction (denoted by AGE, CAG, and AGE×CAG, respectively) at first study visit. In addition, we included an interaction between cUHDRS_start and TIME_end to allow the cUHDRS of a subject who is farther from diagnosis to not change much, while the cUHDRS of a subject who is closer to diagnosis can change noticeably. Thus, the symptom progression model was captured with linear regression as Eθ(cUHDRS_end|TIME_end, cUHDRS_start, AGE, CAG) = α+βTIME_end+γ0cUHDRS_start+γ1TIME_end×cUHDRS_start+γ2AGE+γ3CAG+γ4AGE×CAG. Covariates AGE, CAG, and cUHDRS_start were centered at 18, 36, and 23.8, respectively.
To be included in our analysis, subjects needed to have (i) a CAG repeat length ≥ 36, (ii) not yet been diagnosed with Huntington’s disease at study entry, (iii) undergone all testing to calculate the cUHDRS at the first and last visits (see Supplemental Figure S7 for missing data in cUHDRS components), and (iv) returned for at least one follow-up visit. These criteria left a sample of subjects, 238 (25%) of whom were diagnosed before their last visit, leaving 75% with a censored time to diagnosis.
4.3. Imputing Censored Times to Diagnosis
Time to diagnosis was calculated in the following way. First, DATE of diagnosis was taken as the first visit where a subject met the criteria for diagnosis, i.e., a clinician assigned them to the highest rating of a 4 on the Unified Huntington’s Disease Rating Scale diagnostic confidence level (Long et al., 2014). From DATE, time to diagnosis was calculated from the start or end of the time period, denoted as TIME_start and TIME_end, respectively (Figure 1). The former was done for imputation, because it was most natural to think of the symptom progression from the start of the period. The latter was done for analysis, because it aligned better with our outcome (cUHDRS at that time).
Figure 1:

From a subject’s observed or imputed date of diagnosis, we calculated their time to diagnosis from either the first (A) or last visit (B).
Since subjects who had not yet been diagnosed had no such DATE but would have one someday, TIME_start from the start of the period to diagnosis was randomly right-censored. This variable was imputed for undiagnosed subjects with E(TIME_start|TIME_start > FOLLOW_UP, AGE, CAG), where FOLLOW_UP was their disease-free follow-up time from the start to the end of the period. Imputation began by fitting the Cox proportional hazards model for TIME_start given AGE and CAG from study entry and calculating Breslow’s estimator (details in Web Appendix B.1). We used the Weibull extension to extrapolate the survival estimator beyond the largest uncensored value, where . Also, the context of TIME_start could be used to refine the upper bound of the integral. Specifically, TIME_start from start of the time period to Huntington’s disease diagnosis could not be infinite simply because humans are not immortal. Instead, we assumed TIME_start to be within 60 years of the start of the time period (details in Web Appendix B.2).
Now, we prepared to fit the disease progression model from Section 4.2, which involves the right-censored time to diagnosis (in years). For uncensored subjects, TIME_end was computed by subtracting their last visit date from their DATE of diagnosis. For censored subjects, TIME_end was computed by subtracting their last visit date from the imputed of diagnosis instead, where was found by adding E(TIME_start|TIME_start > FOLLOW_UP, AGE, CAG) to their first visit date.
4.4. Strategic Recruitment for a Clinical Trial
Along with the differing densities of time to diagnosis (Supplemental Figures S8 and S9), the two CMI approaches led to different disease progression models (Supplemental Table S2). We focused on adopting the models to guide recruitment for a new clinical trial in the following way. Suppose we were recruiting 200 at-risk subjects from their last regular study visit and that the clinical trial was expected to last for 2 years. Our recruitment strategy was to (i) predict the subject-specific changes in cUHDRS over the course of the clinical trial period using the disease progression models and (ii) prioritize subjects with the steepest predicted drops in cUHDRS during that time.
4.4.1. How to Estimate One Subject’s Symptom Progression
Consider a randomly selected subject whose cUHDRS declined from cUHDRS_start = 15.9 to cUHDRS_end = 13.3 between their first to last visits, a pre-trial change of . We can predict subjects’ cUHDRS 2 years from recruitment using the fitted symptom progression models, plugging in their cUHDRS at recruitment for cUHDRS_start to obtain their expected cUHDRS at trial end, . Then, expected symptom change during the clinical trial can be calculated from this prediction as If , the subject’s symptoms are expected to worsen, and values farther from 0 are expected to worsen more severely.
As a bonus, a trajectory of the subject’s symptom severity before and during the trial can be constructed with (Supplemental Figure S10). A subject’s changes in symptoms before and after trial recruitment are summarized by and , respectively, along this trajectory. For the same example subject, the extrapolated CMI model predicted at the end of the trial, for an estimated change of during the trial. Based on this value, the subject had the 43 rd largest predicted decrease in cUHDRS among censored subjects, making them a high priority for recruitment. In contrast, the non-extrapolated CMI model predicted at trial end for a smaller estimated change of , ranking this subject 201st and making them instead a low priority for recruitment into a trial of 200 subjects.
Because we saw in the simulation studies (Section 3.2) that the non-extrapolated CMI model estimates can be biased, particularly under extra heavy censoring rates like the 75% in PREDICT-HD, we have more trust in the extrapolated CMI model and believe that its predicted symptom change of for this subject would be closer to the truth. In general, incorrectly prioritizing trial candidates (e.g., by mistakenly ranking someone 201st due to a biased model when they should really have been 43rd) means that non-ideal subjects may take spots away from others with potentially more to gain.
4.4.2. How to Prioritize the Entire Study for Recruitment
We used the process outlined in Section 4.4.1 for all subjects and ordered them by their estimated symptom changes, , starting from the largest decrease. Then, we recruited subjects ranked 1 through 200, prioritizing those with the worst symptom progressions expected and potentially the most to gain. We call this rank-based recruitment.
Although the PREDICT-HD study is over, we demonstrated our recruitment strategy with its data. Figure 2 summarizes the recruitment statuses based on both disease progression models for the 732 censored subjects. To introduce variability, we also created 1000 new datasets of 732 subjects each by resampling with replacement from the 732 censored subjects. In each resampled dataset, we applied our rank-based recruitment strategy based on both models. On average, the extrapolated and non-extrapolated CMI models agreed on 158 and 490 subjects to recruit and not recruit, respectively (Supplemental Figure S11).
Figure 2:

Shaded regions capture subjects recruited based on each model, with the overlapping area in the lower left capturing subjects who would have been recruited based on either model. Points represent the censored subjects from PREDICT-HD.
Our proposed recruitment strategy takes a granular approach to targeting high priority subjects. Other strategies randomly sample from strata defined by a proxy for time to diagnosis. For example, Paulsen et al. (2019) create “low” and “high” risk groups from the CAP score (Zhang et al., 2011), where the high risk group is made up of subjects believed to be nearest diagnosis (CAP > 390.4). One potential drawback of stratified strategies like this is that clinicians cannot prioritize subjects within a risk group. In ranking rather than categorizing, clinicians are empowered to directly recruit the highest priority subjects.
5. Discussion
We demonstrate through simulations that under heavy and extra heavy censoring existing non-extrapolated CMI approaches lead to biased estimates from statistical models. We propose extrapolating from the survival function with the parametric Weibull extension before imputing, which can substantially reduce this bias even when the data are not truly Weibull. Our extrapolated CMI was designed for single imputation, but, because CMI can be adopted in a single or multiple imputation framework, our extrapolation-before-imputation strategy will also work in a multiple imputation framework.
In our simulations and real-data analysis, we focused on linear regression modeling, but the methods apply for any outcome model. This flexibility is one of the strengths of imputation: Once the censored covariate is imputed we can apply any of the usual modeling approaches. However, consistency for the CMI estimators cannot be guaranteed in non-linear outcome models, like logistic regression (Bernhardt et al., 2015).
Under light censoring, the distribution-free robustness of the non-extrapolated semiparametric CMI approach is appealing. Otherwise, extrapolation may be needed to reduce bias. However, even with our improvements, some bias remained with extrapolated CMI. Further investigation is needed to determine which survival function estimator to use for imputation, particularly under heavy and extra heavy censoring, when a more structured parametric estimator might be preferred. Also, approaches that rely on the Cox model are sensitive to non-proportional hazards, but we could test for and accommodate violations.
There are natural connections between our work and other methods that require improper integration over a step function survival estimator, for example, estimating mean residual life or maximum likelihood estimation. Also, other distributions could certainly be used to extend the survival curve, and model selection procedures (e.g., minimizing Akaike’s information criterion) could be used to select the best-fitting one. To ensure a smooth transition from Breslow’s estimator to these parametric extensions, constrained maximum likelihood estimation procedures like Web Appendix A.2 would be needed.
Supplementary Material
Acknowledgments
This research was supported by the National Institute of Environmental Health Sciences grants T32ES007018 and P30ES010126 and the National Institute of Neurological Disorders and Stroke (NINDS) grants K01NS099343 and R01NS131225. The authors thank PREDICT-HD for permission to present their data. Computations were performed using the Wake Forest University (WFU) High Performance Computing Facility, a centrally managed computational resource available to WFU researchers including faculty, staff, students, and collaborators. This work also used resources of the Longleaf Cluster at the University of North Carolina at Chapel Hill.
Footnotes
SUPPLEMENTARY MATERIAL
Additional appendices, tables, and figures: Web Appendices and Supplemental Figures and Tables referenced in Sections 2–4 (.pdf file)
Data: The PREDICT-HD dataset is available at www.ncbi.nlm.nih.gov/gap/
R package for imputation: The R package imputeCensRd containing code to perform the imputation methods described in the article is available on GitHub at http://www.github.com/sarahlotspeich/imputeCensRd
R code for simulation studies: The R scripts to replicate the simulation studies (.zip file)
CONFLICT OF INTEREST
The authors report there are no competing interests to declare.
References
- Atem FD, Matsouaka RA, and Zimmern VE (2019). Cox regression model with randomly censored covariates. Biometrical Journal 61, 1020–1032. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atem FD, Qian J, Maye JE, Johnson KA, and Betensky RA (2017). Linear regression with a randomly censored covariate: Application to an Alzheimer’s study. Journal of the Royal Statistical Society: Series C (Applied Statistics) 66(2), 313–328. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Atem FD, Sampene E, and Greene TJ (2019). Improved conditional imputation for linear regression with a randomly censored predictor. Statistical Methods in Medical Research 28(2), 432–444. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bernhardt PW, Wang HJ, and Zhang D (2015). Statistical methods for generalized linear model with covariates subject to detection limits. Statistics in Biosciences 7, 68–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Breslow NE (1972). Discussion of Professor Cox’s paper. Journal of the Royal Statistical Society: Series B (Methodological) 34(2), 216–217. [Google Scholar]
- Datta S (2005). Estimating the mean life time using right censored data. Statistical Methodology 2(1), 65–69. [Google Scholar]
- Huntington Study Group (1996). Unified Huntington’s disease rating scale: Reliability and consistency. Movement Disorders 11(2), 136–142. [DOI] [PubMed] [Google Scholar]
- Klein J and Moeschberger M (2003). Survival analysis: Techniques for censored and truncated data. 2nd Edition. New York: Springer. [Google Scholar]
- Little RJA (1992). Regression with missing X’s: A review. Journal of the American Statistical Association 87(420), 1227–1237. [Google Scholar]
- Little RJA and Rubin DB (2002). Statistical Analysis with Missing Data. Hoboken: John Wiley & Sons. [Google Scholar]
- Long JD, Paulsen JS, Marder K, Zhang Y, Kim J, Mills JA, and Researchers of the PREDICT-HD Huntington’s Study Group (2014). Tracking motor impairments in the progression of Huntington’s disease. Movement Disorders 29(3), 311–319. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lotspeich SC, Grosser KF, and Garcia TP (2022). Correcting conditional mean imputation for censored covariates and improving usability. Biometrical Journal 64, 858–862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moeschberger M and Klein J (1985). A comparison of several methods of estimating the survival function when there is extreme right censoring. Biometrics 41 (1), 253–259. [PubMed] [Google Scholar]
- Paulsen JS, Langbehn DR, Stout JC, Aylward E, Ross CA, Nance M, Guttman M, Johnson S, MacDonald M, Beglinger LJ, Duff K, Kayson E, Biglan K, Shoulson I, Oakes D, Hayden M, and Predict-HD Investigators and Coordinators of the Huntington Study Group (2008). Detection of Huntington’s disease decades before diagnosis: the Predict-HD study. Journal of Neurology, Neurosurgery & Psychiatry 79(8), 874–880. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Paulsen JS, Lourens S, Kieburtz K, and Zhang Y (2019). Sample enrichment for clinical trials to show delay of onset in Huntington disease. Movement Disorders 34 (2), 274–280. [DOI] [PMC free article] [PubMed] [Google Scholar]
- R Core Team (2019). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. [Google Scholar]
- Reid N and Cox DR (1984). Analysis of Survival Data. New York: Chapman and Hall. [Google Scholar]
- Richardson DB and Ciampi A (2003). Effects of exposure measurement error when an exposure variable is constrained by a lower limit. American Journal of Epidemiology 157, 355–363. [DOI] [PubMed] [Google Scholar]
- Rubin DB (2004). Multiple imputation for nonresponse in surveys, Volume 81. John Wiley & Sons. [Google Scholar]
- Schobel S, Palermo G, Auinger P, Long J, Ma S, Khwaja O, Trundell D, Cudkowicz M, Hersch S, Sampaio C, Dorsey E, Leavitt B, Kieburtz K, Sevigny J, Langbehn D, Tabrizi S, and TRACK-HD, COHORT, CARE-HD, and 2CARE Huntington Study Group Investigators (2017). Motor, cognitive, and functional declines contribute to a single progressive factor in early HD. Neurology 89 (24), 2495–2502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- The Huntington’s Disease Collaborative Research Group (1993). A novel gene containing a trinucleotide repeat that is expanded and unstable on Huntington’s disease chromosomes. Cell 72(6), 971–983. [DOI] [PubMed] [Google Scholar]
- Therneau TM and Grambsch PM (2000). Modeling Survival Data: Extending the Cox Model. New York: Springer. [Google Scholar]
- Ying Z (1989). A note on the asymptotic properties of the product-limit estimator on the whole line. Statistics & Probability Letters 7(4), 311–314. [Google Scholar]
- Zhang Y, Long JD, Mills JA, Warner JH, Lu W, Paulsen JS, the PREDICT-HD Investigators, and C. of the Huntington Study Group (2011). Indexing disease progression at study entry with individuals at-risk for Huntington disease. American Journal of Medical Genetics Part B: Neuropsychiatric Genetics 156B(7), 751–763. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
