Abstract
Survival models which incorporate frailties are common in time-to-event data collected over distinct spatial regions. While incomplete data are unavoidable and a common complication in statistical analysis of spatial survival research, most researchers still ignore the missing data problem. In this paper, we propose a geostatistical modeling approach for incomplete spatially correlated survival data. We achieve this by exploring missingness in outcome, covariates, and spatial locations. In the process, we analyze incomplete spatially-referenced survival data using a Weibull model for the baseline hazard function and correlated log-Gaussian frailties to model spatial correlation. We illustrate the proposed method with simulated data and an application to geo-referenced COVID-19 data from Ghana. There are several disagreements between parameter estimates and credible intervals widths obtained using our proposed approach and complete case analysis. Based on these findings, we argue that our approach provides more reliable parameter estimates and has higher predictive accuracy.
Keywords: Bayesian modeling, COVID-19, Frailties, Incomplete data, Multiple imputation, Spatial survival
1. Introduction
Survival analysis is used to model and predict time-to-event data, and has a long history in biostatistics, epidemiology, public health, engineering, economics, marketing, social sciences, behavioral sciences, political science, and medical research (Cox and Oakes, 1984, Banerjee et al., 2003a, Thamrin et al., 2017). In health studies, often the event is infection or death, and usually survival time is defined in days (or years). If survival times may vary between locations, including spatial information in the survival model will be beneficial (Motarjem et al., 2020, Aswi et al., 2020, Banerjee et al., 2003a, Zhou et al., 2017). This is due to the important role that geographical information plays in predicting survival and identifying high risk areas so that necessary action can be taken by policy makers (Lawson et al., 2016, Aswi et al., 2020, Banerjee et al., 2003a). Unlike ordinary regression models, survival methods adequately make use of information from both censored and uncensored observations in estimating significant model parameters (Cox and Oakes, 1984, Aswi et al., 2020).
Time-to-event data collected over distinct spatial regions are often grouped into strata (or clusters) such as clinical sites or geographical regions (Aswi et al., 2020, Banerjee et al., 2003a). We usually model such strata spatial arrangement in two main settings: geostatistical approaches, where we use the exact geographic locations (for example, latitude and longitude) of the strata, and lattice approaches, where we only use the positions of the strata relative to each other (for example, which counties or regions neighbor which others) (Banerjee et al., 2003a, Vaupel et al., 1979). Hierarchical models have been widely applied to analyze such strata spatial arrangement (Vaupel et al., 1979, Motarjem et al., 2020, Li and Ryan, 2002, Banerjee et al., 2003a, Hanson et al., 2012).
Spatial models that take geographic location into consideration are becoming more popular in survival analysis. The reasons are numerous and include: (1) the spatial variation in survival times is of scientific interest (Taylor and Rowlingson, 2017); (2) it is easier to apply Bayesian approaches to frailty models that can account for spatial clustering because of the advances in computing such as Markov chain Monte Carlo (MCMC) and Geographic Information System (Banerjee et al., 2003a, Banerjee et al., 2003b, Song, 2004); (3) to help avoid bias in statistical results and explain some significant information that could have been easily disregarded without considering geographic location (Munoz et al., 2010, Thamrin et al., 2021); (4) spatial location serves as a proxy for unmeasured regional characteristics such as socioeconomic status, access to health care, and pollution among others (Zhou et al., 2017, Banerjee et al., 2003b); and (5) it plays an important role in health research, since it allows policy makers to better understand the health needs of communities, to identify barriers to health care, and to allocate appropriate health resources in a cost-effective manner (Lawson et al., 2016, Zhou et al., 2017).
Spatial survival frailty models have recently emerged in the literature and have received tremendous attention from both theoretical and practical standpoints especially considering the important role of geographical information in predicting survival (Zhou et al., 2017, Banerjee et al., 2003a, Aswi et al., 2020, Thamrin et al., 2021). Some of the most cited examples and methodological developments in the analysis of spatial survival data includes: Modeling spatial variation in leukemia survival data (Henderson et al., 2002); Bayesian hierarchical model for considering the spatial–temporal correlation of data (Hanson et al., 2012); COVID-19 mortality risk factors using survival analysis (Nasution et al., 2022); Modeling spatially correlated survival data for individuals with multiple cancers (Lawson et al., 2016); Parametric, semiparametric and nonparametric analysis for spatially correlated survival data (Zhou, 2015, Banerjee et al., 2003a); Log-Gaussian frailties in a semi-parametric framework (Li and Ryan, 2002); Proportional hazards and proportional odds models (Banerjee et al., 2003a); Spatially correlated frailties modeled by a log-Gaussian stochastic process (Taylor and Rowlingson, 2017); Bayesian Weibull and cox semiparametric spatial models (Aswi et al., 2020); Estimate of recovery times of COVID-19 patients in India (Mahanta et al., 2021); Bayesian spatial survival modeling for dengue fever in Indonesia (Thamrin et al., 2021); and Modeling the time to detection of urban tuberculosis in Portugal (Nunes and Taylor, 2016).
In each of the papers above, there was no consideration to the data’s incomplete structure. Most of these authors considered only complete data, even though most datasets are incomplete. For example, in Nasution et al., 2022, Motarjem et al., 2020, Hesam et al., 2018, Banerjee et al., 2003b and Aswi et al. (2020), the authors did not address how the missing covariates or missing spatially information were handled. The authors simply dropped the missing values or overlooked the complication of the missing values and used a dataset with only observed cases. In Henderson et al. (2002), the authors adopted MCMC procedures with frailties considered as missing data. In Nunes and Taylor, 2016, Su et al., 2020 and Banerjee et al. (2003b), the authors assumed the lack of data was unlikely to significantly affect their model parameter estimates of interest.
In other papers, for example (Ghazali et al., 2021, LeSage and Pace, 2004), the authors decided to combine the various missing data as “missing” and recorded it as another category under the variable. By doing this, however, the authors are moving away from an ignorable to a nonignorable data structure. Estimating parameters with nonignorable missing data is more complex and unfortunately ignored in these papers. In reality, when dealing with nonignorable data structure, one must specify a model for the missingness and incorporate it into the complete data log-likelihood. However, when the right missingness model cannot be specified, this may lead to biased parameter estimates (Rubin, 1976, Little and Wang, 1996). Nonignorability occurs when nonresponse is related to the values of the missing variables (Rubin, 1976, Little and Wang, 1996).
Despite the long history of the significant effect of missing data on statistical analyses, spatial modeling literature has not provided a systematic treatment to the missing data problem (Panzera et al., 2016). Missing data are common in the spatial realm, and handling missing data is a challenging problem. Past literature has shown that even in the Bayesian paradigm, complete case analysis (CCA) often leads to posterior distributions with properties that are quite different than using the posterior distribution with all observed cases (Ibrahim et al., 2014, Dhara et al., 2020). Hence, CCA should not be considered the standard in spatial research. Because appropriate inference on the association of the covariates with the event-specific survival times relies on careful consideration of underlying spatial correlations and complete data (Li and Ryan, 2002, Ibrahim et al., 2012, Ibrahim et al., 2014, Shand et al., 2018).
Incomplete data are common and unavoidable across almost all research areas, in particular time-to-event analyses. With CCA every incomplete observation is deleted, and only complete cases are kept in the dataset. CCA is the simplest and most used approach for handling incomplete data, because CCA works for all data types, and is the default for most statistical software. In many cases, CCA may result in biased parameter estimates, loss of precision, and reduce the statistical power of a study, leading sometimes to invalid and inaccurate conclusions (Little and Rubin, 2014, Allotey and Harel, 2019, Ibrahim et al., 2012). However, in some situations, the use of CCA may result in accurate parameter estimates (Harel, 2007, Harel and Zhou, 2007, Allotey and Harel, 2019, Dhara et al., 2020).
Several papers have highlighted the shortcomings of CCA and other ad-hoc missing data procedures (such as mean substitution, regression imputation, last observation carried forward among others) and suggested the use of more theoretical based procedures that has showed to have better properties (Stuart et al., 2009, Ibrahim et al., 2012, Harel and Zhou, 2006). Some of the suggested techniques includes multiple imputation (MI) (Harel et al., 2017, Harel and Zhou, 2007, Little and Rubin, 2014, Ibrahim et al., 2012, Harel and Zhou, 2006), Bayesian methods (Oba et al., 2003), inverse probability weighting (Sun and Tchetgen Tchetgen, 2018, Perkins et al., 2017), maximum likelihood (Enders, 2001, Newman, 2003), and sensitivity analysis (to assess the robustness of the results) (Daniels and Hogan, 2008, Ibrahim et al., 2012). All these approaches are based on assumptions related to the reasons for the incomplete data. Usually, methods for handling missing data depends on the pattern of missingness and the mechanism that generates the missing values. These theoretical based techniques are implemented in many statistical software. But there seems to be a general lack of understanding and computational complexity that has limited their general use in research (Donders et al., 2006).
In this paper, we focus on MI for handling incomplete data for spatially correlated survival data. We employ MI because it is a commonly used method in health research dealing with incomplete data which have very broad applications. MI has several other advantages including: (1) one can impute both continuous and categorical variables and anticipated to become the standard approach for handling incomplete data in health research (Rubin, 1996, Collins et al., 2001); (2) unlike other approaches, the imputation models may differ from the analysis models under MI (Rubin, 2004); (3) some papers have shown that fully Bayesian approach should only be used after MI (Zhou and Reiter, 2010); (4) results from several simulations and real data examples have shown that MI can produce unbiased results and improve efficiency significantly with high number of imputations (Erler et al., 2016, Chen and Ibrahim, 2014); and (5) under ignorability, MI provides an unbiased and accurate parameter estimates, as well as improves the validity of medical research results (Harel and Zhou, 2007, Rubin, 1976, Rubin, 1996, Rubin, 2004).
The main objective of this paper is to propose a geostatistical modeling approach for incomplete spatially correlated survival data, applying and expanding on concepts from Banerjee et al., 2003a, Li and Ryan, 2002, Motarjem et al., 2020 and Thamrin et al. (2021) to mention but a few. Even though spatial survival models have already been discussed in literature, none of the studies have focus on appropriately handling missing covariate and spatial information in the data at the same time. To the best of our knowledge, in the geostatistical setting, there are currently no studies that accommodate both missing covariates and spatial information. We apply our approach to the analysis of geo-referenced COVID-19 mortality data from Ghana making use of important covariates, while accounting for possible spatially correlated differences in hazards among districts and regions. The data are described in detail below. Like many other health studies, the data are challenging with a lot of missing covariates and missing spatial information. In this paper, we combine MI and universal kriging procedure widely use in the geostatistical area to examine cases with missing observations. This approach proved to be beneficial in terms of producing improve parameter estimates and accurate predictions compared to CCA which ignores the missing covariates and missing spatial information in the data.
The rest of the paper is as follows. In Section 2, we introduce geostatistical spatial survival parametric models. In Section 3, we discuss the use of MI for handling missing data with a geostatistical model and how it will be assessed in simulations. In Section 4, we perform extensive simulations to illustrate the usefulness of spatial survival models, conduct MCMC diagnostics to check whether the quality of a sample generated with an MCMC algorithm is sufficient to provide an accurate approximation of the target distribution and discuss models for fitting and predicting spatial survival data. In Section 5, we describe the data, discuss the models for analyzing the survival times of such datasets and analyze the results from both methodological and practical standpoints. Finally, in Section 6, we summarize our findings and suggest some recommendations for future work.
2. Geostatistical spatial survival parametric models
In this paper, we employ a traditional approach to modeling spatial association among observations at a fixed set of spatial locations, referred to by Cressie (2015) and Lawson et al. (2016) as geostatistical modeling. Geostatistical models are not commonly used in the health sciences compared to area-level spatial models, mainly due to identifiability issues associated with spatial models and stationarity assumption in geostatistical models (Brown, 2016, Lawson et al., 2016). However, advances in mobile technologies have made georeferencing to point locations more common and recent advances in computing have made it possible for researchers to better fit geostatistical models (Brown, 2016, Lawson et al., 2016).
We propose an MI approach to analyze incomplete spatially-referenced survival data using a parametric model for the baseline hazard function and correlated log-Gaussian frailties to model spatial correlation. The inclusion of frailties (or random effects) helps analyze spatially clustered survival data, as well as capture and describe the dependence of observations within a cluster and/or the heterogeneity between clusters. Sometimes correlated survival data can be achieved through considering the location of each item resulting in the formation of spatial survival data (Motarjem et al., 2020, Banerjee et al., 2003a). We assume that random effects corresponding to observations close to each other are similar in magnitude and more correlated than those further away (Banerjee et al., 2003a).
For parametric distributions, we choose Weibull because it is a commonly used distribution in survival analysis and it seems to represent a good trade-off between simplicity and flexibility compared to exponential, gamma, lognormal among other distributions (Banerjee et al., 2003a, Mudholkar et al., 1996). In addition, the Weibull distribution is analytically tractable and computationally friendly compared to other distributions (Mudholkar et al., 1996). Therefore, we employed the Weibull model for the baseline hazard function and correlated log-Gaussian frailties to model spatial correlation. To incorporate spatial information, the proportional hazard model is written in the form:
| (1) |
where is the baseline hazard function, is the observed time for the th individual, is a vector of covariate values for the th observation and the parameters of this model, are covariate effects, the parameter of the baseline hazard and the parameter of the covariance function of a spatially latent Gaussian field W, respectively. Here, is the value of the field at the location of individual i. We define W as:
| (2) |
where is the Cholesky decomposition of the matrix of the covariance function evaluated at each of the coordinates of the observations, and assume a priori N(0, 1), where1 is the identity matrix (Taylor, 2015, Taylor and Rowlingson, 2017). We do this because we work with directly instead of .
From Eq. (1), the term exp describes the hazard ratio (HR) for the th observation. The HR is defined in two parts: the first, exp, is the part of the risk that can be explained by the available covariates; and the second, exp is the unexplained risk (Taylor and Rowlingson, 2017). With the latter, we parameterize the latent field W in such a way that E[] = 1 (Taylor and Rowlingson, 2017). We do this partly to avoid identifiability issues, but also to give a direct and useful interpretation for , as a multiplicative scaling on the hazard function (Taylor and Rowlingson, 2017, Su et al., 2020). Because to obtain reliable results from performing inference on geostatistical spatial survival model from Eq. (1), it should be identifiable. In the case of a Gaussian field whose marginal variance is , this is done by setting the mean of W to be . That is, E(W) = , so that E[] = 1. Proof of identifiability of this model can be found elsewhere (Li and Ryan, 2002, Motarjem et al., 2020).
For geospatial data, it is reasonable to assume that nearby observations will have similar response values, so we seek to model this relationship via an autocorrelation function (Banerjee et al., 2003a, Motarjem et al., 2020, Su et al., 2020). We use an exponential covariance function for the spatial random effects, that is:
| (3) |
where is the Euclidean distance between the centroid of the cell representing and that representing , 0 is the marginal variance of the field and 0 is the spatial decay parameter (the larger , the longer the range of spatial dependence there is in the field) (Taylor and Rowlingson, 2017, Taylor, 2015). We choose exponential covariance function because it is one of the suitable spatial covariance functions useful in epidemiological applications and falls under the Matern family, which happens to be a rich family of autocorrelation functions (Cressie, 2015, Motarjem et al., 2020).
The baseline hazard function and the baseline cumulative hazard function derived from the Weibull survival model have the form:
| (4) |
| (5) |
where , both (shape parameter) and (scale parameter) 0. With reflecting the shape of the monotonic hazard in the Weibull model, reflects a monotonically rising hazard rate, reflects a monotonically declining hazard, and reflects a flat hazard. The density function , cumulative function , and survival function for the individual for parametric proportional hazard spatial survival model are of the forms:
| (6) |
| (7) |
| (8) |
The above probability distributions rely on unknown parameters , , , , and . In the Bayesian paradigm, we define a prior density for each parameter of interest and the data modify the prior by using the likelihood to arrive at the posterior (Taylor and Rowlingson, 2017). With appropriate prior choice, we use MCMC algorithm to draw samples from the posterior density to perform Bayesian inference for each class of models (Taylor and Rowlingson, 2017, Li and Ryan, 2002). The idea is to use MCMC to draw samples from the posterior to estimate model parameters. We use different priors in the process to investigate the influence of the priors on the posterior estimates. The simulation results indicate that the posterior estimates are relatively insensitive to different prior choices. The difference in parameter estimates between different prior choices were mostly close to zero and most importantly the substantive inferences resulting from the analyses remain unchanged.
We employ MCMC because it provides flexibility and an intuitive basis on which to perform inference, allowing us to easily generate plots showing quantiles of quantities of interest (such as the hazard, baseline hazard, survival function) (Taylor and Rowlingson, 2017). We look for evidence of satisfactory convergence and mixing in the MCMC chain by considering the trace-plot of , , and W. We examine if conditions are satisfied, then proceed to estimate each parameter via the Bayesian paradigm. Several advantages of MCMC inferential algorithms for the parametric proportional hazards model can be found elsewhere (Ibrahim et al., 2001b, Banerjee et al., 2003a).
Throughout the process, we develop several competing models, then evaluate and select the best model for statistical inference. Usually, model comparisons in the Bayesian paradigm make use of Bayes factor (Banerjee et al., 2003a). However, Bayes factors are difficult to compute using MCMC approaches, and in any case are not well defined for flat (or noninformative) priors like ours (Banerjee et al., 2003a, Lawson et al., 2016). Therefore, we compare models using mean absolute percentage bias (MAPB) (Motarjem et al., 2020) and the model with smaller MAPB is considered satisfactory. In addition, we perform other statistical model checking by considering and evaluating coverage probability (CP), credible intervals (CI) widths, and standardized bias (SB) in deciding the best competing model. Using the approach employed by Collins et al. (2001), a result is considered unbiased if its SB is between −0.40 and 0.40 inclusive. We consider any SB with an absolute value greater than 40% problematic, because once the absolute SB exceeds 40%, it begins to have a significant effect on the coverage and efficiency. Furthermore, we compare plots of prior and posterior to check that the data is sufficient to allow identifiability of the model parameters.
3. Missing data
Incomplete data are common and unavoidable in all types of health studies. Like numerous other health studies, our data have many incomplete observations. In this paper, we focus on obtaining adjusted estimates based on incomplete data in the outcome, covariates, and spatial information. We discuss the use of MI for handling missing data with a geostatistical model, since our main substantive goal is to allow for incomplete data in spatially correlated survival models. In addition, we discuss different missing data mechanisms and most importantly how to use MI to explain the pattern of COVID-19 mortality using important missing covariates and missing spatial information. Hence, we propose and explore the use of geostatistical models with MI to allow for incomplete data in spatial survival models in epidemiological studies.
3.1. Missing data mechanisms
Missing data mechanisms generally fall into one of three main categories: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Rubin, 1976, Little and Rubin, 2014, Harel, 2007, Allotey and Harel, 2019). The missing mechanism can be specified as MCAR when the reasons for missing values have nothing to do with observed or missing data (Harel et al., 2017, Schafer and Olsen, 1998). This usually occurs when the missingness process is totally random. For instance, when health records are missing for a patient, whose questionnaire got misplaced in the mail. While CCA of MCAR data usually yield unbiased estimates, most often efficiency is lost, because CCA disregards information on incomplete cases (Perkins et al., 2017). If data are MCAR, application of principled methods will also yield unbiased estimates, as well as improving the efficiency by recovering information from incomplete cases (Perkins et al., 2017, Harel et al., 2017, Allotey and Harel, 2019).
MAR can be considered when the reasons for the missing values only depend on observed quantities. It is important to note that while the name specifies “missing at random”, it is best to think of it as conditionally missing (Harel et al., 2017, Schafer and Graham, 2002). For instance, a registry examining depression may encounter data that are MAR if men are less likely to complete a survey about depression severity than women. That is, the probability of survey completion is gender related (which is fully observed) but is not related to the severity of their depression (Mack et al., 2018). When missing data are MAR, CCA may result in biased parameter estimates, which can lead to invalid conclusions (Perkins et al., 2017, Donders et al., 2006). However, theoretical based techniques such as MI usually yields unbiased results when missing data are MAR (Perkins et al., 2017, Rubin, 1978, Rubin, 1976).
MNAR occurs when missingness is no longer “random”, which means that the reasons for missingness depend on incomplete data (He, 2010). A good example of MNAR is when data are missing on intelligence quotient (IQ), where people with low IQ are more likely to have missing observations. In addition, MNAR does allow for the possibility that “missingness” on Y is related to “missingness” on some other variable X. For instance, an individual who refuses to provide their age when answering a survey may also refuse to provide their income (Allison, 2001). Biases caused by data that are MNAR can be evaluated through sensitivity analyses by examining the effect of different assumptions about the missing data mechanism. More insight about implication of these methods under MNAR can be found elsewhere (Bartlett et al., 2014b, Bartlett et al., 2014a).
To better understand the three different types of missing data mechanisms, we illustrate it with an example in a clinical trial setting. In this setting, patients who receive COVID-19 vaccine are followed up for a year to determine if the treatment was effective. If the patient(s) missed their visits totally at random, the data would be considered MCAR. However, if the probability of missing a visit is somehow related to previous responses the data would be MAR. If the patients with COVID-19 are more likely to miss their visits than patients without COVID-19, the data would be MNAR. With MNAR, the probability of being missing varies for reasons that are unknown to the researcher and is the most complex scenario among the three missing data mechanisms.
When data are incomplete, instead of modeling only the data (in form of likelihood or posterior distribution), one needs to model the joint distribution of the data and the missingness process. Statistical methods that do not model the distribution of the missing data (often called the missing data mechanism) are subject to bias (Ibrahim et al., 2001a). The modeling depends on some untestable assumptions (MAR and MNAR). Using observed data, one can test the MCAR assumption by practically disproving it (Little, 1988a, Little, 1988b). However, MAR and MNAR mechanisms are untestable with observed data (Harel et al., 2017, Perkins et al., 2017), which means that one cannot distinguish between these two assumptions and the choice is based on assumptions but not the data (Harel et al., 2017, Perkins et al., 2017).
In general, MCAR is a very strong assumption, and it is uncommon in typical health studies. MAR is a less confining (or weaker) assumption and more feasible than MCAR to hold in health studies, especially when we have comprehensive covariate information (Allotey and Harel, 2019). Most implementations assume the missing data are MAR, that is, given the observed data, the reason for the missing data does not depend on the unobserved data (Carpenter et al., 2007, Schafer, 1997). MNAR occurs when nonresponse is related to the values of the missing variables, also referred to as nonignorability (Rubin, 1976, Lipsitz et al., 1997). Mostly used approaches to handle nonignorable missing data in research are broadly classified into three main types of models: selection models (Ibrahim et al., 2001a), shared parameter models (Tsonaka et al., 2009, Vonesh et al., 2006), and pattern-mixture models (Glynn et al., 1986, Ibrahim et al., 2001a).
Rubin (1976) introduced the minimum conditions under which the missingness process does not need to be modeled. In other words, it is not necessary to incorporate the missingness mechanism together with the observed-data process. In order for this to occur, two assumptions must be satisfied. First, either the MAR or MCAR assumptions must hold (Rubin, 1978, Harel et al., 2017, Allotey and Harel, 2019, Rubin, 1976). Second, the joint prior distribution of the data parameters and the missingness parameters can be split into the product of these two priors. In this paper, we focus mainly on the implications for MI, where the equivalent assumption is that the parameter estimates used for the imputation model and the parameter estimates used for the analysis model are independent (Rubin, 1976, Rubin, 1978). Both assumptions together imply ignorability, which means that the missingness model necessary under MNAR can be ignored and the observed data will be sufficient (Harel et al., 2017, Rubin, 1976, Rubin, 1978).
3.2. Multiple imputation
MI is a three-stage simulation-based technique which replaces missing data with plausible values and allows for additional uncertainty due to the missing information caused by the incomplete data (Rubin, 1976, Rubin, 1978, Rubin, 2004).
MI involves three main stages. The first stage is the imputation or fill-in stage, in which missing data are filled in with some plausible values (usually from a parametric or a nonparametric model), and a complete dataset is created. This process of fill-in is repeated m times. The second stage is the analysis stage, in which each of the m complete datasets are analyzed using a complete-data statistical method of interest (for example, multiple linear regression or cox regression). The last stage is the pooling or combining stage, in which parameter estimates of interest (for example, regression coefficients) obtained from each “completed dataset” analyzed are combined for a final comprehensive inference (Rubin, 1976, Rubin, 2004).
In this paper, we explore MI using geostatistical models for handling incomplete data in epidemiological research. To help account for the spatial variation in epidemiological research, a geostatistical model is considered as the imputation model. While MI can result in an improvement of parameter estimates by itself (LeSage and Pace, 2004), in cases of spatially dependent data, it is also important to consider the characteristics of nearby observations. Including characteristics of nearby missing information could contain potentially useful information that can influence the parameter estimate values via improving spatial predictions (LeSage and Pace, 2004, Cressie, 2015).
Kriging is a well-known technique for spatial interpolation that generates predictions for the unobserved values of the spatial random process at the unvisited sites (Munoz et al., 2010, Diggle et al., 1998). The kriging estimator is a minimum error weighted linear predictor that assumes a Gaussian distribution for the random process and a model for the variance–covariance matrix, see Diggle et al. (1998) and Cressie (2015) for more details. Diggle et al. (1998) later extended the concept of geostatistical models to non-Gaussian situations within the framework of generalized linear models (Munoz et al., 2010). Unlike other procedures, Kriging is based solely on observed sample data points which gives more weight to sample points nearby a location than those further away. This helps reduce bias in the predictions, especially if there is spatial correlation in the data. To implement kriging, we used the krige() function in gStat R package (Pebesma et al., 2015) to impute (or predict) the values that were not observed. Thus, we use MI for imputing missing covariate information, and the spatial prediction method kriging for imputing missing spatial information.
4. Simulation study
We perform extensive simulations to illustrate the usefulness of spatial survival models for geostatistical survival data first with complete data and based on that we decide on the choice of reasonable parameters and values for the priors. This will form the basics, that is, allowing us to incorporate incomplete data in spatial survival models. This way, we will be able to assess the use of MI with a geostatistical model through simulation studies. To do this, we first need to generate a spatial survival dataset using simulation methods for generating survival data propose by Taylor and Rowlingson (2017) and Thamrin et al. (2017). We generalize these methods to generate spatial survival data by using a Gaussian random field.
In the simulation study, spatial survival data is generated from a Weibull (parametric) baseline hazard model with sample size of n observations. In addition, we simulate data using an exponential covariance function for the spatial random effects and assume this is known when analyzing the data. Furthermore, we assume that survival time depends on five individual-level covariates: age, temperature, sex, smoker, and symptomatic. Age and temperature are continuous (numerical) variables, and the rest of the variables are binary (categorical).
We run MCMC sampler for 2,000,000 iterations with a burn-in of 200,000 iterations and retaining every 500th sample. We used R-Spatsurv package (Taylor and Rowlingson, 2017) in generating the survival times (in this setting, time to COVID-19 mortality in days) and generate right censoring times for individuals (or patients). We assume that the censoring process is independent of survival times. If the censoring time is less than the survival time we generate, then the observation is considered censored. We employed R-Spatsurv package because it is computationally efficient; can handle large spatial datasets; and can cope with right, left, or interval censored survival data.
We obtain and specify the coordinates and priors (from the normal distribution) for each parameter. We assume N(0, ) priors for and log ; N(0, 0.8) priors for log ; and N(−3, 0.5) for log . The unit for the spatial parameters are in kilometres (km). Other results from the simulation study not reported in this paper showed consistent and very good parameter estimates even with different prior choices. In addition, we perform several simulations with different values of using the same values of the other parameters and the results show that the model parameter estimates are more accurate for mid-to-high values of . Finally, we compare plots of prior and posterior distributions to make sure that the data is sufficient to allow identifiability of the model parameters.
In Table 1, we present the true and posterior parameter estimates and 95% CI for the complete simulated data using Weibull distribution for different sample sizes. All the CI contains the true parameter value. Results are very good and consistent even for smaller sample sizes. MAPB indicates more precise parameter estimates are achieved as the sample size increases. We use 95% Bayesian CI to declare statistical significance. All the variables (age, temperature, sex, smoker and symptomatic) are significant risk factors for COVID-19 survival. The spatial survival model with the Weibull distribution baseline hazard from n = 1000 is as follows:
| (9) |
Table 1.
Comparison of simulation results for complete data with different sample sizes.
| Parameters | Mean (95% CI) | Mean (95% CI) | Mean (95% CI) | |
|---|---|---|---|---|
| Fixed effects | ||||
| Age | 1.57 | 1.58 (1.55, 1.59) | 1.58 (1.55, 1.61) | 1.55 (1.49, 1.58) |
| Temperature | 1.75 | 1.75 (1.68, 1.80) | 1.78 (1.70, 1.86) | 1.79 (1.66, 1.93) |
| Sex | 2.22 | 2.21 (1.85, 2.40) | 2.13 (1.89, 2.73) | 2.04 (1.68, 2.62) |
| Smoker | 2.00 | 2.07 (1.88, 2.28) | 1.82 (1.40, 2.23) | 1.76 (1.17, 2.24) |
| Symptomatic | 2.59 | 2.66 (2.64, 3.14) | 2.68 (2.19, 2.86) | 2.41 (1.95, 3.09) |
| Baseline hazard | ||||
| Alpha | 0.50 | 0.51 (0.46, 0.54) | 0.51 (0.45, 0.54) | 0.48 (0.44, 0.57) |
| Lambda | 2.00 | 2.05 (1.83, 2.24) | 2.08 (2.14, 2.88) | 2.16 (2.09, 3.06) |
| Spatial covariance | ||||
| Sigma | 0.70 | 0.71 (0.57, 0.88) | 0.72 (0.58, 1.01) | 0.73 (0.65, 1.16) |
| Phi | 0.10 | 0.10 (0.04, 0.10) | 0.07 (0.04, 0.11) | 0.06 (0.03, 0.10) |
| MAPB | 13.22 | 57.74 | 86.91 | |
In Table 2, we report standardized bias (SB), credible interval length (length), and coverage probability (CP) for the 95% CI for each of the parameters. The results indicate the 95% CI became narrower as the sample size increases indicating less bias and reliable coverage in the models. Since the reported SB are almost zero and within the −0.40 to 0.40 bounds for all specified sample sizes, we conclude our parameter estimates obtained are unbiased. Notably, for the fixed effects, the interval lengths are about double if the sample size is decreased by fourfold. Overall, the results show that the parameter estimators of fixed effects (), baseline hazard parameters, and spatial covariance parameters are very close to the true parameter value.
Table 2.
Comparison of different criteria under the complete data.
| Parameters | SB, Length, CP | SB, Length, CP | SB, Length, CP |
|---|---|---|---|
| Age | 0.031, 0.04, 98% | 0.029, 0.06, 98% | −0.055, 0.09, 96% |
| Temperature | 0.000, 0.12, 99% | 0.033, 0.16, 96% | 0.037, 0.27, 95% |
| Sex | −0.002, 0.55, 98% | −0.019, 0.84, 97% | −0.047, 0.94, 97% |
| Smoker | 0.022, 0.40, 99% | −0.038, 0.83, 97% | −0.056, 1.07, 96% |
| Symptomatic | 0.017, 0.50, 99% | 0.024, 0.67, 97% | −0.039, 1.14, 95% |
| Alpha | 0.015, 0.08, 96% | 0.019, 0.09, 93% | −0.038, 0.13, 92% |
| Lambda | 0.015, 0.41, 98% | 0.019, 0.74, 98% | 0.041, 0.97, 96% |
| Sigma | 0.004, 0.31, 94% | 0.008, 0.43, 94% | 0.014, 0.51, 92% |
| Phi | 0.000, 0.06, 96% | −0.075, 0.07, 95% | −0.142, 0.07, 93% |
CP, coverage probability; Length, credible interval length; SB, standardized bias.
In Table 3, results illustrate that as the censoring percentage of the data increases, parameter estimates precision decreases considerably and the CI width widens. With 25% censoring, results show that the parameter estimates in the spatial survival model are unbiased and accurate for mid-values of . Since, the reported SB are almost zero and fall within the recommend range for almost all the parameters in the model. On the other hand, if we decide on very low or high values of , the spatial covariance parameters in the model are affected and not accurately captured. With 50% censoring, results show that the spatial survival model is more accurate for the fixed effects with mid-values of , but the spatial covariance parameters are biased and not accurately captured. Overall results were inconsistent and biased with very low or high values of . It is worth noting that, we do not necessarily expect to obtain the true parameters back when we analyze these datasets because of the spatial random effects in the model. The bold values of SB and CP are considered problematic.
Table 3.
Comparison of different censoring for complete data with different variances.
| = 0.9 | = 0.6 | = 0.3 | ||
|---|---|---|---|---|
| Censor percentage | Parameters | SB, Length, CP | SB, Length, CP | SB, Length, CP |
| 25% | ||||
| Age | −0.011, 0.019, 91% | 0.000, 0.015, 96% | 0.017, 0.013, 92% | |
| Temperature | 0.500, 0.104, 93% | 0.108, 0.098, 97% | 0.117, 0.090, 94% | |
| Sex | 0.084, 0.487, 93% | −0.018, 0.473, 98% | 0.030, 0.436, 94% | |
| Smoker | −0.116, 0.494, 93% | −0.026, 0.473, 98% | −0.162, 0.439, 93% | |
| Symptomatic | 0.082, 0.488, 92% | 0.017, 0.462, 97% | 0.118, 0.437, 94% | |
| Sigma | 0.897, 0.548, 92% | −0.328, 0.403, 98% | −0.933, 0.400, 93% | |
| Phi | −0.595, 0.061, 89% | −0.212, 0.031, 99% | −0.716, 0.069, 93% | |
| 50% | ||||
| Age | −0.050, 0.028, 87% | −0.033, 0.020, 93% | −0.089, 0.029, 88% | |
| Temperature | −0.475, 0.156, 85% | 0.035, 0.143, 89% | −0.108, 0.168, 87% | |
| Sex | 0.412, 1.099, 88% | −0.136, 0.550, 94% | 0.302, 0.674, 92% | |
| Smoker | −0.215, 0.904, 88% | 0.074, 0.658, 92% | 0.487, 0.730, 88% | |
| Symptomatic | −0.301, 1.038, 85% | −0.117, 0.903, 90% | 0.365, 1.158, 89% | |
| Sigma | 0.560, 0.701, 85% | −0.511, 0.602, 93% | −0.619, 0.769, 87% | |
| Phi | 0.853, 0.070, 87% | 0.500, 0.070, 93% | 0.931, 0.073, 88% | |
CP, coverage probability; Length, credible interval length; SB, standardized bias; , standard deviation. The bold values of SB and CP are considered problematic.
4.1. MCMC diagnostics
Before proceeding with interpreting the results, we examined and looked for evidence of satisfactory convergence and mixing in the MCMC chain for the simulation (complete) data (for n = 1000 from Table 1). We did the same for n = 500 and for n = 250 from Table 1, but did not report the results in this paper for brevity. We used traceplots, lag-1 autocorrelation plots for each of the W’s and log-posterior plot to check for mixing, convergence, and stationarity. MCMC diagnostics tools are needed to help check whether the quality of a sample generated with an MCMC algorithm is sufficient to provide an accurate approximation of the target distribution. It is worth noting that MCMC diagnostics cannot guarantee that chain has converged but instead can indicate that it has not converged.
In Fig. 1, we present the lag-1 autocorrelations plot for each of the W’s. The lag-1 autocorrelation in the W’s is very low (most points around 0) indicating the sample from the posterior is reasonable. The log-posterior provides evidence that the chain has left the transient phase and has reached stationarity (Taylor and Rowlingson, 2017). In addition, the log-posterior plot shows that the MCMC algorithm has found a mode and is exploring it. In this case, the initial value of the log-target appears to be close to the mode of the posterior, more commonly there would be a jump from the initial value, the chain settling at some other value (Taylor and Rowlingson, 2017). What is important with this diagnostic plot is that there is no long-term trend evident, which appears to be the case here. After convergence conditions have been satisfied, we proceed to calculate the posterior summaries of each parameter of interest.
Fig. 1.
Lag-1 Autocorrelations for each of the W’s.
For inferential purposes, it is of interest to qualify the influence of the prior choice on the resulting parameter estimates. One way of achieving this is to overlay plots of the prior and posterior densities. From Fig. 2, the plots illustrate the prior density function on its original scale (in this case, on the log-scale for and ) as a red line with a histogram of the posterior samples overlaid. The priors for and appear as flat lines because they are uninformative. We expect to see a big difference between the prior and posterior for each parameter, which would indicate that the data are informative for each parameter of interest (Taylor and Rowlingson, 2017). All the parameters but are well identified by the data. As expected in geostatistical inference, the spatial decay parameter, , is less well identified.
Fig. 2.
Plot showing the Prior (red line) and Posterior Distribution (histogram) for each Parameter.
In Fig. 3, the plot illustrates the posterior median of the spatial covariance function and its corresponding 95% CI. This implies there is evidence of moderate to strong significant correlation between any two selected locations. Thus, the spatial location plays an important role in predicting survival.
Fig. 3.
Posterior median of the spatial covariance function and its 95% CI.
4.2. Creating missingness
In this section, we explore how to create missingness under different missing data mechanisms, and investigate several scenarios where patients have incomplete records. We start with the complete data and simulate the missing data process by repeatedly setting a proportion of the data to be missing. We create missingness in the outcome, covariates, and spatial locations. In reality, since the true values of the missing data are unknown, we conduct several sensitivity analyses by changing the missing data mechanism to help: (1) assess the impact of untestable and unavoidable assumptions about any unobserved data; and (2) assess the robustness of findings to plausible alternative assumptions about the missing data (Ibrahim et al., 2001a).
For MCAR, we randomly select 10%, 25% and 50% patients from the complete data and set their responses to be missing (NA) and compare the results with the complete data. To induce MAR condition, we found the median age, and impose missing values at a rate of 10%, 25% and 50% for patients below the median age. We did this because younger patients have lower chance of COVID-19 mortality.
Table 4 illustrate the SB, length of the 95% CI, and 95% CP for each missing rate under CCA and MI. Again, a result is considered unbiased if the SB is between −0.40 and 0.40 bounds. SB values outside the defined range and CP less than 90%, which are the values in bold are considered problematic. Several observations can be made from the results in Table 4. First, as the percentage of missing data increases, the coverage rate decreases. Second, as the missing rate increases, MI remain unbiased and appear to be much closer to the 95% coverage as compared to CCA. Third, the length of the 95% CI are shorter for MI compared to CCA, which implies there is more precision in the results under MI. Finally, we can conclude that MI is doing very well in all scenarios while CCA only works well with very low percentage of missing data.
Table 4.
Comparison of simulation results for 10%, 25%, and 50% missing rates under CCA and MI.
| MCAR |
MAR |
|||
|---|---|---|---|---|
| Parameters |
CCA |
MI |
CCA |
MI |
| 10% missing | SB, Length, CP | SB, Length, CP | SB, Length, CP | SB, Length, CP |
| Sex (Male) | −0.328, 0.461, 92% | −0.017, 0.444, 96% | −0.044, 0.453, 92% | −0.011, 0.428, 93% |
| Age | −0.117, 0.700, 94% | −0.050, 0.651, 99% | −0.150, 0.956, 95% | −0.075, 0.511, 95% |
| Accra (Yes) | 0.060, 0.220, 94% | 0.043, 0.206, 99% | −0.020, 0.227, 93% | 0.005, 0.207, 95% |
| Underlying (Yes) | −0.034, 0.351, 94% | 0.002, 0.237, 98% | −0.106, 0.346, 96% | 0.013, 0.304, 94% |
| Symptoms (Yes) | 0.015, 0.093, 95% | 0.009, 0.087, 98% | 0.040, 0.095, 95% | 0.022, 0.087, 96% |
| 0.761, 0.473, 94% | 0.015, 0.252, 98% | 0.358, 0.296, 95% | 0.298, 0.270, 96% | |
| 25% missing | ||||
| Sex (Male) | −0.284, 0.570, 89% | −0.156, 0.450, 94% | −0.406, 0.622, 90% | −0.167, 0.439, 93% |
| Age | −0.483, 0.832, 93% | 0.258, 0.688, 95% | 0.492, 2.111, 89% | −0.167, 1.504, 94% |
| Accra (Yes) | 0.172, 0.383, 94% | 0.046, 0.213, 96% | −0.011, 0.297, 92% | −0.002, 0.218, 95% |
| Underlying (Yes) | −0.147, 0.419, 90% | 0.013, 0.343, 94% | −0.304, 0.419, 91% | −0.002, 0.341, 96% |
| Symptoms (Yes) | 0.435, 0.118, 90% | 0.075, 0.090, 95% | 0.467, 0.128, 88% | 0.125, 0.091, 94% |
| −0.836, 0.741, 93% | 0.259, 0.264, 96% | −0.412, 0.379, 86% | 0.381, 0.288, 93% | |
| 50% missing | ||||
| Sex (Male) | −0.383, 0.829, 91% | −0.339, 0.461, 92% | −0.933, 0.833, 86% | −0.301, 0.448, 94% |
| Age | −0.767, 1.442, 92% | 0.442, 0.704, 95% | 0.905, 3.048, 92% | 0.311, 1.676, 93% |
| Accra (Yes) | 0.250, 0.488, 89% | 0.061, 0.214, 95% | 0.034, 0.494, 90% | 0.015, 0.237, 95% |
| Underlying (Yes) | −0.272, 0.746, 90% | 0.209, 0.351, 93% | −0.235, 0.625, 89% | −0.176, 0.352, 94% |
| Symptoms (Yes) | −0.618, 0.190, 89% | 0.211, 0.094, 94% | 0.492, 0.205, 89% | 0.237, 0.108, 95% |
| −0.867, 0.565, 90% | 0.413, 0.268, 93% | −0.672, 0.721, 90% | 0.397, 0.324, 95% | |
CP, coverage probability; Length, credible interval length; SB, standardized bias. The bold values of SB and CP are considered problematic.
5. Application
Using geostatistical spatial survival models, the data from the Surveillance Outbreak Response Management and Analysis System (SORMAS) study was used as the dataset on which all subsequent analyses are carried out. SORMAS was a prospective cohort study of 32,747 patients with COVID-19 pneumonia hospitalized at the Korle-Bu Teaching hospital in Accra, Ghana between May 2020, and August 2021. The data consist of factors influencing COVID-19 outcomes.
We illustrate our method with this motivating dataset because it has been almost three years since coronavirus disease 2019 (COVID-19) emerged as a global pandemic and still remains a serious public health challenge worldwide, despite widespread efforts to control the disease. In the latter part of 2019, COVID-19 outbreak occurred in Wuhan, China and has since then spread rapidly across the globe. The pandemic is still ongoing with over 500 million people having recuperated from it and over 6 million reported deaths, according to Johns Hopkins coronavirus resource center database ( https://coronavirus.jhu.edu/map.html).
Even though, randomized clinical trials are considered the gold standard for health research, most COVID-19 data are collected via observational studies and most studies have focused on clinical risk factors associated with serious illness and mortality of COVID-19 (Albitar et al., 2020, Williamson et al., 2020). In this paper, we focus on patients’ location, socio-demographic, and environmental variables associated with COVID-19 infection and mortality. The reason is that, in health research, geographic location and/or environmental factors may affect the outcome. We aim to identify significant risk factors associated with COVID-19 mortality in Ghana.
Variables available to us for statistical analysis included the time to death which is treated as the response. All other patients, including those who survived the study period and those who dropped out or die of other causes were considered censored. Independent variables include sex categorized as male and female; age (in years); and the rest of the variables are patients who reside in Accra, history of any underlying disease, and symptomatic were categorized as binary (yes or no). The exact residential locations of patients and their administrative districts are available, enabling us to fit both geostatistical and lattice models.
We used a subset of the data. Variables considered for exclusion in the final models includes non-Ghanaian patients, patients below the age 15 years, variables that had values which were unrealistically low or high, and patients with more than one underlying disease. Finally, we analyze COVID-19 data from 12,491 adults hospitalized at the Korle-Bu Teaching hospital between May 2020, and August 2021, as identified through SORMAS. We decided to include socioeconomic factors such as poverty index that might help explain some of the spatial variability in the outcome measure. Statistical analyses are performed in R software version 4.2.1, and for Bayesian spatial survival model building, we used R-Spatsurv package.
We begin with some descriptive statistics and graphical displays to help fill in the gaps of understanding the data. Approximately 30% of the patients’ survival times were censored, 42% of patients were females, 21% experience symptoms, and most of the cases were diagnosed when patients were 30–49 years old. Almost all the patients had at least one covariate, spatial or response missing information, with most missingness occurring in the covariates. Prior to implementing MI, we had a total of about 82% missing cases on the final dataset, meaning there were approximately only 18% complete cases in the final dataset, see Fig. 4. The first plot (left) in Fig. 4 shows the proportion of missing values for each variable, while the second plot (right) shows the missing data patterns (red represents missing data and blue represents observed data). This figure can identify the most common pattern of missingness, which may facilitate speculation on why missingness may be occurring for certain patterns more frequently.
Fig. 4.
Missing data patterns.(For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
We used MI as described above to create m = 100 imputed complete datasets and analyze using standard statistical procedures to obtain point and variance estimates. Due to the different sample sizes, we compared these models using the mean square error (MSE). It is of interest to investigate possible spatial variation in survival after accounting for a combination of individual-level covariates and district-level spatially correlated effects. Our parametric spatial survival model is flexible to capture missing information. We compare the results from different models employed and explore common issues that could arise when these techniques are being used. Table 5 illustrates the estimated posterior median hazard ratios (HR) with their corresponding 95% CI. In addition, we report the MSE value for each of the models being evaluated.
Table 5.
Median Hazard Ratio (HR) and 95% CI from the posterior summaries.
| Nonspatial Frailty Model |
Geostatistical Frailty Model |
|||
|---|---|---|---|---|
| Parameters | CCA | MI | CCA | MI |
| Sex (Male) | 1.09 (0.67, 1.82) | 1.86 (1.23, 3.36) | 1.04 (0.97, 1.22) | 1.57 (1.52, 1.64) |
| Age | 0.75 (0.67, 0.88) | 1.16 (0.78, 1.70) | 1.43 (1.18, 2.04) | 2.43 (2.19, 2.70) |
| Accra (Yes) | 1.35 (0.98, 1.56) | 0.83 (0.64, 0.94) | 0.76 (0.38, 1.43) | 0.74 (0.44, 1.08) |
| Underlying (Yes) | 0.84 (0.78, 1.43) | 1.09 (1.05, 1.14) | 1.14 (0.94, 1.42) | 5.62 (4.73, 6.69) |
| Symptoms (Yes) | 3.88 (0.86, 6.43) | 0.84 (0.78, 1.43) | 1.13 (1.02, 1.27) | 2.81 (1.61, 4.66) |
| Poverty Index | 1.17 (1.14, 1.48) | 1.14 (1.16, 1.20) | 1.16 (1.14, 1.18) | 1.18 (1.15, 1.20) |
| 0.04 (0.01, 0.16) | 0.09 (0.03, 0.12) | |||
| MSE | 0.089 | 0.071 | 0.064 | 0.055 |
In Table 5, we compare the performance of spatial (including spatial correlation) model and nonspatial frailty model (by setting = 0, from Eq. (1)). Looking at the MSE values, the geostatistical frailty models provide a significant improvement over the nonspatial frailty models. This is because common survival models cannot explain the spatial correlation in the data. Geostatistical frailty models not only succeeded to demonstrate the accuracy in estimating parameters of spatial survival models but also produced smaller standard errors compared to nonspatial frailty models. This helps overcome the heterogeneity of unexplained variance in the model caused by the spatial effect. These results are consistent with literature (Banerjee et al., 2003a, Aswi et al., 2020, Motarjem et al., 2020).
From the geostatistical frailty model under MI, all the variables in the model except for Accra are statistically significant risk factors for COVID-19 mortality. Males, symptoms, and having underlying disease have higher risk of death from COVID-19 compared to females, not showing symptoms, and not having an underlying disease. For instance, males are 57% more likely to die from COVID-19 than females in this population (HR = 1.57, 95% CI = (1.52, 1.64)), and the rest of the variables can be interpreted in similar manner. As expected, Accra was not statistically significant in the spatial frailty model since it has already been controlled for in spatial model. Age and are also statistically significant risk factors of COVID-19 mortality. For exponential covariance function, the quantity 3/ can be seen as a measure of the effective isotropic range, see Banerjee et al. (2003b) for more details. This implies that there is a significant spatial dependence after controlling for covariates in the model (HR = 43, 95% CI = (27, 60)) (in kilometers). In Fig. 5, we can observe that cases within a distance of less than 43 km had a high correlation of hazard in space. The correlation of hazard starts to decrease when the distance for cases is more than 43 km apart, this result is supported by the value shown in Table 5.
Fig. 5.
Posterior spatial correlation function and its 95% CI.
In addition, the analyses under the geostatistical frailty model, unveiled large discrepancy in inference comparing the results between CCA and MI. Under CCA, we can clearly see that the variables sex and having an underlying disease are not significant predictors of COVID-19 mortality which contradicts the results from MI. Also, overall, the length of the 95% CI was shorter under MI compared to CCA. In addition, the median effective spatial range is 34 km under MI. But the distance of high correlation of hazard in space is more than double under CCA (3/0.04 = 75 km). Fig. 6 shows the calibration curves for the nonspatial frailty model and the geostatistical frailty model. The validation is obtained from the test dataset, with the area under operating characteristic curve (AUC) and Brier scores expressed as the point estimates and 95% confidence intervals (Zhang et al., 2018). The geostatistical model is the better model since it has the largest AUC and the smallest Brier score (Zhang et al., 2018). This indicate a good predictive ability of the chosen model to predict COVID-19 mortality.
Fig. 6.
Calibration curves for Nonspatial Frailty Model and Geostatistical Frailty Model.
Now comparing the results of CCA and MI from the nonspatial frailty models. We can see that there is a vast difference between the two models in terms of parameter estimates obtained. The results from the nonspatial frailty under MI shows that sex, and having an underlying disease are significant predictors of COVID-19 mortality which is consistent with the results under geostatistical frailty model via MI. For instance, the results under CCA indicate that those who had symptoms tend to have the similar risk of COVID-19 mortality compared to those who did not, contrary to MI. The difference is possibly due to the missing information not properly handled, which lead to a result contradicting many known results that patients with symptoms are more likely to die from COVID-19 than those who did not show any symptoms (Sousa et al., 2020, Albitar et al., 2020).
Furthermore, parameter estimates between CCA vs MI in the nonspatial frailty models were trending in opposite direction for all the variables except for age which is not even statistically significant under CCA. Most notably, there is a significant reduction in CI length for the symptom variable under the non-spatial frailty model (HR = 3.88, 95% CI = (0.86, 6.43)), and geostatistical frailty model (HR = 2.81, 95% CI = (1.61, 4.66)), which indicate precision. Overall, even though the nonspatial frailty with MI produce wider CI compared to the geostatistical frailty model, MI gave significant improvement and produced unbiased and consistent parameter estimates and shorter CI due to smaller standard errors compared to CCA. The spatial model is more suitable for these data and CCA should not be considered the standard in research.
6. Discussions
We have introduced geostatistical frailty modeling and extensive simulations for incomplete spatially correlated survival data, with focus on understanding COVID-19 mortality in Ghana. By doing so, we have demonstrated the potential of incorporating incomplete data into spatial survival models for modeling time-to-event data. We believe our approach provides precise parameter estimates and ensures the validity of study results in epidemiological research.
The results from the geostatistical spatial survival models illustrates that in spatially correlated survival data, nonspatial frailty models are not appropriate, which is consistent with existing literature (Motarjem et al., 2020, Banerjee et al., 2003a, Su et al., 2020, Aswi et al., 2020). This is because common survival models cannot explain the data spatial correlation and should not be considered the standard in research. Spatial survival models produced accurate parameter estimates and lower standard errors compared to nonspatial frailty models, which is consistent with literature (Banerjee et al., 2003a). In addition, the simulation studies indicate that as the censoring percentage increases there is a decrease in the precision of the estimator which is consistent with existing literature (Motarjem et al., 2020). Simulation results also indicate that our approach is efficient in studies where missingness data are under MCAR and MAR regardless of the sample size. Also, results show that MI gives significant improvement over CCA via producing consistent parameter estimates and shorter CI, which is consistent with literature (Allotey and Harel, 2019, Harel and Zhou, 2007, Harel, 2007, Munoz et al., 2010, Ibrahim et al., 2012, Banerjee et al., 2003a).
Furthermore, results show inconsistencies between MI and CCA for parameter estimates and CI widths. For instance, the results suggest that age and sex do not appear to be significant risk factors for predicting COVID-19 mortality using CCA. We believe this is indicating a bias, as it contradicts many confirmed results in existing literature (Nasution et al., 2022, Ko et al., 2021). However, using MI, we find that both age and sex are significantly associated with COVID-19 mortality as expected. These results indicate that our approach is beneficial in terms of producing improved parameter estimates and accurate predictions compared to nonspatial models and CCA, which ignores both the missing covariate and missing spatial information in the data.
There are some limitations which may affect our study generalization. First, we use data from only one major hospital in Ghana which may lead to bias via clinical test selection and implementation. In addition, some districts had few cases and there were concerns of under-reporting and incorrectly diagnosis of COVID-19 cases. We acknowledge that using information of other hospitals may probably influence our results. Second, we lack other secondary predictors of COVID-19 infection and mortality such as employment status, education level, income, housing conditions, clinical predictors among many others, which could influence the ability to seek care, adhere to treatment, and practice physical distancing measures among many other factors should be considered (Khalatbari-Soltani et al., 2020, Nasution et al., 2022). Third, as with any simulation study, results cannot be generalized beyond the set of conditions that were explored. For instance, we did not investigate the effect of missingness under MNAR. Despite the limitations, these results could provide useful implications for the government and health care practitioners in Ghana and beyond.
We recommend performing several sensitivity analyses to assess the impact of untestable and unavoidable assumptions before proceeding with the analysis especially when missing information is high. This is because, the more missing information, the more sensitive the results are to the modeling assumptions, as sensitivity analyses can help us gain a great deal of confidence in our results, see Ibrahim et al. (2012) and Daniels and Hogan (2008) for more details. Based on these findings, we encourage researchers to consider using MI or other theoretically based methods mentioned above for handling incomplete data, as such procedures will often perform better than CCA.
In future work, we plan to use MI to examine the impact of incomplete data using semiparametric and nonparametric models for spatially correlated survival data, by exploring missingness in outcome, covariates, and spatial locations. We will explore the outcome patterns using important covariates, while accounting for spatially correlated differences in the hazards among districts and for possible space–time interactions. In addition, we plan to discuss the impact of missing data under different models, different spatial priors and compare the results under different structures with respect to goodness of fit and generalizability.
References
- Albitar O., Ballouze R., Ooi J.P., Ghadzi S.M.S. Risk factors for mortality among COVID-19 patients. Diabetes Res. Clin. Pract. 2020;166 doi: 10.1016/j.diabres.2020.108293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Allison P.D. Sage publications; 2001. Missing Data, Vol. 136. [Google Scholar]
- Allotey P.A., Harel O. Multiple imputation for incomplete data in environmental epidemiology research. Current Environ. Health Rep. 2019;6(2):62–71. doi: 10.1007/s40572-019-00230-y. [DOI] [PubMed] [Google Scholar]
- Aswi A., Cramb S., Duncan E., Hu W., White G., Mengersen K. Bayesian spatial survival models for hospitalisation of dengue: A case study of Wahidin hospital in Makassar, Indonesia. Int. J. Environ. Res. Public Health. 2020;17(3):878. doi: 10.3390/ijerph17030878. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Banerjee S., Carlin B.P., Gelfand A.E. Chapman and Hall/CRC; 2003. Hierarchical Modeling and Analysis for Spatial Data. [Google Scholar]
- Banerjee S., Wall M.M., Carlin B.P. Frailty modeling for spatially correlated survival data, with application to infant mortality in Minnesota. Biostatistics. 2003;4(1):123–142. doi: 10.1093/biostatistics/4.1.123. [DOI] [PubMed] [Google Scholar]
- Bartlett J.W., Carpenter J.R., Tilling K., Vansteelandt S. Corrigendum: Improving upon the efficiency of complete case analysis when covariates are MNAR (10.1093/biostatistics/kxu023) Biostatistics. 2014;16(1):205. doi: 10.1093/biostatistics/kxu051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bartlett J.W., Carpenter J.R., Tilling K., Vansteelandt S. Improving upon the efficiency of complete case analysis when covariates are MNAR. Biostatistics. 2014;15(4):719–730. doi: 10.1093/biostatistics/kxu023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Brown P.E. Handbook of Spatial Epidemiology. Chapman and Hall/CRC; 2016. Geostatistics in small-area health applications; pp. 229–242. [Google Scholar]
- Carpenter J.R., Kenward M.G., White I.R. Sensitivity analysis after multiple imputation under missing at random: A weighting approach. Stat. Methods Med. Res. 2007;16(3):259–275. doi: 10.1177/0962280206075303. [DOI] [PubMed] [Google Scholar]
- Chen Q., Ibrahim J.G. A note on the relationships between multiple imputation, maximum likelihood and fully Bayesian methods for missing responses in linear regression models. Stat. Interface. 2014;6(3):315. doi: 10.4310/SII.2013.v6.n3.a2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Collins L.M., Schafer J.L., Kam C.-M. A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychol. Methods. 2001;6(4):330. [PubMed] [Google Scholar]
- Cox D.R., Oakes D. CRC Press; 1984. Analysis of Survival Data, Vol. 21. [Google Scholar]
- Cressie N. John Wiley & Sons; 2015. Statistics for Spatial Data. [Google Scholar]
- Daniels M.J., Hogan J.W. chapman and hall/CRC; 2008. Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis. [Google Scholar]
- Dhara K., Lipsitz S., Pati D., Sinha D. A new Bayesian single index model with or without covariates missing at random. Bayesian Anal. 2020;15(3):759. doi: 10.1214/19-ba1170. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diggle P.J., Tawn J.A., Moyeed R.A. Model-based geostatistics. J. R. Stat. Soc. Ser. C. Appl. Stat. 1998;47(3):299–350. [Google Scholar]
- Donders A.R.T., Van Der Heijden G.J., Stijnen T., Moons K.G. A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 2006;59(10):1087–1091. doi: 10.1016/j.jclinepi.2006.01.014. [DOI] [PubMed] [Google Scholar]
- Enders C.K. A primer on maximum likelihood algorithms available for use with missing data. Struct. Equ. Model. 2001;8(1):128–141. [Google Scholar]
- Erler N.S., Rizopoulos D., Rosmalen J.v., Jaddoe V.W., Franco O.H., Lesaffre E.M.E.H. Dealing with missing covariates in epidemiologic studies: A comparison between multiple imputation and a full Bayesian approach. Stat. Med. 2016;35(17):2955–2974. doi: 10.1002/sim.6944. [DOI] [PubMed] [Google Scholar]
- Ghazali A.K., Keegan T., Taylor B.M. Spatial variation of survival for colorectal cancer in Malaysia. Int. J. Environ. Res. Public Health. 2021;18(3):1052. doi: 10.3390/ijerph18031052. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Glynn R.J., Laird N.M., Rubin D.B. Drawing Inferences from Self-Selected Samples. Springer; 1986. Selection modeling versus mixture modeling with nonignorable nonresponse; pp. 115–142. [Google Scholar]
- Hanson T.E., Jara A., Zhao L. A Bayesian semiparametric temporally-stratified proportional hazards model with spatial frailties. Bayesian Anal. 2012;7(1):147–188. [PMC free article] [PubMed] [Google Scholar]
- Harel O. Inferences on missing information under multiple imputation and two-stage multiple imputation. Stat. Methodol. 2007;4(1):75–89. [Google Scholar]
- Harel O., Mitchell E.M., Perkins N.J., Cole S.R., Tchetgen Tchetgen E.J., Sun B., Schisterman E.F. Multiple imputation for incomplete data in epidemiologic studies. Am. J. Epidemiol. 2017;187(3):576–584. doi: 10.1093/aje/kwx349. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Harel O., Zhou X.-H. Multiple imputation for correcting verification bias. Stat. Med. 2006;25(22):3769–3786. doi: 10.1002/sim.2494. [DOI] [PubMed] [Google Scholar]
- Harel O., Zhou X.-H. Multiple imputation: Review of theory, implementation and software. Stat. Med. 2007;26(16):3057–3077. doi: 10.1002/sim.2787. [DOI] [PubMed] [Google Scholar]
- He Y. Missing data analysis using multiple imputation: Getting to the heart of the matter. Circ. Cardiovasc. Qual. Outcomes. 2010;3(1):98–105. doi: 10.1161/CIRCOUTCOMES.109.875658. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Henderson R., Shimakura S., Gorst D. Modeling spatial variation in leukemia survival data. J. Amer. Statist. Assoc. 2002;97(460):965–972. [Google Scholar]
- Hesam S., Mahmoudi M., Foroushani A.R., Yaseri M., Mansournia M.A. A spatial survival model in presence of competing risks for Iranian gastrointestinal cancer patients. Asian Pac. J. Cancer Prev. 2018;19(10):2947. doi: 10.22034/APJCP.2018.19.10.2947. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ibrahim J.G., Chen M.-H., Lipsitz S.R. Missing responses in generalised linear mixed models when the missing data mechanism is nonignorable. Biometrika. 2001;88(2):551–564. [Google Scholar]
- Ibrahim J.G., Chen M.-H., Sinha D. Wiley StatsRef: Statistics Reference Online. Wiley Online Library; 2014. Bayesian survival analysis. [Google Scholar]
- Ibrahim J.G., Chen M.-H., Sinha D., Ibrahim J.G., Chen M.H. Springer; 2001. BayesIan Survival Analysis, Vol. 2. [Google Scholar]
- Ibrahim J.G., Chu H., Chen M.-H. Missing data in clinical studies: Issues and methods. J. Clin. Oncol. 2012;30(26):3297. doi: 10.1200/JCO.2011.38.7589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khalatbari-Soltani S., Cumming R.C., Delpierre C., Kelly-Irving M. Importance of collecting data on socioeconomic determinants from the early stage of the COVID-19 outbreak onwards. J. Epidemiol. Community Health. 2020;74(8):620–623. doi: 10.1136/jech-2020-214297. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ko J.Y., Danielson M.L., Town M., Derado G., Greenlund K.J., Kirley P.D., Alden N.B., Yousey-Hindes K., Anderson E.J., Ryan P.A., et al. Risk factors for coronavirus disease 2019 (COVID-19)–associated hospitalization: COVID-19–associated hospitalization surveillance network and behavioral risk factor surveillance system. Clin. Infect. Dis. 2021;72(11):e695–e703. doi: 10.1093/cid/ciaa1419. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lawson A.B., Banerjee S., Haining R.P., Ugarte M.D. CRC Press; 2016. Handbook of Spatial Epidemiology. [Google Scholar]
- LeSage J.P., Pace R.K. Models for spatially dependent missing data. J. Real Estate Finance Econ. 2004;29(2):233–254. [Google Scholar]
- Li Y., Ryan L. Modeling spatial survival data using semiparametric frailty models. Biometrics. 2002;58(2):287–297. doi: 10.1111/j.0006-341x.2002.00287.x. [DOI] [PubMed] [Google Scholar]
- Lipsitz S.R., Fitzmaurice G.M., Molenberghs G., Zhao L.P. Quantile regression methods for longitudinal data with drop-outs: Application to CD4 cell counts of patients infected with the human immunodeficiency virus. J. R. Stat. Soc. Ser. C. Appl. Stat. 1997;46(4):463–476. [Google Scholar]
- Little R.J.A. Missing-data adjustments in large surveys. J. Bus. Econom. Statist. 1988;6(3):287–296. [Google Scholar]
- Little R.J.A. A test of missing completely at random for multivariate data with missing values. J. Amer. Statist. Assoc. 1988;83(404):1198–1202. [Google Scholar]
- Little R.J.A., Rubin D.B. John Wiley & Sons; 2014. Statistical Analysis with Missing Data, Vol. 333. [Google Scholar]
- Little R.J.A., Wang Y. Pattern-mixture models for multivariate incomplete data with covariates. Biometrics. 1996:98–111. [PubMed] [Google Scholar]
- Mack C., Su Z., Westreich D. Agency for Healthcare Research and Quality (US); Rockville (MD): 2018. Managing Missing Data in Patient Registries: Addendum to Registries for Evaluating Patient Outcomes: A User’s Guide. [PubMed] [Google Scholar]
- Mahanta K.K., Hazarika J., Barman M.P., Rahman T. An application of spatial frailty models to recovery times of COVID-19 patients in India under Bayesian approach. J. Sci. Res. 2021;65(1) [Google Scholar]
- Motarjem K., Mohammadzadeh M., Abyar A. Geostatistical survival model with Gaussian random effect. Statist. Papers. 2020;61(1):85–107. [Google Scholar]
- Mudholkar G.S., Srivastava D.K., Kollia G.D. A generalization of the Weibull distribution with application to the analysis of survival data. J. Amer. Statist. Assoc. 1996;91(436):1575–1583. [Google Scholar]
- Munoz B., Lesser V.M., Smith R.A. Applying multiple imputation with geostatistical models to account for item nonresponse in environmental data. J. Modern Appl. Statist. Methods. 2010;9(1):27. [Google Scholar]
- Nasution B.I., Nugraha Y., Prasetya N.L., Aminanto M.E., Sulasikin A., Kanggrawan J.I., Suherman A.L. COVID-19 mortality risk factors using survival analysis: A case study of Jakarta, Indonesia. IEEE Trans. Comput. Soc. Syst. 2022 [Google Scholar]
- Newman D.A. Longitudinal modeling with randomly and systematically missing data: A simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ. Res. Methods. 2003;6(3):328–362. [Google Scholar]
- Nunes C., Taylor B.M. Modelling the time to detection of urban tuberculosis in two big cities in Portugal: A spatial survival analysis. Int. J. Tuberc. Lung. Dis. 2016;20(9):1219–1225. doi: 10.5588/ijtld.15.0822. [DOI] [PubMed] [Google Scholar]
- Oba S., Sato M.-a., Takemasa I., Monden M., Matsubara K.-i., Ishii S. A Bayesian missing value estimation method for gene expression profile data. Bioinformatics. 2003;19(16):2088–2096. doi: 10.1093/bioinformatics/btg287. [DOI] [PubMed] [Google Scholar]
- Panzera D., Benedetti R., Postiglione P. A Bayesian approach to parameter estimation in the presence of spatial missing data. Spat. Econ. Anal. 2016;11(2):201–218. [Google Scholar]
- Pebesma E., Graeler B., Pebesma M.E. 2015. Package ‘gstat’. Comprehensive R Archive Network (CRAN), p. 1. [Google Scholar]
- Perkins N.J., Cole S.R., Harel O., Tchetgen Tchetgen E.J., Sun B., Mitchell E.M., Schisterman E.F. Principled approaches to missing data in epidemiologic studies. Am. J. Epidemiol. 2017;187(3):568–575. doi: 10.1093/aje/kwx348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rubin D.B. Inference and missing data. Biometrika. 1976;63(3):581–592. [Google Scholar]
- Rubin D.B. Vol. 1. American Statistical Association; 1978. Multiple imputations in sample surveys-a phenomenological Bayesian approach to nonresponse; pp. 20–34. (Proceedings of the Survey Research Methods Section of the American Statistical Association). [Google Scholar]
- Rubin D.B. Multiple imputation after 18+ years. J. Amer. Statist. Assoc. 1996;91(434):473–489. [Google Scholar]
- Rubin D.B. John Wiley & Sons; 2004. Multiple Imputation for Nonresponse in Surveys, Vol. 81. [Google Scholar]
- Schafer J.L. Google Scholar; 1997. Analysis of Incomplete Multivariate Data London. [Google Scholar]
- Schafer J.L., Graham J.W. Missing data: Our view of the state of the art. Psychol. Methods. 2002;7(2):147. [PubMed] [Google Scholar]
- Schafer J.L., Olsen M.K. Multiple imputation for multivariate missing-data problems: A data analyst’s perspective. Multivar. Behav. Res. 1998;33(4):545–571. doi: 10.1207/s15327906mbr3304_5. [DOI] [PubMed] [Google Scholar]
- Shand L., Li B., Park T., Albarracín D. Spatially varying auto-regressive models for prediction of new human immunodeficiency virus diagnoses. J. R. Stat. Soc. Ser. C. Appl. Stat. 2018;67(4):1003–1022. doi: 10.1111/rssc.12269. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Song J.J. Texas A&M University; 2004. Bayesian Multivariate Spatial Models and their Applications. [Google Scholar]
- Sousa G.J.B., Garces T.S., Cestari V.R.F., Florêncio R.S., Moreira T.M.M., Pereira M.L.D. Mortality and survival of COVID-19. Epidemiol. Infect. 2020;148 doi: 10.1017/S0950268820001405. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stuart E.A., Azur M., Frangakis C., Leaf P. Multiple imputation with large data sets: A case study of the children’s mental health initiative. Am. J. Epidemiol. 2009;169(9):1133–1139. doi: 10.1093/aje/kwp026. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Su P.-F., Sie F.-C., Yang C.-T., Mau Y.-L., Kuo S., Ou H.-T. Association of ambient air pollution with cardiovascular disease risks in people with type 2 diabetes: A Bayesian spatial survival analysis. Environ. Health. 2020;19(1):1–12. doi: 10.1186/s12940-020-00664-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun B., Tchetgen Tchetgen E.J. On inverse probability weighting for nonmonotone missing at random data. J. Amer. Statist. Assoc. 2018;113(521):369–379. doi: 10.1080/01621459.2016.1256814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Taylor B.M. 2015. Auxiliary variable Markov chain Monte Carlo for spatial survival and geostatistical models. arXiv preprint arXiv:1501.01665. [Google Scholar]
- Taylor B.M., Rowlingson B.S. Spatsurv: An R package for Bayesian inference with spatial survival models. J. Stat. Softw. 2017;77:1–32. [Google Scholar]
- Thamrin S.A., Amran B.S., Jaya A.K., Rahmi S., Ansariadi S. Vol. 1827. AIP Publishing LLC; 2017. Bayesian inference for spatial parametric proportional hazards model using Spatsurv R. (AIP Conference Proceedings). [Google Scholar]
- Thamrin S.A., Jaya A.K., Mengersen K., et al. Bayesian spatial survival modelling for dengue fever in Makassar, Indonesia. Gac. Sanit. 2021;35:S59–S63. doi: 10.1016/j.gaceta.2020.12.017. [DOI] [PubMed] [Google Scholar]
- Tsonaka R., Verbeke G., Lesaffre E. A semi-parametric shared parameter model to handle nonmonotone nonignorable missingness. Biometrics. 2009;65(1):81–87. doi: 10.1111/j.1541-0420.2008.01021.x. [DOI] [PubMed] [Google Scholar]
- Vaupel J.W., Manton K.G., Stallard E. The impact of heterogeneity in individual frailty on the dynamics of mortality. Demography. 1979;16(3):439–454. [PubMed] [Google Scholar]
- Vonesh E.F., Greene T., Schluchter M.D. Shared parameter models for the joint analysis of longitudinal data and event times. Stat. Med. 2006;25(1):143–163. doi: 10.1002/sim.2249. [DOI] [PubMed] [Google Scholar]
- Williamson E.J., Walker A.J., Bhaskaran K., Bacon S., Bates C., Morton C.E., Curtis H.J., Mehrkar A., Evans D., Inglesby P., et al. OpenSAFELY: Factors associated with COVID-19 death in 17 million patients. Nature. 2020;584(7821):430. doi: 10.1038/s41586-020-2521-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang Z., Cortese G., Combescure C., Marshall R., Lee M., Lim H.J., Haller B., et al. Overview of model validation for survival regression model with competing risks using melanoma study data. Ann. Transl. Med. 2018;6(16) doi: 10.21037/atm.2018.07.38. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou H. University of South Carolina; 2015. BayesIan Semi-and Non-Parametric Analysis for Spatially Correlated Survival Data. (Ph.D. thesis) [Google Scholar]
- Zhou H., Hanson T., Zhang J. 2017. SpBayesSurv: Fitting Bayesian spatial survival models using R. arXiv preprint arXiv:1705.04584. [Google Scholar]
- Zhou X., Reiter J.P. A note on Bayesian inference after multiple imputation. Amer. Statist. 2010;64(2):159–163. [Google Scholar]






