Skip to main content
UKPMC Funders Author Manuscripts logoLink to UKPMC Funders Author Manuscripts
. Author manuscript; available in PMC: 2023 Jul 27.
Published in final edited form as: Biom J. 2020 Jan 9;62(2):444–466. doi: 10.1002/bimj.201900051

Multiple imputation methods for handling incomplete longitudinal and clustered data where the target analysis is a linear mixed effects model

Md Hamidul Huque 1,2,3,*, Margarita Moreno-Betancur 1, Matteo Quartagno 4, Julie A Simpson 5, John B Carlin 1,2,5, Katherine J Lee 1,2
PMCID: PMC7614826  EMSID: EMS181448  PMID: 31919921

Abstract

Multiple imputation (MI) is increasingly popular for handling multivariate missing data. Two general approaches are available in standard computer packages: MI based on the posterior distribution of incomplete variables under a multivariate (joint) model, and fully conditional specification (FCS), which imputes missing values using univariate conditional distributions for each incomplete variable given all the others, cycling iteratively through the univariate imputation models. In the context of longitudinal or clustered data, it is not clear whether these approaches result in consistent estimates of regression coefficient and variance component parameters when the analysis model of interest is a linear mixed effects model (LMM) that includes both random intercepts and slopes. In the current paper, we compared the performance of seven different MI methods for handling missing values in longitudinal and clustered data in the context of fitting LMMs with both random intercepts and slopes. We study the theoretical compatibility between specific imputation models fitted under each of these approaches and the LMM, and also conduct simulation studies in both the longitudinal and clustered data settings. Simulations were motivated by analyses of the association between body mass index (BMI) and quality of life (QoL) in the Longitudinal Study of Australian Children (LSAC). Our findings showed that the relative performance of MI methods vary according to whether the incomplete covariate has fixed or random effects and whether there is missingnesss in the outcome variable. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance-components via simulation. We illustrate our findings with the analysis of LSAC data.

Keywords: Fully conditional specification, Joint modelling, Missing data, Multiple imputation, Repeated measurement, clustered data

1. Introduction

Longitudinal and cluster-correlated data arise in many public health settings where data are collected from (i) individual participants repeatedly over time and (ii) from groups of individuals that are clustered within natural units e.g, medical practices, geographical locations. Both of these settings have the common characteristic of correlated measurements either within an individual or within a cluster of individuals. Mixed-effects models are frequently used in the analysis of correlated data. However, the validity of the results obtained from such analyses may be compromised if some covariate values are missing (Laird, 1988).

Multiple imputation (MI) has become a popular tool for dealing with missing data in recent years (Rezvan et al., 2015). MI involves the generation of multiple copies of imputed datasets where missing values are replaced by imputed values sampled from their posterior predictive distribution (or an approximation to this) given the observed data. Each completed dataset is analyzed using the statistical model for the epidemiological question of interest, and the resulting estimates and standard errors are combined using Rubin’s rules (Rubin, 1987). The theoretical basis of MI methods has been developed under the assumption that data are missing at random (MAR), which requires that the probability of data being missing does not depend on the unobserved data, conditional on the observed data (Sterne et al., 2009). If the data are MAR, a correctly implemented MI method can produce unbiased and asymptotically efficient estimates of regression parameters and their standard errors. Correct implementation requires compatibility between the imputation and analysis models. Formally, a set of conditional models are called compatible if there exists a joint density function that generates them (Meng, 1994).

Two general approaches for implementing MI in the presence of multiple incomplete variables are available in the literature: MI based on the joint posterior distribution of incomplete variables, often referred to as joint modeling (JM) (Schafer, 1997), and fully conditional specification (FCS; also known as sequential regression and MI using chained equation (MICE)) (Raghunathan et al., 2001; Van Buuren et al., 2006). The JM approach assumes that the incomplete variables follow a multivariate distribution, usually a multivariate normal distribution in which case the method is referred to as multivariate normal imputation (Schafer, 1997). FCS, on the other hand, imputes missing values using univariate conditional distributions for each incomplete variable given all the other variables in the imputation model, cycling iteratively through the univariate imputation models (Raghunathan et al., 2001; Van Buuren et al., 2006). Both the JM and FCS approaches were originally proposed for imputing missing values in cross-sectional settings with independent observations, and subsequently various extensions have been proposed in the literature to accommodate longitudinal and correlated data.

MI methods developed to impute missing values in both the cluster-correlated and longitudinal data settings include a joint multivariate linear mixed effects model (LMM) approach (JM-MLMM) (Schafer and Yucel, 2002), implemented in the pan software in R. There is also an FCS adaptation of Schafer and Yucel’s approach (FCS-LMM) implemented in the mice.impute.2lpan function of the mice package in R (Van Buuren and Groothuis-Oudshoorn, 2011). Both the JM-MLMM and FCS-LMM approaches assume a constant residual variance across all clusters. Subsequently, Yucel and Van Buuren et al. extended the JM-MLMM and FCS-LMM approaches to allow for heteroscedastic (cluster-specific) random covariance matrices and residual error variances, respectively (Yucel, 2011; Van Buuren et al., 2011), hereby denoted as JM-MLMM-het and FCS-LMM-het. Both JM-MLMM and JM-MLMM-het, and their FCS adaptations (FCS-LMM and FCS-LMM-het), assume normal distributions for the incomplete variables. In practice, incomplete variables may be a mixture of continuous and categorical variables, so the assumption of normality may not be realistic. Goldstein et al. proposed an extension of the JM-MLMM approach which uses latent normal (LN) variables to impute a mixture of discrete, normal and non-normal continuous variables, referred to herein as JM-MLMM-LN (Goldstein et al., 2009). Asparouhov and Muthén suggested a method similar to JM-MLMM-LN where all variables in the imputation models are treated as outcomes, regardless of missing data pattern, hereby denoted as full joint (JM-FJ) model (Asparouhov and Muthen, 2010). More recently, Goldstein et al. proposed a further extension of JM-MLMM-LN where the imputation model is defined as the product of the substantive model and the joint distribution of the covariates, to ensure congeniality (substantive model compatible, denoted JM-SMC) (Goldstein et al., 2014). The JM-MLMM-LN and JM-SMC approaches have been implemented in the REALCOM and Stat-JR software packages, respectively (see http://www.bristol.ac.uk/cmm/software/) and both were later adopted in the R software package jomo (Quartagno and Carpenter, 2016). The jomo implementations for JM-MLMM-LN and JM-SMC allow a random covariance matrix and hence are denoted as JM-MLMM-LN-het and JM-SMC-het. Similar efforts have been made to extend both the FCS-LMM and FCS-LMM-het methods to impute categorical data using either generalized LMM (GLMM)-based MI methods (Resche-Rigon and White, 2016; Zhao and Yucel, 2009) or LN variables (FCS-LMM-LN and FCS-LMM-LN-het) (Enders et al., 2017).

In the special case of longitudinal data collected at equal intervals, standard cross-sectional implementations of MVNI and FCS can be employed to impute missing values by treating the time-dependent longitudinal measurements as distinct variables (Schafer, 1997; Van Buuren et al., 2006); we denote these as JM-MVN and FCS-Standard, respectively. These single-level MI methods can also be used for cluster-correlated data by including cluster-specific indicator variables to capture the within-cluster correlation – known as ‘fixed cluster imputation’ (Reiter et al., 2006).

Although similar MI methods can be used to impute missing values in both longitudinal and clustered data settings, the performance of these methods may differ according to the intra-subject/intra-cluster association between outcome and incomplete variables in the analysis model particularly in the situation when both the outcome and covariates associated with random effects contain missing values. In the longitudinal setting, random slopes (i.e., random coefficients for covariates) are usually associated with the time variable only, which is generally fully observed, but covariates with random slopes in the context of clustered data may be incomplete. Furthermore, it is unclear how important these differences are in practice as currently available comparisons of the various MI methods in the literature are limited to either clustered or longitudinal data settings with little theoretical consideration.

In the context of cluster-correlated data, Grund et al. compared two different modeling strategies with JM-MLMM: (i) a multivariate LMM with a so-called reverse random coefficients model assuming that the outcome is fully observed (this model regresses covariates on the outcome with the outcome having random effects if the covariate has them in the analysis model) for imputing missing data in covariates and (ii) a multivariate LMM with random intercepts only (thus ignoring random slopes in the outcome model) for imputing missing data in both covariates and outcome (Grund et al., 2016). They noted that the reverse random coefficients model provided unbiased estimates of the regression and variance components, but the second model performed poorly for the estimation of the random slope variance. Similar findings were also observed by Enders and colleagues (Enders et al., 2016) who compared JM-MLMM, FCS-LMM-het and fixed cluster imputation when both the outcome and covariates contain missing data. They reported that the FCS-LMM-het approach exhibited better performance than the other methods especially when both the outcome and covariate were incomplete in a random intercept and slope analysis model. Audigier and colleagues (Audigier et al., 2018) recently compared a number of methods including fixed cluster imputation, JM-MLMM, FCS-LMM-het, and JM-MLMM-LN-het in the context of cluster-correlated data and reported that all of these methods provided reliable estimation of the regression parameters but JM-MLMM and fixed cluster imputation approaches severely under-estimated the variance components. No such comparison in the context of longitudinal data, where the analysis model of interest is a random intercept and time-slope model, is available in the literature. Recently, we compared 12 different MI approaches for imputation of incomplete longitudinal data where the analysis model of interest is a LMM with subject-specific random intercept only (Huque et al., 2018). We showed that both standard MI methods (JM-MVN and FCS-Standard) and LMM-based approaches (JM-MLMM, JM-MLMM-LN, FCS-LMM and FCS-LMM-LN), provided consistent estimates of the regression and variance component parameters. However, these results may not be generalizable to a random intercept and slope analysis model. Moreover, all the above comparisons are empirical and no theoretical justification for the observed sub-optimal results is available.

The motivation for this study was an analysis of the Longitudinal Study of Australian Children (LSAC) that explored (a) the association between body mass index (BMI) and health related quality of life (QoL) for children over time and (b) whether the association between early BMI and QoL in later life varied across geographical location. Attrition and non-response make these data a natural candidate for analysis using MI, but no clear guideline was available on the selection of the appropriate MI method. In the current paper, we study the properties of available MI methods, both theoretically and via simulations based on these examples, and we also perform an analysis of the LSAC data. As both BMI and QoL are continuous measures, we restrict our comparisons to the approaches where all variables in the MI model are continuous. This simplifies the study of theoretical compatibility between specific imputation models fitted under each of these MI approaches and the analysis model and reduces the number of competitive MI methods, as under this restriction the MI methods with latent normal variables (JM-MLMM-LN, FCS-LMM-LN and FCS-LMM-LN-het) are identical with those that treat all the variables as continuous (JM-MLMM, FCS-LMM, and FCS-LMM-het, respectively). Our study of compatibility confirms that MI approaches result in consistent estimates of regression parameters when the imputation model is compatible with the analysis model. The results from the LSAC data analysis are also in agreement with those seen in the simulation study.

The structure of the article is as follows: Section 2 describes LSAC and the analysis models of interest. Sections 3 and 4 present a theoretical exploration of the compatibility of different MI methods and a linear mixed model with random intercept and slopes as analysis model in the context of longitudinal and cluster-correlated data, respectively. Section 5 describes and presents the results of our simulation study. The application to the LSAC data is presented in Section 6. We conclude with a general discussion in Section 7. The Web Appendices give detailed proofs, as needed.

2. Methods

2.1. Analysis models of interest

Let yi = (yi1, yi2, …, yini)T be the ni-repeated measures of a continuous outcome for individual i ∈ (1,2, …, n), and xi = (xi1, xi2, …, xini)T and ti = (ti1, ti2, …, tini)T represent repeated measures of a continuous covariate and the measurement times, respectively. Suppose the association between the repeated measured outcome and covariates can be expressed using the following LMM

yixi,ti=β0+β1xi+β2ti+b0i+b1iti+εi,i=1,2,n (1)

where β = (β0, β1, β2) is the vector of fixed-effects, bi = (b0i, b1i) ~ N(0, G) denotes the random effects vector and εi=(εi1,εi2,,εini)N(0,Φi=σεi2I), where I is the ni × ni identity matrix. The LMM in (1) typically assumes that the residual error, εi and random effects bi are independent of each other. Thus the marginal distribution of yi is MVN(μyi=β0+β1xi+β2ti,Σyi=ZiGZiT+Φi), where Zi = (1, ti)T is a ni × 2 matrix with the first column having all elements equal to 1. This LMM models the longitudinal trajectory for each subject over time.

A similar model can also be applied to clustered data where the effect of some covariates on the outcome are allowed to vary from cluster to cluster. In the clustered data setting, the LMM with a random intercept and slope might take the following form

yix1i,x2i=α0+α1x1i+α2x2i+a0i+a2ix2i+ξi,i=1,2,m; (2)

where x1i and x2i are vectors of measurements of covariates x1 and x2, respectively within cluster i ∈ (1, 2, …m), assumed to be associated with the outcome, yi.

The estimation of parameters for the above LMMs can be carried out in similar fashion if all the variables in the model are complete. However, in the presence of incomplete data in the covariates the above two classes of models could differ: in the longitudinal setting, the covariate associated with the random slope, the measurement times t is generally observed, while in the clustered data settings, the covariate (x2) associated with a random slope may be incomplete. To assess the performance of the MI approaches in these distinct situations where an LMM with random intercepts and slopes is the analysis model of interest, we evaluated their performance under the following four scenarios: in the case of a longitudinal study where (i) only the covariate x is incomplete and (ii) both the covariate x and outcome y are incomplete;and in the context of clustered data where (iii) only the covariate x2 is incomplete and (iv) both covariate x2 and outcome y are incomplete.

In the next two sections we study the theoretical properties of various MI methods available for imputing longitudinal and clustered data, in particular, we examine the potential for compatibility of each imputation model with the analysis model of interest.

3. MI methods for missing data in longitudinal settings

In longitudinal studies, data from the same individuals are collected repeatedly over time. Longitudinal data can be arranged in the wide format (new variable for each repeated measurement) if measurements occur at the same time-points for all individuals (i.e., the dataset is balanced) or in the long format (where repeated measurements are stacked). The wide format data can be imputed using standard cross-sectional imputation models (JM-MVN and FCS-Standard) by assuming the repeated assessments of the same variable are distinct variables, while imputation with the long format data requires use of multilevel imputation models.

3.1. JM-MVN

JM-MVN can be applied if we have balanced longitudinal data by treating all the repeated measurements of time-dependent variables as distinct. This method assumes a multivariate normal distribution for all of the incomplete variables. More specifically, assume that both the time-dependent covariates and outcome for individual i ∈ (1, 2, …, n) measured on T occasions, where t = (1,2, ….T) represents the vector of time-points when the measurements took place. If both covariate x and outcome y are incomplete, then JM-MVN assumes that (y1, y2, …, yT, x1, x2, …, xT) ~ N(μ, Σ) where μ and Σ are the mean and an unstructured variance-covariance matrix, respectively.

We study the congeniality between JM-MVN and the analysis model (1) in the setting where the covariate xi also follows a LMM defined as

xiti=γ0+γ1ti+u0i+u1iti+ϵi, (1)

where ϵi ~ N(0, ϒ) and ui = (u0i, u1i) ~ N(0, D), where ϒ and D are the covariance matrices currently left unspecified. As both the conditional distributions of (yi|xi,ti) and (xi|ti) are Gaussian, the joint distribution of (yi, xi|ti) is also Gaussian. Since we are assuming that the data are collected for an equal number of visits at fixed time intervals for all individuals, the joint distribution of (y, x|t)T = (y1, y2, …yT, x1, x2, …xT|1, 2, …T) is normal and given by

(yx|t)=N(μ=(β0+β1(γ0+γ1t)+β2tγ0+γ1t),Σ=(β1Σxβ1T+Σyβ1ΣxΣxβ1Σx)). (4)

[see the Appendix A.1 for proof]. Therefore, the joint distribution assumed by JM-MVN in scenario (ii) is compatible with the joint distribution implied by analysis model (1).

Scenario (i), where there is missing data only for x, is a special case of scenario (ii), hence JM-MVN will be compatible with analysis model (1) for this scenario too.

3.2. JM-MLMM

Instead of treating repeated measurements as distinct variables, Schafer and Yucel suggested using a multivariate LMM for imputing several incomplete longitudinal variables (Schafer and Yucel, 2002). Under scenario (ii) this method imputes missing data from the following multivariate LMM:

(xiyiti)=(β0(x)+β1(x)ti+b0(x)i+b1(x)iti+ε(x)iβ0(y)+β1(y)ti+b0(y)i+b1(y)iti+ε(y)i) (5)

where (b0(x)ib1(x)ib0(y)ib1(y)i)N(0,Ψ) and (ε(x)iε(y)i)N[0,(ΣI)]. The covariance matrix Ψ has dimension 4 × 4 and the Kronecker product notation indicates that the ε(x)i and ε(y)i are independently distributed as N(0, Σ). With some algebra we can show that analysis model (1) can be obtained as a special case of the conditional model for the outcome given the covariate, x, under the bivariate joint distribution defined in (5) [see Appendix A.2 for proof]. Hence the JM-MLMM model would be compatible with the analysis model of interest under scenario (ii).

Under scenario (i) i.e., when only covariate x contains missing data, the imputation model under JM-MLMM is given by

xiyi,ti=β0(x)+β1(x)yi+β2(x)ti+b0(x)i+b1(x)iti+ε(x)i (6)

where ε(x)i ~ N(0, Σ(x)) and b(x)i = (b0(x)i, b1(x)i) ~ N(0, Ψ(x)) with Σxy=ZiΨ(x)ZiT+Σ. Thus, under scenario (i), JM-MLMM would be compatible with the substantive model if both of the conditional models xi|yi, ti and yi|xi, ti lie in the subspace determined by the joint model (xiyiti). It can be shown that imputation model (6) is compatible with analysis model (1) if β1(x)TΣxy1=β1Σy1 [see the Appendix A.2 for proof]. Similar conditions for two linear regressions to be compatible when the target joint distribution is bivariate normal have also been noted (Zhu and Raghunathan, 2015) and (Liu et al., 2014). The current paper extends those results to the context of LMM.

3.3. JM-FJ

Asparouhov and Muthén suggested an alternative to the JM-MLMM-LN (Goldstein et al., 2009) where the data are imputed using an unrestricted model, where all variables in the imputation models are treated as outcome, regardless of missing data pattern, hereby denoted as full joint (JM-FJ) model (Asparouhov and Mutheén, 2010). The JM-FJ method under both scenario (i) and (ii) is given by

(yixiti)=(β0(y)+b(y)0i+ε(y)iβ0(x)+b(x)0i+ε(x)iβ0(t)+b(t)0i+ε(t)i), (7)

where (b(y)0ib(x)0ib(t)0i)N(0,Ωu) and (ε(y)iε(x)iε(t)i)N(0,Ωε). This model imposes the same random-effect structure for all variables, decomposing the variance into within and between-individual components. In longitudinal studies, data are often collected at fixed time intervals for all individuals, and therefore, it may not be sensible to assume between-individual variability for the time variable (or corresponding latent variable). The JM-FJ approach has a large number of parameters and convergence is often difficult to achieve. Moreover, it can be shown that the joint distribution implied by JM-FJ (7) is not compatible with the substantive model (1) [see Appendix A.3 for proof]. This uncongeniality is due to the fact that the JM-FJ does not accommodate the variability in the slope across individuals. Because of this non-congeniality, in our simulation studies we also examine whether assuming heteroscedastic covariance matrices in the imputation model may improve the estimation of the variance components by allowing for subject-specific correlations (JM-FJ-het).

3.4. JM-SMC

Goldstein, Carpenter and Browne (2014) extended JM-MLMM-LN to handle missing data in both covariates and outcomes in multilevel models while ensuring that the imputation model is compatible with the substantive model (Goldstein et al., 2014). We refer to this as the substantive-model-compatible joint modeling approach (JM-SMC). In this formulation, the joint imputation model is defined as a product of the joint distribution of covariates and the analysis model (i.e., conditional model for the outcome given the covariates). Specifically, the JM-SMC approach defines the joint distribution of (xiyiti) as

(xiyiti)=(yixi,ti)×(xiti), (8)

where (xi|ti = β0(x) + β(x)t + b0(x)i + b1(x)iti + ε(x)i) with b(x) - N(0, θu) and ε(x)i - N(0, Θε). The JM-SMC thus ensures compatibility under both scenarios (i) and (ii). Similarly to JM-FJ, in our simulation we also assume heteroscedastic covariance matices for the imputation using JM-SMC, and we labeled this JM-SMC-het.

3.5. FCS-Standard

Similarly to JM-MVN, FCS-Standard can be applied only in the setting with regular measurement time-points, by treating all the repeated measurements of time-dependent variables as distinct variables. Specifically, this approach involves a conditional imputation model for each time-and-variable-specific measurement given the remaining measurements and variables. When considering only continuous outcome and covariates, as in this manuscript, FCS-Standard is implemented using linear regression models without interactions between covariates for the univariate imputation models. In this situation, FCS-Standard and JM-MVN are equivalent (see proposition 1 of (Hughes et al., 2014)). Given we have shown that JM-MVN is compatible with analysis model (1) under model (3) for the incomplete covariate, FCS-Standard will also be compatible with the analysis model (1) in both scenarios under these conditions.

3.6. FCS-LMM

Instead of treating repeated measurements as distinct variables, the FCS-LMM method uses a LMM for imputing missing values in each incomplete time-dependent variable given all the others, cycling iteratively through the univariate imputation models. Specifically, the Gibbs sampler cycles through the univariate LMMs assuming homogeneous within-subject variance, which is a special case of a multivariate LMM (5). That is, it uses the same imputation models as JM-MLMM with only one variable considered incomplete at a given iteration. Under scenarios (i) and (ii), this method will be compatible with the analysis model if the compatibility condition (derived in 3.2) is satisfied.

3.7. FCS-LMM-het

Similarly to FCS-LMM, FCS-LMM-het imputes each time dependent incomplete variable using a LMM. However, this method allows a subject-specific residual error variance. Under this approach, the imputation model for covariate x associated with the i′th subject of interest is given by

(xiyi,ti,bi)=N(β0i(x)+β1i(x)yi+β2i(x)ti,Σixy=σix2Ini), (9)

where β0i(x) = β0(x) + b0(x)i, β1i(x) = β1(x) + b1(x)i and β2i(x) = β2(x) + b2(x)i. Note the FCS-LMM-het approach assumes random slopes for each variable in the imputation model. Analysis model (1) can be re-written as

(yi|xi,ti,bi)=N((β0+b0i)+β1xi+(β2+b1i)ti,Σyx=σy2Ini)

Using similar arguments as with FCS-LMM, it can be shown that under scenario (i) FCS-LMM-het would be compatible with the analysis model if both the conditional model xi|yi, ti and yi|xi, ti lie in the subspace determined by the joint model (xiyi|ti). It can thus be shown that the imputation model (9) is compatible with the analysis model (1) if β1i(x)TΣixy1=β1Σy1, which is very similar to the condition derived in (3.2). Hence, this method will be compatible with the analysis model (1) under both scenarios (i) and (ii), if the above compatibility condition is satisfied.

In summary, for longitudinal studies, we anticipate that all of the above methods, except the JM-FJ, will provide consistent estimates of regression and variance components.

4. MI models for missing data in cluster-correlated data

In the cluster-correlated settings data are arranged in a long format by stacking data from each cluster. The MI methods that can be carried out are i) standard JM and FCS approaches using a total of m-1 indicator variables representing allocation of m clusters as a fixed factor in the model (fixed cluster imputation) (Reiter et al., 2006; Enders et al., 2016), or ii) a multilevel imputation method.

The use of indicator variables in fixed cluster imputation preserves the difference in intercept between clusters. However, an interaction between the indicator variables and the incomplete variables will also be needed to accommodate the random slope variation if the covariate(s) associated with random slopes are incomplete. However, such analysis requires estimation of a large number of parameters and hence is computationally demanding and often infeasible particularly with large number of clusters of small sizes (Enders et al., 2016). In contrast, the multilevel imputation approach is more appealing as it can be easily implemented for random intercept and slope models and is computationally faster than the fixed cluster imputation approach. Therefore, in this paper we will only study the theoretical and empirical properties of the multilevel imputation approach, although we return to fixed cluster imputation in the discussion section.

4.1. JM-MLMM

Similarly to the approach for longitudinal data, JM-MLMM uses a joint LMM for all incomplete variables. However, the formulation of the LMM will differ with respect to whether the incomplete variables are associated with random slopes or fixed effects. For example, when covariate x1 and outcome y are incomplete the following JM-MLMM model is assumed for the incomplete variables:

(x1iyix2i)=(α0(x1)+α1(x1)x2i+a0(x1)i+a2(x1)ix2i+ε1iα0(y)+α1(y)x2i+a0(y)i+a1(y)ix2i+ε2i)

Thus, similarly to the longitudinal settings [Appendix A.2], there is compatibility with the analysis model. As the above imputation model is similar to the longitudinal case (with t replaced by x2), we do not consider it further. However, when covariate x2 and outcome y are incomplete (scenario (iv)) the joint model for the incomplete variables using JM-MLMM is

(x2iyix1i)=(α0(x2)+α1(x2)x1i+a0(x2)i+ε3iα0(y)+α1(y)x1i+a0(y)i+ε4i)

This JM-MLMM imputation model does not accommodate the random slope for the incomplete variable and is therefore incompatible with the analysis model [the proof is omitted as it is similar to the proof provided for the incompatibility of the JM-FJ model]. Similarly to the longitudinal setting, it can be shown that when either x1 or x2 contains missing values but the outcome is fully observed (scenario (iii)) JM-MLMM would be compatible with the analysis model (2).

4.2. JM-FJ

The JM-FJ model for cluster-correlated data assumes the following joint model for (y, x1, x2) irrespective of missing data in y, x1 or x2

(yix1ix2i)=(β0(y)+b(y)0i+ε(y)iβ0(x1)+b(x1)0i+ε(x1)iβ0(x2)+b(x2)0i+ε(x2)i) (10)

Similarly to JM-FJ for longitudinal data (see section 3.3) it can be shown that the joint distribution implied by JM-FJ (10) is not compatible with the analysis model (2) in either scenario (iii) or (iv).

4.3. JM-SMC

As for the analysis model (1), by construction the JM-SMC approach will also be compatible with analysis model (2) irrespective of whether missing data is in the outcome or covariates.

4.4. FCS-LMM

This method uses identical imputation models to those under JM-MLMM (see section 4.1) with only one variable considered missing at a given iteration, and is compatible with the analysis model (2) irrespective of whether covariate(s) and/or the outcome are incomplete (i.e., for scenario (iii) and (iv)) if the compatibility condition (derived in 3.2) is satisfied.

4.5. FCS-LMM-het

The imputation model followed by FCS-LMM-het in the clustered data setting will be compatible with the substantive model of interest under both scenarios (iii) and (iv). The proof is similar to that provided in section 3.7 and is given in Appendix A.4.

In summary, considering all of the above methods for cluster-correlated data, we anticipated that the JM-FJ and both JM-MLMM and JM-FJ would provide biased estimates of the variance components under scenario (iii) and (iv), respectively.

5. Simulation study

In this section we describe the simulation studies that were used to assess the relative performance of the MI methods described in Sections 3 and 4 in the settings of longitudinal and clustered data. Our simulation studies are based on data from the kindergarten (K) cohort of children in LSAC (n=4983), who were aged 4-5 years when recruited in 2004. LSAC is a nationally representative study that examines the development and wellbeing of Australian children. Following recruitment, data have been collected every two years (referred to as waves of data collection) using face-to-face interviews, questionnaires and direct anthropometric measurements. The study is ongoing with six waves of data currently available. The detailed study procedure has been described elsewhere (LSAC). Here we consider two target analyses: (a) a longitudinal example: association between BMI-z score and QoL in children over time and (b) a cluster example: whether BMI-z score at wave 5 predicts the QoL at wave 6 after accounting for clustering by neighbourhood. Specifically, for analysis (a) we fitted a model similar to model (1) with QoL as a time-varying outcome, BMI-z score a time-varying covariate and age (in years) of the child as the time variable, with child-specific random intercepts and time-slopes. For analysis (b) we fitted a model similar to (2) with child QoL at wave 6 as the outcome (y), child BMI-z score at wave 5 as the exposure of interest (x2) and socio-economic index for areas (SEIFA) as a covariate, with both fixed and random effects for area (x1). The missingness patterns among these variables in the LSAC dataset have been described elsewhere (Huque et al., 2018).

5.1. Longitudinal data

For the longitudinal example, we generated 1000 datasets of 5000 children assessed at 6 waves of follow-up. Three covariates at baseline: mother’s education, language spoken at home and family socio-economic position; as well as three time-dependent variables: age, BMI z-score and the outcome, QoL for each child were generated. The details of the simulation setup are given below:

  1. Whether English is the main language spoken at home (hlang) and maternal education (medu: whether or not completed year 12) for each child were generated using binomial distributions with probabilities 0.9 and 0.6 respectively.

  2. The household socio-economic position (hsep) at baseline was generated using the following regression model:
    hsepi=0.8+1.0×medui+0.2× hlang i+νi,i=1,2,,5000.
    where νi ~ N(0, 0.92).
  3. Child age in years (cage) for the ith child in the jth wave (cageij) was generated according to the following model
    cageij=112{48+(waveij1)×24+ϑi}+vij,j=1,2,,6.
    where ϑi = N(11,1.52), is the distribution of age (in months) of the participant at the recruitment and vij = N(0, 22) is the random variation in age at the time of assessment.
  4. The time-varying exposure, cbmij was then generated using the LMM
    cbmiij=γ0+γ1 cageij+u0i+u1i cageij+ϒij,
    where γ = (γ0, γ1) is the vector of fixed-effects, ui = (u0i, u1i) ~ N(0, D) denotes the random effects vector with the following specification for the parameters: γ = (-0.60, 0.10)T, D=(D00D01D10D11)=(0.490.0150.0150.005), where D00 = var(u0i), D01 = cov(u0i, u1i), D11 = var(u1i), and ϒi = (ϒi1, ϒi2, …, ϒini) ~ N(0,0.52).
  5. Finally, the continuous outcome variable, child QoL, cqolij was generated according to
    cqolij=β0+β1cbmiij+β2 cageij+b0i+b1icageij+εij,
    where β = (β0, β1, β2) is the vector of fixed-effects, bi = (b0i, b1i) ~ N(0, G) denotes the random effects vector. We set β = (1.00, –0.20, –0.10)T, G=(0.360.0120.0120.004) and residual error variance, εi = (εi1, εi2, …,εini) ~ N(0,0.662).

All of the above parameter values were based on the LSAC data.

For each simulated dataset we considered two scenarios where (i) only the exposure of interest (cbmi) and (ii) both the exposure of interest (cbmi) and the outcome (cqol) were subject to missingness at each wave under an MAR mechanism. Specifically we used the following models to create missing data in cbmi and cqol, respectively

logit{Pr(R1ij=1)}=θ1+θ2cqolij+θ3cageijlogit{Pr(R2ij=1)}=θ4+θ5cageij+θ6hsepij

where R1ij = 1(R2ij = 1) if cbmiij (cqolij) is observed and 0 if missing. The coefficients θ = (θ1, …, θ6)T were chosen to ensure approximately 30% of the exposure (cbmi) and outcome (cqol) were missing.

5.2. Clustered data

In order to evaluate the performance of the above MI methods in clustered settings, we generated 1000 datasets, each with eight variables: area identification number, socio-economic status for areas (SEIFA), mother’s education (medu), language spoken at home (hlang), family socio-economic position (hsep), child sex (csex), BMI z-score (cbmi) and QoL (cqol). We considered 300 areas (clusters), where the number of children in each area varied between 2 to 25. Our simulated dataset mimicked the LSAC dataset not only in terms of cluster size and the number of clusters, but also with regards to the relationship between the covariates. The analysis of interest was whether the relationship between child BMI z-score at wave 5 and QoL at wave 6 varied across all areas. In all of the simulated datasets variables were simulated in a sequential manner as follows:

  1. Sex (csex), English language background (hlang) and mother’s education (medu: whether or not completed year 12) for each child were generated using binomial distributions with probabilities 0.5, 0.9 and 0.6 respectively.

  2. Child age in years at wave 5 was generated using the following model
    cageij=112{168+ϑij}j=2,3,,25.i=1,2,300
    where ϑij = N(11, 1.52) is the distribution of age (in months) of the jth child at recruitment from area i.
  3. The main exposure variable of interest, cbmi was generated based on child’s age and sex using the following linear regression model
    cbmiij=(1.0+d0i)+0.11cageij+0.05csexij+ψij,
    where ψij ~ N(0,1) and d0i ~ N(0, 0.152)
  4. SEIFA at each area was generated as a standard normal variable.

  5. Family socio-economic position (hsep) was generated based on SEIFA, mother’s education and language using the following linear regression model
    hsepij=4.7+0.8meduij+0.01 SEIFAi+0.01 hlangij+ϕij
    where φij ~ N(0, 0.92).
  6. Finally the outcome, cqol score, was generated using the LMM
    cqolij=(0.05+b0i)+(0.2+b1i)cbmiij+0.25SEIFAi+eij
    with eij ~ N(0, 0.92), (b0ib1i)N(0,G) and G=(0.160.000.000.04).

All of the above parameter values were based on LSAC data. The exception was that we slightly inflated the magnitude of the regression and variance-component parameters in the outcome model in order to accentuate the differences in the estimated parameters from the MI methods.

For each simulated dataset we considered two scenarios: (iii) only the exposure of interest cbmi and (iv) both the exposure cbmi and outcome cqol were missing under an MAR mechanism. Specifically, we fitted the following models to create missing data in cbmi and cqol, respectively

logit{Pr(R3ij=1)}=2.2+1.0 cqollij+0.2SEIFAi0.2 hsepijlogit{Pr(R4ij=1)}=2.5+0.2SEIFAi0.3 hsepij

where R3ij = 1(R4ij = 1) if cbmi(cqol) is observed and 0 if missing.

6. Performance of the MI method

We applied all the imputation methods described in section 3 and 4 to the simulated and LSAC datasets. In light of the seven main choices for the specification of multiple imputation method namely i) the MAR assumption, ii) form of the imputation model, (iii) set of variables included in the imputation model, iv) passive imputation v) order of the variables vi) number of iterations and vii) number of multiply imputed datasets (Van Buuren and Groothuis-Oudshoorn, 2011), we generated data under MAR and included the same set of predictors across all the imputationmethods. Specifically, we included socio-economic position as an auxiliary variable, in addition to all analysis variables, and considered the same order of imputation variables for all the methods (if applicable). Thirty imputations were generated for each approach to limit Monte Carlo (imputation-related) error for the regression coefficient of interest to approximately 5 percent of its standard error. However, for each method, we set the number of burn-in iterations and number of between imputations to the default values of current software implementations and finally the form of the imputation models varies according to the specific imputation method. We compared estimated regression coefficients, standard errors (both average of the model-based and the empirical standard error) and variance-component estimates from the various imputation approaches and an available data analysis, which excluded records with missing data in any analysis variable. Bias and coverage probability of the estimated regression coefficients from each of the approaches and from an available data analysis compared to the values used to generate the data are also presented. In each case, the sampling properties of the estimators were estimated from the 1000 simulated datasets.

6.1. Simulation results

The simulation results for the longitudinal example with missing values under scenarios (i) and (ii) across the 1000 simulated datasets are displayed in Table 1 and 2, respectively. It is clear from Table 1 and 2 that the available data analysis resulted in biased estimation of the regression coefficients and variance components along with inadequate coverage probabilities.

Table 1. Simulation results for the analysis of longitudinal data using simulation scenario (i), i.e., scenario with data missing in the covariate only.

Regression Parameters True value Available data JM-MVN JM-MLMM JM-FJ JM-FJ-het JM-SMC JM-SMC-het FCS-standard FCS-LMM FCS-LMM-het
cbmi, β1^ -0.200 -0.184 -0.200 -0.204 -0.194 -0.186 -0.195 -0.197 -0.199 -0.204 -0.204
β1^ rbias (%) 0.078 0.002 0.020 0.029 0.068 0.027 0.014 0.003 0.021 0.019
Model SE 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008
Empirical SE 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008 0.008
95% Coverage 0.463 0.945 0.921 0.895 0.644 0.891 0.941 0.950 0.919 0.926
cage, β2^ -0.100 -0.090 -0.100 -0.100 -0.101 -0.102 -0.101 -0.101 -0.100 -0.100 -0.100
β1^ rbias (%) 0.105 0.001 0.003 0.009 0.016 0.010 0.007 <0.001 0.003 0.003
Model SE 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
Empirical SE 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
95% Coverage 0.000 0.948 0.947 0.926 0.853 0.923 0.937 0.951 0.948 0.949
Variance components
G00 est 0.360 0.335 0.361 0.364 0.360 0.361 0.361 0.360 0.360 0.364 0.363
G00rbias (%) 0.071 0.002 0.010 <0.001 0.002 0.003 0.001 <0.001 0.010 0.009
G01 est -0.012 -0.011 -0.012 -0.012 -0.012 -0.012 -0.012 -0.012 -0.012 -0.012 -0.012
G01 rbias (%) 0.044 <0.001 0.028 0.012 0.002 0.022 0.022 0.005 0.028 0.024
G11 est 0.004 0.003 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.004 0.004
G11 rbias(%) 0.170 0.003 0.013 0.008 0.003 0.013 0.010 <0.001 0.013 0.011
Residual error, σ^ε2 0.436 0.414 0.436 0.434 0.436 0.437 0.436 0.435 0.436 0.434 0.434
σ^ε rbias (%) 0.049 <0.001 0.003 0.001 0.003 <0.001 0.002 <0.001 0.003 0.003

Table 2. Simulation results for the analysis of longitudinal data using simulation scenario (ii), i.e., scenario with data missing in both covariate and outcome.

Regression Parameters True value Available data JM-MVN JM-MLMM JM-FJ JM-FJ-het JM-SMC JM-SMC-het FCS-standard FCS-LMM FCS-LMM-het
cbmi, β1^ -0.200 -0.183 -0.198 -0.198 -0.197 -0.185 -0.195 -0.197 -0.198 -0.202 -0.202
β1^ rbias (%) 0.083 0.011 0.010 0.015 0.075 0.025 -0.013 0.011 0.011 0.011
Model SE 0.008 0.010 0.010 0.010 0.010 0.009 0.009 0.010 0.009 0.009
Empirical SE 0.008 0.010 0.010 0.010 0.009 0.009 0.009 0.010 0.009 0.009
95% Coverage 0.510 0.940 0.936 0.923 0.697 0.913 0.936 0.943 0.939 0.934
cage, β2^ -0.100 -0.088 -0.100 -0.100 -0.101 -0.102 -0.101 -0.101 -0.100 -0.100 -0.100
β1^ rbias (%) 0.119 0.003 0.002 0.005 0.017 0.009 0.007 0.003 0.002 0.002
Model SE 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
Empirical SE 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
95% Coverage 0.000 0.948 0.950 0.933 0.867 0.915 0.931 0.943 0.959 0.966
Variance components
G00 est 0.360 0.333 0.359 0.375 0.345 0.378 0.360 0.359 0.360 0.378 0.377
G00rbias (%) 0.075 0.003 0.041 0.043 0.051 0.001 0.003 0.001 0.051 0.046
G01 est -0.012 -0.011 -0.012 -0.014 -0.002 -0.007 -0.012 -0.012 -0.012 -0.015 -0.013
G01 rbias (%) 0.050 0.012 0.207 0.818 0.405 0.012 0.006 0.001 0.227 0.099
G11 est 0.004 0.003 0.004 0.004 0.002 0.003 0.004 0.004 0.004 0.004 0.004
G11 rbias(%) 0.176 0.003 0.101 0.494 0.405 0.011 0.008 0.036 0.107 0.031
Residual error, σ^ε2 0.436 0.415 0.438 0.436 0.463 0.487 0.436 0.435 0.438 0.432 0.435
σ^ε rbias (%) 0.047 0.005 0.006 0.063 0.117 0.001 0.001 0.005 0.008 0.001

All of the MI approaches except JM-FJ provided similar estimation of regression parameters and coverage probabilities in both scenarios. Slight under-coverage of the regression parameters was obtained from JM-FJ and JM-SMC, which assume homoscedastic variances. However, somewhat contrasting results were obtained when imputed assuming heteroscedastic covariance matrices for both of these methods. The JM-FJ was more biased and led to underestimation of coverage probabilities while JM-SMC performed better compared with its homoscedastic counterpart.

All of the MI methods except JM-FJ-het provided unbiased estimates of the variance components in the longitudinal setting when only the covariate contained missing values (scenario (i)). However, greater differences were observed across the imputation methods for the estimation of the variance components when both covariate and outcome contained missing values (scenario (ii)). In this scenario (ii) (Table 2), large biases in the variance associated with random slopes were obtained for the JM-MLMM, JM-FJ, JM-FJ-het and FCS-LMM approaches. The large bias in the variance associated with random slopes with the JM-MLMM and FCS-LMM approaches is likely an artefact of dividing by a population value that is close to zero, hence we are hesitant to emphasize this finding.

Following a reviewer’s suggestions, we also evaluated the performance of these methods in the case of smaller samples with 1000 individuals followed for 5 consecutive period under both scenarios (i) and (ii). The results, displayed in Tables B1 and B2 in the Appendix B, are qualitatively similar to those with the larger sample size under scenario (i). But under scenario (ii) large biases associated with random slopes variance estimates were obtained for all the methods except JM-MVN, FCS-standard and JM-SMC approaches. Among the MI methods, JM-MVN and FCS-Standard provided the least biased estimates for the fixed effects and variance components. The estimated coverage probabilities for both of these methods were very close to the nominal value of 0.95, in both scenarios. Among the LMM-based imputation approaches, FCS-LMM-het and JM-SMC-het provided the best performance for estimating regression parameters and variance components.

The simulation results for clustered data with missing values under scenarios (iii) and (iv) across the 1000 simulated datasets are displayed in Tables 3 and 4, respectively. Similarly to the longitudinal setting, in the clustered data setting, the available data analysis resulted in biased estimation of the regression coefficients and variance components, and inadequate coverage probabilities. All of the MI approaches provided similar estimates of the regression coefficients and their estimated coverage probabilities were very close to the nominal value of 0.95 for both scenarios. Slight under-coverage of the confidence interval for the regression coefficient of cbmi was obtained from JM-FJ, JM-FJ-het and JM-MLMM especially under scenario (iv). Somewhat greater differences were observed across the imputation methods for the estimation of variance components both in scenario (iii) and (iv). In scenario (iii), i.e. when only the covariate with the random effect contained missing values, JM-FJ and JM-FJ-het resulted in biased estimation of the random slope variances. On the other hand, in scenario (iv) JM-FJ, JM-FJ-het and JM-MLMM all produced biased estimation of the random slope variances. In both scenarios, JM-SMC, FCS-LMM and FCS-LMM-het produced unbiased estimates of the regression and variance component parameters.

Table 3. Simulation results for the analysis of clustered data using simulation scenario (iii),i.e., scenario with data missing in the covariate only.

Regression Parameters True value Available data JM-MLMM JM-FJ JM-FJ-het JM-SMC JM-SMC-het FCS LMM FCS-LMM-het
SEIFA, γ1^ 0.250 0.192 0.247 0.250 0.192 0.250 0.250 0.247 0.249
γ1^ rbias (%) 0.232 0.011 <0.001 0.232 0.001 0.001 0.011 0.004
Model SE 0.028 0.029 0.029 0.028 0.029 0.029 0.029 0.029
Empirical SE 0.028 0.028 0.028 0.028 0.028 0.028 0.028 0.028
95% Coverage 0.444 0.954 0.950 0.450 0.947 0.949 0.954 0.950
cbmi, γ2^ -0.200 -0.177 -0.196 -0.205 -0.177 -0.200 -0.200 -0.196 -0.198
γ2^ rbias (%) 0.115 0.019 0.027 0.001 0.001 0.001 0.020 0.009
Model SE 0.020 0.023 0.021 0.021 0.023 0.023 0.023 0.023
Empirical SE 0.021 0.023 0.024 0.021 0.024 0.024 0.023 0.023
95% Coverage 0.793 0.958 0.913 0.796 0.946 0.947 0.953 0.945
Variance components
G00 est 0.40 0.346 0.396 0.404 0.348 0.399 0.399 0.396 0.397
G00rbias (%) 0.136 0.011 0.010 0.115 0.002 0.002 0.011 0.007
G11 est 0.20 0.175 0.197 0.135 0.176 0.192 0.192 0.197 0.185
G11 rbias(%) 0.123 0.014 0.325 0.232 0.039 0.038 0.016 0.074
Residual error, σ^e 0.90 0.847 0.900 0.911 0.847 0.901 0.901 0.900 0.903
σ^e rbias (%) 0.058 <0.001 0.012 0.058 0.001 0.001 0.001 0.003

Table 4. Simulation results for the analysis of clustered data using simulation scenario (iv), i.e., senario with data missing in covariate associated with random slope and outcome.

Regression Parameters True value Available data JM-MLMM JM-FJ JM-FJ-het JM-SMC JM-SMC-het FCS-LMM FCS-LMM-het
SEIFA, γ1^ 0.250 0.191 0.249 0.250 0.192 0.250 0.250 0.247 0.249
γ1^ rbias (%) 0.235 0.002 <0.001 0.232 <0.001 0.001 0.011 0.004
Model SE 0.031 0.029 0.029 0.028 0.029 0.029 0.029 0.029
Empirical SE 0.031 0.029 0.028 0.028 0.028 0.028 0.028 0.028
95% Coverage 0.508 0.952 0.950 0.450 0.948 0.948 0.953 0.950
cbmi, γ1^ -0.200 -0.179 -0.205 -0.205 -0.177 -0.200 -0.200 -0.196 -0.198
γ2^ rbias (%) 0.105 0.024 0.026 0.115 0.001 0.001 0.021 0.009
Model SE 0.024 0.021 0.021 0.021 0.023 0.023 0.023 0.023
Empirical SE 0.024 0.024 0.024 0.021 0.024 0.024 0.023 0.023
95% Coverage 0.853 0.912 0.912 0.796 0.945 0.943 0.951 0.943
Variance components
G00 est 0.40 0.342 0.404 0.404 0.348 0.399 0.399 0.396 0.397
G00rbias (%) 0.144 0.010 0.010 0.130 0.002 0.002 0.010 0.007
G11 est 0.20 0.171 0.134 0.135 0.176 0.192 0.192 0.197 0.1185
G11 rbias(%) 0.143 0.330 0.325 0.118 0.039 0.039 0.015 0.073
Residual error, σ^e 0.900 0.848 0.911 0.911 0.847 0.901 0.901 0.900 0.903
σ^e rbias (%) 0.057 0.012 0.012 0.058 0.001 0.001 <0.001 0.003

As with the longitudinal data we also evaluated the performance of these methods under scenario (iii) and (iv) using a relatively small number of clusters (n=100) with smaller cluster sizes (each cluster contained between 2 and 10 observations randomly). The results are displayed in Table B3 and B4, respectively. All of the MI methods except JM-FJ in both scenarios and JM-MLMM in scenario (iV) provided slight under-coverage of the confidence interval. Large biases in the estimation of the random slope parameters were observed for both FCS-LMM and FCS-LMM-het, especially in scenario (iv), leaving JM-SMC as the best methods when the number of cluster in the sample is small.

6.2. Application to the LSAC data

The results for the analysis models (a) and (b) applied to the LSAC data are given in Tables 5 and 6 respectively. Available data analysis provides slightly lower estimates of the regression coefficients in the case for analysis model (b) compared with the estimates from MI methods. However, for analysis model (a) the estimated regression coefficients from the available data analysis are very similar to those from the MI approaches. These results are in line with those seen in the simulation study. However, JM-FJ in analysis model (a) and both JM-FJ and JM-MLMM in analysis model (b) produced lower estimates of the variance components than the other MI approaches.

Table 5. LSAC data analysis for analysis model (1) i.e., longitudinal data scenario.

Regression Parameters Available data JM-MVN JM-MLMM JM-FJ JM-FJ-het JM-SMC JM-SMC-het FCS-standard FCS-LMM FCS-LMM-het
cbmi, β1^ -0.043 -0.043 -0.042 -0.047 -0.042 -0.042 -0.045 -0.044 -0.043 -0.047
se(β1^) 0.008 0.008 0.008 0.007 0.008 0.008 0.008 0.008 0.007 0.010
95% CI (-0.058, -0.029) (-0.058, -0.027) (-0.058, -0.026) (-0.061, -0.033) (-0.058, -0.029) (-0.058, -0.026) (-0.061, -0.029) (-0.059, -0.029) (-0.057, -0.029) (-0.067, -0.027)
cage, β2^, years -0.020 -0.022 -0.021 -0.019 -0.020 -0.021 -0.021 -0.022 -0.021 -0.018
se(β2^) 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002 0.002
95% CI (-0.024, -0.016) (-0.026, -0.018) (-0.025, -0.017) (-0.023, -0.015) (-0.024, -0.016) (-0.025, -0.017) (-0.025, -0.017) (-0.026, -0.018) (-0.025, -0.017) (-0.022, -0.014)
Variance components
Ĝ 00 0.431 0.518 0.473 0.419 0.432 0.443 0.450 0.520 0.472 0.441
Ĝ 01 -0.016 -0.021 -0.020 -0.007 -0.016 -0.016 -0.017 -0.022 -0.020 -0.016
Ĝ 11 0.004 0.004 0.005 0.002 0.004 0.004 0.004 0.005 0.005 0.004
σ^ε2 0.434 0.436 0.434 0.460 0.434 0.438 0.437 0.436 0.434 0.465

Table 6. LSAC data analysis for analysis model (2) of the cluster-correlated data.

Regression Parameters Available data JM-MLMM JM-FJ JM-FJ-het JM-SMC JM-SMC-het FCS-LMM FCS-LMM-het
SEIFA,γ1^ 0.154 0.164 0.166 0.154 0.166 0.166 0.166 0.159
se(γ1^) 0.019 0.019 0.019 0.019 0.019 0.019 0.019 0.019
95% CI (0.117, 0.191) (0.127, 0.201) (0.129, 0.203) (0.117, 0.191) (0.129, 0.203) (0.129, 0.203) (0.129, 0.203) (0.122, 0.196)
cbmi,γ2^ -0.071 -0.075 -0.073 -0.071 -0.078 -0.079 -0.075 -0.073
se(β2^) 0.017 0.018 0.017 0.017 0.018 0.018 0.019 0.018
95% CI (-0.104,-0.038) (-0.110, -0.040) (-0.106, -0.040) (-0.104,-0.038) (-0.113, -0.043) (-0.114, -0.044) (-0.112,-0.038) (-0.108, -0.038)
Variance components
Ĝ 00 0.029 0.035 0.034 0.029 0.031 0.031 0.035 0.030
Ĝ 11 0.007 0.006 0.005 0.007 0.008 0.008 0.014 0.008
σe^ 0.931 0.923 0.929 0.867 0.923 0.923 0.916 0.937

7. Discussion

LMMs are frequently used in the analysis of longitudinal and clustered data in order to account for within-individual and within-cluster correlation, respectively. Although several MI methods are available for imputing missing values in longitudinal and cluster-correlated data in the current software, little guidance is available on which is the most appropriate method. The comparison of MI methods for the analysis of correlated data using LMM with random intercepts and slopes in the context of compatibility, a pre-requisite of valid MI, is very limited in the literature. In the current paper, we compared seven different MI methods (JM-MVN, JM-MLMM, JM-FJ, JM-SMC, FCS-Standard, FCS-LMM, FCS-LMM-het) for handling missing values in longitudinal and clustered data in the context of fitting LMM with both random intercepts and slopes. We derived expressions for each of the MI methods to examine the compatibility of these MI methods with a LMM that include both random intercepts and slopes. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance components via simulation. We have summarized our results in Table 7.

Table 7. Summary of multiple imputation models features.

Features JM-MVN JM-MLMM JM-FJ JM-SMC FCS-standard FCS-LMM FCS-LMM-het
Unbalanced data No Yes Yes Yes No Yes Yes
Imputation of discrete variables continuous continuous latent noπnal latent noπnal categorical continuous continuous
Analysis of discrete variables require rounding require rounding Yes Yes Yes require rounding require rounding
Longitudinal data
   Consistent estimation of random intercepts Yes Yes Yes Yes Yes Yes Yes
   Consistent estimation of random slopes
      with incomplete covariates Yes Yes Yes Yes Yes Yes Yes
      with incomplete covariates and outcome Yes No No Yes Yes No No
Clustered data
   Consistent estimation of random intercepts Yes Yes Yes Yes Yes
   Consistent estimation of random slopes
      with incomplete covariates No No Yes No Yes
      with incomplete covariates and outcome No No Yes No No

The results from our theoretical exploration revealed that the relative performance of the MI methods may be expected to vary according to whether the incomplete covariate has fixed or random effects and to the missingnesss in the outcome variable. Specifically, we showed that JM-MVN and FCS-Standard approaches are compatible with the LMM in the context of longitudinal data if measurements occur at the same time-points for all individuals. We also showed that JM-MLMM is compatible with, but that JM-FJ is incompatible with the analysis model of a LMM with random intercepts and slopes. Both the FCS-LMM and FCS-LMM-het methods are compatible with a LMM with random intercepts and slopes. Our comparison also revealed that the newly available substantive model compatible joint modeling (JM-SMC) approach holds great promise for the imputation of longitudinal data. Our simulation study supported our theoretical results. We observed, however, that JM-FJ-het provided sub-optimal performance, especially in the case of longitudinal data, which might be due to a small number of individuals per cluster (observation per individual) in our example, as shown in Audigier et al. (2018). We also observed that JM-SMC-het provided better estimates for the regression parameters and coverage than JM-SMC, apparently because subject-specific associations were better estimated under the heteroscedastic covariance matrices.

Our results regarding clustered data were similar to those for longitudinal data with a couple of exceptions. We found JM-MLMM was compatible with a LMM with a random intercepts and slopes analysis model if only the covariate contains missing data. The JM-MLMM, however, became non-compatible with a LMM with random intercepts and slopes if both the outcome and random-slope covariate contained missing data. Along with others (Enders et al., 2016), we noted that fixed effect imputation methods are computationally expensive particularly with a large number of clusters, hence may not be very useful in practice. In general, our findings are consistent with those of (Enders et al., 2016) who showed that JM-FJ and JM-MLMM produce biased estimation of the variance components while the FCS-LMM-het approach provided consistent estimates in the context of clustered data. Some of our theoretical results extend the results obtained by Resche-Rigon and White (Resche-Rigon and White, 2016) who considered a LMM with only a random intercept.

It is always difficult to draw general conclusions from a single simulation study, but we believe this study provided a good setting for a comparison of MI methods with both theoretical and empirical evaluation. The simulations were designed to represent real world data with a moderate amount of missingness under MAR. Undoubtedly, future simulation studies and further exploration of methods will be useful in a number of ways. In this study, we restricted our attention to data that are MAR. Often longitudinal data does not satisfy the MAR assumption. In general, the MAR assumption cannot be tested but various sensitivity analysis methods (e.g., selection models, pattern-mixture model and NARFCS) are proposed in very specialized context and no such analysis methods is currently available for the context when both longitudinal covariates and outcomes are missing. Forexample, pattern-mixture models are available in the context of longitudinal outcomes but not for the context when both longitudinal covariates and outcomes are missing. The NARFCS approach, arising from the pattern-mixture paradigm, has been developed recently to handle multivariable missingness in cross-sectional settings(Tompsett et al., 2018; Moreno-Betancur et al., 2017; Leacy, 2016) and these could in principle be applied for longitudinal data in the wide format or with the cluster-indicator method in the scenarios we explored. However, these methods have not yet been extended to the context of multilevel imputation models for general multivariable missingness in longitudinal unbalanced data or clustered data (linear mixed models). Hence there were no obvious methods to add to our evaluation. In order to simplify the theoretical calculations and avoid mis-specification of the imputation models we restricted our comparisons to models and methods that assume normality. Although there has been some discussion of compatibility for non-normal data in the context of general location models, such models are only available for single level data (Seaman and Hughes, 2018). The study of compatibility of multilevel models that include non-normal data is beyond the scope of the present paper as Gaussian random effects are usually assumed in the proposed models and in the available software implementation. However, our results for MI involving normal variables might also hold for non-normal data. We had previously shown that both JM-MVN and FCS-standard showed good performance in the context of imputing binary variables (Huque et al., 2018). Quartagno et al. recently showed that the JM-SMC and FCS-standard methods performed equally well in the context of imputing non-normal data (Quartagno and Carpenter, 2019).

In summary, we found that if measurements occur at the same time-points for all individuals in longitudinal studies, the JM-MVN and FCS-Standard approaches may be the best approaches for imputing longitudinal data. We also found that LMM-based approaches (JM-MLMM, JM-SMC-het, FCS-LMM-het, FCS-LMM) can be used if measurement doesn’t occur at the same time points or the imputation model struggles to converge due to many repeated measurements. In the clustered data setting, we recommend using the LMM-based approaches JM-SMC, FCS-LMM or FCS-LMM-het to handling missing data as they performed well in the estimation of regression parameters and variance components. Although multilevel imputation models are slightly more complex compared with standard cross-sectional imputation methods and require specialized software, our comparison revealed that they are a reasonable choice for imputing missing covariate and outcome data where the analysis of interest is a linear mixed effect model with random intercepts and slopes.

Supplementary Material

Appendix

Acknowledgements

This work was supported by funding from the National Health and Medical Research Council: Project grant ID# 1102468, Career Development Fellowship ID#1127984 (KJL), Senior Research Fellowship ID# 1104975 (JAS), and Centre of Research Excellence grant ID#1035261, for the Victorian Centre for Biostatistics (ViCBiostat). Research at the Murdoch Childrens Research Institute is supported by the Victorian Government’s Operational Infrastructure Support Program.

References

  1. Asparouhov Tihomir, Muthe’n Bengt. Multiple imputation with mplus. 2010. (MPlus Web Notes). [Google Scholar]
  2. Audigier Vincent, White Ian R, Jolani Shahab, Debray Thomas, Quartagno Matteo, Carpenter James, van Buuren Stef, Resche-Rigon Matthieu. Multiple imputation for multilevel data with continuous and binary variables. Statistical Science. 2018;33(2):160–183. [Google Scholar]
  3. Enders Craig K, Mistler Stephen A, Keller Brian T. Multilevel multiple imputation: A review and evaluation of joint modeling and chained equations imputation. Psychological methods. 2016;21(2):222–240. doi: 10.1037/met0000063. [DOI] [PubMed] [Google Scholar]
  4. Enders Craig K, Keller Brian T, Levy Roy. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological methods. 2017 doi: 10.1037/met0000148. [DOI] [PubMed] [Google Scholar]
  5. Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian Data Analysis. Taylor Francis; 2013. (Chapman Hall/CRC Texts in Statistical Science). ISBN 9781439840955 URL https://books.google.com.au/books?id=ZXL6AQAAQBAJ. [Google Scholar]
  6. Goldstein Harvey, Carpenter James, Kenward Michael G, Levin Kate A. Multilevel models with multivariate mixed response types. Statistical Modelling. 2009;9(3):173–197. [Google Scholar]
  7. Goldstein Harvey, Carpenter James R, Browne William J. Fitting multilevel multivariate models with missing data in responses and covariates that may include interactions and non-linear terms. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2014;177(2):553–564. [Google Scholar]
  8. Grund Simon, Ludtke Oliver, Robitzsch Alexander. Multiple imputation of missing covariate values in multilevel models with random slopes: A cautionary note. Behavior Research Methods. 2016;48(2):640–649. doi: 10.3758/s13428-015-0590-3. [DOI] [PubMed] [Google Scholar]
  9. Hughes Rachael A, White Ian R, Seaman Shaun R, Carpenter James R, Tilling Kate, Sterne Jonathan AC. Joint modelling rationale for chained equations. BMC medical research methodology. 2014;14(1):28. doi: 10.1186/1471-2288-14-28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Huque Hamidul, Carlin John B, Simpson Julie A, Lee Katherine J. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Medical Research Methodology. 2018 doi: 10.1186/s12874-018-0615-6. xx(xx):xx–xx. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Laird Nan M. Missing data in longitudinal studies. Statistics in medicine. 1988;7(1-2):305–315. doi: 10.1002/sim.4780070131. [DOI] [PubMed] [Google Scholar]
  12. Leacy FP. Multiple imputation under missing not at random assumptions via fully conditional specification [dissertation]. PhD thesis. 2016. [Google Scholar]
  13. Liu Jingchen, Gelman Andrew, Hill Jennifer, Su Yu-Sung, Kropko Jonathan. On the stationary distribution of iterative imputations. Biometrika. 2014;101(1):155–173. [Google Scholar]
  14. LSAC. Technical report
  15. Meng Xiao-Li. Multiple-imputation inferences with uncongenial sources of input. Statistical Science. 1994:538–558. [Google Scholar]
  16. Moreno-Betancur M, Leacy FP, Tompsett D, White I. mice: The narfcs procedure for sensitivity analyses. 2017. URL https://github.com/moreno-betancur/NARFCS.
  17. Quartagno M, Carpenter JR. Multiple imputation for ipd meta-analysis: allowing for heterogeneity and studies with missing covariates. Statistics in medicine. 2016;35(17):2938–2954. doi: 10.1002/sim.6837. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Quartagno Matteo, Carpenter James R. Multiple imputation for discrete data: Evaluation of the joint latent normal model. Biometrical Journal. 2019 doi: 10.1002/bimj.201800222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Raghunathan Trivellore E, Lepkowski James M, Van Hoewyk John, Solenberger Peter. A multivariate technique for multiply imputing missing values using a sequence of regression models. Survey methodology. 2001;27(1):85–96. [Google Scholar]
  20. Reiter Jerome P, Raghunathan Trivellore E, Kinney Satkartar K. The importance of modeling the sampling design in multiple imputation for missing data. Survey Methodology. 2006;32(2):143. [Google Scholar]
  21. Resche-Rigon Matthieu, White Ian R. Multiple imputation by chained equations for systematically and sporadically missing multilevel data. Statistical methods in medical research. 2016:0962280216666564. doi: 10.1177/0962280216666564. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Rezvan Panteha Hayati, Lee Katherine J, Simpson Julie A. The rise of multiple imputation: a review of the reporting and implementation of the method in medical research. BMC medical research methodology. 2015;15(1):30. doi: 10.1186/s12874-015-0022-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Rubin DB. Multiple imputation for nonresponse in surveys/Donald B Rubin. Wiley; New York: 1987. c1987 [Google Scholar]
  24. Schafer Joseph L. Analysis of incomplete multivariate data. CRC press; 1997. [Google Scholar]
  25. Schafer Joseph L, Yucel Recai M. Computational strategies for multivariate linear mixed-effects models with missing values. Journal of computational and Graphical Statistics. 2002;11(2):437–457. [Google Scholar]
  26. Seaman Shaun R, Hughes Rachael A. Relative efficiency of joint-model and full-conditional-specification multiple imputation when conditional models are compatible: The general location model. Statistical methods in medical research. 2018;27(6):1603–1614. doi: 10.1177/0962280216665872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Sterne JAC, White IR, Carlin JB, Spratt M, Royston P, Kenward MG. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338 doi: 10.1136/bmj.b2393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Tompsett Daniel Mark, Leacy Finbarr, Moreno-Betancur Margarita, Heron Jon, White Ian R. On the use of the not-at-random fully conditional specification (narfcs) procedure in practice. Statistics in medicine. 2018;37(15):2338–2353. doi: 10.1002/sim.7643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Van Buuren Stef, Groothuis-Oudshoorn Karin. mice: Multivariate imputation by chained equations in r. Journal of statistical software. 2011;45(3):1–67. [Google Scholar]
  30. Van Buuren Stef, Brand Jaap PL, Groothuis-Oudshoorn CGM, Rubin Donald B. Fully conditional specification in multivariate imputation. Journal of statistical computation and simulation. 2006;76(12):1049–1064. [Google Scholar]
  31. Van Buuren Stef, et al. Handbook of advanced multilevel analysis. 2011. Multiple imputation of multilevel data; pp. 173–196. [Google Scholar]
  32. Yucel Recai M. Random covariances and mixed-effects models for imputing multivariate multilevel continuous data. Statistical modelling. 2011;11(4):351–370. doi: 10.1177/1471082X1001100404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Zhao Enxu, Yucel Recai M. American Statistical Association Proceedings of the Survey Research Methods Section. 2009. Performance of sequential imputation method in multilevel applications; pp. 2800–2810. [Google Scholar]
  34. Zhu Jian, Raghunathan Trivellore E. Convergence properties of a sequential regression multiple imputation algorithm. Journal of the American Statistical Association. 2015;110(511):1112–1124. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

RESOURCES