Multiple imputation by chained equations for systematically and sporadically missing multilevel data

Matthieu Resche-Rigon; Ian R White

doi:10.1177/0962280216666564

. Author manuscript; available in PMC: 2018 Jun 1.

Published in final edited form as: Stat Methods Med Res. 2016 Sep 19;27(6):1634–1649. doi: 10.1177/0962280216666564

Multiple imputation by chained equations for systematically and sporadically missing multilevel data

Matthieu Resche-Rigon ^1,^2,³, Ian R White ⁴

PMCID: PMC5496677 EMSID: EMS71930 PMID: 27647809

Abstract

In multilevel settings such as individual participant data meta-analysis, a variable is ‘systematically missing’ if it is wholly missing in some clusters and ‘sporadically missing’ if it is partly missing in some clusters. Previously proposed methods to impute incomplete multilevel data handle either systematically or sporadically missing data, but frequently both patterns are observed. We describe a new multiple imputation by chained equations (MICE) algorithm for multilevel data with arbitrary patterns of systematically and sporadically missing variables. The algorithm is described for multilevel normal data but can easily be extended for other variable types. We first propose two methods for imputing a single incomplete variable: an extension of an existing method and a new two-stage method which conveniently allows for heteroscedastic data. We then discuss the difficulties of imputing missing values in several variables in multilevel data using MICE, and show that even the simplest joint multilevel model implies conditional models which involve cluster means and heteroscedasticity. However, a simulation study finds that the proposed methods can be successfully combined in a multilevel MICE procedure, even when cluster means are not included in the imputation models.

Keywords: Missing data, multilevel model, multiple imputation, chained equations, fully conditional specification, individual patient data meta-analysis

1. Introduction

Multiple imputation (MI) is a popular approach for handling the pervasive problem of missing data in biostatistics.¹ MI uses the distribution of the observed data to estimate a set of plausible values for the missing data, usually under a missing at random (MAR) assumption.² MI is a Bayesian procedure: given a joint prior distribution for the observed data and a specific data model, we obtain a posterior distribution of the missing values given the observed data. Missing data in a single variable are straightforward to impute, for example using an appropriate generalised linear model,³^,⁴ but missing data in several variables present a more complex problem except on the rare occasions when data are monotone missing. There are two standard approaches. Firstly, a joint model may be assumed for the data. Most commonly, a multivariate normal model is assumed, as implemented in standard software.⁵^–⁷ This is surprisingly effective despite being mis-specified for categorical data.⁸ Secondly, multiple imputation by chained equations (MICE), also known as multiple imputation by fully conditional specification, specifies a suitable conditional imputation model for each incomplete variable and iteratively imputes until convergence.³^,⁹ The theoretical properties of MICE are not well understood: except in simple cases, conditional imputation models do not correspond to any joint model.¹⁰^,¹¹ Despite this it performs well in practice,¹²^,¹³ especially when the conditional imputation models are well accommodated to the substantive model.¹⁴ MICE is also implemented in standard software.⁶^,¹⁵^,¹⁶

Many datasets have a multilevel or clustered structure. Imputation that ignores this clustering leads to underestimation of the magnitude of clustering and hence underestimated standard errors, even if the analysis does allow for clustering.¹⁷ However, imputation that allows for the clustering through fixed effects of cluster overestimates the magnitude of the clustering and hence overestimates standard errors.¹⁸^,¹⁹

There is therefore a need for advanced imputation methods that allow for the clustering. Schafer and Yucel proposed a Gibbs sampler to generate MIs of continuous missing variables from a joint multivariate linear mixed model:²⁰^,²¹ their method is implemented in the PAN package.²² REALCOM software²³ and the R package jomo²⁴ extended this approach allowing missing data at any level and handling categorical data through latent normal variables. van Buuren proposed extending MICE to perform multilevel imputation by a Bayesian procedure.²⁵^,²⁶ More recently, extensions were proposed to impute level 2 variables.²⁷

In this paper we focus on imputation of two-level clustered data, such as are found in individual participant data (IPD) meta-analysis²⁸ (where the cluster is the study) or in multi-centre studies (where the cluster is the centre). A feature of such data is that some variables may be ‘systematically missing’ – that is, missing for all individuals in one or more clusters.²⁹ This arises in particular in IPD meta-analysis when a potential confounding variable is not collected in one or more studies. Variables may also be ‘sporadically missing’ – that is, missing for some but not all individuals in one or more clusters.²⁹ Unfortunately, the multiple imputation methods and packages presented above are currently not able to handle systematically missing data except for jomo. Therefore, methods to handle systematically missing data have been proposed for continuous data²⁹ and recently extended to binary and discrete data.³⁰ These approaches handle the clustering by using generalized linear mixed models to impute data; they differ only in how they model the uncertainty around the between-cluster covariance parameters. However, they only deal with systematically missing data. The aim of this paper is to propose methods that simultaneously handle systematically and sporadically missing data.

The paper is organised as follows. Section 2 describes imputation of a single incomplete variable and makes two innovations. Firstly, it extends the method previously described for systematically missing data²⁹ to also handle sporadically missing data. Secondly, it proposes a new two-stage algorithm which conveniently allows for heteroscedasticity between clusters in the linear mixed model, as recommended by van Buuren in the sporadically missing case.²⁵ The two-stage algorithm also offers an easy extension to handle categorical data alongside normal data, although we here evaluate it only for normal data. In section 3, we explore the sorts of conditional models needed in multilevel MICE and explain why it is important to account for heteroscedasticity. In section 4, we evaluate the performance of the multilevel imputation algorithms in multilevel MICE. We illustrate the methods in an IPD meta-analysis in cardiovascular disease in section 5. Finally, section 6 provides a discussion.

2. Imputation of univariate missing data

Let y_i be a n_i × 1 vector of observed outcomes on units j ∈ {1, … , n_i} within cluster i ∈ {1, … , N}, y_i = (y_i1, y_i2, … , y_{in_i})^T. We consider the following linear mixed model

y_{i} = X_{i} β + Z_{i} b_{i} + e_{i}, b_{i} \sim N (0, Ψ_{b}), e_{i} \sim N (0, Σ_{i})

(1)

where X_i is a n_i × p matrix of variables associated with the outcome y_i via β, a p × 1 vector of fixed effects, and Z_i (typically equal to or contained within X_i) is a n_i × q matrix of variables associated with y_i via b_i, a q × 1 vector of random effects. We model b_i ~ N(0, Ψ_b) where Ψ_b is the Ψ_b is the q × q variance-covariance matrix of the random effects, and the n_i × 1 error vector as e_i ~ N(0, ∑_i) with $Σ_{i} = σ_{i}^{2} I (n_{i}) .$ Under this model, the posterior distribution of b_i given y_i is b_i| y_i ~ N(m(y_i), v(y_i)), with $m (y_{i}) = Ψ_{b} Z_{i}^{T} {(Z_{i} Ψ_{b} Z_{i}^{T} + Σ_{i})}^{- 1} (y_{i} - X_{i} β)$ and $v (y_{i}) = Ψ_{b} - Ψ_{b} Z_{i}^{T} {(Z_{i} Ψ_{b} Z_{i}^{T} + Σ_{i})}^{- 1} Z_{i} Ψ_{b} .$ ²⁰^,³¹^,³²

Suppose $y = {(y_{1}^{T}, y_{2}^{T}, \dots, y_{N}^{T})}^{T}$ contains missing data y^mis. y_i is systematically missing if it is missing for all individuals in cluster i; it is sporadically missing if it is only partly incomplete.

Our goal is to generate independent draws under a MAR assumption from a posterior predictive distribution for the missing data p( y^mis| y^obs) = ∫ p(y^mis| y^obs, θ) p(θ| y^obs)dθ where θ = (β, Ψ_b, {σ_i}) is the vector of parameters in the linear mixed model (1) and p(θ| y^obs) is the observed data posterior density of θ.¹ In practice, this may be achieved (with implicit vague priors) by:

(1)
fitting the model p( y|θ) to the units with observed y, yielding an estimate (typically an MLE) θ̂ with an estimated variance-covariance matrix S_θ;
(2)
drawing a value of θ, θ*, from its posterior, approximated as N(θ̂, S_θ);
(3)
drawing values of y^mis from p( y|θ*).

In the next two subsections, we propose two approaches to perform step 1. The first approach uses one-stage estimation of the model parameters.³³ The second approach, two-stage estimation, first estimates parameters within each cluster separately using the same model. Then, it applies multivariate meta-analysis methods using the summary statistics obtained at the first stage.³⁴ One-stage and two-stage approaches give similar results in individual patient data meta-analyses,³⁵ except when there are relatively few studies or the studies are small (especially with binary outcomes).³⁶ The two approaches both fit versions of model (1), but extra assumptions are convenient in each approach: in one-stage estimation we consider a homoscedastic model, i.e. a constant σ_i for all clusters, while in two-stage estimation we assume Z_i = X_i.

2.1. One-stage approach

This approach fits model (1) directly. To ease computation, we assume the residuals are homoscedastic between clusters: $σ_{i}^{2} = σ^{2} \forall i \in {1, \dots, N} .$ Steps 1–3 are implemented as follows:

(1)
Estimate the model parameters, θ̂ = (β̂, Ψ̂_b, σ̂) and their variance covariance matrix S_θ by maximizing the restricted log-likelihood of model (1) fitted to the available data. We parameterise Ψ_b as proposed by Pinheiro and Bates,³⁷ using $Ψ_{b}^{t r a n s} = ((log Ψ_{b r r} : r = 1 to q), (log \frac{1 + ρ_{r s}}{1 - ρ_{r s}} : r, s = 1 to q, r < s))$ where $ρ_{r s} = Ψ_{b r s} / \sqrt{Ψ_{b r r} Ψ_{b s s}} .$ We assume that the estimates of the transformed parameters $Ψ_{b}^{t r a n s}$ follow a multivariate normal distribution.
(2)
Draw $θ^{*} = (β^{*}, Ψ_{b}^{t r a n s *}, σ^{*})$ using N(θ̂, Ŝ_θ) and hence obtain $Ψ_{b}^{*}$ and $Σ_{i}^{*} = σ^{* 2} I (n_{i}) .$ If $Ψ_{b}^{*}$ is not positive semi-definite, make it positive semi-definite by setting negative eigenvalues to zero.³⁸
(3)
Compute m*(y_i) and v*(y_i), the posterior mean and variance of b_i, using $Ψ_{b}^{*},$ β* and $Σ_{i}^{*}$ :
- (a)
  For clusters with systematically missing data, m*(y_i) = 0 and $v^{*} (y_{i}) = Ψ_{b}^{*} .$
- (b)
  For clusters with sporadically missing data, $m^{*} (y_{i}) = Ψ_{b}^{*} Z_{i}^{T} {(Z_{i} Ψ_{b}^{*} Z_{i}^{T} + Σ_{i}^{*})}^{- 1} (y_{i} - X_{i} β^{*})$ and $v^{*} (y_{i}) = Ψ_{b}^{*} - Ψ_{b}^{*} Z_{i}^{T} {(Z_{i} Ψ_{b}^{*} Z_{i}^{T} + Σ_{i}^{*})}^{- 1} Z_{i} Ψ_{b}^{*} .$
Then draw each $b_{i}^{*}$ using N(m*(y_i), v*(y_i)) and for observation j ∈ {1, … , n_i} with missing y_ij, draw $e_{i j}^{*} \sim N (0, σ^{* 2})$ and set $y_{i j}^{*} = x_{i j} β^{*} + z_{i j} b_{i}^{*} + e_{i j}^{*} .$

This approach, recently extended by Jolani in a generalised linear model framework,³⁰ is close to van Buuren’s method for sporadically missing data which used a Gibbs sampler to generate θ* and $b_{i}^{*} .$ ²⁵ However, van Buuren argued that the imputation quality could be improved by allowing the within cluster variance to vary over the clusters, and modelled $σ_{i} \sim \frac{σ_{0} χ_{1 / ϕ}^{2}}{ϕ},$ where σ₀ and ϕ are hyperparameters for the location of prior belief about residual variance and a measure of variability respectively.

Implementation

We fit the mixed model using the lme() function from the nlme package in R 3.1.1.³⁹^,⁴⁰ This does not report the covariance between β* and $(Ψ_{b}^{*}, σ^{*}),$ so in practice we draw β* and $(Ψ_{b}^{*}, σ^{*})$ independently.

2.2. Two-stage approach

This approach uses procedures developed for multivariate meta-analysis.⁴¹^,⁴² We first fit a separate model within each cluster and then combine the results using multivariate random-effects meta-analysis models. This approach naturally allows heteroscedastic level-1 variances. The model fitted in cluster i is

y_{i} = X_{i} β_{i} + e_{i}

(2)

with $e_{i} \sim N (0, σ_{i}^{2} I (n_{i})) .$ This is model (1) with Z_i = X_i (i.e. every variable in the model has a random effect) and β_i = β + b_i; we propose how to allow Z_i ≠ X_i in the discussion. We approximate β̂_i ~ N(β_i, S_βi) and log σ̂_i ~ N(log σ_i, S_σi). The between-studies model is β_i ~ N(β, Ψ_β), where Ψ_β is the same as Ψ_b in section 2.1, and log σ_i ~ N(log σ, Ψ_σ), where σ is an average within-cluster standard deviation. Then the marginal models are β̂_i ~ N(β, S_βi + Ψ_β) and log σ̂_i ~ N(log σ, S_σi + Ψ_σ).

The main steps of the two-stage imputation procedure are:

(1)
- (a)
  For each cluster i without systematically missing data, fit model (2) using $y_{i}^{o b s}$ (complete case analysis). This gives β̂_i with variance S_βi, and log σ̂_i whose variance S_σi is obtained by noting that ${\hat{σ}}_{i}^{- 2}$ follows a χ² distribution and using the delta method.
- (b)
  Perform a multivariate meta-analysis using the β̂_i and S_βi from each cluster without systematically missing data. Using a REML estimator, obtain β̂, Ψ̂_β and their variance-covariance matrices S_β and S_{Ψ_β}. Ψ_β is parameterised via its Cholesky decomposition $Ψ_{β}^{c h o l} .$
- (c)
  Perform a univariate meta-analysis using the σ̂_i and S_σi from each cluster without systematically missing data. Using a REML estimator, obtain log σ̂, Ψ̂_σ and their variance-covariance matrices S_σ and S_Ψσ.
(2)
Draw β* using N(β̂, S_β) and $Ψ_{β}^{c h o l *}$ using $N ({\hat{Ψ}}_{β}^{c h o l}, S_{Ψ_{β}^{c h o l}}) .$ Hence set $Ψ_{β}^{*} = Ψ_{β}^{c h o l *} {(Ψ_{β}^{c h o l *})}^{T} :$ the use of the Cholesky decomposition ensures a positive semi-definite $Ψ_{β}^{*} .$ Similarly draw log σ* and $Ψ_{σ}^{*}$ using N(log σ̂, S_σ) and N(Ψ̂_σ, S_Ψσ).
(3)
Draw the missing y_ij:
- (a)
  For clusters with systematically missing data, draw $β_{i}^{*}$ using $N (β^{*}, Ψ_{β}^{*})$ and draw $σ_{i}^{*}$ using $log σ_{i}^{*} \sim N (log σ^{*}, Ψ_{σ}^{*}) .$
- (b)
  For clusters with sporadically missing data, draw $β_{i}^{*}$ from its posterior given $Ψ_{β}^{*}$ and β*,
  $β_{i}^{*} \sim N ({(Ψ_{β}^{* - 1} + S_{β i}^{- 1})}^{- 1} (Ψ_{β}^{* - 1} β * + S_{β i}^{- 1} \hat{β_{i}}), {(Ψ_{β}^{* - 1} + S_{β i}^{- 1})}^{- 1})$
  and draw $log σ_{i}^{*}$ from its posterior given $Ψ_{σ}^{*}$ and log σ*
  $log σ_{i}^{*} \sim N (\frac{log σ * / Ψ_{σ}^{*} + log {\hat{σ}}_{i} / S_{σ i}}{1 / Ψ_{σ}^{*} + 1 / S_{σ i}}, \frac{1}{1 / Ψ_{σ}^{*} + 1 / S_{σ i}}) .$
- (c)
  For each missing y_ij, draw $e_{i j}^{*} \sim N (0, σ_{1}^{* 2})$ and set $y_{i j}^{*} = x_{i j} β_{i}^{*} + e_{i j}^{*} .$

The REML estimation proposed in step 1 can be computationally slow. We therefore propose the method of moments (MM) as a simple and computationally effcient alternative. In step 1(c), we use the univariate MM described in Dersimonian and Laird.⁴³ In step 1(b), we use the multivariate extension proposed by Jackson et al.³⁸ The MM does not allow estimation of S_{Ψ_β} and S_{Ψ_σ}, so we are forced to modify Step 2 by setting $Ψ_{β}^{*} = {\hat{Ψ}}_{β}$ and σ* = σ̂.

Implementation

Multivariate meta-analysis is performed using the mvmeta package in R 3.1.1.³⁴^,⁴⁴^,⁴⁵

3. Imputing two or more incomplete variables

3.1. MICE

It is usual for missing values to occur in several variables. MICE specifies the multivariate imputation model on a variable-by-variable basis by a set of conditional densities obtained by different regression models, one for each incomplete variable.⁴⁶ Thus, MICE allows the imputation model to be adapted to the type of each variable. Conditional distributions for the missing data given the other data for each incomplete variable are iteratively obtained, missing values being replaced by simulated draws. The algorithm may be initiated by a simple random sampling with replacement from the observed values. Once each variable has been imputed the complete cycle is repeated at least 5–10 times to stabilize the posterior distribution obtained for each variable.³ In practice, posterior distributions could be derived either from maximum likelihood estimators or using Gibbs samplers as proposed by van Buuren.²⁵ In the multilevel setting, the use of MICE was first proposed by van Buuren²⁶ and was applied for systematically missing data by Resche-Rigon et al.²⁹ We propose using the methods described in section 2 to provide the conditional imputations needed at each step of the MICE procedure.

3.2. Conditional models in the multilevel setting

Ideally, one wants to use conditional imputation models compatible with the unknown overall data model. For complex models this is often unachievable. We use a simple joint model to explore mathematically the difficulties arising in specifying conditional models in the multilevel setting. We consider two incomplete variables x_1ij and x_2ij, with the joint model

(\begin{array}{l} \begin{array}{l} x_{1 i j} \end{array} \\ \begin{array}{l} x_{2 i j} \end{array} \end{array}) = (\begin{array}{l} μ_{1} \\ μ_{2} \end{array}) + (\begin{array}{l} u_{1 i} \\ u_{2 i} \end{array}) + (\begin{array}{l} ε_{1 i j} \\ ε_{2 i j} \end{array})

(3)

where $(\begin{array}{l} u_{1 i} \\ u_{2 i} \end{array}) \overset{iid}{\sim} N (0, Ψ_{u}) and (\begin{array}{l} ε_{1 i j} \\ ε_{2 i j} \end{array}) \overset{iid}{\sim} N (0, Σ_{ε}) .$

We prove in Appendix 1 that the conditional expectation of x_2ij depends on x_1ij and x̄_1i, the mean of x_1ij within cluster i, and on n_i, the size of cluster i (equation (8)). Similarly, the conditional variance of x_2ij depends on n_i (equation (9)). Appendix 2 generalises this result to the case of many incomplete variables by letting x_1ij and x_2ij be vectors. These results suggest (1) that we should incorporate the cluster means of the predictors in the imputation model and (2) that if cluster sizes vary then the level-1 variance should vary between clusters, i.e. that the imputation model should be heteroscedastic. Thus, a simple joint model implies complex conditional models: this is unlike the situation with independent observations, when a joint normal model implies simple conditional models.¹¹

Many settings, including the simulation setting of section 4, include random slopes: for example, we might combine model (3) with a random slopes model regressing y_ij on x_1ij and x_2ij. Under this joint model, the cluster-specific density of (x, y) is multivariate normal, but the marginal density is not. Deriving the conditional distributions in this model is intractable, and we anticipate that the problems identified above will be joined by others.

4. Simulation study

The methods proposed in section 2 make certain approximations, and section 3 suggests that combining them into a MICE procedure involves mis-specification of the conditional models. We therefore conduct a simulation study aiming (a) to explore and compare the performance of the methods in a MICE procedure, and (b) to explore whether their performance is improved by including cluster means or heteroscedasticity in the imputation model.

4.1. Simulating complete data

As in section 2, let y_i, x_1i, x_2i denote n_i × 1 vectors containing outcomes and covariates respectively on units j ∈ {1, … , n_i} within cluster i ∈ {1, … , N}. The covariates were drawn from a possibly heteroscedastic version of model (3): (x_1ij, x_2ij)^T ~ N(μ_i, Σ_εi and μ_i ~ N(μ, Ψ_u). We allowed the marginal variances to vary across clusters using log Σ_εikk ~ N(log Σ_ε.k, Ψ_{Σ_ε.k}) for k = 1, 2, but we fixed the marginal correlations $Σ_{ε i 12} / \sqrt{Σ_{ε i 11} Σ_{ε i 22}} = ρ_{ε} .$ Our parameter values for the base case were μ = (0, 0)^T, $Ψ_{u} = {0.5}^{2} (\begin{matrix} 1 & 0.5 \\ 0.5 & 1 \end{matrix}),$ Σ_ε.1 = Σ_ε.2 = 0.5², ρ_ε = 0.5 and Ψ_{Σ_ε.1} = Ψ_{Σ_ε.2} = 0, giving homoscedasticity with Σ_εi = Ψ_u for all i and hence intra-cluster correlations of approximatively 0.5 for both x₁ and x₂.

The outcome data y_ij were generated from a random intercepts and slopes model

y_{i j} = β_{0} + β_{1} x_{1 i j} + β_{2} x_{2 i j} + b_{0 i} + b_{1 i} x_{1 i j} + b_{2 i} x_{2 i j} + e_{i j}

(4)

with $e_{i j} \sim N (0, σ_{i}^{2}),$ log σ_i ~ N(log σ, Ψ_σ) and b_i = (b_0i, b_1i, b_2i) ~ N(0, Ψ_b). Our parameter values for the base case were β₀ = 0, β₁ = 0.5 and β₂ = 1; Ψ_b had all diagonal elements 0.5² and all correlations 0.3; σ = 0.5 and Ψ_σ = 0, so the outcome model was homoscedastic with intra-cluster correlation (conditional on x_1ij = x_2ij = 0) of 0.5.

Following Appendices 1 and 2, E(x_1ij|x_2ij, y_ij) and E(x_2ij|x_1ij, y_ij) may depend on ȳ_i and on x̄_2i and x̄_1i respectively. We estimate this dependency using a large simulated data set of 1000 clusters of size 100. Estimated coefficients of ȳ_i and x̄_2i in the model for x_1ij and of ȳ_i and x̄_1i in the model for x_2ij were −0.020, 0.049, 0.007 and 0.020 respectively, much smaller than the coefficients of y_i, x_1i, x_2i which were all greater than 0.12.

The total number of patients was fixed at 2000. The number of clusters was N = 20 yielding n_i = 100 patients per cluster. A total of 1000 independent datasets were generated for each setting.

4.2. Simulating missing data

y_i was complete for all i. x_1i and x_2i were independently systematically missing with probability π_sys. For any cluster where x_1i or x_2i was not systematically missing, x_1ij and x_2ij were independently sporadically missing with probability π_spor. This is a missing completely at random mechanism. Our parameter values were π_sys = 0.2 and π_spor = 0.3 in all cases, leading to an overall probability of missing data equal to 0.44.

4.3. Imputing missing data

Under this data generating model, the conditional distributions [x_1i|x_2i, y_i] where $x_{1 i} = {(x_{1 i 1}^{T}, x_{1 i 2}^{T}, …, x_{1 i n_{i}}^{T})}^{T}$ etc. are not of the simple forms given in equations (1) and (2). Nevertheless, in order to explore the practical consequences of imputation model misspecification, missing data were imputed using a MICE algorithm, using as imputation model either homoscedastic equation (1) (for the one-stage approach) or heteroscedastic equation (2) (for the two-stage approach). The two-stage approach was implemented using either REML or the MM in steps 1(b) and 1(c). When imputing x₁, the outcome in equation (1) or (2) was x₁, and the covariates were x₂ and y, with both fixed and random effects. A similar model was used to impute x₂. To evaluate the impact of the cluster means within the imputation model we used two series of imputation models, either excluding or including the cluster means. Cluster means were included in the one-stage method by adding them as predictors in the imputation models, and in the two-stage method by using meta-regression in the second stage with cluster means as predictors. The incomplete data were imputed m = 5 times using 10 iterations of the chained equations algorithm.

4.4. Analysis

In every case, the analysis model was equation (4). The data were analysed before data deletion as a benchmark for the multiple imputation procedure, and after data deletion using complete cases. After imputation, the model was estimated in each imputed data set using the REML estimator of the lmer() function of the lme4 R-package⁴⁷ and the estimates were combined using Rubin’s rules.¹ The performance of each method was assessed by computing the empirical mean of the parameter estimates, root mean square estimated standard error (Model SE), empirical Monte Carlo standard error (Emp SE), the coverage of nominal 95% confidence intervals (95%CI), and the mean run time per simulated data set.

4.5. Results

In the base case (Table 1), point estimates are slightly negatively biased for β₁ = 0.5 and unbiased for β₂ = 1 for two-stage methods. Empirical standard errors are lower for multiple imputation approaches than for complete case analysis, reflecting the better use made of the data. Model-based standard errors are below empirical standard errors, in particular for the one-stage approach. The underestimated standard errors appear to arise from underestimated random effect variances. Moderate under-coverage is observed, and is only partly explained by the underestimated standard errors: only two-stage methods achieve coverage within 5% of the nominal 95%. Mean run times are in favour of the two-stage MM especially compared to the two-stage REML. Including cluster means in imputation models has little impact on results (lower part of Table 1).

Table 1.

Base-case simulations. $Ψ_{u} = {0.5}^{2} (\begin{matrix} 1 & 0.5 \\ 0.5 & 1 \end{matrix}), Σ_{ε .1} = Σ_{ε .2} = 0.5, Ψ_{Σ_{ε .1}} = Ψ_{Σ_{ε .2}} = 0.$

	β₁				$\sqrt{Ψ_{b 11}}$	β₂				$\sqrt{Ψ_{b 22}}$	Mean run time
Method	Est mean	Model SE	Emp SE	Cover	Est mean	Est mean	Model SE	Emp SE	Cover	Est mean	Mean run time
True value	0.5				0.5	1				0.5
No missing data	0.501	0.113	0.115	0.926	0.489	1.002	0.114	0.117	0.922	0.494
Complete case	0.505	0.145	0.154	0.925	0.483	1.005	0.146	0.150	0.916	0.486
Imputation methods, excluding cluster mean from imputation model
One-stage	0.485	0.112	0.129	0.899	0.407	0.984	0.105	0.132	0.874	0.410	1 min 51 s
Two-stage	0.488	0.125	0.134	0.919	0.461	1.008	0.121	0.134	0.915	0.474	3 min 57 s
Two-stage MM	0.489	0.124	0.136	0.908	0.456	1.010	0.121	0.131	0.910	0.472	1 min 3 s
Imputation methods, including cluster mean in imputation model
One-stage	0.486	0.112	0.129	0.888	0.406	0.982	0.105	0.134	0.868	0.409	3 min 26 s
Two-stage	0.485	0.125	0.137	0.895	0.469	1.005	0.122	0.135	0.909	0.476	4 min 11 s
Two-stage MM	0.487	0.122	0.137	0.906	0.464	1.006	0.120	0.133	0.915	0.474	1 min 6 s

Open in a new tab

Est mean: mean estimate over imputed data sets; Model SE: standard error derived from Rubin’s rules (mean over imputed data sets); Emp SE: standard deviation of estimate over imputed data sets; Cover: coverage of nominal 95% confidence interval.

We next modified the data generating mechanism to explore performance when the coefficients of cluster means are clearly different from 0. To do this we set Σ_ε.1 = 0.5², Σ_ε.2 = 0.17² and $Ψ_{u} = (\begin{matrix} {0.17}^{2} & {0.5}^{2} \times 0.17 \\ {0.5}^{2} \times 0.17 & {0.5}^{2} \end{matrix}) .$ Using a large simulated data set, the estimated coefficients of ȳ_i and x̄_2i in the model for x_1ij and of ȳ_i and x̄_1i in the model for x_2ij were −0.16, −0.87, 0.22 and 0.95 respectively, larger than the corresponding coefficients of y_i, x_1i, x_2i which were 0.30, 0.60, 0.07 and 0.12, respectively.

The bias in β₁ and β₂ remains small, but seems smaller for two-stage approaches (Table 2). The MI methods underestimate Ψ_b11 (especially with one-stage approaches) and overestimate Ψ_b22 (especially with two-stage approaches), leading to mis-estimated model-based standard errors. In this setting, in which cluster means could have an impact on the efficiency of the imputation model, including cluster means has very little effect on the final results, with only a small reduction in model-based standard errors.

Table 2.

Simulations when the coefficients of cluster means are clearly different from 0. $Ψ_{u} = (\begin{matrix} {0.17}^{2} & {0.5}^{2} \times 0.17 \\ {0.5}^{2} \times 0.17 & {0.5}^{2} \end{matrix}), Σ_{ε .1} = {0.5}^{2}, Σ_{ε .2} = {0.17}^{2}, Ψ_{Σ_{ε .1}} = Ψ_{Σ_{ε .2}} = 0.$

	β₁				$\sqrt{Ψ_{b 11}}$	β₂				$\sqrt{Ψ_{b 22}}$	Mean run time
Method	Est mean	Model SE	Emp SE	Cover	Est mean	Est mean	Model SE	Emp SE	Cover	Est mean	Mean run time
True value	0.5				0.5	1				0.5
No missing data	0.500	0.113	0.112	0.934	0.494	1.003	0.131	0.132	0.929	0.485
Complete case	0.500	0.146	0.148	0.926	0.486	1.002	0.188	0.191	0.934	0.484
Imputation methods, excluding cluster mean from imputation model
One-stage	0.475	0.110	0.125	0.897	0.416	1.036	0.183	0.174	0.928	0.521	2 min 54 s
Two-stage	0.486	0.120	0.129	0.919	0.465	1.014	0.204	0.174	0.960	0.631	3 min 52 s
Two-stage MM	0.491	0.121	0.130	0.927	0.463	1.011	0.215	0.176	0.963	0.667	1 min 3 s
Imputation methods, including cluster mean in imputation model
One-stage	0.474	0.108	0.125	0.893	0.412	1.025	0.179	0.172	0.942	0.514	5 min 19 s
Two-stage	0.481	0.119	0.129	0.915	0.465	1.016	0.196	0.177	0.951	0.627	4 min 12 s
Two-stage MM	0.489	0.118	0.129	0.916	0.465	0.995	0.204	0.181	0.959	0.655	1 min 5 s

Open in a new tab

Note: Abbreviations as in Table 1.

We next modified the data generating mechanism of Table 2 by introducing unequal cluster sizes, with 10 clusters of size 160 and 10 clusters of size 40. Results (Table 3) were very close to those obtained in Table 2, and two-stage methods seem to be less biased than one-stage approach.

Table 3.

Simulations when the coefficients of cluster means are clearly different from 0 and cluster sizes are unequal (10 clusters ofsize 160, 10 clusters of size 40). $Ψ_{u} = (\begin{matrix} {0.17}^{2} & {0.5}^{2} \times 0.17 \\ {0.5}^{2} \times 0.17 & {0.5}^{2} \end{matrix}), Σ_{ε .1} = {0.5}^{2}, Σ_{ε .2} = {0.17}^{2}, Ψ_{Σ_{ε .1}} = Ψ_{Σ_{ε .2}} = 0.$

	β₁				$\sqrt{Ψ_{b 11}}$	β₂				$\sqrt{Ψ_{b 22}}$	Mean run time
Method	Est mean	Model SE	Emp SE	Cover	Est mean	Est mean	Model SE	Emp SE	Cover	Est mean	Mean run time
True value	0.5				0.5	1				0.5
No missing data	0.495	0.115	0.115	0.946	0.492	0.997	0.135	0.139	0.950	0.485
Complete case	0.494	0.147	0.155	0.935	0.483	1.002	0.192	0.206	0.912	0.495
Imputation methods, excluding cluster mean from imputation model
One-stage	0.473	0.111	0.132	0.900	0.415	1.023	0.185	0.183	0.941	0.513	3 min 13 s
Two-stage	0.481	0.124	0.135	0.925	0.468	0.994	0.213	0.185	0.971	0.632	3 min 53 s
Two-stage MM	0.486	0.123	0.136	0.925	0.466	0.996	0.221	0.193	0.968	0.668	1 min 3 s
Imputation methods, including cluster mean in imputation model
One-stage	0.473	0.110	0.131	0.899	0.412	1.009	0.181	0.183	0.947	0.504	5 min 42 s
Two-stage	0.479	0.123	0.134	0.928	0.472	0.998	0.206	0.187	0.968	0.637	4 min 15 s
Two-stage MM	0.482	0.121	0.135	0.920	0.469	0.982	0.213	0.186	0.974	0.664	1 min 5 s

Open in a new tab

Note: Abbreviations as in Table 1.

Finally, Table 4 reports a simulation study using the same parameter values as in Table 2, but with heteroscedasticity on X₁ and X₂ introduced by setting Ψ_{Σ_{ε, 1}} = 0.17², Ψ_{Σ_{ε, 2}} = 0.06². Results were still less biased with two-stage methods than with the one-stage approach. Ψ_b22 was largely overestimated. Model-based standard errors obtained with the two-stage approaches were overestimated.

Table 4.

Simulations when the coefficients of cluster means are clearly different from 0 and with heteroscedasticity on X₁ and X₂. $Ψ_{u} = (\begin{matrix} {0.17}^{2} & {0.5}^{2} \times 0.17 \\ {0.5}^{2} \times 0.17 & {0.5}^{2} \end{matrix}), Σ_{ε .1} = {0.5}^{2}, Σ_{ε .2} = {0.17}^{2}, Ψ_{Σ_{ε .1}} = {0.17}^{2}, = Ψ_{Σ_{ε .2}} = {0.06}^{2} .$

	β₁				$\sqrt{Ψ_{b 11}}$	β₂				$\sqrt{Ψ_{b 22}}$	Mean run time
Method	Est mean	Model SE	Emp SE	Cover	Est mean	Est mean	Model SE	Emp SE	Cover	Est mean	Mean run time
True value	0.5				0.5	1				0.5
No missing data	0.500	0.113	0.115	0.943	0.490	1.000	0.134	0.136	0.946	0.488
Complete case	0.503	0.146	0.147	0.943	0.483	0.988	0.187	0.203	0.918	0.484
Imputation methods, excluding cluster mean from imputation model
One-stage	0.451	0.115	0.122	0.912	0.425	1.002	0.204	0.185	0.961	0.598	1 min 41 s
Two-stage	0.484	0.139	0.131	0.951	0.505	1.019	0.268	0.200	0.993	0.796	3 min 54 s
Two-stage MM	0.500	0.136	0.131	0.956	0.499	1.018	0.260	0.200	0.988	0.796	1 min 4 s
Imputation methods, including cluster mean in imputation model
One-stage	0.453	0.115	0.120	0.922	0.422	0.995	0.200	0.182	0.963	0.582	3 min 46 s
Two-stage	0.482	0.138	0.129	0.953	0.509	1.017	0.262	0.201	0.987	0.802	4 min 10 s
Two-stage MM	0.497	0.133	0.128	0.941	0.502	1.001	0.252	0.195	0.986	0.787	1 min 6 s

Open in a new tab

Note: Abbreviations as in Table 1.

5. Application to GREAT data

The GREAT Network explored risk factors associated with short-term mortality in acute heart failure (AHF). Their dataset consisted of 12 observational cohorts: 6 were based in Western Europe (2 in Italy and 1 in each of France, Finland, Switzerland, and Spain), 2 in Central Europe (Austria, Czech Republic), 2 in the United States, 1 in Asia (Japan), and 1 in Africa (Tunisia).⁴⁸ The principal investigators of each cohort study submitted the original data collected for each patient, including a list of patient characteristics and potential risk factors.⁴⁹

The biomarker of interest was brain natriuretic peptide (BNP), which is known to be elevated in acute heart failure. Because our aim is to develop and compare statistical methods, we take as outcome the left ventricular ejection fraction (LVEF) measured by echocardiography, a potential sign of AHF. Our aim is to construct a model predicting the value of LVEF using BNP and six other variables: gender, body mass index (BMI), age, systolic blood pressure (SBP), diastolic blood pressure (DBP) and heart rate (HR). The analysis model is a linear mixed model with all predictors included as linear terms. The intercept and the slope for BNP were both random at cohort level. A description of the variables considered is given in Table 5. BNP was log-transformed for analysis.

Table 5.

GREAT study: description of variables.

Variables	Values	Patients (N=4546)	Statistics
Gender	0	1767	38.87%
	1	2779	61.13%
Age		4540	73.28 [63.75;80]
BMI		3298	27.37 [24.34;31.06]
SBP		4185	136 [116;160]
DBP		4170	80 [69;90]
HR		4439	88 [73;105]
BNP		2034	973 [471.2;1860]
LVEF		4546	40 [27;55]

Open in a new tab

Note: Categorical variables are presented by counts and percentages and quantitative variable by their median and quartiles.

BMI: body mass index; SBP: systolic blood pressure; DBP: diastolic blood pressure; HR: heart rate;

BNP: brain natriuretic peptide; LVEF: left ventricular ejection fraction.

This analysis was complicated by missing data (Table 6). BNP measurement is a relatively recent innovation, so BNP was systematically missing in 4 cohorts and sporadically missing in 5 others. BMI, SBP and DBP were also systematically missing in some cohorts and sporadically missing in others, while HR and age were only sporadically misssing. Gender and LVEF were complete by design.

Table 6.

GREAT study: percentages of missing values by variables and cohort.

Cohort	1	2	3	4	6	7	8	9	10	11	12	13
Sample size	410	567	210	375	107	267	203	354	137	48	1790	78
Percentage missing
Gender	0	0	0	0	0	0	0	0	0	0	0	0
Age	0	0	0	0	0	0	0	1	0	0	0	0
BMI	36	19	43	2	1	100	0	44	1	100	21	60
SBP	1	2	1	1	0	100	1	16	0	0	1	0
DBP	2	3	2	1	0	100	2	16	0	0	1	0
HR	3	1	1	2	0	0	1	19	0	4	0	0
BNP	57	1	0	4	100	100	0	12	0	100	93	100
LVEF	0	0	0	0	0	0	0	0	0	0	0	0

Open in a new tab

Note: Abbreviations as in Table 5.

The results (Table 7) show that multiple imputation approaches have a substantial impact on point estimates of the coefficients. We observe a gain in precision for all MI methods for complete variables (gender, age) and for most incomplete variables (BMI, SBP, DBP, HR). Only log(BNP) has a bigger standard error after imputation, perhaps reflecting a difference between the smaller cohorts with systematically missing data and the more complete cohorts. Results from different imputation methods agree except for log(BNP), whose coefficient and standard errors vary substantially.

Table 7.

GREAT study: results from linear mixed model analysis.

	Gender		Age		BMI		SBP		DBP		HR		log(BNP)
	β	SE	β	SE	β	SE	β	SE	β	SE	β	SE	β	SE
Complete case	−6.869	0.803	0.200	0.030	0.184	0.064	0.167	0.016	−0.176	0.028	−0.060	0.016	−9.887	1.176
Imputation methods, excluding cohort mean in imputation model
One-stage	−5.669	0.475	0.165	0.019	0.233	0.043	0.124	0.012	−0.124	0.023	−0.048	0.009	−10.359	1.112
Two-stage	−5.795	0.474	0.154	0.019	0.226	0.048	0.129	0.012	−0.132	0.023	−0.050	0.009	−9.313	2.034
Two-stage MM	−5.824	0.463	0.156	0.020	0.243	0.043	0.127	0.010	−0.127	0.017	−0.051	0.009	−9.816	1.900

Imputation methods, including cohort mean in imputation model
One-stage	−5.662	0.503	0.170	0.020	0.230	0.049	0.123	0.011	−0.123	0.019	−0.049	0.009	−10.096	1.666
Two-stage	−5.737	0.455	0.158	0.020	0.243	0.045	0.125	0.010	−0.125	0.020	−0.050	0.009	−6.440	2.803
Two-stage MM	−5.950	0.484	0.152	0.020	0.250	0.047	0.137	0.013	−0.141	0.023	−0.049	0.009	−6.302	2.672

Open in a new tab

Note: Abbreviations as in Table 5

6. Discussion

There is increasing need for ways to handle missing values in multilevel structures, notably with the development of meta-analysis of IPD.⁵⁰ IPD, whether from randomised clinical trials or observational studies, have the advantage of facilitating consistent definitions of outcomes, exposures and confounders and consistent analyses between studies. Nevertheless, the availability of confounders typically varies between studies. In this paper, we proposed a method to handle both systematically and sporadically missing covariates in a two-level structure using MICE. We explored the conditional imputation models needed in multilevel MICE and showed that cluster means and heteroscedasticity should be considered in the imputation model. We proposed a new two-stage approach in which we first fit the imputation model within each cluster and then combine the results using multivariate random effects models. Our simulation studies showed broadly good performance for the MICE methods, a very low impact of including the cluster means, and an advantage for heteroscedastic models. Small biases and slight under-coverage were observed in the presence of only systematically missing data.²⁹ These biases disappeared when only sporadically missing covariates were considered, suggesting that they are linked to the presence of systematically missing covariates (Appendix 3). Moreover, the two-stage approach using the MM was more computationally efficient.

Our work extended a previous algorithm that we proposed to handle systematically missing data.²⁹ In recent work, Jolani et al. also extended our one-stage approach to handle systematically missing binary data.³⁰ The approaches differ in the way uncertainty is introduced around the estimated between-cluster covariance parameters, whose sampling distribution is skewed: we used a general log-transformation for the variances and Fisher’s transformation for correlations, as proposed by Pinheiro and Bates³⁷ and applied in some current software packages for fitting non-linear mixed models (e.g. nlme in R), whereas Jolani et al. used a Bayesian framework with a Wishart distribution for the inverse between-cluster covariance.³⁰

To our knowledge, no other package can handle both systematically and sporadically missing data except the recent jomo R-package.²⁴^,⁵¹ Other methods developed to impute one type of missing data could be modified to handle the other type. In particular, the 2 l.norm routine¹⁵ for sporadically missing data in MICE R-package could be adapted using step 3(a) and 3(b) of the one stage approach, although using a MCMC algorithm at each step of the chained equations algorithm seems to be computationally laborious.³⁰

Another alternative is Gibbs sampling. Schafer proposed a Gibbs sampler to generate multiple imputations of a single continuous incomplete variable from a joint multivariate linear mixed model;²⁰ this approach was implemented in the PAN package²² and also in the MICE package.¹⁵ The REALCOM package²³ and the recent jomo R-package²⁴ perform multilevel joint modelling multiple imputations, handling binary and categorical variables through latent normal variables. The jomo R-package is also able to use heteroscedastic framework, and should be in that sense close to our two-stage approach.

The multilevel MICE approach has several problems, especially the heteroscedasticity of the conditional imputation models demonstrated in section 3.2. Our simulation studies showed that with heteroscedastic models our two-stage methods (which naturally allow heteroscedasticity) outperform the one-stage method. It would be possible to introduce heteroscedasticity in the one-stage model, thus avoiding any issues due to small studies or small number of studies,³⁶ but in our current view this is computationally laborious and unrealistic within a MICE procedure. To our knowledge only the 2 l.norm routine in the MICE package allows for heteroscedasticity.²⁵ More recently, Yucel described a heteroscedastic imputation model for imputing multivariate multilevel continuous data.²¹ Similarly, Jolani argued for a cluster-specific error variance which they related to cluster-level characteristics.³⁰

Another potential problem is compatibility between the full conditional distribution of each incomplete variable (represented by the specification of variable-by-variable imputation models) and the global joint distribution of the multivariate missing data. Ideally, the conditional distribution should be derived from a joint probability distribution.²⁶^,⁵² Recent studies have identified compatibility in simple cases, notably for the linear model without any random effect,¹⁰ but the MICE approach is usually validated only by simulation studies, and the results appear to be robust even with incompatible models.²⁶^,⁵³ In this work, we studied simulations with only two missing covariates but extension of our results with more missing variables should be valid provided that conditional imputation models are well specified. Moreover, having a valid joint distribution may be “less important than incorporating information from other variables and unique features of the dataset”.⁵⁴ This could partially explain the good results observed for our method even without including the cluster means in the imputation model. Another possible explanation is that x̄_1i and x̄_2i are normally distributed and can therefore be absorbed into the random intercept term.

The number of random effects that we can consider is a further problem, especially when most studies have systematically missing data. We previously showed that random effects on the intercept and on the factor of interest could be sufficient to produce good results.²⁹ A large number of random effects can make the computational time prohibitive. Nevertheless, we prefer to put a random effect on all the variables, and the two-stage method with the simple and computationally efficient MM is a good alternative despite the fact that we are forced to ignore uncertainty in the variance components $Ψ_{β}^{*}$ and σ*. The computational advantage becomes much greater as the number of random effects increases. Of course, our methods could benefit from parsimonious modelling of the variance-covariance matrix of random effects in the imputation model (e.g. modelling independent random effects). One could also constrain some components of the random effect covariance matrix to zero, based on empirical covariance matrix estimates.²³ Such constraints could be directly implemented in the two-stage methods by forcing some between-studies variances to be zero in the second step. In practice, this corresponds to considering a smaller set of variables in Z_i than in X_i. The two stage approach also allows level-2 variables to be included through meta-regression in the second stage.

An important extension of our imputation model is to impute categorical variables. In the past, this has been done in a MICE framework using random-effects logistic imputation models,³⁰ and in a joint model framework using latent continuous normally distributed variables.²⁴ Our two-stage approach would be particularly simple to modify for any generalised linear or other regression model, and this is the subject of future work.

In conclusion, MICE seems to be a valuable approach to handle both systematically and sporadically missing data. We proposed three methods, but only the two-stage methods easily take account of the potential heteroscedasticity in the imputation model, and they outperform the one-stage approach in some settings. The two-stage methods perform quite similarly, but the computational time needed to fit a REML estimator is a disadvantage. Therefore, the two-stage method with the MM approach appears to be a solid and efficient alternative.

Appendix

NIHMS71930-supplement-Appendix.pdf^{(85.9KB, pdf)}

Acknowledgements

We thank the Global Research on Acute Conditions Team (GREAT, http://www.greatnetwork.org) for allowing the use of their data.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: MR-R was supported by Agence Nationale de Sécurité du Médicament et des Produits de Santé. IRW was supported by the Medical Research Council [Unit Programme number U105260558].

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

1.Rubin DB. Multiple Imputation for Nonresponse in Surveys Wiley Series in Probability and Statistics. New York: Wiley; 1987. [Google Scholar]
2.Little R, Rubin D. Statistical analysis with missing data. 2nd ed. Hoboken, NJ: Wiley; 2002. [Google Scholar]
3.White I, Royston P, Wood A. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30:377–399. doi: 10.1002/sim.4067. [DOI] [PubMed] [Google Scholar]
4.Carpenter J, Kenward M. Multiple imputation and its application. New York: John Wiley & Sons; 2012. [Google Scholar]
5.Novo A, Schafer J. Package NORM, R package version 1.0-9.5. 2015 http://CRAN.R-project.org/package=norm (accessed)
6.StataCorp. Stata multiple-imputation reference manual: release 12. College Station, TX: Stata Press; 2011. [Google Scholar]
7.SAS Institute Inc. SAS/STAT 9.1 user s guide. Chapter 46 Cary, NC: SAS Institute Inc; 2004. [Google Scholar]
8.Schafer J. Analysis of incomplete multivariate data. Boca Raton: Chapman & Hall/CRC; 1997. [Google Scholar]
9.van Buuren S, Boshuizen H, Knook D. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18:681–694. doi: 10.1002/(sici)1097-0258(19990330)18:6<681::aid-sim71>3.0.co;2-r. [DOI] [PubMed] [Google Scholar]
10.Chen H. Compatibility of conditionally specified models. Stat Probab Lett. 2010;80:670–677. doi: 10.1016/j.spl.2009.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Hughes R, White I, Seaman S, et al. Joint modelling rationale for chained equations. BMC Med Res Meth. 2014;14:28. doi: 10.1186/1471-2288-14-28. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.van Buuren S, Brand J, Groothuis-Oudshoorn C, et al. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76:1049–1064. [Google Scholar]
13.van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Meth Med Res. 2007;16:219–242. doi: 10.1177/0962280206074463. [DOI] [PubMed] [Google Scholar]
14.Bartlett JW, Seaman SR, White IR, et al. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Meth Med Res. 2015;24:462–487. doi: 10.1177/0962280214521348. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.van Buuren S, Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]
16.Raghunathan T, Solenberger P, van Hoewyk J. IVEware: imputation and variance estimation software. 2016 http://www.isr.umich.edu/src/smp/ive (accessed)
17.Taljaard M, Donner A, Klar N. Imputation strategies for missing continuous outcomes in cluster randomized trials. Biometr J. 2008;50:329–345. doi: 10.1002/bimj.200710423. [DOI] [PubMed] [Google Scholar]
18.Andridge R. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometr J. 2011;53:57–74. doi: 10.1002/bimj.201000140. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Drechsler J. Multiple imputation of multilevel missing datarigor versus simplicity. J Educ Behav Stat. 2015;40:69–95. [Google Scholar]
20.Schafer J, Yucel R. Computational strategies for multivariate linear mixed-effects models with missing values. J Comput Graph Stat. 2002;11:437–457. [Google Scholar]
21.Yucel R. Random covariances and mixed-effects models for imputing multivariate multilevel continuous data. Stat Model. 2011;11:351–370. doi: 10.1177/1471082X1001100404. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Zhao J, Schafer J. pan: Multiple imputation for multivariate panel or clustered data. 2015 R package version 1.3. [Google Scholar]
23.Carpenter J, Goldstein H, Kenward M. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw. 2011;45:1–14. [Google Scholar]
24.Quartagno M, Carpenter J. jomo: a package for multilevel joint modelling multiple imputation. 2016 R package version 2.2, http://CRAN.R-project.org/package=jomo (accessed)
25.van Buuren S. Multiple imputation of multilevel data. In: Hox J, Roberts J, editors. The handbook of advanced multilevel analysis. New York: Taylor & Francis, Ltd; 2010. pp. 173–196. [Google Scholar]
26.van Buuren S. Flexible imputation of missing data. Boca Raton: CRC Press; 2012. [Google Scholar]
27.van Buuren S, Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. 2015 R package version 2.25, http://CRAN.R-project.org/package=mice (accessed)
28.Stewart L, Parmar M. Meta-analysis of the literature or of individual patient data: is there a difference? Lancet. 1993;341:418–22. doi: 10.1016/0140-6736(93)93004-k. [DOI] [PubMed] [Google Scholar]
29.Resche-Rigon M, White I, Bartlett J, et al. Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Stat Med. 2013;32:4890–4905. doi: 10.1002/sim.5894. [DOI] [PubMed] [Google Scholar]
30.Jolani S, Debray T, Koffijberg H, et al. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Stat Med. 2015;34:1841–1863. doi: 10.1002/sim.6451. [DOI] [PubMed] [Google Scholar]
31.West B, Welch K, Galecki A. Linear mixed models: a practical guide using statistical software. Boca Raton: CRC Press; 2007. [Google Scholar]
32.Lee Y, Nelder J, Pawitan Y. Generalized linear models with random effects: unified analysis via H-likelihood. Boca Raton: CRC Press; 2006. [Google Scholar]
33.Laird N, Ware J. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
34.Jackson D, Riley R, White I. Multivariate meta-analysis: potential and promise. Stat Med. 2011;30:2481–2498. doi: 10.1002/sim.4172. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Riley R, Lambert P, Staessen J, et al. Meta-analysis of continuous outcomes combining individual patient data and aggregate data. Stat Med. 2008;27:1870–1893. doi: 10.1002/sim.3165. [DOI] [PubMed] [Google Scholar]
36.Debray TP, Moons KG, Abo-Zaid GMA, et al. Individual participant data meta-analysis for a binary outcome: one-stage or two-stage? PLoS One. 2013;8:e60650. doi: 10.1371/journal.pone.0060650. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Pinheiro J, Bates D. Mixed-effects models in S and S-PLUS. New York: Springer Science & Business Media; 2000. [Google Scholar]
38.Jackson D, White I, Thompson S. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Stat Med. 2010;29:1282–1297. doi: 10.1002/sim.3602. [DOI] [PubMed] [Google Scholar]
39.Pinheiro J, Bates D, DebRoy S, et al. nlme: linear and nonlinear mixed effects models. 2016 R package version 3.1-28, http://CRAN.R-project.org/package=nlme (accessed)
40.R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2016. http://www.R-project.org/ (accessed) [Google Scholar]
41.Simmonds M, Higgins J, Stewart L, et al. Meta-analysis of individual patient data from randomized trials: a review of methods used in practice. Clinical Trials. 2005;2:209–217. doi: 10.1191/1740774505cn087oa. [DOI] [PubMed] [Google Scholar]
42.Thomas D, Radji S, Benedetti A. Systematic review of methods for individual patient data meta-analysis with binary outcomes. BMC Med Res Meth. 2014;14:79. doi: 10.1186/1471-2288-14-79. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]
44.Gasparrini A. mvmeta: multivariate meta-analysis and meta-regression, R package version 0.4.5. 2011 [Google Scholar]
45.Gasparrini A, Armstrong B, Kenward M. Multivariate meta-analysis for non-linear and other multi-parameter associations. Stat Med. 2012;31:3821–3839. doi: 10.1002/sim.5471. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Van Buuren S, Oudshoorn K. Flexible multivariate imputation by MICE. Leiden, The Netherlands: TNO Prevention Center. 1999 Technical report, http://www.stefvanbuuren.nl/publications/Flexible%20multivariate%20-%20TNO99054%201999.pdf (accessed)
47.Bates D, Maechler M, Bolker B, et al. lme4: linear mixed-effects models using Eigen and S4, R package version 1.1-12. 2016 http://CRAN.R-project.org/package=lme4 (accessed)
48.Mebazaa A, Gayat E, Lassus J, et al. Association between elevated blood glucose and outcome in acute heart failure: results from an international observational cohort. J Am Coll Cardiol. 2013;61:820–829. doi: 10.1016/j.jacc.2012.11.054. [DOI] [PubMed] [Google Scholar]
49.Lassus J, Gayat E, Mueller C, et al. Incremental value of biomarkers to clinical variables for mortality prediction in acutely decompensated heart failure: the multinational observational cohort on acute heart failure (MOCA) study. Int J Cardiol. 2013;168:2186–2194. doi: 10.1016/j.ijcard.2013.01.228. [DOI] [PubMed] [Google Scholar]
50.Riley R, Lambert P, Abo-Zaid G. Meta-analysis of individual participant data: rationale, conduct, and reporting. Br Med J. 2010;340:521–525. doi: 10.1136/bmj.c221. [DOI] [PubMed] [Google Scholar]
51.Quartagno M, Carpenter J. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with missing covariates. Stat Med. 2016;35:2954–2938. doi: 10.1002/sim.6837. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Gilks W, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in practice. Boca Raton: Springer; 1996. Full conditional distributions; pp. 75–88. [Google Scholar]
53.Liu J, Gelman A, Hill J, et al. On the stationary distribution of iterative imputations. Biometrika. 2014;101:155–173. [Google Scholar]
54.Gelman A. Parameterization and Bayesian modeling. J Am Stat Assoc. 2004;99:537–545. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

NIHMS71930-supplement-Appendix.pdf^{(85.9KB, pdf)}

[R1] 1.Rubin DB. Multiple Imputation for Nonresponse in Surveys Wiley Series in Probability and Statistics. New York: Wiley; 1987. [Google Scholar]

[R2] 2.Little R, Rubin D. Statistical analysis with missing data. 2nd ed. Hoboken, NJ: Wiley; 2002. [Google Scholar]

[R3] 3.White I, Royston P, Wood A. Multiple imputation using chained equations: issues and guidance for practice. Stat Med. 2011;30:377–399. doi: 10.1002/sim.4067. [DOI] [PubMed] [Google Scholar]

[R4] 4.Carpenter J, Kenward M. Multiple imputation and its application. New York: John Wiley & Sons; 2012. [Google Scholar]

[R5] 5.Novo A, Schafer J. Package NORM, R package version 1.0-9.5. 2015 http://CRAN.R-project.org/package=norm (accessed)

[R6] 6.StataCorp. Stata multiple-imputation reference manual: release 12. College Station, TX: Stata Press; 2011. [Google Scholar]

[R7] 7.SAS Institute Inc. SAS/STAT 9.1 user s guide. Chapter 46 Cary, NC: SAS Institute Inc; 2004. [Google Scholar]

[R8] 8.Schafer J. Analysis of incomplete multivariate data. Boca Raton: Chapman & Hall/CRC; 1997. [Google Scholar]

[R9] 9.van Buuren S, Boshuizen H, Knook D. Multiple imputation of missing blood pressure covariates in survival analysis. Stat Med. 1999;18:681–694. doi: 10.1002/(sici)1097-0258(19990330)18:6<681::aid-sim71>3.0.co;2-r. [DOI] [PubMed] [Google Scholar]

[R10] 10.Chen H. Compatibility of conditionally specified models. Stat Probab Lett. 2010;80:670–677. doi: 10.1016/j.spl.2009.12.025. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Hughes R, White I, Seaman S, et al. Joint modelling rationale for chained equations. BMC Med Res Meth. 2014;14:28. doi: 10.1186/1471-2288-14-28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.van Buuren S, Brand J, Groothuis-Oudshoorn C, et al. Fully conditional specification in multivariate imputation. J Stat Comput Simul. 2006;76:1049–1064. [Google Scholar]

[R13] 13.van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Meth Med Res. 2007;16:219–242. doi: 10.1177/0962280206074463. [DOI] [PubMed] [Google Scholar]

[R14] 14.Bartlett JW, Seaman SR, White IR, et al. Multiple imputation of covariates by fully conditional specification: accommodating the substantive model. Stat Meth Med Res. 2015;24:462–487. doi: 10.1177/0962280214521348. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.van Buuren S, Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. J Stat Softw. 2011;45:1–67. [Google Scholar]

[R16] 16.Raghunathan T, Solenberger P, van Hoewyk J. IVEware: imputation and variance estimation software. 2016 http://www.isr.umich.edu/src/smp/ive (accessed)

[R17] 17.Taljaard M, Donner A, Klar N. Imputation strategies for missing continuous outcomes in cluster randomized trials. Biometr J. 2008;50:329–345. doi: 10.1002/bimj.200710423. [DOI] [PubMed] [Google Scholar]

[R18] 18.Andridge R. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometr J. 2011;53:57–74. doi: 10.1002/bimj.201000140. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Drechsler J. Multiple imputation of multilevel missing datarigor versus simplicity. J Educ Behav Stat. 2015;40:69–95. [Google Scholar]

[R20] 20.Schafer J, Yucel R. Computational strategies for multivariate linear mixed-effects models with missing values. J Comput Graph Stat. 2002;11:437–457. [Google Scholar]

[R21] 21.Yucel R. Random covariances and mixed-effects models for imputing multivariate multilevel continuous data. Stat Model. 2011;11:351–370. doi: 10.1177/1471082X1001100404. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Zhao J, Schafer J. pan: Multiple imputation for multivariate panel or clustered data. 2015 R package version 1.3. [Google Scholar]

[R23] 23.Carpenter J, Goldstein H, Kenward M. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw. 2011;45:1–14. [Google Scholar]

[R24] 24.Quartagno M, Carpenter J. jomo: a package for multilevel joint modelling multiple imputation. 2016 R package version 2.2, http://CRAN.R-project.org/package=jomo (accessed)

[R25] 25.van Buuren S. Multiple imputation of multilevel data. In: Hox J, Roberts J, editors. The handbook of advanced multilevel analysis. New York: Taylor & Francis, Ltd; 2010. pp. 173–196. [Google Scholar]

[R26] 26.van Buuren S. Flexible imputation of missing data. Boca Raton: CRC Press; 2012. [Google Scholar]

[R27] 27.van Buuren S, Groothuis-Oudshoorn K. MICE: multivariate imputation by chained equations in R. 2015 R package version 2.25, http://CRAN.R-project.org/package=mice (accessed)

[R28] 28.Stewart L, Parmar M. Meta-analysis of the literature or of individual patient data: is there a difference? Lancet. 1993;341:418–22. doi: 10.1016/0140-6736(93)93004-k. [DOI] [PubMed] [Google Scholar]

[R29] 29.Resche-Rigon M, White I, Bartlett J, et al. Multiple imputation for handling systematically missing confounders in meta-analysis of individual participant data. Stat Med. 2013;32:4890–4905. doi: 10.1002/sim.5894. [DOI] [PubMed] [Google Scholar]

[R30] 30.Jolani S, Debray T, Koffijberg H, et al. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Stat Med. 2015;34:1841–1863. doi: 10.1002/sim.6451. [DOI] [PubMed] [Google Scholar]

[R31] 31.West B, Welch K, Galecki A. Linear mixed models: a practical guide using statistical software. Boca Raton: CRC Press; 2007. [Google Scholar]

[R32] 32.Lee Y, Nelder J, Pawitan Y. Generalized linear models with random effects: unified analysis via H-likelihood. Boca Raton: CRC Press; 2006. [Google Scholar]

[R33] 33.Laird N, Ware J. Random-effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]

[R34] 34.Jackson D, Riley R, White I. Multivariate meta-analysis: potential and promise. Stat Med. 2011;30:2481–2498. doi: 10.1002/sim.4172. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] 35.Riley R, Lambert P, Staessen J, et al. Meta-analysis of continuous outcomes combining individual patient data and aggregate data. Stat Med. 2008;27:1870–1893. doi: 10.1002/sim.3165. [DOI] [PubMed] [Google Scholar]

[R36] 36.Debray TP, Moons KG, Abo-Zaid GMA, et al. Individual participant data meta-analysis for a binary outcome: one-stage or two-stage? PLoS One. 2013;8:e60650. doi: 10.1371/journal.pone.0060650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Pinheiro J, Bates D. Mixed-effects models in S and S-PLUS. New York: Springer Science & Business Media; 2000. [Google Scholar]

[R38] 38.Jackson D, White I, Thompson S. Extending DerSimonian and Laird’s methodology to perform multivariate random effects meta-analyses. Stat Med. 2010;29:1282–1297. doi: 10.1002/sim.3602. [DOI] [PubMed] [Google Scholar]

[R39] 39.Pinheiro J, Bates D, DebRoy S, et al. nlme: linear and nonlinear mixed effects models. 2016 R package version 3.1-28, http://CRAN.R-project.org/package=nlme (accessed)

[R40] 40.R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing; Vienna, Austria: 2016. http://www.R-project.org/ (accessed) [Google Scholar]

[R41] 41.Simmonds M, Higgins J, Stewart L, et al. Meta-analysis of individual patient data from randomized trials: a review of methods used in practice. Clinical Trials. 2005;2:209–217. doi: 10.1191/1740774505cn087oa. [DOI] [PubMed] [Google Scholar]

[R42] 42.Thomas D, Radji S, Benedetti A. Systematic review of methods for individual patient data meta-analysis with binary outcomes. BMC Med Res Meth. 2014;14:79. doi: 10.1186/1471-2288-14-79. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.DerSimonian R, Laird N. Meta-analysis in clinical trials. Control Clin Trials. 1986;7:177–188. doi: 10.1016/0197-2456(86)90046-2. [DOI] [PubMed] [Google Scholar]

[R44] 44.Gasparrini A. mvmeta: multivariate meta-analysis and meta-regression, R package version 0.4.5. 2011 [Google Scholar]

[R45] 45.Gasparrini A, Armstrong B, Kenward M. Multivariate meta-analysis for non-linear and other multi-parameter associations. Stat Med. 2012;31:3821–3839. doi: 10.1002/sim.5471. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] 46.Van Buuren S, Oudshoorn K. Flexible multivariate imputation by MICE. Leiden, The Netherlands: TNO Prevention Center. 1999 Technical report, http://www.stefvanbuuren.nl/publications/Flexible%20multivariate%20-%20TNO99054%201999.pdf (accessed)

[R47] 47.Bates D, Maechler M, Bolker B, et al. lme4: linear mixed-effects models using Eigen and S4, R package version 1.1-12. 2016 http://CRAN.R-project.org/package=lme4 (accessed)

[R48] 48.Mebazaa A, Gayat E, Lassus J, et al. Association between elevated blood glucose and outcome in acute heart failure: results from an international observational cohort. J Am Coll Cardiol. 2013;61:820–829. doi: 10.1016/j.jacc.2012.11.054. [DOI] [PubMed] [Google Scholar]

[R49] 49.Lassus J, Gayat E, Mueller C, et al. Incremental value of biomarkers to clinical variables for mortality prediction in acutely decompensated heart failure: the multinational observational cohort on acute heart failure (MOCA) study. Int J Cardiol. 2013;168:2186–2194. doi: 10.1016/j.ijcard.2013.01.228. [DOI] [PubMed] [Google Scholar]

[R50] 50.Riley R, Lambert P, Abo-Zaid G. Meta-analysis of individual participant data: rationale, conduct, and reporting. Br Med J. 2010;340:521–525. doi: 10.1136/bmj.c221. [DOI] [PubMed] [Google Scholar]

[R51] 51.Quartagno M, Carpenter J. Multiple imputation for IPD meta-analysis: allowing for heterogeneity and studies with missing covariates. Stat Med. 2016;35:2954–2938. doi: 10.1002/sim.6837. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] 52.Gilks W, Richardson S, Spiegelhalter D. Markov Chain Monte Carlo in practice. Boca Raton: Springer; 1996. Full conditional distributions; pp. 75–88. [Google Scholar]

[R53] 53.Liu J, Gelman A, Hill J, et al. On the stationary distribution of iterative imputations. Biometrika. 2014;101:155–173. [Google Scholar]

[R54] 54.Gelman A. Parameterization and Bayesian modeling. J Am Stat Assoc. 2004;99:537–545. [Google Scholar]

PERMALINK

Multiple imputation by chained equations for systematically and sporadically missing multilevel data

Matthieu Resche-Rigon

Ian R White

Abstract

1. Introduction

2. Imputation of univariate missing data

2.1. One-stage approach

2.2. Two-stage approach

3. Imputing two or more incomplete variables

3.1. MICE

3.2. Conditional models in the multilevel setting

4. Simulation study

4.1. Simulating complete data

4.2. Simulating missing data

4.3. Imputing missing data

4.4. Analysis

4.5. Results

Table 1.

Table 2.

Table 3.

Table 4.

5. Application to GREAT data

Table 5.

Table 6.

Table 7.

6. Discussion

Appendix

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Multiple imputation by chained equations for systematically and sporadically missing multilevel data

Matthieu Resche-Rigon

Ian R White

Abstract

1. Introduction

2. Imputation of univariate missing data

2.1. One-stage approach

2.2. Two-stage approach

3. Imputing two or more incomplete variables

3.1. MICE

3.2. Conditional models in the multilevel setting

4. Simulation study

4.1. Simulating complete data

4.2. Simulating missing data

4.3. Imputing missing data

4.4. Analysis

4.5. Results

Table 1.

Table 2.

Table 3.

Table 4.

5. Application to GREAT data

Table 5.

Table 6.

Table 7.

6. Discussion

Appendix

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases