Review and Evaluation of Imputation Methods for Multivariate Longitudinal data with Mixed-type Incomplete Variables

Yi Cao; Heather Allore; Brent Vander Wyk; Roee Gutman

doi:10.1002/sim.9592

. Author manuscript; available in PMC: 2023 Dec 30.

Published in final edited form as: Stat Med. 2022 Oct 11;41(30):5844–5876. doi: 10.1002/sim.9592

Review and Evaluation of Imputation Methods for Multivariate Longitudinal data with Mixed-type Incomplete Variables

Yi Cao ¹, Heather Allore ^2,³, Brent Vander Wyk ², Roee Gutman ¹

PMCID: PMC9771917 NIHMSID: NIHMS1838926 PMID: 36220138

Summary

Estimating relationships between multiple incomplete patient measurements requires methods to cope with missing values. Multiple imputation is one approach to address missing data by filling in plausible values for those that are missing. Multiple imputation procedures can be classified into two broad types: joint modeling (JM) and fully conditional specification (FCS). JM fits a multivariate distribution for the entire set of variables, but it may be complex to define and implement. FCS imputes missing data variable-by-variable from a set of conditional distributions. In many studies, FCS is easier to define and implement than JM, but it may be based on incompatible conditional models. Imputation methods based on multilevel modeling show improved operating characteristics when imputing longitudinal data, but they can be computationally intensive, especially when imputing multiple variables simultaneously. We review current MI methods for incomplete longitudinal data and their implementation on widely accessible software. Using simulated data from the National Health and Aging Trends Study, we compare their performance for monotone and intermittent missing data patterns. Our simulations demonstrate that in a longitudinal study with a limited number of repeated observations and time-varying variables, FCS-Standard is a computationally efficient imputation method that is accurate and precise for univariate single-level and multilevel regression models. When the analyses comprise multivariate multilevel models, FCS-LMM-latent is a statistically valid procedure with overall more accurate estimates, but it requires more intensive computations. Imputation methods based on generalized linear multilevel models can lead to biased subject-level variance estimates when the statistical analyses involve hierarchical models.

Keywords: longitudinal analysis, multiple imputation, chained equations, joint modeling

1 |. INTRODUCTION

Missing data are often inevitable in longitudinal studies. A primary reason is that non-response can occur at any time in the study. Individuals’ responses may be missing because they have moved out of the area, missed an appointment, were too ill to attend, or died. In studies involving annual surveys, missing data also occur when participants refuse to answer or do not know the answer.

Commonly used statistically valid methods can be classified into three broad types: (1) likelihood and Bayesian methods; (2) weighting methods; and (3) imputation methods.¹ The likelihood and Bayesian methods define a model for the observed and unobserved variables. Using computational techniques such as the EM algorithm² or Data Augmentation,³ it provides estimates for the estimands of interest. These methods may result in biased estimates when the model is mis-specified or when the missing data mechanism is non-ignorable. Weighting is an alternative approach to handling missing data. This approach weights the observed data to account for missing observations using the estimated probabilities of non-response.^4–6 Weighting methods are best suited for monotone missing data patterns and are commonly used when the missing variable is scalar. In contrast to the former two methods, imputation methods explicitly “fill in” the missing values with plausible values. Single imputations have been shown to result in sampling variance estimates that are too small.⁷ Multiple imputation procedures circumvent this issue by replacing each missing value with a set of B plausible values drawn from the predictive distribution of the missing data. Based on these values, B sets of complete datasets are generated. Each dataset is analyzed separately, and final estimates are obtained using common combination rules.⁸

Although the idea behind multiple imputation seems simple, developing procedures to produce plausible values is more complex. Multiple papers have summarized and compared possible procedures to impute scalar variables,⁹ multiple non-clustered variables,^10–13 and multiple continuous longitudinal variables.^14–18 However, methods to impute mixed-type variables in longitudinal studies are more limited and dispersed. Kalaycioglu et al¹⁹ compared three chained equations imputation approach,²⁰ the multivariate Normal imputation,¹¹ and a Bayesian imputation approach to impute time-varying binary, categorical, skewed, and normally distributed variables. Huque et al²¹ presented a comparison study with twelve currently available imputation methods for longitudinal data with incomplete continuous and binary variables. Other studies compared imputation methods for continuous and binary multilevel data,^16,22–25 and multilevel categorical data²⁶ in non-longitudinal settings. This manuscript reviews available imputation methods for multiple variables of various types in longitudinal studies. Using simulations based on the National Health and Aging Trend Study (NHATS),²⁷ we compare the operating characteristics of different methods for handling missing values in both outcomes and explanatory variables. Code in R²⁸ for implementing all of the methods for the simulations and the real-data example is provided.²⁹

The paper proceeds as follows. Section 2 introduces the multiple imputation approach. Sections 3 and 4 review the fully conditional specification (FCS) imputation procedures and the joint modeling (JM) imputation procedures, respectively. Section 5 describes the simulation analyses and presents the results of the simulations. An application for estimating the associations between hospital admissions, physical, cognitive abilities, and skilled nursing facilities admissions by different imputation methods is demonstrated in Section 6. Section 7 provides a discussion and conclusions.

2 |. MULTIPLE IMPUTATION FOR LONGITUDINAL STUDIES

2.1 |. Notations and assumptions

Let Y = {y_ijk} represent a multivariate longitudinal dataset such that y_ijk is the value of variable Y_j, j ∈ {1,…, J}, for subject i ∈ {1,…, n} at time t_k, k ∈ {1,…,K}. The data Y can be saved in either a wide or long matrix format. In wide format, Y is stored as a n × JK matrix with each row representing an individual and each column corresponding to variable Y_j measured at time t_k, denoted by Y_jk. In long format, Y is a nK × (J + 1) matrix, with the last column describing the time point that the observation is recorded and the other columns are the J variables. For simplicity, we define Y_l, l ∈ {1,…,L}, to be a column in either the wide or long format. In wide format, L = JK and Y_l is a vector of size n. In long format, L = J and Y_l is a vector of size nK. Tables 1 and 2 present an example of data arranged in wide format and long format, respectively, with K = 4 time points and J = 3 variables. Question marks represent missing values.

TABLE 1.

Example of data arranged in the wide format

ID	Time t₁			Time t₂			Time t₃			Time t₄
ID	Y₁₁	Y₂₁	Y₃₁	Y₁₂	Y₂₂	Y₃₂	Y₁₃	Y₂₃	Y₃₃	Y₁₄	Y₂₄	Y₃₄
1	y ₁₁₁	y ₁₂₁	y ₁₃₁	y ₁₁₂	y ₁₂₂	y ₁₃₂	?	?	?	?	?	?
2	y ₂₁₁	y ₂₂₁	y ₂₃₁	y ₂₁₂	y ₂₂₂	y ₂₃₂	y ₂₁₃	y ₂₂₃	y ₂₃₃	?	?	?
3	y ₃₁₁	y ₃₂₁	y ₃₃₁	y ₃₁₂	y ₃₂₂	y ₃₃₂	?	y ₃₂₃	?	y ₃₁₄	y ₃₂₄	y ₃₃₄

Open in a new tab

TABLE 2.

Example of data arranged in the long format

ID	Y₁	Y₂	Y₃	Time
1	y ₁₁₁	y₁₂₁	y₁₃₁	t ₁
1	y ₁₁₂	y ₁₂₂	y ₁₃₂	t ₂
1	?	?	?	t ₃
1	?	?	?	t ₄
2	y ₂₁₁	y ₂₂₁	y ₂₃₁	t ₁
2	y ₂₁₂	y ₂₂₂	y ₂₃₂	t ₂
2	y ₂₁₃	y ₂₂₃	y ₂₃₃	t ₃
2	?	?	?	t ₄
3	y ₃₁₁	y ₃₂₁	y ₃₃₁	t ₁
3	y ₃₁₂	y ₃₂₂	y ₃₃₂	t ₂
3	?	y ₃₂₃	?	t ₃
3	y ₃₁₄	y ₃₂₄	y ₃₃₄	t ₄

Open in a new tab

Let M = {m_ijk} be a matrix of indicators, such that m_ijk = 0 when y_ijk is observed and m_ijk = 1, otherwise. In addition, let $Y_{l}^{obs}$ and $Y_{l}^{mis}$ be the observed (m_ijk = 0) and the missing (m_ijk = 1) parts of Y_l, respectively. Monotone and intermittent missing data are two missing data patterns that are commonly observed in longitudinal data.⁷ Monotone missing data occurs when subjects drop out from the study and do not return for follow-up appointments. If subject i drops out from the study at time point t_k, then m_ijk* = 1, ∀j ∈ {1, …, J} and k* ≥ k. Intermittent missing data usually occur when subjects skip an interview or refuse to answer certain questions. In such case, m_ijk = 1 for any j ∈ {1, …, J} and k ∈ {1,…,K}. Imputation of $Y_{l}^{mis}$ involves modeling the relationships between Y_l, complete covariates X = {x_ip}, p ∈ {1, .., P}, and all the other L − 1 variables, Y_−l.

To obtain valid inferences when data comprise missing values, one should consider the missing data mechanism,⁷ P(M|Y,x,ϕ), where ϕ are the parameters governing this distribution. In many software packages, the default method for missing values is the list-wise deletion procedure, which assumes that the data are missing completely at random (MCAR). MCAR implies that the missing data mechanism is unrelated to missing and observed values, P(M|Y,x,ϕ) = P(M|ϕ). In aging research, older adults may be too ill to complete the study, and assuming MCAR can lead to biased estimates. One way to relax the MCAR assumption is to assume that missing data depends only on observed values, also know as missing at random (MAR), P(M|Y,x,ϕ) = P(M|Y^obs,x,ϕ), where $Y^{obs} = {Y_{l}^{obs}}$ . Under MAR, a variety of analytic strategies to address missing data can be considered. The third type of missing data mechanism is not missing at random (NMAR). Under NMAR, missing data depends on both the observed and unobserved values. Because missing data depends on unobserved values, methods to handle missing data under NMAR rely on assumptions that are not verifiable from the observed data. Thus, many authors emphasized the need for sensitivity analysis to assess inferences under different plausible assumptions.^1,30 This paper describes multiple imputation methods that assume that the missing data mechanism is MAR, and it examines their performance when these methods are applied to multiple incomplete longitudinal variables of different types.

2.2 |. Multiple imputation for multivariate data

Assuming MAR, a multiple imputation (MI) procedure generates B plausible values for each missing value resulting in B complete datasets. Each complete dataset is analyzed separately, and point and interval estimates are obtained using common combination rules. The procedure for creating imputations for all partially observed variables in Y consists of three steps:

Specify the model P(Y₁, …, Y_L|θ,x) and prior distributions of parameters P(θ) and calculate the posterior distribution of θ based on Y^obs;
Draw a value θ* from its posterior distribution P(θ|Y^obs,x);
Draw imputations $Y_{1}^{*}, \dots, Y_{L}^{*}$ from the conditional posterior predictive distribution of Y^mis given Y^obs, θ*, and x, P(Y^mis|Y^obs,θ*,x).

After the imputation for all incomplete variables is completed, researchers can conduct any statistical analysis that they would have performed on a complete dataset. Using common combination rules, an estimate for a scalar parameter of interest, β, is derived as the average of ${\hat{β}}^{(u)}$ , u ∈ {1,…,B}, which is the estimate of β within complete dataset u. And its sampling variance, $var (\hat{β})$ , is estimated by summing the average sampling variances within imputation and the variance between imputations⁸. Formally,

\hat{β} = \frac{1}{B} \sum_{u = 1}^{B} {\hat{β}}^{(u)}, var (\hat{β}) = \frac{1}{B} \sum_{u = 1}^{B} U_{B} + (1 + \frac{1}{B}) W_{B}, where U_{B} = \sum_{u = 1}^{B} \hat{var} (β^{(u)}), W_{B} = \frac{1}{B - 1} \sum_{u = 1}^{B} {({\hat{β}}^{(u)} - \hat{β})}^{2} .

(2.1)

Two main strategies for specifying a distribution of all incomplete variables have been proposed: joint modeling (JM) and fully conditional specification (FCS).¹ The JM approach defines a multivariate distribution for all incomplete variables, P(Y₁,…,Y_L|θ,x). The FCS approach specifies a set of univariate conditional distributions for each incomplete variable given the other variables, {P(Y_l|θ_l,Y_−l,x)}. Compared to the JM approach, the FCS approach provides more flexibility for imputing different types of variables, but it may suffer from theoretical limitations, because the joint distribution based on the different conditional models may not exist.¹³

Under either the FCS or JM strategy, the imputation models used to create $Y_{l}^{*}$ depend on the data format. With long format, multilevel models are commonly used as imputation models, where the variance at the subject level captures the correlation between repeated measurements.^31–33 In wide format data, single-level models are specified to impute an incomplete variable at a specific time point. The correlation between the repeated measurements of an incomplete variable is accounted for by adjusting for its values measured at all the other time points as constant effects.³⁴ This approach assumes an unstructured correlation structure between the repeated measurements of the incomplete variables. While a wide format can be used when the same observation is recorded for the same unit over time, it may be impractical when the same observation is recorded for different units, but units are grouped within clusters. The imputation methods using wide format data should generally be used for balanced longitudinal data in which information on all individuals is recorded over similar intervals. When individuals’ reporting is recorded over different time intervals, special care should be given to the time between reports.

3 |. IMPUTATION BY FULLY CONDITIONAL SPECIFICATIONS

The FCS approach is also referred to multivariate imputation by chained equations (MICE). MICE iterates through sampling from the conditional posterior distributions of model parameters, $P (θ_{l} ∣ Y_{l}^{obs}, Y_{- l}, X)$ , and sampling from the conditional posterior predictive distributions $P (Y_{l}^{mis} ∣ Y_{l}^{obs}, Y_{- l}, X, θ_{l}^{*})$ . Formally, the t-th iteration of MICE involves sampling

\begin{array}{l} θ_{1}^{* (t)} & ~ & P (θ_{1} ∣ Y_{1}^{o b s}, Y_{- 1}^{(t - 1)}, X), \\ Y_{1}^{* (t)} & ~ & P (Y_{1} ∣ Y_{1}^{o b s}, Y_{- 1}^{(t - 1)}, X, θ_{1}^{* (t)}), \\ ⋮ \\ θ_{L}^{* (t)} & ~ & P (θ_{L} ∣ Y_{L}^{o b s}, Y_{- L}^{(t)}, X), \\ Y_{L}^{* (t)} & ~ & P (Y_{L} ∣ Y_{L}^{o b s}, Y_{- L}^{(t)}, X, θ_{L}^{* (t)}) . \end{array}

(3.1)

Because the conditional models in MICE may not represent a joint distribution, there are no theoretically supported methods to assess convergence. One possible method to assess convergence is by examining whether the imputation of each variable has converged. For example, by examining the convergence of a summary statistic that utilizes the imputed values in each incomplete variable over multiple chains.³⁵ The MICE algorithm is implemented in the mice package³⁶ in R, where researchers specify a set of univariate imputation models for each incomplete variable. Other available software for implementing the MICE algorithm include IVEware,³⁷ PROC MCMC in SAS with FCS statement. In all of these implementations, FCS model specifications commonly depend on the data format.

3.1 |. FCS using wide format data

For wide format longitudinal data, imputation models treat repeated measurements of an incomplete variable Y_j at time t_k as K distinct variables, {Y_jk;k ∈ {1, …,K}}. If Y_j contains missing values at and after time t_k, then variables {Y_jt;∀t ≥ k} are incomplete variables. To impute a missing variable Y_jk, the imputer specifies a linear regression model or a generalized linear model where Y_jk is the dependent variable, and x, the variables {Y_jt;∀t ≠ k} and all the other variables at all time points ${Y_{\tilde{j} t}; \forall \tilde{j} \neq j, t \in {1, \dots, K}}$ are the independent variables. Another method for imputing incomplete continuous and count variables is the predictive mean matching (PMM) procedure.⁴ PMM is a semi-parametric method that imputes data using observed values, making it less sensitive to model mis-specification than purely parametric methods.³⁸ For count variables with large number of zeros, the zero-inflated Poisson and the zero-inflated Negative-Binomial models can be used.³⁹ These models are implemented in the countimp package.⁴⁰ The application of MICE to wide data format has been referred to as FCS-Standard²¹ or as imputation by chained equations with fixed effects regression models (ICE-FS).¹⁹

As the number of waves in longitudinal studies increases, FCS-Standard can result in numerical instabilities because of the lack of identification that arises from specifying many explanatory variables in the conditional models. Nevalainen et al⁴¹ proposed to impute variables recorded at time t_k only with variables that are recorded within (t_k −δ, t_k +δ) time window. This procedure assumes that a partially recorded variable observed at time t_k is independent from variables recorded at time t_k ±(δ +ϵ) (ϵ > 0) conditional on variables recorded at time t_k ± δ. This reduces the number of covariates used within each conditional model. Additional details of this FCS method are provided in Welch et al.⁴²

3.2 |. FCS– Multilevel linear model

When data are saved in long format, multilevel linear models have been proposed as conditional imputation models for continuous variables. The first level of the model describes the repeated observations of subjects across time, and it is nested within a second-level, which is subject-level information. Formally, to impute the missing observation of a continuous variable Y_l at time point k for the i-th subject, the following multilevel linear model is used,

y_{i l k} = X_{i k}^{T} β_{l} + Z_{i k}^{T} b_{i} + e_{i}, b_{i} ~ N_{q} (0, V_{b}), e_{i} ~ N (0, σ_{e}^{2}),

(3.2)

where x_ik and z_ik are p×1 vector of covariates and q×1 vector of subject-level covariates, respectively, and V_b is an unstructured covariance matrix of the subject-level effects. Both x and z may comprise complete and other incomplete variables. Parameters β_l are regression coefficients corresponding to covariates x, and b_i correspond to subject-level variations in covariates Z. Model (3.2) assumes a common conditional variance $σ_{e}^{2}$ across all subjects. Applying MICE with Model (3.2) involves sampling $θ_{l}^{*} = (β_{l}^{*}, V_{b}^{*}, σ_{e}^{2 *})$ , subject-level effects $b_{i}^{*}$ , and imputation values $y_{i l k}^{*}$ . Commonly, conjugate prior distributions are assumed for $θ_{l}^{*}$ . With continuous incomplete variables, a multivariate Gaussian prior distribution for β_l, an Inverse-Wishart distribution for V_b, and an Inverse-Gamma distribution for $σ_{e}^{2}$ are assumed. Samples of b_i can be drawn from its conditional posterior distribution N(μ_bi,Ψ_bi), with $μ_{b i} = V_{b} Z_{i}^{T} {(Z_{i} V_{b} Z_{i}^{T} + σ_{e}^{2} I_{K})}^{- 1} (y_{i l} - X_{i}^{T} β_{l})$ and $Ψ_{b i} = V_{b} - V_{b} Z_{i}^{T} {(Z_{i} V_{b} Z_{i}^{T} + σ_{e}^{2} I_{K})}^{- 1} Z_{i} V_{b}$ , where y_il = (y_il1, ⋯,y_ilK)^T is the responses of subject i measured at all K time points and I_K denotes a K ×K identity matrix. Imputed values of $y_{i l k}^{*}$ are sampled from Model (3.2) given $θ_{l}^{*}$ and $b_{i}^{*}$ . The sampling procedure relies on a MCMC algorithm, which can be computationally burdensome because many samples are required for the chain to converge to its equilibrium distribution.⁴³ A possible approximation procedure samples $θ_{l}^{*}$ from their large sample Normal approximation and $b_{i}^{*}$ from $N (μ_{b_{i}}, Ψ_{b_{i}})$ .^16,25

To impute discrete variables, one approach is to sample from multilevel linear models and round the imputed continuous values to the nearest valid discrete values. However, this rounding step can result in biased estimates.⁴⁴ To address this, Yucel et al⁴⁵ have proposed a calibration method that is similar to posterior predictive checks in Bayesian analysis⁴⁶ to improve imputed rounded values.

3.3 |. FCS– Multilevel linear model with latent variables

Another approach for imputing binary and categorical variables is to sample from a multilevel linear model with latent variables.^47,48 For a binary variable Y_l, this model assumes that there is a latent Normal variable ${\tilde{Y}}_{l}$ , such that y_ilk = 1 if ${\tilde{y}}_{i l k} > 0$ , and y_ilk = 0 otherwise. The latent variable is assumed to follow a multilevel linear model. This representation is equivalent to assuming that y_ilk follows a multilevel probit model. Formally,

Φ^{- 1} (P (y_{i l k} = 1 ∣ X_{i k}, b_{i}, θ_{l}, τ)) = {\tilde{y}}_{i l k} = X_{i k}^{T} β_{l} + Z_{i k}^{T} b_{i} + e_{i}, b_{i} ~ N_{q} (0, V_{b}), e_{i} ~ N (0, 1),

(3.3)

where Φ⁻¹ is the inverse cumulative distribution function (CDF) of the standard Normal distribution. The sampling procedure for drawing imputations from Model (3.3) is the same as the procedure for Model (3.2), which can be implemented by the function mice.impute.2l.jomo in the micemd R package.

For ordinal categorical variables Y_l with H > 2 levels, the latent variable imputation model is based on a cumulative probit model. The model assumes that Y_l is determined by a latent Normal variable ${\tilde{Y}}_{l}$ partitioned by H − 1 threshold parameters τ = {τ_h}, h ∈ {1, …,H}, such that y_il = h, if $τ_{h - 1} < {\tilde{y}}_{i l} < τ_{h} (τ_{0} = - \infty, τ_{H} = \infty)$ . In addition, the model assumes that the latent variable ${\tilde{Y}}_{l}$ follows a multilevel linear model as Model (3.3). Formally, a cumulative probit model of Y_l is defined as

Φ^{- 1} (P (y_{i l k} \leq h ∣ X_{i k}, b_{i}, θ_{l})) = Φ^{- 1} (P ({\tilde{y}}_{i l k} < τ_{h} ∣ X_{i k}, b_{i}, θ_{l})) = τ_{h} - (X_{i k}^{T} β_{l} + Z_{i k}^{T} b_{i}) .

(3.4)

The sampling procedure at iteration t starts with updating $τ^{(t)} ~ N (τ ∣ Y_{l}, {\tilde{Y}}^{(t - 1)}, θ_{l}^{(t - 1)}, b_{i}^{(t - 1)})$ followed by sampling ${\tilde{Y}}_{l} ~ T N ({\tilde{Y}}_{l} ∣ Y, τ^{(t)}, θ_{l}^{(t - 1)}, b_{i}^{(t - 1)})$ (a truncated Normal distribution) for all subjects, where τ^(t) are the truncation parameters at iteration t, and $θ_{l}^{(t - 1)} = {(β_{l}^{(t - 1)}, V_{b}^{(t - 1)})}^{T}$ , and $b_{i}^{(t - 1)}$ are the sample of θ_l = (β_l, V_b)^T and b_i at iteration t − 1, respectively. Samples of the parameters $θ_{l}^{(t)}$ are drawn from its conditional posterior distribution $P (θ_{l} ∣ Y_{l}, {\tilde{Y}}_{l}^{(t)}, τ^{(t)}, b_{i}^{(t - 1)})$ and $b_{i}^{(t)}$ are drawn from $P (b_{i} ∣ Y_{l}, {\tilde{Y}}_{l}^{(t)}, τ^{(t)}, θ_{l}^{(t)})$ . The conjugate prior distributions that are used for Model (3.3) are commonly specified for θ_l and b_i. Additional technical details are provided in Enders et al.⁴⁸

The multilevel multinomial probit model⁴⁹ can be used to impute a nominal categorical variable Y_l with H_l categories. This model expands Y_l into H_l binary variables $Y_{l}^{h}$ , h ∈ {1, …,H_l}, that indicate whether y_il = h for subject i. An underlying latent Normal variable ${\tilde{Y}}_{l}^{h}$ corresponding to $Y_{l}^{h}$ is defined by the probability of $y_{i l}^{h} = 1$ . If ${\tilde{y}}_{i l}^{h}$ is greater than ${\tilde{y}}_{i l}^{h *}$ for all h* ≠ h, then $y_{i l}^{h} = 1$ and y_il = h. For identifiability purposes, a multivariate linear multilevel model is assumed for the first H_l − 1 latent variables, such that b_i ~ MV N(0, V_b) and the within-subject variance Σ_e is the identity matrix. The sampling procedure is similar to Model (3.3), except that ${\tilde{Y}}_{l}^{h}$ are generated by an accept-reject algorithm.⁴⁷ These models for categorical variables using latent Normal variables are implemented in the software Blimp.⁵⁰ We refer to the multilevel linear models with latent variables as FCS-LMM-latent.

3.4 |. FCS– Multilevel generalized linear model

The multilevel generalized linear models are a flexible approach to model skewed or non-normally distributed variables. These models are also commonly referred to as generalized linear mixed models (GLMM). Assuming that an incomplete variable Y_l conditional on item-level covariates X and Z has an exponential family probability density function or probability mass function. A GLMM is defined as

g (E (y_{i l k} ∣ X_{i k}, Z_{i k}, b_{i}, θ_{l})) = X_{i k}^{T} β + Z_{i k}^{T} b_{i}, b_{i} ~ N_{q} (0, V_{b}),

(3.5)

where g(·) is a function linking the expected value of response y_ilk to linear predictors. Sampling from the posterior distribution of P(θ_l|X,Z,Y_l) can be implemented using MCMC. The latent multilevel variable model in Section 3.3 can be viewed as Model (3.5) with a probit link function. The probit model can perform well for binary and categorical variable, but may not suitable for skewed or count variables. A different commonly used link function is the logit link for binary or the log-link for count variables. Compared to the probit link, sampling from P(b_i|y_i,x_i, θ_l) with the logit or log link functions can be more complex. A possible approximation can be obtained by sampling b_i from its marginal posterior distribution N(0,Ψ_b), where Ψ_b is estimated by the REML²³ or the Fisher scoring method.¹⁶ Using GLMM models for imputation of binary and count missing variables are implemented in the micemd package in R. Availability of software that implements multilevel generalized linear models using the log-Normal or Gamma likelihoods for incomplete skewed continuous data are limited. Throughout, we refer to this method as FCS-GLMM.

4 |. IMPUTATION BY JOINT MODELING

4.1 |. JM– General location model

The JM approach specifies a multivariate distribution for all incomplete variables. A multivariate Normal distribution is often used when the data are arranged in wide format and consist of only continuous variables.^14,17 For a mixture of continuous and discrete variables, the general location model is proposed as a possible imputation model.¹¹ This model describes the joint distribution of Y = (W,C) in terms of a marginal distribution for all discrete variables, $W = (W_{1}, \dots, W_{S_{1}})$ , and a conditional distribution of all continuous variables $C = (C_{1}, \dots, C_{S_{2}})$ given the discrete variables, where S₁ + S₂ = L(= JK). The general location model is defined as

P (W, C) = P (W) P (C ∣ W) = Multinomial (N, π_{d}) \cdot Normal (μ_{d}, Σ) .

(4.1)

The marginal distribution P(W) of S₁ discrete variables is modeled by a multinomial distribution on the cell counts of a S₁-dimensional contingency table with $D = \prod_{l = 1}^{S_{1}} H_{l}$ cells and cell probabilities π_d, d ∈ {1, ..,D}, where H_l is the number of distinct levels of variable W_l. Within each cell of the contingency table, continuous variables C follow a multivariate Normal distribution with mean μ_d and covariance matrix Σ_d. In finite samples, as the number of categorical variables increases, some cells may be empty. This may lead to unstable estimation.⁵¹ In these situations, the restricted general location model can be applied. The restricted model assumes that a contingency table cell counts follow a log-linear model, which is fitted by a subset of W_l, l ∈ {1, …, S₁}, and possibly their interactions. Continuous variables are modeled by a multivariate linear regression model with the categorical variables as the independent variables. Another possible limitation of both the general location model and the restricted general location model is their reliance on the multivariate Normal distributions for continuous Y_l. This may result in inaccurate imputation when Y_l is skewed or multi-modal. The general location model and the restricted location model are implemented in the mix package⁵² in R. Throughout, we refer to the general location model as JM-GL.

4.2 |. JM– Multivariate multilevel linear model

When Y comprise only continuous variables and is in a long format, a possible joint model for imputing Y is the multivariate linear multilevel model. Let y_ik = (y_i1k, ⋯, y_iLk)^T be a column vector of L continuous responses of the i-th subject measured at time point k. The multivariate multilevel linear model (MLMM) is

y_{i k} = (I_{L} \otimes X_{i k}^{T}) β + (I_{L} \otimes Z_{i k}^{T}) b_{i} + e_{i}, b_{i} ~ N (0, V_{b}), e_{i} ~ N (0, Σ_{e}),

(4.2)

where the column vector β has pL elements and column vector b_i has qL elements. The symbol ⊗ is the Kronecker product. Schafer and Yucel¹⁷ proposed a MCMC procedure to sample b_i, θ, and Y^mis jointly. It assumed that Σ ~ Inv-Wishart(υ₁,Λ₁), V_b ~ Inv-Wishart(υ₂,Λ₂), and an improper uniform density over ℛ^PL for β. This model is implemented in R package pan.⁵³

4.3 |. JM-Multivariate multilevel linear model with latent variables

For imputations of both continuous and categorical variables, a multivariate multilevel linear model (MLMM) can be used to model latent Normal variables that correspond to each of the categorical variables together with other continuous variables. Formally, a MLMM assumes that a set of incomplete continuous variables, $y_{i k}^{(c)} = {(y_{i 1 k}^{(c)}, \dots, y_{i C k}^{(c)})}^{T}$ , and a set of latent variables ${\tilde{y}}_{i k}^{(w)} = {({\tilde{y}}_{i 1 k}^{(w)}, \dots, {\tilde{y}}_{i W k}^{(w)})}^{T}$ of categorical variables $y_{i k}^{(w)} = {(y_{i 1 k}^{(w)}, \dots, y_{i W k}^{(w)})}^{T}$ are distributed as

y_{i k}^{(c)} = (I_{C} \otimes X_{i k}^{T}) β_{c} + (I_{C} \otimes Z_{i k}^{T}) b_{c i} + e_{c i}, {\tilde{y}}_{i k}^{(w)} = Φ^{- 1} (P (y_{i k}^{(w)} = 1)) = (I \otimes X_{i k}^{T}) β_{w} + (I_{W} \otimes Z_{i k}^{T}) b_{w i} + e_{w i}, b_{i} = {(b_{c i}, b_{w i})}^{T} ~ N (0, V_{b}), e_{i} = {(e_{c i}, e_{w i})}^{T} ~ N (0, Σ_{e}), {where Σ}_{e} = (\begin{matrix} σ_{e}^{2} I_{C} & cov (e_{c}, e_{w}) \\ cov (e_{c}, e_{w}) & I_{W} \end{matrix}) .

(4.3)

Under this model, a vector stacking the latent variables ${\tilde{y}}^{(w)}$ and the continuous variables y^(c) follows a multivariate Normal distribution, where V_b is an unstructured covariance matrix of the subject-level effects, and the covariance matrix Σ_e captures the associations between the two sets of variables. The imputation algorithm is similar to sampling from the Model (4.2). However, Σ_e can not be sampled from the Inverse-Wishart distribution. Instead, the elements of Σ_e should be updated individually using a Metropolis-Hastings procedure. Detailed descriptions of the imputation algorithm are provided in Carpenter and Kenward⁵⁴ chapters 4–5. We refer to this model as JM-MLMM-latent and it is implemented in the jomo package in R⁵⁵ and the REALCOM program in MATLAB.⁵⁶

4.4 |. JM– Multivariate generalized multilevel linear model

Extending the FCS-GLMM to its JM version is an another approach for handling mixed-type incomplete variables. Let y_i = {y_i1, ⋯, y_iL} be a K × L response matrix of subject i, which consists of different types of variables. We assume a multivariate generalized linear mixed model (MGLMM) for P(Y₁, …,Y_l|θ) defined as

p (y_{i 1}, \dots, y_{i L} ∣ θ) = \int \prod_{l = 1}^{L} p_{l} (y_{i l} ∣ b_{i l}, θ_{l}) p_{b} (b_{i} ∣ V_{b}) d b_{i},

(4.4)

where p_l(·) are density functions, b_i = (b_i1, …, b_iL)^T is a vector of subject-level effects which follows a multivariate Normal distribution with mean zero and covariance matrix V_b. The Model (4.4) links a set of univariate generalized linear multilevel models by introducing correlations between the variance components of the subject-level effects. The model assumes that the Y_ls are independent conditional on b_i and x. We refer to this model as JM-MGLMM. For example, a shared-random intercepts model with two outcomes (L = 2) in Equation (4.4) is

E (y_{i 1 k} ∣ b_{i 0}, X_{i k}, β_{01}) = g_{1}^{- 1} (β_{01} + b_{i 10} + β_{1} X_{i k}), E (y_{i 2 k} ∣ b_{i 0}, X_{i k}, β_{02}) = g_{2}^{- 1} (β_{02} + b_{i 20} + β_{2} X_{i k}), b_{i 0} = (b_{i 10}, b_{i 20}) ~ N_{2} (0, V_{b}), V_{b} = (\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}),

(4.5)

where g₁ and g₂ are link functions for outcomes Y₁ and Y₂, respectively, and the latent correlation of outcomes at the subject-level is identified by ρ. A possible extension of Model (4.5) involves adding random slopes in E(y_i1k|b_i0) and E(y_i2k|b_i0). Including all of the covariates as subject-level random slopes in b_i = (b_i10, …, b_i1p, …, b_iL0, …, b_iLp) would increase the dimension of V_b to (L × (1 + p))². To complete the Bayesian model, diffused prior distributions can be used. Specifically, β_p ~ N(0, 100), V_b ~ Inverse-Wishart(I_q, q) where q is the cardinality of b_i. Sampling from posterior distributions of the parameters can be implemented using the JAGS software⁵⁷, which requires the users to specify both the prior distributions and the likelihood functions. We have provided a code example on the GitHub website.

5 |. SIMULATIONS

The National Health and Aging Trend Study (NHATS) collects information on a nationally representative sample of Medicare beneficiaries ages 65 and older. Beginning in 2011, annual interviews are conducted and detailed information on a broad range of variables related to sociodemographic factors, physical, cognitive capacity, and health outcomes are collected. We use the NHATS data collected from 2011 to 2014 (rounds 1–4) and select a set of variables including four incomplete longitudinal variables of varying types: an indicator of whether a person had an overnight hospital stay, a person’s body mass index (BMI), comorbidity index, and the count of devices paid to assist with daily activities during the past year (paid assistive devices). The comorbidity index is defined as a count of 10 chronic conditions, including heart disease (e.g., angina or congestive heart failure), hypertension, arthritis, osteoporosis, diabetes, lung disease, Alzheimer’s disease, and related dementia, cancer, and whether they experienced a heart attack or a stroke in the past. The number of paid assistive devices (ranges from 0–9) includes the following aids: vision aids, hearing aid, cane, walker, wheelchair, scooters, grabbers, special dress items, and adapted utensils. Almost 98% of the missing values resulted from loss to follow-up. Patients’ information collected at the first interview is used as covariates for imputation and analysis, including age, gender, self-rated health, an indicator for whether participants take prescribed medicines, and an indicator for whether participants or their spouse/partner have any medical bills that are being paid off over time. Our goal is to estimate the associations between the comorbidity index, BMI, the paid assistive devices, and the hospital stay status after adjustment for baseline variables.

We design a simulation study to evaluate the operating characteristics of different imputation methods described in Sections 3 and 4. Our simulations are based on 5309 participants that were observed for 4 all rounds, and we simulate different missing data patterns.

5.1 |. Missing data mechanism and missing data patterns

In the simulation, we assume that the missing data mechanism is MAR for the monotone missing data pattern. In longitudinal studies with many covariates, it is reasonable to believe that P(m_ijk|x_i, Y_i1,…, Y_iL) would depend on at least one missing Y_iL value for some i, where Y is recorded in a wide format. Thus, for intermittent missing data, we assume an NMAR missing data mechanism. We simulate both missing data patterns on the wide-formatted NHATS data composed of only the completed cases. Missing data indicators are sampled from the Bernoulli distribution with the event probabilities predicted by models estimated from the original NHATS data.

To simulate an intermittent missing data pattern, we generate the missing indicators m_ijk = 1 for each of the incomplete variables independently. For subject i at round k ∈ {2, 3, 4}, the probability of Y_ijk being missing is

P (m_{i j k} = 1 ∣ Y_{i}, X_{i}) = {logit}^{- 1} (α_{0 j k}^{*} + \sum_{j = 1}^{J} \sum_{k = 1}^{k - 1} {\hat{α}}_{j k}^{*} y_{i j k} + \sum_{p = 1}^{P} {\hat{ψ}}_{j p} x_{i p}),

(5.1)

where the covariates include all of the time-varying variables prior to round k and time-invariant variables. We set $α_{0 j k}^{*}$ to ensure that the average missing proportions of Y_jk at round 2, 3, and 4 are 20%, 35%, and 40%, respectively. The coefficients ${\hat{α}}_{j k}^{*}$ and ${\hat{ψ}}_{j p}$ are the maximum likelihood estimates of Model (5.1) using the original NHATS data.

To simulate a monotone missing data pattern, we generate the drop-out indicators for each subject. Let r_ik represent a drop-out indicator for subject i at round k ∈ {2, 3, 4}. If subject i drops out at round k*, then r_ik* = 1 and values for subject i are not observed at round k* and in subsequent rounds for all variables (m_ijk = 1, ∀k ≥ k* and ∀j ∈ {1, …, J}). The probability that subject i is lost to follow-up at round k is

P (r_{i k} = 1 ∣ Y_{i}^{obs}, X_{i}) = {logit}^{- 1} ({\tilde{α}}_{0 k} + \sum_{j = 1}^{J} \sum_{k = 1}^{k - 1} {\hat{α}}_{j k} y_{i j k} + \sum_{p = 1}^{P} {\hat{ψ}}_{p} x_{i p}),

(5.2)

where the covariates include all of the time-varying variables that are fully observed priorto round k and time-invariant variables. The intercept ${\tilde{α}}_{0 k}$ is set to ensure a pre-specified proportion of participants who drop out. Based on the original NHATS data, we set the proportion of participants who start to drop out from the study at rounds 2, 3, and 4 to 20%, 15% and 10%, respectively. Cumulatively, the proportion of individuals with missing information at round 4 is approximately 45%. The coefficients ${\hat{α}}_{j k}^{*}$ and ${\hat{ψ}}_{j p}$ are the maximum likelihood estimates of Model (5.2) using the drop-out indicators observed in NHATS.

5.2 |. Study design

We consider three configurations with different numbers and types of incomplete variables. The variables include a binary hospital stay status (Y₁), a discrete bounded comorbidity index (Y₂), a continuous BMI (Y₃), and a discrete count number of paid assistive devices (Y₄). Configuration 1 assumes that Y₁ and Y₂ are incomplete and the other two variables are fully observed. Configuration 2 assumes that in addition to Y₁ and Y₂, Y₃ is incomplete. Lastly, Configuration 3 assumes that all four variables are incomplete. Because monotone missing data pattern is observed for most of the individuals in NHATS, we evaluate the performance of the different imputation methods using 500 simulated incomplete datasets with monotone missing data patterns for each of the three configurations. We also compare the different methods on 300 incomplete simulated datasets with intermittent missing data patterns for Configuration 3.

For every simulated dataset, we conduct a multiple imputation procedure with B = 5 imputations using 10 imputation methods with different choices of imputation models described in Section 3 and 4. For the FCS-Standard method, we consider three imputation models to impute the count variables: linear regressions, predictive mean matching, and Poisson regressions. When applying FCS-LMM-latent, we use latent variable models for binary variables and linear models for continuous and count variables. For the FCS-GLMM method, we specify a logit link for binary variables, an identity link for continuous variables, and either an identity or a log link for count variables. When applying JM-MLMM-latent, we consider both homoscedastic and heteroscedastic within-subject variance. For JM-MGLMM, we assume a logit link for binary variables, an identity link for continuous variables, and a log link for count variables. A complete list of the different methods is provided in Table 3.

TABLE 3.

Summary of imputation methods for mixed-type longitudinal data

Approach	Data format	Method	Imputation models
FCS			Imputation model for binary variables	Imputation model for count variables
	Wide	FCS-Standard (LM)	Logistic regression	Linear regression
		FCS-Standard (PMM)	Logistic regression	Predictive mean matching
		FCS-Standard (Poisson)	Logistic regression	Poisson regression
	Long	FCS-LMM-latent	Multilevel linear regression on latent variables	Multilevel linear regression
		FCS-GLMM (Gaussian)	Multilevel logistic regression	Multilevel linear regression
		FCS-GLMM (Poisson)	Multilevel logistic regression	Multilevel Poisson regression
JM	Wide	JM-GL	General location model
	Long	JM-MLMM-latent (common)	Multivariate multilevel linear model with latent variables and homoscedastic within-subject variance
		JM-MLMM-latent (random)	Multivariate multilevel linear model with latent variables and heteroscedastic within-subject variance
		JM-MGLMM	Multivariate multilevel generalized linear model using a logit and log link for binary, count variables

Open in a new tab

Imputing the missing data is usually performed as part of the data preparation process, and the ultimate goal is to generate unbiased estimates of the associations and conditional associations between variables. Our simulations mimic situations in which no specific statistical analysis is specified prior to the imputation, and the imputed datasets are used for multiple analyses. We conduct three types of analyses: univariate generalized hierarchical model, latent growth model, and bi-variate generalized hierarchical model. An overview of the simulation is presented in Table 4.

TABLE 4.

Summary of simulation design

Configuration	Incomplete variables	Missing Pattern	Statistical Analysis
1	Hospital stay (Binary)	Monotone	Univariate GLMM
1	Comorbidity index (Count)	Monotone	Univariate GLMM
2	Hospital stay (Binary)
	Comorbidity index (Count)	Monotone	Univariate GLMM
	BMI (Continuous)
3	Hospital stay (Binary)		Univariate GLMM Growth Curve Model Bivariate GLMM
	Comorbidity index (Count)	Monotone
	BMI (Continuous)	Intermittent
	Number of devices paid for caring (Count)

Open in a new tab

5.2.1 |. Univariate generalized linear hierarchical model

Within each imputed dataset, a multilevel logistic regression model is used to model the conditional associations between the comorbidity index and the hospital stay status,

logit (P (y_{i 1 k} = 1 ∣ γ, b_{0 i}, X_{i}, t_{i k}, y_{i 2 k}, y_{i 3 k}, y_{i 4 k})) = γ_{0} + b_{0 i} + γ_{1} y_{i 2 k} + γ_{2} y_{i 3 k} + γ_{3} y_{i 4 k} + γ_{4} t_{i k} + X_{i}^{T} γ_{p},

(5.3)

where X_i comprises of five baseline covariates for subject i, t_ik is the round of time that individual i is being interviewed, $b_{0 i} ~ N (0, σ_{b}^{2})$ denotes the subject-level effects, and γ_l, l ∈ {0, …, 8}, denotes a set of unknown coefficients. Let ${\hat{γ}}_{1}^{(u)}, {\hat{γ}}_{2}^{(u)}, {\hat{γ}}_{3}^{(u)}$ be the estimates of Model (5.3) within imputed data set u = {1,…, 5}. and ${\hat{σ}}_{γ_{1}}^{2}, {\hat{σ}}_{γ_{2}}^{2}, {\hat{σ}}_{γ_{3}}^{2}$ be their corresponding sampling variances. The final estimates are obtained using the common combination rules described in Section 2.

5.2.2 |. Latent growth curve model

Researchers may be interested in the trajectory of individuals over time. For the imputed datasets in Configuration 3, we fit a latent linear growth curve model (LGCM) on the trajectory of BMI over time. The model includes all the time-invariant variables and the time-varying comorbidity index Y₂ as the predictors. The LGCM requires the data to be structured in wide format, and it can be broken down into two latent constructs, the intercept factor and the slope factor.⁵⁸ Let t_i ∈ {0, 1, 2, 3} denote the round that subject i is observed. The adjusted LGCM is expressed by a multilevel model that consists of an intercept model, π_0i and a slope model, π_1i,

y_{i 3 t} = π_{0 i} + π_{1 i} t_{i} + ϵ_{i t}, π_{0 i} = η_{00} + η_{01} y_{i 2 t} + X_{i}^{T} η_{0} + e_{0 i}, π_{1 i} = η_{10} + η_{11} y_{i 2 t} + X_{i}^{T} η_{1} + e_{1 i},

(5.4)

where the residuals of the intercept and slope models are (e_0i,e_1i)^T ~ N(0₂,Σ), and Σ is an unknown unstructured covariance matrix.

5.2.3 |. Bivariate generalized linear hierarchical model

In some studies, researchers are interested in examining the associations between multiple factors and multiple outcomes simultaneously.^59–61 This can be achieved by jointly modeling multiple outcomes. Compared to a univariate model, joint models are computationally more complex. We examine a joint model comprising of hospital stay (Y₁) and paid assistive devices (Y₄) on datasets with imputed comorbidity index (Y₂) and BMI (Y₃) for Configuration 3. The bivariate multilevel generalized linear model used in the analysis is

E (y_{i 1 k} ∣ b_{i 0}, y_{i 2 k}, y_{i 3 k}, X_{i}, λ_{10}, λ_{11}, λ_{12}) = logi t^{- 1} (λ_{10} + b_{i 10} + λ_{11} y_{i 2 k} + λ_{12} y_{i 3 k} + λ_{1} X_{i k}), E (y_{i 4 k} ∣ b_{i 0}, y_{i 2 k}, y_{i 3 k}, X_{i}, λ_{20}, λ_{21}, λ_{22}) = exp (λ_{20} + b_{i 20} + λ_{21} y_{i 2 k} + λ_{22} y_{i 3 k} + λ_{2} X_{i k}), where b_{i 0} = {(b_{i 10}, b_{i 20})}^{T} ~ N_{2} (0, Σ_{b}), Σ_{b} = (\begin{matrix} σ_{1}^{2} & ρ σ_{1} σ_{2} \\ ρ σ_{1} σ_{2} & σ_{2}^{2} \end{matrix}),

(5.5)

where a logit link function is applied to hospital stay and the log link function to the paid assistive devices. The subject-level effect b_i0 = (b_i10, b_i20)^T is assumed to follow a bivariate Normal distribution with zero means and an unstructured covariance matrix Σ_b. The correlation between the separate random intercepts ρ represents the interdependence between the two outcomes at the subject-level.

5.2.4 |. Congeniality of Analysis Models

Congeniality⁶² between the different imputation methods and the three analyses varies. Not all of the imputation procedures had the same or more general model specification compared to the three analysis models. For the univariate GLMM, methods that include multilevel modeling encompass the analysis model, whereas methods based on the wide format data are mis-specified. For the bivariate GLMM analysis, JM-MLMM and JM-MGLMM methods are the only two methods that encompass the analysis model. For the latent growth curve, all imputation methods are mis-specified because time was not adjusted for in any of the imputation models.

5.2.5 |. Performance assessment metrics

Estimates of the three analyses are obtained using the common combination rules described in Section 2. For each configuration and each replication we estimate the relative bias $({\hat{θ}}^{meth} - {\hat{θ}}^{comp}) / {\hat{θ}}^{comp}$ , where ${\hat{θ}}^{meth}$ is the estimate obtained after implementation of multiple imputation procedure meth described in Table 3, and ${\hat{θ}}^{comp}$ is the estimate from the complete dataset. We record the root of mean squared error (RMSE), the 95% interval estimate width, whether the interval estimate covers the estimate with complete data. Additionally, we estimate the fraction of missing information (FMI) that measures the uncertainty in the imputed values for missing elements.^8,63 For each parameter estimate, the FMI estimate is

{\hat{λ}}_{m} = \frac{r_{m} + 2 / (v + 3)}{r_{m} + 1},

(5.6)

where $r_{m} = (1 + m^{- 1}) \frac{W_{B}}{U_{B}}$ , $v = (m - 1) {(1 + \frac{1}{r})}^{2}$ . W_B and U_B are calculated using Equation (2.1), and represent the variance between the m complete-data estimates and the average of the m complete-data variances, respectively. We summarize these metrics by averaging across all replications.

5.3 |. Results of the univariate GLMM across all configurations

5.3.1. | Relative bias of coefficients and the subject-level variance

Figure 1 presents the relative bias of regression coefficients estimates associated with the incomplete variables and the subject-level variance estimate, $σ_{b}^{2}$ , for study design Configurations 1–3 with monotone missing data patterns. The first row depicts the change in relative bias of ${\hat{γ}}_{1}$ , the conditional log-odds ratio of having a hospital stay with one unit increase in the comorbidity index, as the number of incomplete variables increases from two to four. The boxplots of relative biases are similar for each method as the number of incomplete predictors increases. Across Configurations 1–3, JM-GL results in the smallest mean relative bias of −0.02. However, it has the largest variability in relative biases of ${\hat{λ}}_{1}$ compared to other methods for all configurations. FCS-Standard with either linear regressions (LM) or with predictive mean matching (PMM), and FCS-LMM-latent result in comparable mean relative biases to JM-GL. Across Configurations 1–3, the averages relative bias for LM is between −0.04 to −0.03, and −0.03 to −0.02 for PMM and FCS-LMM-latent. Imputation methods that assume Poisson regressions for the count variables, FCS-Standard, FCS-GLMM, and JM-MGLMM have mean relative biases that are greater than −0.29. The JM-MLMM methods generally have the second largest relative bias. Across Configurations 1–3, JM-MGLMM leads to the largest mean relative bias of approximately −0.40.

Relative bias of regression coefficients estimates associated with the incomplete covariates and the random effect estimate. Each column represents a simulation setting. From left to right, the column corresponds to Configuration 1 to Configuration 3.

Similar trends are observed for the relative bias of ${\hat{γ}}_{2}$ in Configurations 2 and 3 (second row of Figure 1). The mean relative biases produced by FCS-Standard with either LM or PMM, FCS-LMM-latent, and JM-GL are close to zero. All of the methods show higher variability in relative biases for ${\hat{γ}}_{2}$ compared to those observed for ${\hat{γ}}_{1}$ and ${\hat{γ}}_{3}$ . The interquartiles (IQR) of the relative biases of ${\hat{γ}}_{2}$ in Configurations 2–3 are above 0.20 for all of the imputation methods, whereas the IQRs of the relative bias for ${\hat{γ}}_{1}$ and ${\hat{γ}}_{3}$ are around 0.05 and 0.08 on average. This is because BMI is not significantly associated with hospital stay status after adjustment for the other covariates. In Configuration 3, the mean relative biases of ${\hat{γ}}_{3}$ after using FCS-standard with Poisson regressions, FCS-GLMM, and JM-MGLMM with multilevel Poisson regressions are 0.007, −0.004 and −0.124, respectively. In contrast, the mean relative biases of ${\hat{γ}}_{1}$ produced by these methods ranges from −0.3 to −0.4.

The relative bias of the subject-level variance, $σ_{b}^{2}$ , is depicted in the last row of Figure 1. All of the methods present similar trends across all configurations. The FCS-LMM-latent has the smallest mean relative bias compared to all other methods. The mean relative bias produced by JM-GL is close to zero; however, the variability of the relative bias across replications is the largest. JM-MLMM performs similarly to JM-GL. The two FCS-GLMM methods lead to the largest mean relative biases for estimating $σ_{b}^{2}$ compared to all other methods. Their mean relative biases across all configurations are greater than 0.17.

The relative biases of regression coefficient estimates associated with fully observed predictors are presented in Figure 2. The relative biases of statistically significant coefficients (p-value<0.001) show similar trends for all methods. Generally, the average values and the variability of the relative biases across replications are smaller compared to the relative biases of the statistically insignificant coefficients (the top four rows compared to the bottom two rows). In Configuration 3, the mean relative bias of FCS-Standard is close to zero for all fully observed covariates, except for FCS-Standard with Poisson regression. FCS-LMM-latent has a similar performance to FCS-Standard methods other than the one with Poisson regressions. JM-MGLMM and JM-MLMM have large relative biases for the coefficients of gender, an individual’s status of taking prescribed medicines, and having medical bills being paid off over time.

Relative bias of regression coefficients estimates associated with the fully observed predictors. Each column represents a simulation setting. From left to right, the column corresponds to Configuration 1 to Configuration 3.

5.3.2 |. The RMSE, Interval Width and Coverage

Because all of the methods have produced similar trends across all configurations, we only present the RMSEs, Interval widths, and 95% coverage rates for all imputations methods in Configuration 3 (Table 5). Results for Configurations 1 and 2 are provided in the Appendix. Generally, FCS-LMM-latent, FCS-Standard with either LM or PMM, and JM-GL outperform the other methods in terms of RMSE and average interval width across all regression coefficients. FCS-LMM-latent has the smallest RMSE when estimating γ₁, γ₂, γ₇, γ₈, γ₉, and smaller RMSEs when estimating other coefficients compared to FCS-Standard with Poisson regression, FCS-GLMM, JM-MLMM, and JM-MGLMM. Across all methods, the average interval widths for all coefficients are similar, and the differences in RMSE are mainly driven by biases.

TABLE 5.

The root mean squared error (RMSE), average interval width and empirical coverage of the 95% CI of coefficients estimates and subject-level variance estimate for the univariate analysis in Configuration 3 with the monotone missing data pattern. The incomplete predictors are marked in bold.

Estimates	Metrics	FCS-Standard (LM)	FCS-Standard (PMM)	FCS-Standard (Poisson)	FCS-LMM-latent	FCS-GLMM (Gauss.)	FCS-GLMM (Poisson)	JM-General Location	JM-MLMM (common)	JM-MLMM (random)	JM-MGLMM
Intercept ${\hat{γ}}_{0}^{comp} = - {3.89}^{* *}$	RMSE	0.150	0.154	0.149	0.156	0.255	0.237	0.176	0.131	0.165	0.125
	Width	1.157	1.136	1.127	1.110	1.039	1.033	1.094	1.131	1.144	1.055
	Coverage(%)	100	99	100	100	99	100	100	100	100	100
Comorbidity Index ${\hat{γ}}_{1}^{comp} = {0.38}^{* *}$	RMSE	0.021	0.020	0.136	0.019	0.025	0.112	0.020	0.031	0.071	0.152
	Width	0.117	0.118	0.117	0.117	0.110	0.105	0.113	0.117	0.113	0.100
	Coverage(%)	99	100	0	100	100	0	99	100	22	0
BMI ${\hat{γ}}_{2}^{comp} = - {0.02}^{*}$	RMSE	0.003	0.004	0.004	0.003	0.004	0.003	0.004	0.004	0.006	0.005
	Width	0.029	0.029	0.028	0.028	0.026	0.026	0.027	0.028	0.030	0.027
	Coverage(%)	100	100	100	100	100	100	100	100	99	100
Paid assistive devices ${\hat{γ}}_{3}^{comp} = {0.36}^{* *}$	RMSE	0.026	0.022	0.021	0.026	0.019	0.018	0.025	0.019	0.052	0.048
	Width	0.149	0.152	0.150	0.150	0.142	0.141	0.144	0.149	0.151	0.143
	Coverage(%)	100	100	100	99	100	100	99	100	87	92
Age ${\hat{γ}}_{4}^{comp} = {0.11}^{* *}$	RMSE	0.014	0.014	0.015	0.016	0.018	0.015	0.019	0.017	0.018	0.012
	Width	0.112	0.112	0.111	0.111	0.104	0.101	0.107	0.112	0.111	0.106
	Coverage(%)	100	100	100	100	100	100	100	100	100	100
Gender ${\hat{γ}}_{5}^{comp} = {0.28}^{* *}$	RMSE	0.044	0.041	0.073	0.056	0.048	0.072	0.053	0.040	0.054	0.090
	Width	0.310	0.311	0.306	0.307	0.282	0.282	0.299	0.307	0.304	0.287
	Coverage(%)	100	100	99	100	100	96	100	100	100	96
Self-rated health ${\hat{γ}}_{6}^{comp} = {0.20}^{* *}$	RMSE	0.024	0.024	0.052	0.024	0.021	0.049	0.028	0.022	0.040	0.072
	Width	0.162	0.162	0.164	0.157	0.146	0.146	0.155	0.157	0.158	0.148
	Coverage(%)	100	99	95	100	100	93	100	100	98	60
Have med. bill paid off overtime ${\hat{γ}}_{7}^{comp} = {0.34}^{*}$	RMSE	0.064	0.066	0.060	0.048	0.058	0.066	0.131	0.093	0.102	0.067
	Width	0.483	0.483	0.480	0.473	0.444	0.435	0.462	0.473	0.475	0.451
	Coverage(%)	100	100	100	100	100	100	93	99	99	100
Took prescribed med. ${\hat{γ}}_{8}^{comp} = 0.30$	RMSE	0.095	0.096	0.129	0.071	0.096	0.156	0.110	0.155	0.206	0.183
	Width	0.702	0.714	0.692	0.666	0.647	0.644	0.667	0.684	0.674	0.650
	Coverage(%)	100	100	100	100	100	100	99	100	99	98
Time ${\hat{γ}}_{9}^{comp} = {0.10}^{* *}$	RMSE	0.020	0.022	0.048	0.016	0.027	0.023	0.021	0.016	0.017	0.018
	Width	0.109	0.108	0.108	0.104	0.102	0.105	0.099	0.105	0.106	0.100
	Coverage(%)	99	100	66	100	96	98	99	100	100	100
Random effect ${\hat{σ}}_{b_{comp}}^{2} = {1.13}^{* *}$	RMSE	0.048	0.044	0.070	0.037	0.202	0.220	0.050	0.035	0.040	0.107
	Width	0.235	0.239	0.233	0.238	0.217	0.212	0.221	0.241	0.240	0.220
	Coverage(%)	98	99	95	100	0	0	94	100	99	56

Open in a new tab

The interval coverages of all coefficients estimated using FCS-Standard methods, FCS-LMM-latent, and JM-GL are above nominal. The two JM-MLMM methods produce comparable RMSE, average interval widths, and coverages to FCS-LMM-latent only for γ₂ and γ₄. However, JM-MLMM with heteroscedastic Σ_ei has significantly lower than nominal coverage for γ₁ and γ₃. Compared to other methods, FCS-GLMM and JM-MGLMM with multilevel Poisson regressions have shorter interval widths for all coefficients and comparable RMSE for most coefficients. However, they result in large RMSEs and below nominal coverages when estimating γ₁ and $σ_{b}^{2}$ .

5.3.3 |. Results of GLMM for intermittent missing data pattern

In Configuration 3 with intermittent missing data, the FCS-standard methods and JM-GL perform similarly to the one observed for the monotone missing pattern, in terms of the mean relative biases (MRB), interval coverages, and RMSE for most of the coefficients (Appendix table A1). Comparing the monotone to the intermittent missing data pattern, we observe an increase in the RMSE of the coefficients of the incomplete variables for FCS-LMM-latent and FCS-GLMM, and their 95% coverage rates for the co-morbidity index coefficient decrease to 85% (FCS-LMM-latent) and 44% (FCS-GLMM). The performances of the JM-MLMM and JM-MGLMM also deteriorate with intermittent missing data. Specifically, their average 95% coverage rates for the coefficients of the incomplete variables, γ₁ − γ₃, drop from above nominal to below 60%. The two FCS-GLMM methods have higher relative biases, wider interval widths, and interval estimates that do not cover the true parameter for the subject-level variance estimate.

5.4 |. Results of the latent growth curve model for Configuration 3

The results of the mean relative biases, RMSE, and average interval coverages of the estimated coefficients associated with the slope factor at each time point are presented in Table 6. Overall, the performances of the JM methods are inferior to those of the FCS methods. Among all methods, JM-MGLMM results in the largest mean relative biases and below nominal interval coverages for all coefficients. Among the FCS methods, the FCS-Standard methods outperform the FCS-LMM-latent and FCS-GLMM. For the comorbidity index coefficient after round 2, FCS-Standard methods result in nominal average interval coverages, while FCS methods with hierarchical modeling result in below nominal coverages. For intermittent missing data patterns, the average interval coverages of FCS-LMM-latent and FCS-GLMM are below 75%. Comparing the results between monotone and intermittent missing data patterns, we observe lower relative biases and better interval coverages for the monotone missing data pattern for all of the FCS methods and JM-GL across all coefficient estimates. JM-MLMM with heteroscedastic variances and JM-MGLMM result in larger relative biases for the intercepts at rounds 3 and 4 in simulations with intermittent missing data patterns, and their interval estimates do not cover the true intercepts.

TABLE 6.

The mean relative bias (MRB), the root mean squared error (RMSE), average empirical coverage of the 95% CI, average FMI estimates of slope factor’s coefficients estimates for the latent growth curve analysis in Configuration 3 with intermittent (Inter.) and monotone (Mono.) missing data patterns.

Slope Estimates	Metrics	FCS-Standard (LM)		FCS-Standard (PMM)		FCS-Standard (Poisson)		FCS-LMM-latent		FCS-GLMM (Gaussian)		FCS-GLMM (Poisson)		JM-General Location		JM-MLMM (common)		JM-MLMM (random)		JM-MGLMM
Slope Estimates	Metrics	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.	Inter.	Mono.
Round 1 Intercept	MRB	−0.03	0.00	−0.02	0.00	−0.03	−0.01	0.11	0.06	0.11	0.06	0.11	0.06	−0.03	−0.01	0.11	0.05	0.16	0.32	0.89	0.05
	RMSE	0.21	0.16	0.21	0.17	0.21	0.16	0.22	0.15	0.22	0.15	0.22	0.15	0.21	0.16	0.22	0.15	0.31	0.56	1.63	0.15
	Coverage(%)	74	85	77	87	72	86	60	85	60	84	58	88	71	83	61	88	64	26	13	93
	FMI	0.40	0.32	0.47	0.46	0.38	0.31	0.31	0.24	0.29	0.26	0.29	0.25	0.26	0.23	0.30	0.25	0.44	0.68	0.08	0.34
Round 1 Comorbidity	MRB	−0.43	0.08	−0.48	0.40	−2.15	1.96	−4.01	−1.57	−4.14	−1.54	−4.07	−2.87	−0.48	−0.15	−1.19	−0.05	−1.57	4.84	−12.53	−2.47
	RMSE	0.03	0.02	0.03	0.02	0.04	0.04	0.06	0.03	0.06	0.03	0.06	0.05	0.03	0.02	0.03	0.02	0.11	0.20	0.19	0.04
	Coverage(%)	100	100	100	100	94	94	70	98	65	98	44	82	97	100	100	100	93	97	0	96
	FMI	0.42	0.31	0.42	0.31	0.45	0.53	0.37	0.26	0.35	0.28	0.38	0.50	0.25	0.23	0.38	0.28	0.84	0.92	0.14	0.41
Round 2 Intercept	MRB	0.01	0.01	0.02	0.00	0.02	0.01	0.00	0.01	0.00	0.01	0.00	0.02	0.01	0.01	0.00	0.02	0.46	0.37	4.75	0.23
	RMSE	0.18	0.16	0.19	0.16	0.18	0.16	0.12	0.12	0.12	0.12	0.12	0.12	0.18	0.16	0.12	0.13	0.94	0.76	9.50	0.48
	Coverage(%)	64	71	84	81	66	71	83	79	85	79	85	80	60	66	83	78	09	11	0	5
	FMI	0.42	0.36	0.74	0.59	0.43	0.34	0.31	0.28	0.34	0.28	0.32	0.29	0.30	0.27	0.34	0.29	0.93	0.89	0.24	0.45
Round 2 Comorbidity	MRB	−0.23	−0.08	−0.24	0.01	−0.86	0.41	−1.34	−0.61	−1.37	−0.59	−1.39	−1.06	−0.23	−0.12	−0.45	−0.30	−0.36	1.37	−2.31	−0.93
	RMSE	0.03	0.02	0.04	0.02	0.05	0.03	0.08	0.04	0.08	0.04	0.08	0.06	0.04	0.02	0.04	0.03	0.13	0.23	0.14	0.06
	Coverage(%)	99	100	98	100	77	95	39	94	38	96	9	62	91	100	99	99	89	96	9	63
	FMI	0.52	0.33	0.53	0.33	0.60	0.64	0.44	0.32	0.44	0.34	0.52	0.64	0.32	0.25	0.47	0.34	0.90	0.95	0.24	0.51
Round 3 Intercept	MRB	−0.04	−0.01	−0.02	−0.03	−0.04	−0.01	0.17	0.10	0.17	0.10	0.17	0.10	−0.04	−0.01	0.17	0.10	1.09	0.74	9.47	0.52
	RMSE	0.20	0.14	0.20	0.14	0.19	0.14	0.30	0.20	0.31	0.20	0.30	0.21	0.20	0.14	0.30	0.20	1.85	1.26	16.04	0.89
	Coverage(%)	62	76	76	83	62	75	13	43	14	44	19	43	52	70	18	45	0	0	0	0
	FMI	0.55	0.39	0.77	0.62	0.53	0.39	0.39	0.35	0.40	0.34	0.41	0.34	0.36	0.29	0.41	0.36	0.97	0.94	0.22	0.50
Round 3 Comorbidity	MRB	−0.21	−0.11	−0.22	−0.05	−0.66	0.13	−1.01	−0.50	−1.03	−0.48	−1.07	−0.79	−0.21	−0.13	−0.44	−0.36	−0.31	0.76	−0.99	−0.73
	RMSE	0.04	0.02	0.04	0.02	0.07	0.03	0.10	0.05	0.10	0.05	0.10	0.08	0.05	0.03	0.05	0.04	0.16	0.24	0.10	0.07
	Coverage(%)	95	99	97	99	54	97	16	81	17	82	1	39	82	99	90	92	84	93	46	25
	FMI	0.59	0.37	0.60	0.37	0.71	0.73	0.49	0.35	0.49	0.37	0.60	0.70	0.36	0.29	0.52	0.37	0.92	0.96	0.31	0.54
Round 4 Intercept	MRB	−0.03	−0.02	−0.04	−0.03	−0.02	−0.01	0.10	0.02	0.10	0.02	0.10	0.02	−0.03	−0.02	0.09	0.02	1.04	0.60	9.99	0.45
	RMSE	0.33	0.21	0.33	0.21	0.33	0.21	0.24	0.14	0.24	0.14	0.24	0.14	0.34	0.22	0.23	0.14	2.03	1.19	19.36	0.88
	Coverage(%)	62	78	70	87	63	79	72	92	72	91	69	92	54	70	73	91	1	5	0	0
	FMI	0.61	0.40	0.76	0.70	0.61	0.43	0.42	0.32	0.42	0.32	0.41	0.33	0.40	0.30	0.43	0.32	0.95	0.88	0.16	0.45
Round 4 Comorbidity	MRB	−0.32	−0.16	−0.33	−0.10	−0.81	0.02	−1.06	−0.50	−1.09	−0.47	−1.14	−0.82	−0.30	−0.17	−0.48	−0.42	−0.29	0.80	−1.18	−0.76
	RMSE	0.05	0.03	0.06	0.03	0.08	0.04	0.10	0.05	0.11	0.05	0.11	0.08	0.06	0.03	0.05	0.05	0.18	0.26	0.12	0.07
	Coverage(%)	93	99	94	99	58	97	23	87	22	88	3	46	76	97	92	93	84	93	50	35
	FMI	0.61	0.39	0.63	0.39	0.74	0.74	0.49	0.34	0.50	0.36	0.59	0.65	0.36	0.30	0.53	0.35	0.91	0.95	0.35	0.52

Open in a new tab

5.5 |. Results of the multivariate GLMM for Configuration 3

The relative biases of the coefficients estimates of the hospital stay outcome are summarized in Figure 3. For all imputation methods, estimates of model parameters generally have larger relative biases when using a joint analysis model compared to the univariate model analysis. FCS-LMM-latent has the smallest mean relative bias and narrower IQR of relative biases for all coefficients except for the coefficient of the indicator variable of having any prescribed medicine. JM-GL has comparable performance to FCS-LMM-latent in terms of having small mean relative biases, but it has larger variability in relative biases for all coefficients. JM-MLMM and JM-MGLMM lead to small relative biases only in coefficients that are statistically significant (age, BMI, gender, self-rated health). FCS-Standard with either LM or Poisson regressions have higher mean relative biases for coefficients of incomplete variables compared to their performance in the univariate analysis. FCS-Standard with PMM leads to mean relative biases close to zero for coefficients of BMI, gender, self-rated health, and having any prescribed medicine, but it has the widest IQRs of relative biases compared to other imputation methods.

Relative bias of regression coefficients estimates for hospital stay.

Table 7 summarizes the RMSEs, average interval widths and coverages for the coefficients that represent the associations between incomplete explanatory variables and the two incomplete outcomes. When estimating the conditional association between comorbidity index and the binary outcome hospital stay, λ₁₁, FCS-GLMM with identity links for count variables, FCS-LMM-latent, FCS-GLMM with identity links for count variables, JM-GL, and JM-MLMM have interval coverages above 95%, with small RMSE of approximately 0.04. However, these methods produce larger RMSEs (> 0.11) and lower than nominal interval coverages when estimating the conditional associations between the comorbidity index and the paid assistive devices, λ₂₁. FCS-Standard, FCS-GLMM and JM-MGLMM, which assume Poisson regressions for count variables, lead to small RMSE (0.01, 0.02, 0.01, respectively) and interval coverages around nominal (96%, 92%, 98%, respectively). When estimating the conditional association between BMI and the two outcomes, FCS-LMM-latent produces the smallest RMSE for λ₂₁ and λ₂₂, and it has interval coverages above nominal. Generally, all methods lead to higher RMSE when estimating λ₂₁ and λ₂₂ compared to the estimation of λ₁₁ and λ₁₂. This is because of the insignificant association between BMI and the number of paid assistive devices.

TABLE 7.

The root mean squared error (RMSE), average interval width and empirical coverage of the 95% CI of coefficients estimates of the incomplete predictors and the correlation estimate for the joint analysis in Configuration 3.

Estimates	Metrics	FCS-Standard (LM)	FCS-Standard (PMM)	FCS-Standard (Poisson)	FCS-LMM-latent	FCS-GLMM (Gauss.)	FCS-GLMM (Poisson)	JM-General Location	JM-MLMM (common)	JM-MLMM (random)	JM-MGLMM
Outcome 1: Hospital stay (Yes)
Comorbidity ${\hat{λ}}_{11}^{comp} = {0.44}^{* *}$	RMSE	0.064	0.160	0.143	0.043	0.038	0.113	0.070	0.046	0.059	0.193
	Width	0.216	0.890	0.520	0.482	0.768	0.614	0.596	1.140	0.633	0.339
	Coverage	81	88	85	100	100	80	100	100	100	35
BMI ${\hat{λ}}_{12}^{comp} = - {0.02}^{* *}$	RMSE	0.006	0.008	0.006	0.004	0.005	0.005	0.006	0.005	0.007	0.007
	Width	0.032	0.062	0.049	0.042	0.055	0.052	0.043	0.074	0.055	0.039
	Coverage	99	100	100	100	100	100	100	100	100	99
Outcome 2: Paid assistive devices
Comorbidity ${\hat{λ}}_{21}^{comp} = {0.086}^{* *}$	RMSE	0.016	0.028	0.013	0.020	0.019	0.016	0.020	0.021	0.109	0.011
	Width	0.051	0.054	0.057	0.052	0.048	0.05	0.049	0.051	0.256	0.049
	Coverage	87	44	96	81	76	92	75	79	78	98
BMI ${\hat{λ}}_{22}^{comp} = 0.003$	RMSE	0.003	0.002	0.003	0.002	0.002	0.002	0.003	0.002	0.005	0.003
	Width	0.012	0.014	0.012	0.012	0.012	0.012	0.013	0.011	0.021	0.013
	Coverage	100	100	99	100	100	100	97	97	99	99
Correlation between the two outcomes
$\hat{ρ} = 0.30$	RMSE	0.038	0.037	0.037	0.035	0.045	0.051	0.044	0.035	0.039	0.045
	Width	0.231	0.226	0.230	0.230	0.252	0.241	0.220	0.243	0.233	0.221
	Coverage	100	100	100	100	100	98	100	100	100	97

Open in a new tab

For the correlation estimate between the two outcomes, ρ, the FCS-LMM-latent and JM-MLMM have the smallest RMSE compared to the other methods. However, all methods have relatively similar interval estimates which lead to interval coverages that are above nominal.

5.6 |. Fraction of missing information and computational time

The FMI estimates of the univariate GLMM analysis are summarized in Table 8. Generally, the average FMI estimates from simulations with monotone missing data patterns are smaller than those with intermittent missing data patterns for all imputation methods. When estimating the coefficients associated with incomplete variables and the subject-level variance, larger differences between the FMI estimated from the intermittent and monotone missing data patterns are observed for FCS-based methods and JM-MLMM with heteroscedastic variances compared to the other imputation methods. Across all simulation scenarios, the range of the average FMI estimates is between 0.14 and 0.51. Most imputation methods have average FMI estimates between 0.20 to 0.30. Among all of the methods, JM-GL results in the smallest FMI estimates for all of the constant effects coefficients.

TABLE 8.

The fraction of missing information estimated for the univariate GLMM analysis in Configuration 3.

Estimates	Missing pattern	FCS-Standard (LM)	FCS-Standard (PMM)	FCS-Standard (Poisson)	FCS-LMM-latent	FCS-GLMM (Gauss.)	FCS-GLMM (Poisson)	JM-General Location	JM-MLMM (common)	JM-MLMM (random)	JM-MGLMM
Intercept	Monotone	0.28	0.29	0.31	0.23	0.23	0.22	0.19	0.23	0.29	0.23
	Intermittent	0.32	0.30	0.31	0.26	0.27	0.26	0.17	0.25	0.35	0.23
Comorbidity Index	Monotone	0.26	0.27	0.51	0.23	0.23	0.37	0.19	0.23	0.36	0.38
	Intermittent	0.37	0.38	0.48	0.33	0.36	0.47	0.20	0.34	0.45	0.36
BMI	Monotone	0.28	0.27	0.30	0.24	0.24	0.23	0.19	0.23	0.33	0.23
	Intermittent	0.33	0.33	0.33	0.27	0.30	0.32	0.18	0.28	0.46	0.25
Paid assistive devices	Monotone	0.24	0.30	0.23	0.21	0.19	0.18	0.16	0.21	0.37	0.24
	Intermittent	0.49	0.51	0.47	0.41	0.40	0.39	0.26	0.37	0.48	0.31
Age	Monotone	0.27	0.25	0.28	0.21	0.24	0.24	0.17	0.22	0.23	0.22
	Intermittent	0.25	0.25	0.25	0.20	0.26	0.26	0.14	0.21	0.21	0.20
Gender	Monotone	0.27	0.26	0.28	0.21	0.24	0.24	0.18	0.22	0.22	0.21
	Intermittent	0.27	0.25	0.26	0.20	0.27	0.28	0.15	0.22	0.22	0.19
Self-rated health	Monotone	0.30	0.30	0.35	0.22	0.24	0.26	0.19	0.23	0.26	0.23
	Intermittent	0.27	0.27	0.30	0.23	0.27	0.29	0.16	0.24	0.28	0.21
Have med. bill paid off overtime	Monotone	0.23	0.23	0.23	0.17	0.22	0.22	0.16	0.20	0.20	0.20
	Intermittent	0.23	0.23	0.22	0.18	0.26	0.26	0.11	0.19	0.20	0.18
Took prescribed med.	Monotone	0.34	0.34	0.35	0.23	0.25	0.25	0.23	0.23	0.23	0.23
	Intermittent	0.36	0.35	0.35	0.26	0.28	0.29	0.20	0.29	0.31	0.23
Time	Monotone	0.32	0.32	0.36	0.26	0.30	0.30	0.21	0.29	0.28	0.25
	Intermittent	0.36	0.36	0.37	0.29	0.32	0.32	0.18	0.33	0.33	0.27
Subject-levelvariance	Monotone	0.36	0.35	0.37	0.33	0.22	0.23	0.24	0.35	0.35	0.28
	Intermittent	0.44	0.45	0.47	0.40	0.31	0.33	0.23	0.41	0.43	0.33

Open in a new tab

The average FMI estimates of the coefficients in the latent curve model when assuming a monotone missing data pattern are smaller than those assuming the intermittent missing data pattern (Table 6). Overall, the FMI estimates for the latent growth curve model coefficients are larger than the univariate GLMM coefficients. The FCS-based methods and JM-GL result in relatively smaller FMI that are around 0.25 compared to the JM-MLMM with heteroscedastic variances.

We recorded the computational time of all of the 10 methods on a standard Windows PC intel Xeon Core, 2.40GHz processor. For a single imputation, the FCS-standard with the linear regressions and PMM procedures, and JM-GL required the least computational time (less than 2s). FCS-Standard with the Poisson regressions had an average computational time of 22s. The three JM-based methods with multivariate multilevel modeling required approximately 280s of computation time. The computational time of FCS-GLMM methods was around 800s. The most computationally intensive method was the FCS-LMM-latent, which required several hours to complete a single imputation.

6 |. CASE STUDY ON SKILLED NURSING HOME ADMISSIONS

We applied the different imputation methods to study the relationship between sociodemographic factors, the number of hospital admissions, measures of physical, cognitive abilities, and skilled nursing facilities admissions. The data are based on NHATS Rounds 1–5, linked to data from the Center for Medicare and Medicaid (CMS). We selected 4836 participants who were enrolled in the Medicare fee for service at baseline. Deceased participants and those who left the Medicare fee for service during the study were assumed to be censored in contrast to missing. The proportions of missing participants because of loss to follow up at each round were approximately 16%, 14%, 12%, and 6%.

The sociodemographic factors comprised baseline age, sex, race/ethnicity (non-Hispanic white and other), educational level (college and higher, no school or < 9 grade, 9–12 grade and high school, and vocational training), and time-varying cohabitation status (living alone versus not living alone) and Medicare-Medicaid dual eligibility. The cognitive ability is a 5-points Likert scale ranging from excellent (1 point) to poor (5 points). The status of time-varying Alzheimer’s Disease and Related Dementias (ADRD) was defined based on the Chronic Condition Data Warehouse (CCW) algorithm. The time-varying comorbidity index comprises a count of 20 common chronic conditions among older adults identified through the CCW algorithm.⁶⁴ The function and frailty factors include two time-varying count variables: activities of daily living (ADL) and frailty score. ADL is derived as a count of limitations in seven daily living activities. Frailty score is defined by the sum of five criteria: exhaustion, low activity, weakness, slowness, and shrinking. Among all explanatory variables, only cohabitation status, ADL, and frailty score are subject to missing values.

We implemented FCS-Standard, FCS-LMM-latent, FCS-GLMM, JM-GL, JM-MLMM, and JM-MGLMM to impute the missing data. We applied a multilevel logistic regression model to examine the relationship between SNF admission status and sociodemographic factors adjusted for other factors that assess the physical and cognitive abilities. The results are displayed in Figure 4. Patients who are non-Hispanic whites are associated with 75% higher conditional odds of having an SNF admission compared to other race/ethnicity groups. For one unit increase in the number of hospital admissions, the conditional odds of having an SNF admission will quadruple. Being diagnosed with ADRD, with a higher ADL disabilities value and a higher comorbidity index, especially frailty, were associated with increased conditional odds of SNF admission. Not living alone and having Medicare-Medicaid dual eligibility were associated with decreased conditional odds of SNF. Education showed a statistically insignificant relationship with any SNF admission after adjustment for the other covariates.

Odds ratios estimates of sociodemographic factors, the number of hospital admissions, measures of physical and cognitive abilities on skilled nursing facilities admissions for different imputation methods.

All imputation methods resulted in similar estimates of the conditional odds ratio and 95% interval estimates for the fully observed variables. FCS-based methods and JM-based methods result in different estimates for variables with missing values. FCS-Standard produces higher conditional odds ratios for cohabitation status and the number of hospital admissions, and it has lower conditional odds ratios for ADL disabilities and frailty scores compared to other methods. Compared to FCS-based methods, JM-based methods have higher conditional odds ratios for estimates of ADL disabilities and frailty scores.

7 |. DISCUSSION

Addressing missing data in longitudinal studies is challenging and involves advanced statistical techniques. We review existing MI methods for incomplete longitudinal mixed data, and their implementations on widely accessible software that requires limited additional coding. We compare these methods using simulation analyses and describe an applied example based on the NHATS data.

Among all of the methods examined, the FCS-LMM-latent method had the best performance for the univariate and multivariate multilevel modeling in terms of having small biases, relatively low FMI, and interval coverages that were at or above nominal. FCS-Standard and JM-GL resulted in the best performance for the growth curve model analysis, and comparable performance to FCS-LMM-latent for univariate hierarchical regression modeling, but sub-optimal performance when estimating some of the coefficients of multivariate multilevel modeling. Across all analyses, all imputation methods displayed better performance for data with monotone missing data patterns compared to intermittent missing data. This is partly because our intermittent missing data pattern is NMAR, while the monotone missing data pattern examined is MAR.

FCS-standard and JM-GL posit that the same variable measured at a different time point is a different variable, and use conditional associations between variables measured at different time points to impute the ones that are incomplete. This process essentially assumes an unstructured association structure between the repeated measurements. FCS-LMM-latent models adjust for the associations between repeated measurements recorded at different time points through subject-specific intercepts. The FCS-Standard methods are computationally more efficient and more flexible for imputing non-Normally distributed variables compared to FCS methods with multilevel modeling. The disadvantage of FCS-Standard methods is that they can result in non-identifiable models as the number of waves and variables within waves increases. In such cases, the FCS-twofold method⁴¹ is suggested for building imputation models. In our simulation study, we did not implement the FCS-twofold method because the simulation data comprise a small number of waves and variables within each wave. Possible disadvantages of the FCS-LMM-latent method are its intensive computational time and its possible sub-optimal performance when estimating growth curve models.

The JM-based methods with multilevel modeling, JM-MLMM-latent, and JM-MGLMM use fully observed variables as predictors and model the associations between incomplete variables through the correlations of the subject-level intercepts. This may result in biased estimates of the association between an outcome and a predictor in the analysis model when both variables are incomplete. In addition, the simulations display that using JM-MLMM-latent with a common covariance Σ_e across subjects had slightly better performance compared to JM-MLMM-latent with heterogeneous-covariances $Σ_{e_{i}}$ . The poor performance of JM-MLMM-latent with heterogeneous-covariances may arise from the lack of convergence of the sampling algorithm. The Gibbs sampler used by JM-MLMM-latent (random) is slow to converge with a large number of subjects and poorly estimated subject-level variance.¹⁴

Imputation of count variables is sensitive to distributional assumptions. The simulations show that methods using Poisson regression may produce biased estimates of the comorbidity coefficient compared to a multilevel linear model or predictive mean matching. This may arise because many subjects have no comorbidities and the total number of comorbidities is bounded. Thus, a Poisson or a Negative-Binomial regression does not approximate this variable well. In situations where Poisson or Negative-Binomial hierarchical regression models approximate the data well (e.g. the number of paid assistive devices), our simulations show that these methods perform similarly to multilevel linear regression or predictive mean matching. Another option for imputing non-Normal continuous variables is to transform them so they would appear approximately Normal during the imputation step and transform the variables back in the statistical analyses step. However, these transformations may not preserve the original correlation structure between variables.^19,65

In univariate analysis, FCS-GLMM methods had point estimates of regression coefficients with small biases, but it resulted in poor operating characteristics for point and interval estimates of subject-level variance when using a multilevel model for analysis. This may arise from the computational algorithm that is being used to impute binary variables. The MCMC algorithm that has been used to sample from a multilevel logistic regression in the micemd package samples $b_{i}^{*}$ from its marginal posterior distribution N(0,Ψ_b). This can result in underestimation of the variance of b_i in the imputation process which carries over to the analysis model.

In our simulations, FCS-Standard performs well in cases with missing predictors but results in worse operating characteristics if these incomplete variables are used as dependent variables. This finding relates to the differences between the imputation of explanatory variables and outcomes. When both explanatory variables and outcomes are missing, Little⁶⁶ suggested that imputing outcome variables provides limited information for the subsequent regression of analysis. Von Hippel⁶⁷ proposed the multiple-imputation-deletion procedure that includes all incomplete variables in the imputation step and excludes missing response variables values from the substantive analysis. However, this procedure was proposed when the response variables and the subsequent analyses are defined in advance. In this work, we considered situations where no specific statistical analysis is specified before the imputation step, and the imputed datasets can be used for multiple analyses.

Our results agree with the conclusions of Huque et al²¹ that FCS-Standard provides reliable estimates for univariate hierarchical model analysis. However, we also demonstrated that FCS-Standard methods produce higher bias for coefficient estimates of level-1 covariates when analyzing multiple outcomes simultaneously in joint multilevel models. When imputing count variables, our results are consistent with the findings of Demirtas et al⁶⁵ and Kalaycioglu et al,¹⁹ who demonstrated that assuming a linear regression model for a non-Normally distributed variable may result in smaller biases compared to a mis-specified non-Gaussian model. Our simulations show that imputations using either the predictive mean matching or a multilevel linear model with rounding result in good operating characteristics of subsequent analyses.

In addition to the methods discussed in the manuscript, there are methods that can be used for imputing incomplete longitudinal data but require more technical coding and advanced statistical knowledge. One option is the fully Bayesian approach.¹⁹ It requires users to specify prior on the parameters of both imputation and analysis models and implement additional coding with software, such as STAN or PROC MCMC in SAS. The MCMC algorithm updates the parameters of the imputation model, draws an imputation for the incomplete variable, and updates the parameters of the analysis model sequentially at each iteration. More details on this approach can be found in Kalaycioglu et al.¹⁹ Another option is to implement nonparametric modeling techniques, such as sequential regression trees, random forests, and Dirichlet process mixture models.^68–71 These methods are beyond the scope of this manuscript, because their implementation with multiple variables that experience missing values is not trivial and they require the use of advanced statistical theory and coding. A future direction of our comparative work is to consider non-parametric imputation methods and extend the study simulations to data with more incomplete variables measured at a larger number of waves.

In conclusion, in longitudinal data with a small number of waves and a limited number of variables, when the analysis models comprise univariate regression models, FCS-standard is a computationally efficient method that results in precise and accurate estimates for both single and multilevel models. However, if the analysis models comprise multivariate multilevel models FCS-LMM-latent is a valid statistical method that produces more accurate estimates at the cost of more intensive computations.

ACKNOWLEDGMENTS

This work was supported by the National Institute on Aging at the National Institutes of Health [to HGA R01AG047891 who contributed from the Yale Claude D. Pepper Older Americans Independence Center P30AG021342 and Yale Alzheimer’s Disease Research Center P30AG066508]. RG and HGA were supported by the National Institute on Aging at the National Institutes of Health [U54AG063546] which funds the Imbedded Pragmatic Alzheimer’s Disease and AD-Related Dementia’s Clinical Trials Collaboratory (NIA IMPACT Collaboratory). The National Health and Aging Trends Study (NHATS) is sponsored by the National Institute on Aging [U01AG032947] through a cooperative agreement with the Johns Hopkins Bloomberg School of Public Health. Content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders played no role in the design, execution, analysis, or interpretation of the data or writing of the study.

APPENDIX

A SIMULATION RESULTS OF UNIVARIATE MODEL ANALYSIS FOR CONFIGURATIONS 3 WITH INTERMITTENT MISSING DATA PATTERN

B SIMULATION RESULTS OF UNIVARIATE MODEL ANALYSIS FOR CONFIGURATIONS 1 AND 2

TABLE A1.

The mean relative bias (MRB), the root mean squared error (RMSE), average interval width and empirical coverage of the 95% CI of coefficients estimates and subject-level variance estimate for the univariate analysis in Configuration 3 with intermittent missing data pattern. The incomplete predictors are marked in bold.

Estimates	Metrics	FCS-Standard (LM)	FCS-Standard (PMM)	FCS-Standard (Poisson)	FCS-LMM-latent	FCS-GLMM (Gauss.)	FCS-GLMM (Poisson)	JM-General Location	JM-MLMM (common)	JM-MLMM (random)	JM-MGLMM
Intercept ${\hat{γ}}_{0}^{comp} = - {3.89}^{* *}$	MRB	−0.018	−0.031	0.080	−0.045	0.039	0.086	0.153	0.101	0.170	0.224
	RMSE	0.170	0.176	0.166	0.201	0.236	0.219	0.252	0.480	0.535	0.255
	Width	1.170	1.154	1.149	1.123	1.027	1.009	1.075	1.116	1.170	1.026
	Coverage(%)	100	100	100	99	100	99	97	76	67	99
Comorbidity Index ${\hat{γ}}_{1}^{comp} = {0.38}^{* *}$	MRB	0.046	0.011	0.027	−0.099	0.002	−0.027	0.050	−0.001	−0.197	−0.332
	RMSE	0.026	0.029	0.152	0.046	0.060	0.141	0.027	0.041	0.116	0.239
	Width	0.127	0.127	0.115	0.122	0.112	0.106	0.113	0.124	0.120	0.089
	Coverage(%)	98	96	0	85	44	0	95	90	3	0
BMI ${\hat{γ}}_{2}^{comp} = - {0.02}^{*}$	MRB	−0.036	−0.050	−0.401	−0.112	−0.153	−0.371	−0.017	−0.099	−0.301	−0.634
	RMSE	0.004	0.004	0.005	0.004	0.004	0.006	0.005	0.010	0.010	0.011
	Width	0.030	0.030	0.029	0.029	0.026	0.026	0.027	0.029	0.033	0.024
	Coverage(%)	100	100	99	100	100	99	99	94	98	62
Paid assistive devices ${\hat{γ}}_{3}^{comp} = {0.36}^{* *}$	MRB	0.031	0.089	0.185	−0.089	−0.247	−0.154	−0.140	0.085	−0.126	0.120
	RMSE	0.041	0.036	0.035	0.043	0.025	0.025	0.048	0.026	0.077	0.122
	Width	0.183	0.185	0.181	0.172	0.159	0.158	0.154	0.167	0.167	0.146
	Coverage(%)	96	99	100	94	100	100	88	99	63	1
Age ${\hat{γ}}_{4}^{comp} = {0.11}^{* *}$	MRB	−0.013	−0.018	−0.001	−0.043	−0.053	−0.047	−0.007	−0.120	−0.132	0.057
	RMSE	0.015	0.015	0.014	0.025	0.017	0.014	0.017	0.029	0.025	0.016
	Width	0.111	0.111	0.110	0.109	0.098	0.098	0.106	0.110	0.108	0.102
	Coverage(%)	100	100	100	100	100	100	100	99	99	100
Gender ${\hat{γ}}_{5}^{comp} = {0.28}^{* *}$	MRB	−0.051	−0.052	−0.241	−0.254	−0.212	−0.298	−0.088	−0.268	−0.350	−0.458
	RMSE	0.040	0.041	0.077	0.076	0.067	0.088	0.052	0.081	0.102	0.132
	Width	0.311	0.307	0.304	0.299	0.272	0.272	0.293	0.305	0.301	0.280
	Coverage(%)	100	100	98	100	99	94	100	99	95	62
Self-rated health ${\hat{γ}}_{6}^{comp} = {0.20}^{* *}$	MRB	−0.002	0.011	0.354	0.000	0.000	0.217	0.014	0.068	0.287	0.668
	RMSE	0.022	0.022	0.073	0.018	0.019	0.047	0.026	0.023	0.061	0.135
	Width	0.160	0.160	0.159	0.157	0.141	0.140	0.151	0.159	0.161	0.141
	Coverage(%)	100	100	66	100	100	94	99	100	87	0
Have med. bill paid off overtime ${\hat{γ}}_{7}^{comp} = {0.34}^{*}$	MRB	−0.035	−0.046	0.010	−0.204	−0.108	−0.072	−0.033	−0.233	−0.195	0.092
	RMSE	0.059	0.059	0.062	0.048	0.051	0.055	0.093	0.059	0.076	0.091
	Width	0.481	0.481	0.473	0.469	0.424	0.419	0.453	0.478	0.473	0.444
	Coverage(%)	100	100	100	100	100	100	97	100	100	100
Took prescribed med. ${\hat{γ}}_{8}^{comp} = 0.30$	MRB	0.031	0.039	0.429	0.081	0.261	0.495	−0.129	−0.247	0.081	1.055
	RMSE	0.104	0.101	0.165	0.065	0.111	0.168	0.215	0.105	0.081	0.331
	Width	0.722	0.722	0.710	0.677	0.632	0.630	0.658	0.682	0.683	0.637
	Coverage(%)	100	99	98	100	100	99	97	100	100	47
Time ${\hat{γ}}_{9}^{comp} = {0.10}^{* *}$	MRB	0.009	0.016	−0.191	−0.091	−0.143	−0.260	−0.083	0.510	0.473	−0.594
	RMSE	0.017	0.018	0.025	0.016	0.029	0.021	0.024	0.017	0.020	0.019
	Width	0.110	0.111	0.110	0.105	0.105	0.104	0.098	0.109	0.108	0.102
	Coverage(%)	100	100	99	100	95	99	97	100	99	100
Random effect ${\hat{σ}}_{b_{comp}}^{2} = {1.13}^{* *}$	MRB	−0.027	−0.027	−0.041	−0.018	−0.297	−0.305	−0.010	0.007	−0.017	−0.099
	RMSE	0.053	0.055	0.063	0.042	0.338	0.346	0.067	0.038	0.042	0.120
	Width	0.256	0.257	0.257	0.247	0.232	0.232	0.219	0.250	0.253	0.225
	Coverage(%)	98	97	96	100	0	0	90	100	100	53

Open in a new tab

TABLE B2.

Estimates	Metrics	FCS-Standard (LM)	FCS-Standard (PMM)	FCS-Standard (Poisson)	FCS-LMM-latent	FCS-GLMM (Gauss.)	FCS-GLMM (Poisson)	JM-General Location	JM-MLMM (common)	JM-MLMM (random)	JM-MGLMM
Intercept ${\hat{γ}}_{0}^{comp} = - {3.89}^{* *}$	RMSE	0.133	0.134	0.144	0.115	0.164	0.145	0.159	0.384	0.349	0.153
	Width	1.156	1.143	1.125	1.119	1.047	1.041	1.078	1.116	1.116	1.121
	Coverage	100	100	100	100	100	100	100	91	96	100
Comorbidity Index ${\hat{γ}}_{1}^{comp} = {0.38}^{* *}$	RMSE	0.020	0.018	0.132	0.018	0.027	0.110	0.018	0.024	0.075	0.150
	Width	0.117	0.118	0.117	0.115	0.108	0.106	0.114	0.115	0.185	0.101
	Coverage	100	99	0	100	98	0	99	100	75	0
BMI ${\hat{γ}}_{2}^{comp} = - {0.02}^{*}$	RMSE	0.004	0.004	0.006	0.004	0.004	0.005	0.005	0.006	0.005	0.005
	Width	0.029	0.028	0.028	0.028	0.026	0.026	0.027	0.028	0.028	0.028
	Coverage	100	100	99	100	100	100	99	100	100	100
Paid assistive devices ${\hat{γ}}_{3}^{comp} = {0.36}^{* *}$	RMSE	0.026	0.024	0.024	0.055	0.023	0.022	0.035	0.022	0.021	0.022
	Width	0.151	0.153	0.149	0.147	0.142	0.141	0.141	0.150	0.148	0.148
	Coverage	99	100	100	84	100	100	96	100	100	100
Age ${\hat{γ}}_{4}^{comp} = {0.11}^{* *}$	RMSE	0.015	0.016	0.015	0.014	0.016	0.014	0.018	0.024	0.022	0.013
	Width	0.113	0.113	0.111	0.110	0.103	0.099	0.106	0.113	0.111	0.109
	Coverage	100	100	100	100	100	100	100	100	100	100
Gender ${\hat{γ}}_{5}^{comp} = {0.28}^{* *}$	RMSE	0.039	0.040	0.065	0.052	0.050	0.072	0.053	0.084	0.099	0.077
	Width	0.31	0.309	0.305	0.305	0.287	0.279	0.295	0.309	0.307	0.302
	Coverage	100	100	100	100	100	99	100	99	98	99
Self-rated health ${\hat{γ}}_{6}^{comp} = {0.20}^{* *}$	RMSE	0.021	0.021	0.053	0.023	0.016	0.046	0.028	0.027	0.051	0.081
	Width	0.161	0.163	0.159	0.159	0.147	0.144	0.155	0.160	0.178	0.157
	Coverage	100	100	96	100	100	97	99	100	94	49
Have med. bill paid off overtime ${\hat{γ}}_{7}^{comp} = {0.34}^{*}$	RMSE	0.065	0.064	0.065	0.054	0.058	0.060	0.099	0.058	0.060	0.078
	Width	0.484	0.479	0.478	0.475	0.443	0.436	0.461	0.488	0.481	0.471
	Coverage	100	100	100	100	100	100	98	100	100	100
Took prescribed med. ${\hat{γ}}_{8}^{comp} = 0.30$	RMSE	0.078	0.084	0.135	0.066	0.089	0.144	0.099	0.197	0.148	0.197
	Width	0.701	0.724	0.689	0.674	0.648	0.638	0.669	0.686	0.697	0.674
	Coverage	100	100	99	100	100	100	100	97	100	99
Time ${\hat{γ}}_{9}^{comp} = {0.10}^{* *}$	RMSE	0.02	0.021	0.046	0.015	0.022	0.020	0.018	0.018	0.017	0.019
	Width	0.104	0.107	0.112	0.103	0.101	0.103	0.099	0.104	0.108	0.104
	Coverage	100	99	75	100	99	100	98	99	99	98
Random effect ${\hat{σ}}_{b_{comp}}^{2} = {1.13}^{* *}$	RMSE	0.042	0.040	0.060	0.029	0.204	0.222	0.044	0.041	0.037	0.041
	Width	0.239	0.237	0.240	0.237	0.213	0.218	0.218	0.237	0.236	0.237
	Coverage	100	100	99	100	0	0	99	99	100	100

Open in a new tab

TABLE B3.

The root mean squared error (RMSE), average interval width and empirical coverage of the 95% CI of coefficients estimates and subject-level variance estimate for the univariate analysisin Configuration 2. The incomplete predictors are marked in bold.

Estimates	Metrics	FCS-Standard (LM)	FCS-Standard (PMM)	FCS-Standard (Poisson)	FCS-LMM-latent)	FCS-GLMM (Gauss.)	FCS-GLMM (Poisson)	JM-General Location	JM-MLMM (common)	JM-MLMM (random)	JM-MGLMM
Intercept ${\hat{γ}}_{0}^{comp} = - {3.89}^{* *}$	RMSE	0.138	0.140	0.139	0.154	0.239	0.228	0.175	0.483	0.520	0.222
	Width	1.167	1.131	1.129	1.125	1.047	1.027	1.096	1.111	1.134	1.117
	Coverage	100	100	100	100	98	100	99	71	67	100
Comorbidity Index ${\hat{γ}}_{1}^{comp} = {0.38}^{* *}$	RMSE	0.018	0.019	0.133	0.015	0.025	0.108	0.017	0.023	0.044	0.15
	Width	0.119	0.118	0.116	0.117	0.108	0.103	0.113	0.115	0.130	0.102
	Coverage	100	100	0	100	99	0	100	100	89	0
BMI ${\hat{γ}}_{2}^{comp} = - {0.02}^{*}$	RMSE	0.004	0.004	0.005	0.003	0.004	0.003	0.004	0.007	0.008	0.008
	Width	0.029	0.028	0.028	0.028	0.026	0.026	0.027	0.029	0.030	0.027
	Coverage	100	100	100	100	99	100	100	99	97	98
Paid assistive devices ${\hat{γ}}_{3}^{comp} = {0.36}^{* *}$	RMSE	0.024	0.022	0.023	0.054	0.022	0.020	0.041	0.020	0.019	0.022
	Width	0.150	0.150	0.149	0.147	0.145	0.143	0.143	0.149	0.152	0.15
	Coverage	100	100	100	91	100	100	93	100	100	100
Age ${\hat{γ}}_{4}^{comp} = {0.11}^{* *}$	RMSE	0.016	0.015	0.015	0.015	0.019	0.016	0.017	0.027	0.026	0.013
	Width	0.113	0.112	0.112	0.112	0.103	0.101	0.107	0.111	0.112	0.109
	Coverage	100	100	100	100	100	100	100	99	100	100
Gender ${\hat{γ}}_{5}^{comp} = {0.28}^{* *}$	RMSE	0.041	0.039	0.071	0.053	0.049	0.074	0.047	0.090	0.096	0.078
	Width	0.313	0.315	0.306	0.304	0.285	0.279	0.298	0.314	0.304	0.303
	Coverage	100	100	99	100	100	98	99	98	96	98
Self-rated health ${\hat{γ}}_{6}^{comp} = {0.20}^{* *}$	RMSE	0.021	0.023	0.051	0.023	0.017	0.045	0.025	0.025	0.036	0.074
	Width	0.162	0.163	0.162	0.159	0.15	0.146	0.154	0.158	0.161	0.156
	Coverage	100	100	93	100	100	97	100	100	99	55
Have med. bill paid off overtime ${\hat{γ}}_{7}^{comp} = {0.34}^{*}$	RMSE	0.051	0.058	0.049	0.050	0.053	0.055	0.096	0.055	0.058	0.077
	Width	0.485	0.493	0.476	0.473	0.435	0.440	0.459	0.480	0.485	0.466
	Coverage	100	100	100	100	100	100	96	100	100	100
Took prescribed med. ${\hat{γ}}_{8}^{comp} = 0.30$	RMSE	0.084	0.087	0.138	0.067	0.096	0.153	0.130	0.229	0.205	0.189
	Width	0.731	0.708	0.703	0.667	0.654	0.639	0.657	0.698	0.687	0.676
	Coverage	100	100	100	100	100	100	99	95	95	99
Time ${\hat{γ}}_{9}^{comp} = {0.10}^{* *}$	RMSE	0.019	0.018	0.044	0.014	0.022	0.022	0.017	0.017	0.017	0.017
	Width	0.106	0.105	0.110	0.104	0.104	0.103	0.099	0.104	0.104	0.105
	Coverage	99	100	80	100	100	100	100	100	100	100
Random effect ${\hat{σ}}_{b_{comp}}^{2} = {1.13}^{* *}$	RMSE	0.041	0.040	0.065	0.036	0.200	0.218	0.047	0.043	0.041	0.046
	Width	0.237	0.242	0.235	0.240	0.214	0.215	0.219	0.236	0.240	0.236
	Coverage	100	100	96	100	0	0	97	100	99	100

Open in a new tab

Footnotes

FINANCIAL DISCLOSURE

None reported.

CONFLICT OF INTEREST

The authors declare no potential conflict of interests.

DATA AVAILABILITY STATEMENT

The data that support the findings of the simulation study are openly available from the National Health and Aging Trends Study (NHATS) at https://nhats.org/researcher/data-access/public-use-files. The data that support the findings of the real data analysis are available at https://nhats.org/researcher/data-access/sensitive-data-files?id=restricted_data_files. Restrictions apply to the availability of these data, which were used under license for this study.

References

1.Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G. Handbook of Missing Data Methodology. Chapman and Hall/CRC. 2014. [Google Scholar]
2.Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 1977; 39(1): 1–22. [Google Scholar]
3.Tanner MA, Wong WH. The Calculation of Posterior Distributions by Data Augmentation. Journal of the American Statistical Association 1987; 82(398): 528–540. [Google Scholar]
4.Little RJA. Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics 1988; 6(3): 287–296. [Google Scholar]
5.Holt D, Elliot D. Methods of Weighting for Unit Non-Response. Journal of the Royal Statistical Society. Series D (The Statistician) 1991; 40(3): 333–342. [Google Scholar]
6.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research 2013; 22(3): 278–295. [DOI] [PubMed] [Google Scholar]
7.Little RJA, Rubin DB. Statistical Analysis with Missing Data, Third Edition. USA: John Wiley Sons, Inc.. 2019. [Google Scholar]
8.Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons. 1987. [Google Scholar]
9.Rubin DB. Inference and Missing Data. Biometrika 1976; 63(3): 581–592. [Google Scholar]
10.Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missingvalues using a sequence of regression models. Survey methodology 2001; 27. [Google Scholar]
11.Schafer J. Analysis of incomplete multivariate data. Chapman&Hall. 1997. [Google Scholar]
12.Schafer JL, Olsen MK. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivariate Behavioral Research 1998; 33(4): 545–571. [DOI] [PubMed] [Google Scholar]
13.Van Buuren S, Brand JP, Groothuis-Oudshoorn C, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul 2006; 76. [Google Scholar]
14.Yucel RM. Random-covariances and mixed-effects models for imputing multivariate multilevel continuous data. Statistical modelling 2001; 11(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Yucel RM. Multiple imputation inference for multivariate multilevel continuous data with ignorable non-response. Philosophical transactions Series A, Mathematical, physical, and engineering sciences 2008; 366(1874): 2389–403. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Yucel RM, Zhao E, Schenker N, Raghunathan TE. Sequential Hierarchical Regression Imputation. Journal of Survey Statistics and Methodology 2017; 6(1): 1–22. [Google Scholar]
17.Schafer JL, Yucel RM. Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values. Journal of Computational and Graphical Statistics 2002; 11(2): 437–457. [Google Scholar]
18.Jolani S. Hierarchical imputation of systematically and sporadically missing data: An approximate Bayesian approach usingchained equations. Biometrical Journal 2018; 60(2): 333–351. [DOI] [PubMed] [Google Scholar]
19.Kalaycioglu O, Copas A, King M, Omar RZ. A comparison of multiple-imputation methods for handling missing data inrepeated measurements observational studies. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2016; 179(3): 683–706. [Google Scholar]
20.Buuren S, Oudshoorn C. Multivariate Imputation by Chained Equations : Mice V1.0 User’s manual. In:; 2000.
21.Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Medical Research Methodology 2018; 18(1): 168. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Jolani S, Debray TPA, Koffijberg H, Van Buuren S, Moons KGM. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Statistics in Medicine 2015; 34(11): 1841–1863. [DOI] [PubMed] [Google Scholar]
23.Audigier V, White IR, Jolani S, Debray TP, Quartagno M, Carpenter J. Multiple imputation for multilevel data with continuous and binary variables. Stat Sci 2018; 33. [Google Scholar]
24.Zhao E, Yucel RM. Performance of sequential imputation method in multilevel applications. Proceedings in Joint statistical meetings Washington DC. 2009. 2009. [Google Scholar]
25.Resche-Rigon M, White IR. Multiple imputation by chained equations for systematically and sporadically missing multilevel data. Statistical methods in medical research 2018; 27(6). [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Enders CK, Mistler SA, Keller BT. Multilevel multiple imputation: a review and evaluation of joint modeling and chained equations imputation. Psychol Methods 2016; 21. [DOI] [PubMed] [Google Scholar]
27.Kasper JD, Freedman VA. Findings From the 1st Round of the National Health and Aging Trends Study (NHATS): Introduction to a Special Issue. The Journals of Gerontology: Series B 2014; 69(Suppl_1): S1–S7. [DOI] [PubMed] [Google Scholar]
28.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2020. [Google Scholar]
29.Cao Y. Github repository: Multiple imputation for longitudinal data. 2021. https://github.com/Yi-Cao1227/Multiple-Imputation-for-longitudinal-data.
30.Carpenter JR, Kenward MG, White IR. Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Statistical Methods in Medical Research 2007; 16(3): 259–275. [DOI] [PubMed] [Google Scholar]
31.Andridge RR. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal 2011; 53(1): 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Drechsler J. Multiple Imputation of Multilevel Missing Data—Rigor Versus Simplicity. Journal of Educational and Behavioral Statistics 2015; 40(1): 69–95. [Google Scholar]
33.Grund S, Lüdtke O, Robitzsch A. Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations. Organizational Research Methods 2018; 21(1): 111–149. [Google Scholar]
34.Gelman A. Analysis of variance—why it is more important than ever. The annals of statistics 2005; 33(1): 1–53. [Google Scholar]
35.Buuren SV. Flexible Imputation of Missing Data, Second Edition. Chapman and Hall/CRC. 2018. [Google Scholar]
36.Buuren SV, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Articles 2011; 45(3): 1–67. [Google Scholar]
37.Raghunathan TE, Hoewyk JV. IVEware: Imputation and Variance Estimation Software User Guide. 2002.
38.Andridge RR, Little RJA. A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review 2010; 78(1): 40–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Kleinke K, Reinecke J. Multiple imputation of incomplete zero-inflated count data. Statistica Neerlandica 2013; 67(3): 311–336. [Google Scholar]
40.Kleinke K, Reinecke J. countimp 1.0 – A Multiple Imputation Package for Incomplete Count Data. 2013.
41.Nevalainen J, Kenward MG, Virtanen SM. Missing values in longitudinal dietary data: A multiple imputation approach based on a fully conditional specification. Statistics in Medicine 2009; 28(29): 3657–3669. [DOI] [PubMed] [Google Scholar]
42.Welch CA, Petersen I, Bartlett JW, et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Statistics in Medicine 2014; 33(21): 3725–3737. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Fong Y, Rue H, Wakefield J. Bayesian inference for generalized linear mixed models. Biostatistics 2010; 11(3): 397–412. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Horton NJ, Lipsitz SR, Parzen M. A Potential for Bias When Rounding in Multiple Imputation. The American Statistician 2003; 57(4): 229–232. [Google Scholar]
45.Yucel RM, He Y, Zaslavsky AM. Using Calibration to Improve Rounding in Imputation. The American Statistician 2008; 62(2): 125–129. doi: 10.1198/000313008X300912 [DOI] [Google Scholar]
46.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman and Hall/CRC. 3rd ed. 2013. [Google Scholar]
47.Goldstein H, Carpenter J, Kenward MG, Levin KA. Multilevel models with multivariate mixed response types. Stat Model 2009; 9. [Google Scholar]
48.Enders CK, Keller BT, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological methods 2018; 23(2). [DOI] [PubMed] [Google Scholar]
49.Aitchison J, Bennett JA. Polychotomous quantal response by maximum indicant. Biometrika 1970; 57(2): 253–262. [Google Scholar]
50.Keller BT, Enders CK. Blimp Software Manual (Version Beta 6.7). Los Angeles, Ca. 2017. [Google Scholar]
51.Belin TR, Hu MY, Young AS, Grusky O. Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study. Statistics in Medicine 1999; 18(22): 3123–3135. [DOI] [PubMed] [Google Scholar]
52.Schafer JL. mix: Estimation/Multiple Imputation for Mixed Categorical and Continuous Data. 2017. R package version 1. 0–10. [Google Scholar]
53.Zhao JH, Schafer JL. pan: Multiple imputation for multivariate panel or clustered data. 2018. R package version 1.6. [Google Scholar]
54.Carpenter JR, Kenward MG. Multiple Imputation and its Application, First Edition. John Wiley & Sons, Ltd. 2013. [Google Scholar]
55.Quartagno M, Carpenter J. jomo: A package for Multilevel Joint Modelling Multiple Imputation. 2014.
56.Carpenter JR, Goldstein H, Kenward MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw 2011; 45. [Google Scholar]
57.Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. 2003.
58.Duncan TE, Duncan SC. An introduction to latent growth curve modeling. Behavior Therapy 2004; 35(2): 333–363. doi: 10.1016/S0005-7894(04)80042-X [DOI] [Google Scholar]
59.Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review.. Statistical methods in medical research 2014; 23(1): 42–59. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Jaffa MA, Gebregziabher M, Jaffa AA. Analysis of multivariate longitudinal kidney function outcomes using generalized linear mixed models. Journal of Translational Medicine 2015; 13(1): 192. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Cao Y, Allore H, Gutman R, Vander Wyk B, Jørgensen TSH. Risk Factors of Skilled Nursing Facility Admissions and the Interrelation With Hospitalization and Amount of Informal Caregiving Received. Medical Care 2022; 60(4). [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Meng XL. Multiple-imputation inferences with uncongenial sources of input. Statistical Science 1994: 538–558. [Google Scholar]
63.Larose C, Dey DK, Harel O. THE IMPACT OF MISSING VALUES ON DIFFERENT MEASURES OF UNCERTAINTY. Statistica Sinica 2019; 29(2): 551–566. [Google Scholar]
64.Goodman RA, Posner SF, Huang ES, Parekh AK, Koh HK. Defining and Measuring Chronic Conditions: Imperatives for Research, Policy, Program, and Practice. Prev Chronic Dis 2013; 10: E66. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Demirtas H, Freels SA, Yucel RM. Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. Journal of Statistical Computation and Simulation 2008; 78(1): 69–84. [Google Scholar]
66.Little RJA. Regression With Missing X’s: A Review. Journal of the American Statistical Association 1992; 87(420): 1227–1237. [Google Scholar]
67.Von Hippel PT. REGRESSION WITH MISSING YS: AN IMPROVED STRATEGY FOR ANALYZING MULTIPLY IMPUTED DATA. Sociological Methodology 2007; 37(1): 83–117. [Google Scholar]
68.Burgette LF, Reiter JP. Multiple Imputation for Missing Data via Sequential Regression Trees. American Journal of Epidemiology 2010; 172(9): 1070–1076. doi: 10.1093/aje/kwq260 [DOI] [PubMed] [Google Scholar]
69.Vidotto D, Vermunt JK, Van Deun K. Bayesian Latent Class Models for the Multiple Imputation of Categorical Data. Methodology 2018; 14(2): 56–68. doi: 10.1027/1614-2241/a000146 [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2011; 28(1): 112–118. doi: 10.1093/bioinformatics/btr597 [DOI] [PubMed] [Google Scholar]
71.Wongkamthong C, Akande O. A Comparative Study of Imputation Methods for Multivariate Ordinal Data. Journal of Survey Statistics and Methodology 2021. doi: 10.1093/jssam/smab028 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

[R1] 1.Molenberghs G, Fitzmaurice G, Kenward MG, Tsiatis A, Verbeke G. Handbook of Missing Data Methodology. Chapman and Hall/CRC. 2014. [Google Scholar]

[R2] 2.Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data Via the EM Algorithm. Journal of the Royal Statistical Society: Series B (Methodological) 1977; 39(1): 1–22. [Google Scholar]

[R3] 3.Tanner MA, Wong WH. The Calculation of Posterior Distributions by Data Augmentation. Journal of the American Statistical Association 1987; 82(398): 528–540. [Google Scholar]

[R4] 4.Little RJA. Missing-Data Adjustments in Large Surveys. Journal of Business & Economic Statistics 1988; 6(3): 287–296. [Google Scholar]

[R5] 5.Holt D, Elliot D. Methods of Weighting for Unit Non-Response. Journal of the Royal Statistical Society. Series D (The Statistician) 1991; 40(3): 333–342. [Google Scholar]

[R6] 6.Seaman SR, White IR. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research 2013; 22(3): 278–295. [DOI] [PubMed] [Google Scholar]

[R7] 7.Little RJA, Rubin DB. Statistical Analysis with Missing Data, Third Edition. USA: John Wiley Sons, Inc.. 2019. [Google Scholar]

[R8] 8.Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons. 1987. [Google Scholar]

[R9] 9.Rubin DB. Inference and Missing Data. Biometrika 1976; 63(3): 581–592. [Google Scholar]

[R10] 10.Raghunathan TE, Lepkowski JM, Van Hoewyk J, Solenberger P. A multivariate technique for multiply imputing missingvalues using a sequence of regression models. Survey methodology 2001; 27. [Google Scholar]

[R11] 11.Schafer J. Analysis of incomplete multivariate data. Chapman&Hall. 1997. [Google Scholar]

[R12] 12.Schafer JL, Olsen MK. Multiple Imputation for Multivariate Missing-Data Problems: A Data Analyst’s Perspective. Multivariate Behavioral Research 1998; 33(4): 545–571. [DOI] [PubMed] [Google Scholar]

[R13] 13.Van Buuren S, Brand JP, Groothuis-Oudshoorn C, Rubin DB. Fully conditional specification in multivariate imputation. J Stat Comput Simul 2006; 76. [Google Scholar]

[R14] 14.Yucel RM. Random-covariances and mixed-effects models for imputing multivariate multilevel continuous data. Statistical modelling 2001; 11(4). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Yucel RM. Multiple imputation inference for multivariate multilevel continuous data with ignorable non-response. Philosophical transactions Series A, Mathematical, physical, and engineering sciences 2008; 366(1874): 2389–403. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Yucel RM, Zhao E, Schenker N, Raghunathan TE. Sequential Hierarchical Regression Imputation. Journal of Survey Statistics and Methodology 2017; 6(1): 1–22. [Google Scholar]

[R17] 17.Schafer JL, Yucel RM. Computational Strategies for Multivariate Linear Mixed-Effects Models With Missing Values. Journal of Computational and Graphical Statistics 2002; 11(2): 437–457. [Google Scholar]

[R18] 18.Jolani S. Hierarchical imputation of systematically and sporadically missing data: An approximate Bayesian approach usingchained equations. Biometrical Journal 2018; 60(2): 333–351. [DOI] [PubMed] [Google Scholar]

[R19] 19.Kalaycioglu O, Copas A, King M, Omar RZ. A comparison of multiple-imputation methods for handling missing data inrepeated measurements observational studies. Journal of the Royal Statistical Society: Series A (Statistics in Society) 2016; 179(3): 683–706. [Google Scholar]

[R20] 20.Buuren S, Oudshoorn C. Multivariate Imputation by Chained Equations : Mice V1.0 User’s manual. In:; 2000.

[R21] 21.Huque MH, Carlin JB, Simpson JA, Lee KJ. A comparison of multiple imputation methods for missing data in longitudinal studies. BMC Medical Research Methodology 2018; 18(1): 168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Jolani S, Debray TPA, Koffijberg H, Van Buuren S, Moons KGM. Imputation of systematically missing predictors in an individual participant data meta-analysis: a generalized approach using MICE. Statistics in Medicine 2015; 34(11): 1841–1863. [DOI] [PubMed] [Google Scholar]

[R23] 23.Audigier V, White IR, Jolani S, Debray TP, Quartagno M, Carpenter J. Multiple imputation for multilevel data with continuous and binary variables. Stat Sci 2018; 33. [Google Scholar]

[R24] 24.Zhao E, Yucel RM. Performance of sequential imputation method in multilevel applications. Proceedings in Joint statistical meetings Washington DC. 2009. 2009. [Google Scholar]

[R25] 25.Resche-Rigon M, White IR. Multiple imputation by chained equations for systematically and sporadically missing multilevel data. Statistical methods in medical research 2018; 27(6). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] 26.Enders CK, Mistler SA, Keller BT. Multilevel multiple imputation: a review and evaluation of joint modeling and chained equations imputation. Psychol Methods 2016; 21. [DOI] [PubMed] [Google Scholar]

[R27] 27.Kasper JD, Freedman VA. Findings From the 1st Round of the National Health and Aging Trends Study (NHATS): Introduction to a Special Issue. The Journals of Gerontology: Series B 2014; 69(Suppl_1): S1–S7. [DOI] [PubMed] [Google Scholar]

[R28] 28.R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing; Vienna, Austria: 2020. [Google Scholar]

[R29] 29.Cao Y. Github repository: Multiple imputation for longitudinal data. 2021. https://github.com/Yi-Cao1227/Multiple-Imputation-for-longitudinal-data.

[R30] 30.Carpenter JR, Kenward MG, White IR. Sensitivity analysis after multiple imputation under missing at random: a weighting approach. Statistical Methods in Medical Research 2007; 16(3): 259–275. [DOI] [PubMed] [Google Scholar]

[R31] 31.Andridge RR. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials. Biometrical Journal 2011; 53(1): 57–74. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.Drechsler J. Multiple Imputation of Multilevel Missing Data—Rigor Versus Simplicity. Journal of Educational and Behavioral Statistics 2015; 40(1): 69–95. [Google Scholar]

[R33] 33.Grund S, Lüdtke O, Robitzsch A. Multiple Imputation of Missing Data for Multilevel Models: Simulations and Recommendations. Organizational Research Methods 2018; 21(1): 111–149. [Google Scholar]

[R34] 34.Gelman A. Analysis of variance—why it is more important than ever. The annals of statistics 2005; 33(1): 1–53. [Google Scholar]

[R35] 35.Buuren SV. Flexible Imputation of Missing Data, Second Edition. Chapman and Hall/CRC. 2018. [Google Scholar]

[R36] 36.Buuren SV, Groothuis-Oudshoorn K. mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, Articles 2011; 45(3): 1–67. [Google Scholar]

[R37] 37.Raghunathan TE, Hoewyk JV. IVEware: Imputation and Variance Estimation Software User Guide. 2002.

[R38] 38.Andridge RR, Little RJA. A Review of Hot Deck Imputation for Survey Non-response. International Statistical Review 2010; 78(1): 40–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] 39.Kleinke K, Reinecke J. Multiple imputation of incomplete zero-inflated count data. Statistica Neerlandica 2013; 67(3): 311–336. [Google Scholar]

[R40] 40.Kleinke K, Reinecke J. countimp 1.0 – A Multiple Imputation Package for Incomplete Count Data. 2013.

[R41] 41.Nevalainen J, Kenward MG, Virtanen SM. Missing values in longitudinal dietary data: A multiple imputation approach based on a fully conditional specification. Statistics in Medicine 2009; 28(29): 3657–3669. [DOI] [PubMed] [Google Scholar]

[R42] 42.Welch CA, Petersen I, Bartlett JW, et al. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Statistics in Medicine 2014; 33(21): 3725–3737. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Fong Y, Rue H, Wakefield J. Bayesian inference for generalized linear mixed models. Biostatistics 2010; 11(3): 397–412. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Horton NJ, Lipsitz SR, Parzen M. A Potential for Bias When Rounding in Multiple Imputation. The American Statistician 2003; 57(4): 229–232. [Google Scholar]

[R45] 45.Yucel RM, He Y, Zaslavsky AM. Using Calibration to Improve Rounding in Imputation. The American Statistician 2008; 62(2): 125–129. doi: 10.1198/000313008X300912 [DOI] [Google Scholar]

[R46] 46.Gelman A, Carlin JB, Stern HS, Rubin DB. Bayesian Data Analysis. Chapman and Hall/CRC. 3rd ed. 2013. [Google Scholar]

[R47] 47.Goldstein H, Carpenter J, Kenward MG, Levin KA. Multilevel models with multivariate mixed response types. Stat Model 2009; 9. [Google Scholar]

[R48] 48.Enders CK, Keller BT, Levy R. A fully conditional specification approach to multilevel imputation of categorical and continuous variables. Psychological methods 2018; 23(2). [DOI] [PubMed] [Google Scholar]

[R49] 49.Aitchison J, Bennett JA. Polychotomous quantal response by maximum indicant. Biometrika 1970; 57(2): 253–262. [Google Scholar]

[R50] 50.Keller BT, Enders CK. Blimp Software Manual (Version Beta 6.7). Los Angeles, Ca. 2017. [Google Scholar]

[R51] 51.Belin TR, Hu MY, Young AS, Grusky O. Performance of a general location model with an ignorable missing-data assumption in a multivariate mental health services study. Statistics in Medicine 1999; 18(22): 3123–3135. [DOI] [PubMed] [Google Scholar]

[R52] 52.Schafer JL. mix: Estimation/Multiple Imputation for Mixed Categorical and Continuous Data. 2017. R package version 1. 0–10. [Google Scholar]

[R53] 53.Zhao JH, Schafer JL. pan: Multiple imputation for multivariate panel or clustered data. 2018. R package version 1.6. [Google Scholar]

[R54] 54.Carpenter JR, Kenward MG. Multiple Imputation and its Application, First Edition. John Wiley & Sons, Ltd. 2013. [Google Scholar]

[R55] 55.Quartagno M, Carpenter J. jomo: A package for Multilevel Joint Modelling Multiple Imputation. 2014.

[R56] 56.Carpenter JR, Goldstein H, Kenward MG. REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. J Stat Softw 2011; 45. [Google Scholar]

[R57] 57.Plummer M. JAGS: A program for analysis of Bayesian graphical models using Gibbs sampling. 2003.

[R58] 58.Duncan TE, Duncan SC. An introduction to latent growth curve modeling. Behavior Therapy 2004; 35(2): 333–363. doi: 10.1016/S0005-7894(04)80042-X [DOI] [Google Scholar]

[R59] 59.Verbeke G, Fieuws S, Molenberghs G, Davidian M. The analysis of multivariate longitudinal data: a review.. Statistical methods in medical research 2014; 23(1): 42–59. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Jaffa MA, Gebregziabher M, Jaffa AA. Analysis of multivariate longitudinal kidney function outcomes using generalized linear mixed models. Journal of Translational Medicine 2015; 13(1): 192. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Cao Y, Allore H, Gutman R, Vander Wyk B, Jørgensen TSH. Risk Factors of Skilled Nursing Facility Admissions and the Interrelation With Hospitalization and Amount of Informal Caregiving Received. Medical Care 2022; 60(4). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Meng XL. Multiple-imputation inferences with uncongenial sources of input. Statistical Science 1994: 538–558. [Google Scholar]

[R63] 63.Larose C, Dey DK, Harel O. THE IMPACT OF MISSING VALUES ON DIFFERENT MEASURES OF UNCERTAINTY. Statistica Sinica 2019; 29(2): 551–566. [Google Scholar]

[R64] 64.Goodman RA, Posner SF, Huang ES, Parekh AK, Koh HK. Defining and Measuring Chronic Conditions: Imperatives for Research, Policy, Program, and Practice. Prev Chronic Dis 2013; 10: E66. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Demirtas H, Freels SA, Yucel RM. Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. Journal of Statistical Computation and Simulation 2008; 78(1): 69–84. [Google Scholar]

[R66] 66.Little RJA. Regression With Missing X’s: A Review. Journal of the American Statistical Association 1992; 87(420): 1227–1237. [Google Scholar]

[R67] 67.Von Hippel PT. REGRESSION WITH MISSING YS: AN IMPROVED STRATEGY FOR ANALYZING MULTIPLY IMPUTED DATA. Sociological Methodology 2007; 37(1): 83–117. [Google Scholar]

[R68] 68.Burgette LF, Reiter JP. Multiple Imputation for Missing Data via Sequential Regression Trees. American Journal of Epidemiology 2010; 172(9): 1070–1076. doi: 10.1093/aje/kwq260 [DOI] [PubMed] [Google Scholar]

[R69] 69.Vidotto D, Vermunt JK, Van Deun K. Bayesian Latent Class Models for the Multiple Imputation of Categorical Data. Methodology 2018; 14(2): 56–68. doi: 10.1027/1614-2241/a000146 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] 70.Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics 2011; 28(1): 112–118. doi: 10.1093/bioinformatics/btr597 [DOI] [PubMed] [Google Scholar]

[R71] 71.Wongkamthong C, Akande O. A Comparative Study of Imputation Methods for Multivariate Ordinal Data. Journal of Survey Statistics and Methodology 2021. doi: 10.1093/jssam/smab028 [DOI] [Google Scholar]

PERMALINK

Review and Evaluation of Imputation Methods for Multivariate Longitudinal data with Mixed-type Incomplete Variables

Yi Cao

Heather Allore

Brent Vander Wyk

Roee Gutman

Summary

1 |. INTRODUCTION

2 |. MULTIPLE IMPUTATION FOR LONGITUDINAL STUDIES

2.1 |. Notations and assumptions

TABLE 1.

TABLE 2.

2.2 |. Multiple imputation for multivariate data

3 |. IMPUTATION BY FULLY CONDITIONAL SPECIFICATIONS

3.1 |. FCS using wide format data

3.2 |. FCS– Multilevel linear model

3.3 |. FCS– Multilevel linear model with latent variables

3.4 |. FCS– Multilevel generalized linear model

4 |. IMPUTATION BY JOINT MODELING

4.1 |. JM– General location model

4.2 |. JM– Multivariate multilevel linear model

4.3 |. JM-Multivariate multilevel linear model with latent variables

4.4 |. JM– Multivariate generalized multilevel linear model

5 |. SIMULATIONS

5.1 |. Missing data mechanism and missing data patterns

5.2 |. Study design

TABLE 3.

TABLE 4.

5.2.1 |. Univariate generalized linear hierarchical model

5.2.2 |. Latent growth curve model

5.2.3 |. Bivariate generalized linear hierarchical model

5.2.4 |. Congeniality of Analysis Models

5.2.5 |. Performance assessment metrics

5.3 |. Results of the univariate GLMM across all configurations

5.3.1. | Relative bias of coefficients and the subject-level variance

FIGURE 1.

FIGURE 2.

5.3.2 |. The RMSE, Interval Width and Coverage

TABLE 5.

5.3.3 |. Results of GLMM for intermittent missing data pattern

5.4 |. Results of the latent growth curve model for Configuration 3

TABLE 6.

5.5 |. Results of the multivariate GLMM for Configuration 3

FIGURE 3.

TABLE 7.

5.6 |. Fraction of missing information and computational time

TABLE 8.

6 |. CASE STUDY ON SKILLED NURSING HOME ADMISSIONS

FIGURE 4.

7 |. DISCUSSION

ACKNOWLEDGMENTS

APPENDIX

TABLE A1.

TABLE B2.

TABLE B3.

Footnotes

DATA AVAILABILITY STATEMENT

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases