A note on the relationships between multiple imputation, maximum likelihood and fully Bayesian methods for missing responses in linear regression models

Qingxia Chen; Joseph G Ibrahim

doi:10.4310/SII.2013.v6.n3.a2

. Author manuscript; available in PMC: 2014 Oct 9.

Published in final edited form as: Stat Interface. 2014 Jul 1;6(3):315–324. doi: 10.4310/SII.2013.v6.n3.a2

A note on the relationships between multiple imputation, maximum likelihood and fully Bayesian methods for missing responses in linear regression models

Qingxia Chen ^1,^†,^✉, Joseph G Ibrahim ^2,^‡

PMCID: PMC4190159 NIHMSID: NIHMS600768 PMID: 25309677

Abstract

Multiple Imputation, Maximum Likelihood and Fully Bayesian methods are the three most commonly used model-based approaches in missing data problems. Although it is easy to show that when the responses are missing at random (MAR), the complete case analysis is unbiased and efficient, the aforementioned methods are still commonly used in practice for this setting. To examine the performance of and relationships between these three methods in this setting, we derive and investigate small sample and asymptotic expressions of the estimates and standard errors, and fully examine how these estimates are related for the three approaches in the linear regression model when the responses are MAR. We show that when the responses are MAR in the linear model, the estimates of the regression coefficients using these three methods are asymptotically equivalent to the complete case estimates under general conditions. One simulation and a real data set from a liver cancer clinical trial are given to compare the properties of these methods when the responses are MAR.

Keywords and phrases: Missing data, Multiple imputation, Maximum likelihood, Fully Bayesian, Missing response, Missing at random

1. INTRODUCTION

Missing data is very common in various experimental settings, including clinical trials, sample surveys and environmental studies. There are essentially three major likelihood-based approaches for handling missing data in a regression problem. These are i) Maximum Likelihood (ML), ii) Multiple Imputation (MI), and iii) Fully Bayesian (FB). The EM algorithm is a technique often used to obtain ML estimates and is useful when the likelihood function of the observed data has no closed form. The recent developments of missing data approaches also include empirical likelihood method [18], parametric fractional imputation [10], among others. In this paper, we investigate theoretical connections between MI, ML (especially within the EM framework), and FB approaches in the linear regression model when the response variable is missing at random (MAR).

It is well known that when the response variable is MAR and the covariates are fully observed, the likelihood function of the observed data is the same as the complete case likelihood (i.e. the likelihood obtained by omitting all cases with missing values), and therefore the ML estimates are identical to the complete case (CC) estimates. However, this result is not obvious under the MI and FB approaches. Although the CC estimates are unbiased and efficient under MAR responses, the MI and FB methods are still used in practice since many researchers are unaware of this special property of the CC estimates. To study the relationships between the three methods in this context, we consider MAR responses in the linear model and investigate the small and large sample properties of the estimates, and derive analytic and asymptotic expressions of the estimates and standard errors for the MI, ML and FB approaches. Under noninformative priors for the MI and FB methods, we show that the estimates and their standard errors under these three approaches are asymptotically equivalent to the CC estimates.

There is much literature on MI, ML and FB, respectively. [21] provide asymptotic results for MI with MAR responses in linear models. [17], [15], and [20] discuss theoretical properties of proper and improper MI. [22] and [19] propose a consistent variance estimator for MI. For ML, one of the earliest references is [12]. [6], [11], and [8] proposes the “EM by the method of weights” and the Monte Carlo EM algorithm (MCEM) for the ML framework in generalized linear models (GLM). [16] examine the problem of using EM to obtain the asymptotic covariance matrix of the parameter estimates. [7] discuss FB methods for MAR covariates in GLM’s. There are two major differences of our work in this paper from previous literature. First, for MAR responses, we derive both the small and large sample properties of the estimates, while the previous work mainly focuses on large sample properties. Secondly, the main purpose of this paper is to investigate the theoretical relationships between MI, ML and FB, as this was only investigated only through simulations before.

The rest of this paper is organized as follows. In Section 2.1, we derive the small sample and asymptotic expressions of the estimates and standard errors for proper MI. In Section 2.2, we derive the expressions for ML via the EM algorithm. The expressions for FB are derived in Section 2.3. In Section 3, we conduct a simulation to demonstrate our results. A real data analysis of a liver cancer clinical trial is given in Section 4, and we conclude the paper with a brief discussion in Section 5.

2. MAR RESPONSES IN THE LINEAR MODEL

Consider the linear model

y = X β + e,

(1)

where β is a p × 1 vector of unknown parameters, X is an n×p full rank matrix of explanatory variables including an intercept, and e is an n × 1 vector of random errors with e ~ N_n(0, σ²I), where σ² is assumed unknown throughout. We assume throughout that X is fully observed and the components of y are MAR. For simplicity, we rearrange the data so that y₁ = (y₁, …, y_n₁)′ are fully observed and y₂ = (y_n₁+1, …, y_n)′ are MAR, and assume that the corresponding n₁ × p and n₂ × p matrices of fixed covariates X₁ and X₂ for y₁ and y₂ are full-rank, n₁+n₂ = n and p < n₁. Therefore, we write $y = {(y_{1}^{'}, y_{2}^{'})}^{'}$ and $X = {(X_{1}^{'}, X_{2}^{'})}^{'}$ .

As shown in [13], the maximum likelihood estimates of β and σ² are the same as the CC estimates in the linear regression model, in which cases with any missing values are simply discarded. In fact, this is true for any regression model with MAR responses satisfying conditional independence between y₁ and y₂ given X and γ, where γ = (β, σ²), since the likelihood function of the observed data D_obs = (y₁, X) is given by

\begin{array}{l} L (γ ∣ D_{obs}) = \int p (y_{1}, y_{2} ∣ X, γ) d y_{2} \\ = p (y_{1} ∣ X, γ) \int p (y_{2} ∣ X, γ) d y_{2} \\ = p (y_{1} ∣ X, γ), \end{array}

which is the CC likelihood.

The standard results for the linear regression model with MAR responses are

{\hat{β}}_{M L} = {\hat{β}}_{C C} = {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1},

(2)

which is independent of

{\hat{σ}}_{M L}^{2} = {\hat{σ}}_{C C}^{2} = y_{1}^{'} (I - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}) y_{1} / n_{1}

(3)

with E(β̂) = β and $E ({\hat{σ}}_{M L}^{2}) = (n_{1} - p) σ^{2} / n_{1}$ . The variances of the estimates are

Var ({\hat{β}}_{M L}) = σ^{2} {(X_{1}^{'} X_{1})}^{- 1}

(4)

and

Var ({\hat{σ}}_{M L}^{2}) = 2 (n_{1} - p) σ^{4} / n_{1}^{2} .

(5)

Clearly, we can adjust the estimate of σ² to get an unbiased estimate by letting ${\tilde{σ}}_{M L}^{2} = n_{1} {\hat{σ}}_{M L}^{2} / (n_{1} - p)$ . It is worth noting that if we apply the EM algorithm in this setting and use Louis’s method [14] to get the variance estimates, we get same variance estimates of β̂_ML as in Eq. (4), and the variance estimate of ${\hat{σ}}_{M L}^{2}$ is equal to $2 {\hat{σ}}_{M L}^{4} / n_{1}$ , which is larger than Eq. (5) in small samples but asymptotically equivalent when n₁ → ∞. In the next three subsections, we explore the small and large sample properties of the estimators under MI, ML via the EM algorithm, and FB methods for MAR responses under model (1).

2.1 Multiple imputation (MI)

Multiple imputation has emerged as a very popular technique for inference in missing data problems. In this section, we consider the precision parameter instead of the variance parameter for the development of MI. Therefore, we assume γ^* = (β, τ), where τ = 1/σ². Proper MI is based on creating imputed datasets in which the missing values are sampled from the posterior predictive distribution of the missing data given the observed data, given by

p (D_{mis} ∣ D_{obs}) = \int p (D_{mis} ∣ D_{obs}, γ^{*}) π (γ^{*} ∣ D_{obs}) d γ^{*},

(6)

where D_obs = (y₁, X₁, X₂), and D_mis = y₂ for the current setting. π(γ^*|D_obs) is the posterior distribution of γ^* based on the observed data, given by π(γ^*|D_obs) ∝ {∫ L(γ^*|D_mis, D_obs)dD_mis}π(γ^*), where L(γ^*|D_mis, D_obs) is the likelihood function of the complete-data, ∫ L(γ^*|D_mis, D_obs)dD_mis is the likelihood based on the observed data, and π(γ^*) is the prior distribution of γ^*. Assume $D_{mis}^{(l)}$ , l = 1, …, m, are draws of D_mis from the posterior predictive distribution p(D_mis|D_obs) given in Eq. (6). Let ${\hat{γ}}_{l}^{*}$ and V_l denote the posterior mean and covariance matrix of γ^* based on π(γ^*|D_obs) calculated for the lth imputed data set (y₁, $y_{2}^{(l)}$ ). Then, the MI estimate of γ^* is ${\hat{γ}}_{M I}^{*} = m^{- 1} \sum_{l = 1}^{m} {\hat{γ}}_{l}^{*}$ , and the estimate of the variance of ${\hat{γ}}_{M I}^{*}$ is

\hat{Var} ({\hat{γ}}_{M I}^{*}) = \hat{V} + (1 + \frac{1}{m}) \hat{B},

(7)

where $\hat{V} = m^{- 1} \sum_{l = 1}^{m} {\hat{V}}_{l}$ and $\hat{B} = \sum_{l = 1}^{m} ({\hat{γ}}_{l}^{*} - {\hat{γ}}_{M I}^{*}) {({\hat{γ}}_{l}^{*} - {\hat{γ}}_{M I}^{*})}^{'} / (m - 1)$ is the between-imputation variance. There are several imputation methods that have been proposed for the MI method. In this paper, we concentrate on proper MI using the improper prior,

π (γ^{*}) \propto τ^{- 1} .

(8)

We note that in MI, the imputation model can be different from the analysis model, but in this paper we only consider the case in which the two models are the same.

Theorem 1 gives the small sample behavior of the estimates of β and σ² for proper MI assuming the improper prior π(γ^*) ∝ τ⁻¹. Large sample properties of the estimates are also given under some general conditions. To derive these properties, we need the following lemma.

Lemma 2.1

If the n×1 random vector z has a multivariate t distribution, denoted S_n(v, μ, V), with density proportional to ${[1 + \frac{1}{v} {(z - μ)}^{'} V^{- 1} (z - μ)]}^{- \frac{(v + n)}{2}}$ , and A and B are matrices of constants, then

$E (z^{'} Az) = \frac{v}{v - 2} t r (AV) + μ^{'} A μ$ , when v > 2
$E ({zz}^{'} Az) = \frac{v}{v - 2} [{VA}^{'} μ + VA μ + μ t r (AV)] + μ μ^{'} A μ$ , when v > 2
$E [(z^{'} Az) (z^{'} Bz)] = \frac{v^{2}}{(v - 2) (v - 4)} [t r (AV) t r (BV) + t r (AVBV) + t r ({AVB}^{'} V)] + \frac{v}{v - 2} [t r (AV) μ^{'} B μ + μ^{'} A μ t r (BV) + μ^{'} (AVB + A^{'} VB + AV B^{'} + A^{'} {VB}^{'}) μ] + μ^{'} A μ μ^{'} B μ$ , when v > 4.

The proof of Lemma 2.1 is given in the Appendix. For the linear regression model (1) with prior as Eq. (8), the posterior distribution of γ^* based on the observed data is

\begin{array}{l} p (γ^{*} ∣ y_{1}) \propto p (y_{i} ∣ γ^{*}) π (γ^{*}) \\ \propto τ^{n_{1} / 2 - 1} exp {- \frac{τ}{2} {(y_{1} - X_{1} β)}^{'} (y_{1} - X_{1} β)} \end{array}

and the posterior predictive distribution is

\begin{array}{l} p (y_{2} ∣ y_{1}) = \int \int p (y_{2} ∣ y_{1}, γ^{*}) p (γ^{*} ∣ y_{1}) d β d τ \\ \propto {[1 + \frac{{(y_{2} - {\hat{y}}_{2})}^{'} H (y_{2} - {\hat{y}}_{2})}{(n_{1} - p) s^{2}}]}^{- \frac{n_{1} + n_{2} - p}{2}} \end{array}

where ${\hat{y}}_{2} = X_{2} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}, s^{2} = y_{1}^{'} (I - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}) y_{1} / (n_{1} - p)$ and $H = I - X_{2} {(X^{'} X)}^{- 1} X_{2}^{'}$ . Since $X^{'} X = X_{1}^{'} X_{1} + X_{2}^{'} X_{2}$ , and $X_{1}^{'} X_{1}$ are of full-rank, it can be shown that H is positive definite with inverse $H^{- 1} = I + X_{2} {(X_{1}^{'} X_{1})}^{- 1} X_{2}^{'}$ . Hence, the posterior predictive distribution of [y₂|y₁] is a multivariate t distribution given by

y_{2} ∣ y_{1} ~ S_{n - n_{1}} (n_{1} - p, X_{2} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}, s^{2} H^{- 1}) .

(9)

Theorem 2.1 establishes the small and large sample properties of the estimates based on the MI method.

Theorem 2.1

For the linear regression model (1) with prior (8), let $y_{2}^{(l)}$ , l = 1, …, m, be the samples of y₂ from [y₂|y₁] in Eq. (9). Then

the multiple imputation (MI) estimate of β is
${\hat{β}}_{M I} = \frac{1}{m} \sum_{l = 1}^{m} {(X^{'} X)}^{- 1} (X_{1}^{'} y_{1} + X_{2}^{'} y_{2}^{(l)}),$ (10)

with mean E(β̂_MI) = β and variance
$Var ({\hat{β}}_{M I}) = σ^{2} {(X_{1}^{'} X_{1})}^{- 1} + \frac{σ^{2} (n_{1} - p)}{m (n_{1} - p - 2)} [{(X_{1}^{'} X_{1})}^{- 1} - {(X^{'} X)}^{- 1}] .$ (11)
The MI estimate of σ² is
${\hat{σ}}_{M I}^{2} = \frac{1}{m} \sum_{l = 1}^{m} y^{{(l)}^{'}} (I - X {(X^{'} X)}^{- 1} X^{'}) y^{(l)} / (n - 2),$ (12)

with mean
$E ({\hat{σ}}_{M I}^{2}) = \frac{(n - p - 2) (n_{1} - p)}{(n - 2) (n_{1} - p - 2)} σ^{2}$

and variance
$Var ({\hat{σ}}_{M I}^{2}) = \frac{2 (n_{1} - p) {(n - p - 2)}^{2}}{{(n - 2)}^{2} {(n_{1} - p - 2)}^{2}} σ^{4} + a_{1} σ^{4} / m,$ (13)

where $a_{1} = \frac{2 (n - n_{1}) (n - p - 2) (n_{1} - p) (n_{1} - p + 2)}{{(n - 2)}^{2} {(n_{1} - p - 2)}^{2} (n_{1} - p - 4)}$ .

From Theorem 2.1, it can be shown that the MI estimate of β and σ² as well as their variances are asymptotically equivalent to the CC estimates. Furthermore, after some algebra, we can show that when n > n₁ > p+ 2

E ({\hat{σ}}_{M I}^{2} ∣ y_{1}) = \frac{n_{1} (n - p - 2)}{(n - 2) (n_{1} - p - 2)} {\hat{σ}}_{C C}^{2} > {\hat{σ}}_{C C}^{2},

and

Var ({\hat{σ}}_{M I}^{2}) = \frac{n_{1}^{2} {(n - p - 2)}^{2}}{{(n_{1} - p - 2)}^{2} {(n - 2)}^{2}} Var ({\hat{σ}}_{C C}^{2}) + \frac{2 (n - p + 2) (n - n_{1})}{m (n_{1} - p + 2) (n + 2)} σ^{4} > Var ({\hat{σ}}_{C C}^{2}),

where ${\hat{σ}}_{C C}^{2}$ is given in Eq. (3). We note here that throughout this paper, we do not consider the situation in which the number of regression coefficients p increases as n increases, so p is either fixed or increases at a slower rate than n.

Remark 2.1

$E ({\hat{σ}}_{M I}^{2})$ is independent from m while $Var ({\hat{σ}}_{M I}^{2})$ is a function of m, therefore, increasing the number of imputations, m, does not reduce the bias of ${\hat{σ}}_{M I}^{2}$ , but it reduces the variance of ${\hat{σ}}_{M I}^{2}$ .

Remark 2.2

(β̂_MI|y₁)/β̂_CC → 1, $({\tilde{σ}}_{M I}^{2} ∣ y_{1}) / {\tilde{σ}}_{C C}^{2} \to 1$ as m → ∞, where ${\tilde{σ}}_{M I}^{2} = \frac{(n - 2) (n_{1} - p - 2)}{(n - p - 2) (n_{1} - p)} {\hat{σ}}_{M I}^{2}$ and ${\tilde{σ}}_{C C}^{2} = \frac{n_{1}}{n_{1} - p} {\hat{σ}}_{C C}^{2}$ are unbiased estimates of σ². However, this is not true for a fixed m.

We also note that Var(β̂_MI) > Var(β̂_CC) and $Var ({\tilde{σ}}_{M I}^{2}) > Var ({\tilde{σ}}_{C C}^{2})$ , which imply that the MI estimates are less efficient than the CC estimates. This is because the ML estimates for MAR responses in the linear model are the same as the CC estimates, and the ML estimates are most efficient if the model is correct. The extra variability of the MI estimate is induced by the sampling involved in finding the estimator. Even though we are able to improve the MI estimates under the setting of MAR responses in linear regression with small samples, this is not the main aim of this paper. The goal of this paper is to investigate the relationships between MI, ML, and FB approaches. The small sample properties of MI have been studied under more general settings in [1] and [9].

2.2 Maximum likelihood (ML)

As shown in equations (2) and (3), there are closed form estimates of β and σ² using the ML method when the response variable is MAR in the linear model, and those estimates are precisely the CC estimates. However, the ML method is more generally carried out using the EM algorithm, which can be either directly solved when the E-step has a closed form, or it may be obtained using Monte Carlo methods when it does not have a closed form. This latter version of the EM algorithm is referred to as the Monte Carlo EM (MCEM) algorithm and is a more general method of carrying out ML since for most regression models with missing data, the E-step does not have a closed form. We will study ML via MCEM in this subsection in order to study the connections between ML, MI, and FB, and to shed light on examining the properties of the MCEM method when closed form estimates under ML do not exist. ML via MCEM will be the basis of our development in this subsection. In particular, we will derive expressions for the estimates, and their associated variances for both the small sample and large sample situations using MCEM. Following [6] and [8], the Monte Carlo E-Step at the (t + 1)^st EM iteration can be written as

\begin{array}{l} Q (γ ∣ γ^{(t)}) = \int l (γ ∣ D_{obs}, D_{mis}, γ^{(t)}) p (D_{mis} ∣ D_{obs}, γ^{(t)}) d D_{mis} \\ \approx \frac{1}{m} \sum_{j = 1}^{m} l (γ ∣ D_{obs}, D_{mis}^{(j)}, γ^{(t)}), \end{array}

where l(γ | D_obs, D_mis, γ⁽^t⁾) is the log-likelihood function based on the complete data given the parameter estimates at the t^th iteration, D_obs = (y₁, X₁, X₂) is the observed data, D_mis = y₂ and the $D_{mis}^{(j)}$ ’s are the missing values replaced by their j^th sampled values from the full conditional distribution p(D_mis | D_obs, γ⁽^t⁾). The M-Step at the (t+1)^st EM iteration maximizes Q(γ | γ⁽^t⁾). Standard errors can be calculated by using Louis’s method and the estimated observed information matrix of γ based on Louis’s method is given by

I (\hat{γ}) = - \ddot{Q} (\hat{γ} ∣ \hat{γ}) - \frac{1}{m} \sum_{j = 1}^{m} S (\hat{γ}; D_{obs}, D_{mis}^{(j)}) S {(\hat{γ}; D_{obs}, D_{mis}^{(j)})}^{'}

where γ̂ is the ML estimate at MCEM convergence and Q̈(γ̂|γ̂) is the second derivative matrix of the Q function. The estimate of the asymptotic covariance matrix of γ̂ is therefore [ Inline graphic (γ̂)]⁻¹.

Note that unlike the MI method, which creates m pseudo complete datasets by replacing the missing values with each of the m sets of imputed values, ML via MCEM calculates the estimates from a single dataset and assigns a weight of 1 for complete observations and a weight of 1/m for each sampled value. In order to explore the connections between MI and ML, we consider the imputation distribution [y₂|y₁, β̂] of MCEM, given by

y_{2} ∣ y_{1}, \hat{β} ~ N_{n - n_{1}} (X_{2} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}, (\frac{n_{1} - p}{n_{1}}) s^{2} I),

(14)

where $s^{2} = y_{1}^{'} (I - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}) y_{1} / (n_{1} - p)$ . Theorem 2.2 gives the estimates of β and σ² along with their small and large sample properties.

Theorem 2.2

For the linear regression model (1), let $y_{2}^{(l)}$ , l = 1, …, m, be the Gibbs samples of y₂ from [y₂|y₁, β̂] in Eq. (14). Then

the maximum likelihood estimate of β using MCEM is
${\hat{β}}_{M L 2} = \frac{1}{m} \sum_{l = 1}^{m} {(X^{'} X)}^{- 1} (X_{1}^{'} y_{1} + X_{2}^{'} y_{2}^{(l)}),$ (15)

with variance
$Var ({\hat{β}}_{M L 2}) = σ^{2} {(X_{1}^{'} X_{1})}^{- 1} + \frac{σ^{2} (n_{1} - p)}{{m n}_{1}} {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X^{'} X)}^{- 1} .$ (16)
β̂_ML2 is an unbiased estimator of β.
The ML estimate of σ² is
${\hat{σ}}_{M L 2}^{2} = \frac{1}{m} \sum_{l = 1}^{m} {(y^{(l)} - X {\hat{β}}_{M L 2})}^{'} (y^{(l)} - X {\hat{β}}_{M L 2}) / n$ (17)

with mean
$E ({\hat{σ}}_{M L 2}^{2}) = (\frac{n_{1} - p}{n_{1}} - \frac{(n_{1} - p) t r (M)}{{mnn}_{1}}) σ^{2}$

and variance
$Var ({\hat{σ}}_{M L 2}^{2}) = \frac{2 (n_{1} - p)}{n_{1}^{2}} σ^{4} + \frac{2 (n_{1} - p)}{n^{2} n_{1}^{2}} a_{2} σ^{4},$ (18)

where $a_{2} = \frac{{t r}^{2} (M) + (n_{1} - p + 2) t r (M^{2}) - 2 (n_{1} - p + 2) t r (M)}{m^{2}} + \frac{(n - n_{1}) (n_{1} - p + 2) - 2 n t r (M)}{m}$ and $M = (X_{2}^{'} X_{2}) {(X^{'} X)}^{- 1}$ .
${\hat{σ}}_{M L 2}^{2} \overset{P}{\to} σ^{2}$ , as n₁ → ∞ and m → ∞.

Again from Theorem 2.2, it can be easily shown that the estimate of β and its variance based on MCEM are asymptotically equivalent to the CC estimates. In particular, after some algebra, it can be shown that

\frac{E ({\hat{σ}}_{M L 2}^{2} ∣ y_{1})}{{\hat{σ}}_{C C}^{2}} = 1 - \frac{tr (M)}{m n} \to 1,

as n₁ or m → ∞. The condition that tr(M) → K, 0 ≤ K < ∞, as n → ∞, implies that the information contained in the covariates corresponding to the missing responses is finite compared to the total information in the covariates. The variance of ${\hat{σ}}_{M L 2}^{2}$ in Eq. (18) can also be written as

Var ({\hat{σ}}_{M L 2}^{2}) = Var ({\hat{σ}}_{C C}^{2}) + O (\frac{1}{{m n}_{1}}) σ^{4},

and hence $Var ({\hat{σ}}_{M L 2}^{2}) / Var ({\hat{σ}}_{C C}^{2}) \to 1$ as n₁ or m go to infinity.

Note that the variance of β̂_ML2 in Eq. (16) is smaller than the variance of β̂_MI in Eq. (11), however, the derivation of Theorem 2.2 is based on the assumption that the imputation distribution of the missing responses yields the ML estimates, which may not be true in practice. Again, note that although we write the estimates of (β, σ²) and their variance as if there were m data sets in order to compare the MI and ML methods, in practice, ML via MCEM calculates the estimates from only one dataset with different weights assigned to the observed and sampled values. In this sense, MCEM augments the data “vertically” and MI augments the data “horizontally”.

Remark 2.3

Both $E ({\hat{σ}}_{M L 2}^{2})$ and $Var ({\hat{σ}}_{M L 2}^{2})$ are functions of m, the number of Gibbs samples, and therefore, increasing m reduces the bias and variance of ${\hat{σ}}_{M L 2}^{2}$ .

2.3 Fully Bayesian (FB)

FB methods for the missing data problem are based on specifying priors for all of the parameters and then the missing data are sampled from their full conditional distributions within the Gibbs sampler. Clearly, ML and MI have Bayesian connections, since ML can be viewed as a large sample Bayesian method, and in many cases, the implementation of Bayesian methods using uniform improper priors on all parameters leads to ML estimates. In this subsection, we consider the FB method under conjugate priors, which yield closed form expressions for the posterior mean and variance of the parameters.

Note that observed data likelihood for MAR responses is the CC likelihood and thus the posterior distribution of γ^* based on the observed data is p(γ^*|y₁, X) ∝ p(y₁|X; γ^*)π(γ^*). Theorem 2.3 provides the properties of the fully Bayesian estimates of β and τ.

Theorem 2.3

For the linear regression model (1), assume that the prior for γ^* = (β, τ) is π(γ^*) = π(β|τ)π(τ), where π(β|τ) = N(μ₀, τ⁻¹Σ₀) and π(τ) = Gamma(δ₀/2, λ₀/2). Then

the fully Bayesian estimate of β is
${\hat{β}}_{F B} = \frac{1}{m} \sum_{l = 1}^{m} β^{(m)},$

where β_m is the m sample from the posterior distribution
$p (β ∣ D_{obs}) ~ S_{p} (n_{1} + δ_{0}, \tilde{β}, {\tilde{s}}^{2} ({(X_{1}^{'} X_{1} + \sum_{0})}^{- 1}))$

with β̃ = Λμ₀ + (I − Λ)β̂, $Λ = {(X_{1}^{'} X_{1} + \sum_{0}^{- 1})}^{- 1} \sum_{0}^{- 1}, \hat{β} = {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}$ , and ${\tilde{s}}^{2} = {(n_{1} + δ_{0})}^{- 1} [y_{1}^{'} (I - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}) y_{1} + {(\hat{β} - μ_{0})}^{'} (Λ^{'} X_{1}^{'} X_{1}) (\hat{β} - μ_{0}) + λ_{0}]$ .
The posterior mean and variance of β based on the observed data are
$E (β ∣ D_{obs}) = \tilde{β}$

and
$Var (β ∣ D_{obs}) = (n_{1} + δ_{0}) {\tilde{s}}^{2} {(X_{1}^{'} X_{1} + \sum_{0})}^{- 1} / (n_{1} + δ_{0} - 2) .$
The fully Bayesian estimate of τ = 1/σ² is
${\hat{τ}}_{F B} = \frac{1}{m} \sum_{l = 1}^{m} τ^{(m)}$

where p(τ |D_obs) ~ Gamma((n₁ + δ₀)/2, (n₁ + δ₀)s̃²/2) with s̃² defined in (i).
The posterior mean and variance of τ are
$E (τ ∣ D_{obs}) = 1 / s^{2} and Var (τ ∣ D_{obs}) = 2 {(n_{1} + δ_{0})}^{- 1} {\tilde{s}}^{- 4} .$

The proof of Theorem 2.3 is straightforward and can be found in most Bayesian textbooks. We state it as a Theorem here only to be consistent with other sections.

Remark 2.4

When the prior for γ^* is an improper prior, π(γ^*) ∝ τ⁻¹, s̃² reduces to $y_{1}^{'} (I - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}) y_{1} / n_{1}$ and the posterior mean and variance of (β, τ) are then equal to the CC estimates given in equations (2) and (3).

Therefore, the CC analysis is recommended over the MI, MCEM, and FB methods for MAR responses in the linear regression model, unless additional information is available to specify informative priors for the MI and FB methods, or the imputation model of MI includes covariates not specified in the analysis model. On the other hand, the loss of efficiency of MI, MCEM, and FB methods can be significantly reduced by increasing the number of imputations for MI or the number of Gibbs samples for MCEM and FB.

3. SIMULATION STUDY

In this section, we will compare inferences about β using the four methods, MI, CC, MCEM and FB using the formulas we developed in Section 2 for a small sample size n and various values of m for MI and MCEM.

We generate N = 1,000 replicates with each simulation consisting of n = 250 independent response variables y_i from the linear regression model as

y_{i} = β_{0} + β_{1} x_{i 1} + β_{2} x_{i 2} + e_{i},

where the e_i’s are independent and identically distributed (i.i.d.) as N(0, σ²). The values chosen for the parameters are (β₀, β₁, β₂) = (1.0, 1.5, −1.0) and σ² = 1.0. The covariates (x_i₁, x_i₂) are i.i.d. and simulated as

x_{i 1} ~ N (1.0, 1.0) and x_{i 2} ∣ x_{i 1} ~ N (α_{0} + α_{1} x_{i 1}, σ_{x}^{2})

where (α₀, α₁) = (1.0, 1.0) and $σ_{x}^{2} = 1.0$ .

We assume that y_i is MAR for some i’s and x_i₁ and x_i₂ are completely observed throughout. In this setting, the model for the missing data mechanism of y_i is given by

p (r_{i 1} = 1 ∣ x_{i 1}, x_{i 2}, ϕ) = \frac{exp (ϕ_{0} + ϕ_{1} x_{i 1} + ϕ_{2} x_{i 2})}{1 + exp (ϕ_{0} + ϕ_{1} x_{i 1} + ϕ_{2} x_{i 2})},

where (ϕ₀₁, ϕ₁₁, ϕ₂₁) = (−5.5, 1.0, 1.0) and r_i₁ = 1 if y_i is missing, 0 otherwise.

Table 1 gives the results using the four methods, MI, CC, MCEM and FB, and also gives the estimates based on the full data (i.e., no missing values), as these estimates serve as a benchmark for comparison. From the N = 1,000 simulations, the average number of observations with y_i missing is 19%. We chose the number of samples m equal to 30 and 3 in both the MI and MCEM methods in order to compare the results. [13] note that for proper MI, m, the number of imputed datasets, can be as small as m = 5. However, m in MCEM, the number of Gibbs samples, is usually large, say m = 100 or more, in order to accurately represent the sampling distribution in the E-step, especially in complex models with large missing data fractions. This is consistent with the simulation results. When m = 3 in the MI method, the two forms of the variance estimates give similar coverage rates because both of them adjust well for small m, and when m = 3 in MCEM, equations (16) and (18) give much better coverage rates than the Louis method. On the other hand, for the MI method, considering that the estimates with m = 30 always have smaller variances than the estimates with m = 3 with better coverage probabilities, larger values of m may need to be considered if the computational burden is not heavy. The simulation results confirm the theorems in Section 2, and show that the three methods (MI, ML via MCEM, FB) produce consistent estimates with valid inferences and all are asymptotically equivalent to the CC estimates when the response variable is MAR.

Table 1.

Simulation with MAR responses in the linear regression model. The 95% CR is the coverage rate of a 95% confidence interval. γ̂_F (full data), ${\hat{γ}}_{M I}^{*}$ (Multiple Imputation with covariance matrix as Eq. (7), ${\hat{γ}}_{M I}^{†}$ (Multiple Imputation with covariance matrix as Eq. (11) and Eq. (13), γ̂_CC (CC estimates), ${\hat{γ}}_{M L 2}^{*}$ (MCEM with covariance matrix using Louis’s method, ${\hat{γ}}_{M L 2}^{†}$ (MCEM with covariance matrix as Eq. (16) and Eq. (18), and γ̂_FB (fully Bayesian). The number m is defined in the section discussing the corresponding method

Method

Estimate (var(×10⁻³))[95% CR]

β₀ = 1.00

β₁ = 1.50

β₂ = −1.00

σ² = 1.0

γ̂_F

0.994(12)[94]

1.499(8)[96]

−0.998(4)[96]

0.989(8)[94]

{\hat{γ}}_{M I}^{*}

0.995(14)[95]

1.496(11)[95]

−0.998(5)[95]

1.000(10)[95]

{\hat{γ}}_{M I}^{†}

0.995(14)[95]

1.496(10)[95]

−0.998(5)[95]

1.000(10)[95]

{\hat{γ}}_{M I}^{*}

0.993(14)[95]

1.496(11)[95]

−0.998(6)[94]

1.002(11)[94]

{\hat{γ}}_{M I}^{†}

0.993(14)[95]

1.496(11)[96]

−0.998(6)[95]

1.002(10)[94]

γ̂_CC

0.995(14)[95]

1.497(10)[95]

−0.9984(5)[95]

0.987(10)[94]

{\hat{γ}}_{M L 2}^{*}

0.995(14)[95]

1.497(10)[95]

−0.998(5)[95]

0.987(10)[94]

{\hat{γ}}_{M L 2}^{†}

0.995(14)[95]

1.497(10)[95]

−0.998(5)[95]

0.987(10)[94]

{\hat{γ}}_{M L 2}^{*}

0.993(14)[92]

1.497(10)[90]

−0.998(5)[89]

0.987(10)[90]

{\hat{γ}}_{M L 2}^{†}

0.993(14)[95]

1.497(10)[95]

−0.998(5)[93]

0.987(10)[93]

γ̂_FB

0.995(14)[95]

1.497(10)[95]

−0.998(5)[95]

0.992(10)[95]

Open in a new tab

4. LIVER CANCER DATA

To further illustrate the CC, MI, ML and FB methods, we consider a real dataset on n = 174 patients from two Eastern Cooperative Oncology Group clinical trials, EST 2282 [3] and EST 1286 [4]. We are interested in how the number of cancerous liver nodes (CNT) when entering the trials is predicted by six other baseline characteristics: body mass index (BMI, defined as weight in kilograms divided by the square of height in meters); age (in years); associated jaundice (yes, no); and time since diagnosis of the disease (TSD, in weeks). Thirty four out of 174 (19.5%) patients have a missing response variable (CNT). Throughout, we assume that the response variable CNT is MAR. The square root transformation on CNT and TSD was used in the analyses.

We use linear regression to model the response variable, $\sqrt{CNT}$ , as $\sqrt{{CNT}_{i}} = β_{0} + β_{1} {BMI}_{i} + β_{2} {Age}_{i} + β_{3} {Jaundice}_{i} + β_{4} \sqrt{{TSD}_{i}} + e_{i}$ , where the e_i’s are i.i.d. normally distributed as e_i~ N(0, σ²).

Table 2 show the results for the CC analysis, MI with m = 3 and m = 30 using equations (11) and (13), MCEM with m = 30 and m = 300 using equations (16) and (18), and the FB method discussed in Section 3. Moreover, the variance estimates are 0.775, 0.715, 0.715, 0.724, 0.739, and 0.712 for CC, MI with m = 3 and m = 30, ML via MCEM with m = 30 and m = 300, and FB, respectively. As shown in the table, the MI, MCEM and FB methods yield very similar estimates with very little differences from the CC estimates. In particular, the p-values of all the covariates except Age are smaller with larger m. The results show that the age of the patients is significantly associated with the number of cancerous liver nodes controlling for body mass index, associated jaundice, and time since diagnosis.

Table 2.

Estimates for liver cancer data

Effect

Method

β̂

Std

P-value

Intercept

2.826

0.445

< .001

MI m = 3

2.801

0.446

< .001

MI m = 30

2.763

0.408

< .001

ML2 m = 30

2.782

0.440

< .001

ML2 m = 300

2.766

0.395

< .001

2.770

0.408

< .001

BMI

−0.008

0.015

0.595

MI m = 3

−0.007

0.015

0.632

MI m = 30

−0.005

0.013

0.726

ML2 m = 30

−0.005

0.014

0.740

ML2 m = 300

−0.005

0.013

0.715

−0.005

0.013

0.720

Age

−0.012

0.005

0.018

MI m = 3

−0.012

0.005

0.016

MI m = 30

−0.012

0.005

0.007

ML2 m = 30

−0.012

0.005

0.012

ML2 m = 300

−0.012

0.004

0.006

−0.012

0.005

0.008

Jaundice

0.190

0.152

0.212

MI m = 3

0.231

0.146

0.115

MI m = 30

0.217

0.134

0.107

ML2 m = 30

0.210

0.144

0.147

ML2 m = 300

0.204

0.129

0.116

0.204

0.134

0.129

\sqrt{TSD}

0.002

0.034

0.964

MI m = 3

0.000

0.034

0.998

MI m = 30

−0.002

0.032

0.957

ML2 m = 30

−0.003

0.034

0.933

ML2 m = 300

−0.003

0.030

0.926

−0.002

0.032

0.943

Open in a new tab

5. DISCUSSION

It is known in the missing data literature that when the responses are MAR, the CC analysis is unbiased and efficient. However, MI, ML via MCEM, and FB are still commonly used in practice in this setting. This may be due to the fact that (a) the unbiasedness and efficiency properties of the CC method in this setting is not known to general researchers; (b) MI, ML, and FB, as well as some other methods including parametric fractional imputation [10] and empirical likelihood [18] outperform CC in a general setting. To overcome these barriers, it is important to inform researchers and practitioners about these important results. Moreover, we also showed in this paper that the loss of efficiency of MI, ML via MCEM, and FB can be significantly reduced by increasing the number of imputations for MI and the number of Gibbs samples for MCEM and FB. It would be of interest to extend our theoretical results to MAR responses for models other than linear regression. This is a topic of current investigation. It would also be interesting to accommodate missingness in the predictors. Unfortunately, even for linear regression models with normally distributed MAR covariates, no closed form expressions are available for the estimates of the three methods, which makes the comparisons between the methods very hard. A special scenario of it, assuming unit variances for response variable and missing covariates, was investigated in [2] for the maximum likelihood approach.

Acknowledgments

The authors wish to thank the editor, the associate editor and two referees for several suggestions and editorial changes which have greatly improved the paper.

APPENDIX

Proof of Lemma 2.1

If the n × 1 random vector z has a multivariate t distribution as S_n(v, μ, V), then we can write $z = x / \sqrt{y / v} + μ$ , where x is an n × 1 random vector that has a multivariate normal distribution N(0, V), y is a random variable which has a $χ_{v}^{2}$ distribution, x and y are independent. Therefore, (i) is straightforward and, to get (ii), we have

\begin{array}{l} E ({zz}^{'} Az) = E [E (\frac{{xx}^{'} Ax}{{(y / v)}^{3 / 2}} + \frac{x μ^{'} Ax + {xx}^{'} A μ + μ x^{'} Ax}{(y / v)} + \frac{μ μ^{'} Ax + μ x^{'} A μ}{{(y / v)}^{1 / 2}} + μ μ^{'} A μ ∣ y)] \\ = 0 + E [{v y}^{- 1} ({VA}^{'} μ + VA μ + μ tr (AV))] + 0 + μ μ^{'} A μ \\ = \frac{v}{v - 2} [{VA}^{'} μ + VA μ + μ tr (AV)] + μ μ^{'} A μ . \end{array}

We substitute $z = x / \sqrt{y / v} + μ$ and calculate the double expectation in the first equality. Because of independence between x and y, we can substitute in expressions for multivariate normal moments in the second equality, and therefore

\begin{array}{l} E [(z^{'} Az) (z^{'} Bz)] = E [E ({(x / \sqrt{y / v} + μ)}^{'} A (x / \sqrt{y / v} + μ) {(x / \sqrt{y / v} + μ)}^{'} \times B (x / \sqrt{y / v} + μ) ∣ y)] \\ = E [{(v / y)}^{2} E (x^{'} {Axx}^{'} Bx)] + E [{(v / y)}^{3 / 2} \times 0] \\ + E [(v / y) E (x^{'} Ax μ^{'} B μ + μ^{'} Ax μ^{'} Bx + μ^{'} Ax x^{'} B μ \\ + μ^{'} Ax x^{'} B μ + x^{'} A μ μ^{'} Bx + x^{'} A μ x^{'} B μ \\ + μ^{'} A μ x^{'} Bx)] + E [{(v / y)}^{1 / 2} \times 0] + μ^{'} A μ μ^{'} B μ \\ = \frac{v^{2}}{(v - 2) (v - 4)} [tr (AV) t r (BV) + tr (AVBV) \\ + tr ({AVB}^{'} V)] + \frac{v}{(v - 2} [tr (AV) μ^{'} B μ + μ^{'} A μ tr (BV) \\ + μ^{'} (AVB + A^{'} VB + AV B^{'} + A^{'} {VB}^{'}) μ] \\ + μ^{'} A μ μ^{'} B μ . \end{array}

In the second equality, the two zero components correspond to the first and third moments of x. In the last equality, we use the second and fourth moments of x, which are available in [5] with a modification for nonsymmetric matrices A and B.

Proof of Theorem 2.1

(i) and (ii): It is straightforward to get the estimate of β as in Eq. (10). It is also straightforward to use double expectations to get E(β̂_MI) = β. To find the variance of β̂_MI, we have

\begin{array}{l} Var ({\hat{β}}_{M I}) = E (Var (\hat{β} ∣ y_{1})) + Var (E (\hat{β} ∣ y_{1})) \\ = \frac{n_{1} - p}{m (n_{1} - p - 2)} \times E ({(X^{'} X)}^{- 1} X_{2}^{'} s^{2} (I + X_{2} {(X_{1}^{'} X_{1})}^{- 1} X_{2}^{'}) X_{2} {(X^{'} X)}^{- 1}) + Var ({(X^{'} X)}^{- 1} X_{1}^{'} y_{1} + {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}) \\ = \frac{σ^{2} (n_{1} - p)}{m (n_{1} - p - 2)} {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X_{1}^{'} X_{1})}^{- 1} + Var ({(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}) \\ = \frac{σ^{2} (n_{1} - p)}{m (n_{1} - p - 2)} {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X_{1}^{'} X_{1})}^{- 1} + σ^{2} {(X_{1}^{'} X_{1})}^{- 1} . \end{array}

(iii) and (iv): It is straightforward to get the estimate of σ² as in Eq. (12). In order to find $E ({\hat{σ}}_{M I}^{2})$ , we write $(n - 2) {\hat{σ}}_{M I}^{2} = y_{1}^{'} {Ay}_{1} - 2 y_{1}^{'} B {\bar{y}}_{2} + \frac{1}{m} \sum_{l = 1}^{m} y_{2}^{{(l)}^{'}} {Cy}_{2}^{(l)}$ , where ${\bar{y}}_{2} = \sum_{l = 1}^{m} y_{2}^{(l)} / m, A = I - X_{1} {(X^{'} X)}^{- 1} X_{1}^{'}, B = X_{1} {(XX)}^{- 1} X_{2}^{'}$ , and $C = I - X_{2} {(X^{'} X)}^{- 1} X_{2}^{'}$ . Let $P_{X_{1}} = I - X_{1} {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'}$ and $D = X_{1} (X_{1}^{'} X_{1}) (X_{2}^{'} X_{2}) {(X^{'} X)}^{- 1} X_{1}^{'}$ . Then after some algebra, we have

\begin{array}{l} E ({\hat{σ}}_{M I}^{2}) = E [E (σ_{M I}^{2} ∣ y_{1})] \\ = E [y_{1}^{'} (A - D + \frac{(n - n_{1})}{(n_{1} - p - 2)} P_{X_{1}}) y_{1}] / (n - 2) \\ = \frac{(n - p - 2)}{(n_{1} - p - 2) (n - 2)} E (y_{1}^{'} P_{X_{1}} y_{1}) \\ = \frac{(n - p - 2) (n_{1} - p)}{(n - 2) (n_{1} - p - 2)} σ^{2} \\ \to σ^{2}, \end{array}

by noting that y₁ ~ N(X₁β, σ²I), A − D = P_X₁, $P_{X_{1}}^{2} = P_{X_{1}}$ and $X_{1}^{'} P_{X_{1}} X_{1} = 0$ .

To find $Var ({\hat{σ}}_{M I}^{2})$ , we write $Var ((n - 2) {\hat{σ}}_{M I}^{2}) = Var [E ((n - 2) {\hat{σ}}_{M I}^{2} ∣ y_{1})] + E [Var ((n - 2) {\hat{σ}}_{M I}^{2} ∣ y_{1})]$ . First we have

\begin{array}{l} Var [E ((n - 2) {\hat{σ}}_{M I}^{2} ∣ y_{1})] = \frac{{(n - p - 2)}^{2}}{{(n_{1} - p - 2)}^{2}} Var ((n_{1} - p) s^{2}) \\ = \frac{2 (n_{1} - p) {(n - p - 2)}^{2}}{{(n_{1} - p - 2)}^{2}} σ^{4} . \end{array}

Then we obtain

\begin{array}{l} Var((n - 2) {\hat{σ}}_{M I}^{2} ∣ y_{1}) = Var (- 2 y_{1}^{'} B {\bar{y}}_{2} + \frac{1}{m} \sum_{l = 1}^{m} y_{2}^{{(l)}^{'}} {Cy}_{2}^{(l)} ∣ y_{1}) \\ = \frac{4 y_{1}^{'} B Var (y_{2} ∣ y_{1}) B^{'} y_{1}}{m} + \frac{Var (y_{2}^{'} {Cy}_{2} ∣ y_{1})}{m} - \sum_{l = 1}^{m} \sum_{k = 1}^{m} \frac{4 Cov (y_{1}^{'} {By}_{2}^{(k)}, y_{2}^{(l)} {Cy}_{2}^{(l)} ∣ y_{1})}{m^{2}} \\ = \frac{4 (n_{1} - p)}{m (n_{1} - p - 2)} y_{1}^{'} B \sum B^{'} y_{1} + \frac{1}{m} Var (y_{2}^{'} {Cy}_{2} ∣ y_{1}) - \frac{4}{m} Cov (y_{1}^{'} {By}_{2}, y_{2} {Cy}_{2} ∣ y_{1}), \end{array}

where $\sum = s^{2} (I + X_{2} {(X_{1}^{'} X_{1})}^{- 1} X_{2}^{'})$ and s² is defined as in Theorem 2.1. The last equality holds because $y_{2}^{(j)}$ is independent of $y_{2}^{(k)}$ given y₁, when j ≠ k. Then we use Lemma 2.1 and get

\begin{array}{l} E [Var ((n - 2) {\hat{σ}}_{M I}^{2} ∣ y_{1})] = \frac{2 (n_{1} - p) (n - n_{1}) (n - p - 2)}{{(n_{1} - p - 2)}^{2} (n_{1} - p - 4)} E [\frac{y_{1}^{'} P_{X_{1}} y_{1} y_{1}^{'} P_{X_{1}} y_{1}}{m}] = \frac{2 (n_{1} - p) (n - n_{1}) (n - p - 2) (n_{1} - p + 2)}{m {(n_{1} - p - 2)}^{2} (n_{1} - p - 4)} σ^{4}, \end{array}

and therefore, we can get $Var ({\hat{σ}}_{M I}^{2})$ as in (13). Since $Var ({\hat{σ}}_{M I}^{2}) \to 0, {\hat{σ}}_{M I}^{2} \overset{P}{\to} σ^{2}$ .

Proof of Theorem 2.2

(i) and (ii): It is straightforward to get the estimate of β as in Eq. (15). It is straightforward to use double expectations to get E(β̂_MI) = β. To find the variance of β̂_ML2, we have

\begin{array}{l} Var ({\hat{β}}_{M L 2}) = E (Var (\hat{β} ∣ y_{1})) + Var (E (\hat{β} ∣ y_{1})) \\ = \frac{n_{1} - p}{{m n}_{1}} E ({(X^{'} X)}^{- 1} X_{2}^{'} s^{2} {IX}_{2} {(X^{'} X)}^{- 1}) + Var ({(X^{'} X)}^{- 1} X_{1}^{'} y_{1} + {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}) \\ = \frac{σ^{2} (n_{1} - p)}{{m n}_{1}} {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X^{'} X)}^{- 1} + Var ({(X_{1}^{'} X_{1})}^{- 1} X_{1}^{'} y_{1}) \\ = \frac{σ^{2} (n_{1} - p)}{{m n}_{1}} {(X^{'} X)}^{- 1} (X_{2}^{'} X_{2}) {(X^{'} X)}^{- 1} + σ^{2} {(X_{1}^{'} X_{1})}^{- 1} . \end{array}

(iii) and (iv): It is straightforward to get the estimate of σ² as in Eq. (17). In order to find $E ({\hat{σ}}_{M I}^{2})$ , we write $n {\hat{σ}}_{M L 2}^{2} = y_{1}^{'} {Ay}_{1} + \sum_{l = 1}^{m} y_{2}^{{(l)}^{'}} y_{2}^{(l)} / m - 2 y_{1}^{'} B {\bar{y}}_{2} - {\bar{y}}_{2}^{{(l)}^{'}} (I - C) {\bar{y}}_{2}^{(l)}$ , where the symbols are same as the proof of Theorem 2.2. Then we have

\begin{array}{l} E ({\hat{σ}}_{M L 2}^{2}) = E [E ({\hat{σ}}_{M I}^{2} ∣ y_{1})] \\ = E [y_{1}^{'} (A - D + \frac{n - n_{1}}{n_{1}} P_{X_{1}} - \frac{tr (M)}{{m n}_{1}} P_{X_{1}}) y_{1}] / n \\ = (\frac{n_{1} - p}{n_{1}} - \frac{(n_{1} - p) tr (M)}{{mnn}_{1}}) σ^{2} \\ \to σ^{2}, \end{array}

where $M = (X_{2}^{'} X_{2}) {(X^{'} X)}^{- 1}$ . To find $Var ({\hat{σ}}_{M L 2}^{2})$ , we have

Var ((n + 2) {\hat{σ}}_{M I}^{2}) = Var [E ((n + 2) {\hat{σ}}_{M I}^{2} ∣ y_{1})] + E [Var ((n + 2) {\hat{σ}}_{M I}^{2} ∣ y_{1})] .

Then we have

\begin{array}{l} Var [E (n {\hat{σ}}_{M L 2}^{2} ∣ y_{1})] = Var ((\frac{n}{n_{1}} - \frac{tr (M)}{m n}) y_{1}^{'} P_{X_{1}} y_{1}) \\ = 2 {(\frac{n}{n_{1}} - \frac{tr (M)}{m n})}^{2} (n_{1} - p) σ^{4} . \end{array}

Using Chapter 10.9 of [5] and after some algebra, we get

Var (n {\hat{σ}}_{M L 2}^{2} ∣ y_{1}) = 2 (\frac{n - n_{1}}{{m n}_{1}^{2}} - \frac{tr (M^{2})}{m^{2} n_{1}^{2}} - \frac{2 tr (M)}{m^{2} n_{1}^{2}}) {(y_{1}^{'} P_{X_{1}} y_{1})}^{2} .

Therefore, after some algebra, we can get $Var ({\hat{σ}}_{M L 2}^{2})$ as in Eq. (18), and therefore ${\hat{σ}}_{M L 2}^{2} \overset{P}{\to} σ^{2}$ .

Contributor Information

Qingxia Chen, Email: cindy.chen@vanderbilt.edu, Department of Biostatistics, Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, 3723, USA.

Joseph G. Ibrahim, Email: ibrahim@bios.unc.edu, Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA.

References

1.Barnard J, Rubin D. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86:948–955. [Google Scholar]
2.Chen Q, Ibrahim JG, Chen MH, Senchaudhuri P. Theory and inference for regression models with missing response and covariates. Journal of Multivariate Analysis. 2008;99:1302–1331. doi: 10.1016/j.jmva.2007.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Falkson G, Cnaan A, Simson IW. A randomized phase II study of acivicin and d́eoxydoxorubicin in patients with hepatocellular carcinoma in an eastern cooperative oncology group study. American Journal of Clinical Oncology. 1990;13:510– 515. doi: 10.1097/00000421-199012000-00012. [DOI] [PubMed] [Google Scholar]
4.Falkson G, Lipsitz S, Borden E, Simson IW, Haller D. A ECOG randomized phase i1 study of beta interferon and menogoril. American Journal of Clinical Oncology. 1995;18:287–292. [PubMed] [Google Scholar]
5.Graybill FA. Matrices with Applications in Statistics. 2. Duxbury Press; 1983. p. MR0682581. Wadsworth Statistics/Probability Series. [Google Scholar]
6.Ibrahim JG. Incomplete data in generalized linear models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]
7.Ibrahim JG, Chen MH, Lipsitz SR. Bayesian methods for generalized linear models with covariates missing at random. Canadian Journal of Statistics. 2002;30:55–78. [Google Scholar]
8.Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linearmodels when the missing data mechanism is nonignorable. Journal of the Royal Statistical Society: Series B. 1999;61:173–190. [Google Scholar]
9.Kim JK. Finite sample properties of multiple imputation estimators. The Annals of Statistics. 2004;32:766–783. [Google Scholar]
10.Kim JK. Parametric fractional imputation for missing data analysis. Biometrika. 2011;98:119–132. [Google Scholar]
11.Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]
12.Little RJA. Inference about means from incomplete multivariate data. Biometrika. 1976;63:593–604. [Google Scholar]
13.Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2. John Wiley; 2002. p. MR1925014. [Google Scholar]
14.Louis T. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B. 1982;44:226–233. [Google Scholar]
15.Meng XL, Romero M. Discussion to S. F. Nielsen: Efficiency and self-efficiency with multiple imputation inference. International Statistical Review. 2003;71:607–618. [Google Scholar]
16.Meng XL, Rubin DB. Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association. 1991;86:899–909. [Google Scholar]
17.Nielsen SF. Proper and improper multiple imputation. International Statistical Review. 2003;71:593–607. [Google Scholar]
18.Qin J, Zhang B, Leung DHY. Empirical likelihood in missing data problems. Journal of the American Statistical Association. 1999;104:1492–1503. [Google Scholar]
19.Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–124. [Google Scholar]
20.Rubin D. Discussion to S. F. Nielsen: Discussion on multiple imputation. International Statistical Review. 2003;71:619–625. [Google Scholar]
21.Schenker N, Welsh AH. Asymptotic results for multiple imputation. The Annals of Statistics. 1988;16:1550–1566. [Google Scholar]
22.Wang N, Robins JM. Large-sample theory for parametric imputation procedures. Biometrika. 1998;85:935–948. [Google Scholar]

[R1] 1.Barnard J, Rubin D. Small-sample degrees of freedom with multiple imputation. Biometrika. 1999;86:948–955. [Google Scholar]

[R2] 2.Chen Q, Ibrahim JG, Chen MH, Senchaudhuri P. Theory and inference for regression models with missing response and covariates. Journal of Multivariate Analysis. 2008;99:1302–1331. doi: 10.1016/j.jmva.2007.08.009. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Falkson G, Cnaan A, Simson IW. A randomized phase II study of acivicin and d́eoxydoxorubicin in patients with hepatocellular carcinoma in an eastern cooperative oncology group study. American Journal of Clinical Oncology. 1990;13:510– 515. doi: 10.1097/00000421-199012000-00012. [DOI] [PubMed] [Google Scholar]

[R4] 4.Falkson G, Lipsitz S, Borden E, Simson IW, Haller D. A ECOG randomized phase i1 study of beta interferon and menogoril. American Journal of Clinical Oncology. 1995;18:287–292. [PubMed] [Google Scholar]

[R5] 5.Graybill FA. Matrices with Applications in Statistics. 2. Duxbury Press; 1983. p. MR0682581. Wadsworth Statistics/Probability Series. [Google Scholar]

[R6] 6.Ibrahim JG. Incomplete data in generalized linear models. Journal of the American Statistical Association. 1990;85:765–769. [Google Scholar]

[R7] 7.Ibrahim JG, Chen MH, Lipsitz SR. Bayesian methods for generalized linear models with covariates missing at random. Canadian Journal of Statistics. 2002;30:55–78. [Google Scholar]

[R8] 8.Ibrahim JG, Lipsitz SR, Chen MH. Missing covariates in generalized linearmodels when the missing data mechanism is nonignorable. Journal of the Royal Statistical Society: Series B. 1999;61:173–190. [Google Scholar]

[R9] 9.Kim JK. Finite sample properties of multiple imputation estimators. The Annals of Statistics. 2004;32:766–783. [Google Scholar]

[R10] 10.Kim JK. Parametric fractional imputation for missing data analysis. Biometrika. 2011;98:119–132. [Google Scholar]

[R11] 11.Lipsitz SR, Ibrahim JG. A conditional model for incomplete covariates in parametric regression models. Biometrika. 1996;83:916–922. [Google Scholar]

[R12] 12.Little RJA. Inference about means from incomplete multivariate data. Biometrika. 1976;63:593–604. [Google Scholar]

[R13] 13.Little RJA, Rubin DB. Statistical Analysis With Missing Data. 2. John Wiley; 2002. p. MR1925014. [Google Scholar]

[R14] 14.Louis T. Finding the observed information matrix when using the EM algorithm. Journal of the Royal Statistical Society: Series B. 1982;44:226–233. [Google Scholar]

[R15] 15.Meng XL, Romero M. Discussion to S. F. Nielsen: Efficiency and self-efficiency with multiple imputation inference. International Statistical Review. 2003;71:607–618. [Google Scholar]

[R16] 16.Meng XL, Rubin DB. Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association. 1991;86:899–909. [Google Scholar]

[R17] 17.Nielsen SF. Proper and improper multiple imputation. International Statistical Review. 2003;71:593–607. [Google Scholar]

[R18] 18.Qin J, Zhang B, Leung DHY. Empirical likelihood in missing data problems. Journal of the American Statistical Association. 1999;104:1492–1503. [Google Scholar]

[R19] 19.Robins JM, Wang N. Inference for imputation estimators. Biometrika. 2000;87:113–124. [Google Scholar]

[R20] 20.Rubin D. Discussion to S. F. Nielsen: Discussion on multiple imputation. International Statistical Review. 2003;71:619–625. [Google Scholar]

[R21] 21.Schenker N, Welsh AH. Asymptotic results for multiple imputation. The Annals of Statistics. 1988;16:1550–1566. [Google Scholar]

[R22] 22.Wang N, Robins JM. Large-sample theory for parametric imputation procedures. Biometrika. 1998;85:935–948. [Google Scholar]

PERMALINK

A note on the relationships between multiple imputation, maximum likelihood and fully Bayesian methods for missing responses in linear regression models

Qingxia Chen

Joseph G Ibrahim

Abstract

1. INTRODUCTION

2. MAR RESPONSES IN THE LINEAR MODEL

2.1 Multiple imputation (MI)

Lemma 2.1

Theorem 2.1

Remark 2.1

Remark 2.2

2.2 Maximum likelihood (ML)

Theorem 2.2

Remark 2.3

2.3 Fully Bayesian (FB)

Theorem 2.3

Remark 2.4

3. SIMULATION STUDY

Table 1.

4. LIVER CANCER DATA

Table 2.

5. DISCUSSION

Acknowledgments

APPENDIX

Proof of Lemma 2.1

Proof of Theorem 2.1

Proof of Theorem 2.2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

A note on the relationships between multiple imputation, maximum likelihood and fully Bayesian methods for missing responses in linear regression models

Qingxia Chen

Joseph G Ibrahim

Abstract

1. INTRODUCTION

2. MAR RESPONSES IN THE LINEAR MODEL

2.1 Multiple imputation (MI)

Lemma 2.1

Theorem 2.1

Remark 2.1

Remark 2.2

2.2 Maximum likelihood (ML)

Theorem 2.2

Remark 2.3

2.3 Fully Bayesian (FB)

Theorem 2.3

Remark 2.4

3. SIMULATION STUDY

Table 1.

4. LIVER CANCER DATA

Table 2.

5. DISCUSSION

Acknowledgments

APPENDIX

Proof of Lemma 2.1

Proof of Theorem 2.1

Proof of Theorem 2.2

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases