Bivariate Mixed Effects Analysis of Clustered Data with Large Cluster Sizes

Daowen Zhang; Jie Lena Sun; Karen Pieper

doi:10.1007/s12561-015-9140-x

. Author manuscript; available in PMC: 2017 Oct 1.

Published in final edited form as: Stat Biosci. 2016 Jan 22;8(2):220–233. doi: 10.1007/s12561-015-9140-x

Bivariate Mixed Effects Analysis of Clustered Data with Large Cluster Sizes

Daowen Zhang ^1,^*, Jie Lena Sun ², Karen Pieper ²

PMCID: PMC5061463 NIHMSID: NIHMS754386 PMID: 27746847

Abstract

Linear mixed effects models are widely used to analyze a clustered response variable. Motivated by a recent study to examine and compare the hospital length of stay (LOS) between patients undertaking percutaneous coronary intervention (PCI) and coronary artery bypass graft (CABG) from several international clinical trials, we proposed a bivariate linear mixed effects model for the joint modeling of clustered PCI and CABG LOS’s where each clinical trial is considered a cluster. Due to the large number of patients in some trials, commonly used commercial statistical software for fitting (bivariate) linear mixed models failed to run since it could not allocate enough memory to invert large dimensional matrices during the optimization process. We consider ways to circumvent the computational problem in the maximum likelihood (ML) inference and restricted maximum likelihood (REML) inference. Particularly, we developed an expected and maximization (EM) algorithm for the REML inference and presented an ML implementation using existing software. The new REML EM algorithm is easy to implement and computationally stable and efficient. With this REML EM algorithm, we could analyze the LOS data and obtained meaningful results.

Keywords: Meta Analysis, Missing Data, Multi-center Studies

1 Introduction

For patients with coronary artery diseases, percutaneous coronary intervention (PCI) or coronary artery bypass graft (CABG) are two common procedures recommended by physicians. Among many other factors, the hospital length of stay (LOS) is an important one to evaluate the cost effectiveness of each procedure. While some patients receiving a PCI are discharged from the hospital within 24 hours after the procedure, many PCI patients require longer observation period. Compared to a PCI, the CABG is a much more invasive procedure and the patients with a CABG usually require a longer hospital stay. Because of the difference in patient care and management across the world, there may be a regional difference in hospital LOS’s of these two procedures.

Recently, we have a unique opportunity to examine the hospital LOS’s of PCI and CABG procedures from 10 international clinical trials on patients with coronary artery diseases (See Section 7 for more details). One of the research objectives is to quantify the expected LOS’s and their differences of the CABG and PCI procedures after adjusting for patients’ age, smoking and diabetes statuses, for seven different representative regions across the world, while taking into account the trial-to-trial variation. A common approach used to analyze clustered data such as the LOS data is linear mixed effects models where random effects are used to model the cluster effects of trials or trial-to-trial variation. However, empirical evidence indicated that CABG and PCI LOS’s exhibit very different within-trial (residual) variation, which has to be taken into account in the analysis for optimal inference. We hence propose a bivariate linear mixed model for jointly modeling CABG and PCI LOS’s. Due to large number of patients in some trials, the commonly used statistical software, the MIXED procedure of SAS (SAS/STAT 9.3, 2013), for fitting linear mixed models, failed to run since it could not allocate enough memory to invert large dimensional matrices. This computational problem motivates the research of this paper.

In this paper, we considered ways to circumvent the computational problem in the maximum likelihood (ML) inference and the restricted maximum likelihood (REML) inference for the proposed bivariate linear mixed model. For the ML inference, we presented an implementation using existing software. For the REML inference, we proposed a novel expected and maximization (EM) algorithm (Dempster, Laird, and Rubin, 1977) and proved its theoretical statistical property. The new REML EM algorithm is easy to implement and computationally stable and efficient. With this REML EM algorithm, we could analyze the LOS data and obtained meaningful results.

The paper is organized as follows. In Section 2, we describe the proposed bivariate linear mixed model. In Section 3, we discuss the computational issues and solution in the inference on fixed effects and random effects with large cluster sizes. In Section 4, we discuss the ML inference on variance/covariance components and in Section 5, we discuss their REML inference and present an EM algorithm for the REML inference. We present simulation results in Section 6. The proposed REML EM algorithm was used to fit a bivariate mixed model for the LOS data in Section 7. We conclude the paper in Section 8 with some discussion.

2 Models

Suppose our sample consists of m clusters with 2 response variables. In cluster i, we have observations for response variable Y₁ from n_1i subjects and response variable Y₂ from n_2i subjects, so that the size of cluster i is n_i = n_1i + n_2i. At the same time, some important covariates were also measured for each subject. We consider the following Laird-Ware (Laird and Ware, 1986) linear mixed model for each response variable:

Y_{1 i j} = X_{i j}^{T} β_{1} + Z_{i j}^{T} b_{1 i} + e_{1 i j}, i = 1, 2, \dots, m, j = 1, 2, \dots, n_{1 i},

Y_{2 i j} = X_{i, n_{1 i} + j}^{T} β_{2} + Z_{i, n_{1 i} + j}^{T} b_{2 i} + e_{2 i j}, i = 1, 2, \dots, m, j = 1, \dots, n_{2 i},

(1)

where X_ij (usually including the intercept) is the p̃-dimensional vector of covariates for fixed effects β₁ and β₂, Z_ij, usually a subset of X_ij, is the q̃-dimensional vector of covariates for cluster specific random effects b_1i and b_2i. It is assumed that the random effects vectors b_1i and b_2i are independent across clusters and have normal distributions with mean zero and variance matrices D₁₁, D₂₂ and covariance matrix D₁₂. We assume in this paper that the variance matrix D formed by D₁₁, D₁₂ and D₂₂ is a positive definite and unstructured matrix, an assumption commonly used in the statistical literature. It is further assumed that e_1ij’s and e_2ij’s are independent residual errors, independent of b_1i, b_2i, and distributed as $N (0, σ_{1}^{2})$ and $N (0, σ_{2}^{2})$ respectively.

Model (1) is particularly applicable for data from multi-center studies where each center is considered as a cluster, or data analyzed in meta analysis when the data from each individual study were available, in which case each study is considered as a cluster. By modeling response variables Y₁ and Y₂ jointly as in (1), we can answer many important scientific questions. For the hospital LOS data example, if we take X_ij = Z_ij = 1, we can then compare the overall LOS’s from CABG and PCI procedures by comparing β₁ and β₂. When we include dummy variables for different representative regions in the world as well as adjusting covariates, we can then estimate the covariate-adjusted CABG LOS’s and PCI LOS’s, and hence their differences for those regions.

Model (1) can be re-written as a linear mixed effects model for a single response variable. For cluster i, denote by Y_1i = (Y_1i1, Y_1i2, …, Y_{1in_1i})^T, the data vector of response variable 1 and Y_2i = (Y_2i1, Y_2i2, …, Y_{2in_2i})^T of response variable 2. Similarly we can define the residual error vectors e_1i and e_2i. Stack $X_{i j}^{T} (j = 1, 2, \dots, n_{1 i})$ to form X_1i and $X_{i, n_{1 i} + j}^{T} (j = 1, 2, \dots, n_{2 i})$ to form X_2i. Similarly we can form Z_1i and Z_2i. We then define new response vector $Y_{i} = {(Y_{1 i}^{T}, Y_{2 i}^{T})}^{T}$ , covariate matrices X_i = diag{X_1i, X_2i}, Z_i = diag{Z_1i, Z_2i}, and residual vector $e_{i} = {(e_{1 i}^{T}, e_{2 i}^{T})}^{T}$ . In matrix notation, model (1) can be re-written as a regular linear mixed effects model for the new response vector Y_i

Y_{i} = X_{i} β + Z_{i} b_{i} + e_{i}, i = 1, 2, \dots, m,

(2)

where $β = {(β_{1}^{T}, β_{2}^{T})}^{T}$ is the p × 1 (p = 2p̃) new fixed effects vector, $b_{i} = {(b_{1 i}^{T}, b_{2 i}^{T})}^{T}$ is the q × 1 (q = 2q̃) new random effects vector distributed as N(0, D), and e_i ~ N(0, R_i) is the new residual error vector independent of b_i with $R_{i} = diag {σ_{1}^{2} I_{n_{1 i} \times n_{1 i}}, σ_{2}^{2} I_{n_{2 i} \times n_{2 i}}}$ .

The estimation and inference for linear mixed effects models have been thoroughly studied in the statistical literature for the case where cluster sizes are small to moderate, and have been implemented in many statistical softwares such as the MIXED procedure of SAS for routine analyses of many correlated data, including clustered data. However, the cluster sizes n_i’s in the LOS data example for some clusters are so prohibitively large that a general purpose software cannot even allocate enough memory during the computation process. In this paper, we develop estimation and inference procedures for model (2) by taking advantage of the special features of the random effects and residual errors in the model.

3 Inference on Fixed Effects β and Random Cluster Effects b_i

Stack Y_i to form the response vector Y. Stack X_i to form the design matrix X for the fixed effects vector β, and define Z = diag{Z₁, Z₂, …, Z_m} for the design matrix of the random effects vector $b = {(b_{1}^{T}, b_{2}^{T}, \dots, b_{m}^{T})}^{T}$ . Given D, $σ_{1}^{2}$ and $σ_{2}^{2}$ , the maximum likelihood estimate (MLE) of β is given by

\hat{β} = {(X^{T} V^{- 1} X)}^{- 1} X^{T} V^{- 1} Y = {(\sum_{i = 1}^{m} X_{i}^{T} V_{i}^{- 1} X_{i})}^{- 1} \sum_{i = 1}^{m} X_{i}^{T} V_{i}^{- 1} Y_{i},

(3)

where V = var(Y) = diag{V₁, V₂, …, V_m}, $V_{i} = var (Y_{i}) = Z_{i} D Z_{i}^{T} + R_{i}$ . For applications where n_i >> q, the calculation of β̂ can be facilitated by the following expressions

V_{i}^{- 1} = R_{i}^{- 1} - R_{i}^{- 1} Z_{i} {(D^{- 1} + Z_{i}^{T} R_{i}^{- 1} Z_{i})}^{- 1} Z_{i}^{T} R_{i}^{- 1},

(4)

X^{T} V^{- 1} X = \sum_{i = 1}^{m} X_{i}^{T} R_{i}^{- 1} X_{i} - \sum_{i = 1}^{m} X_{i}^{T} R_{i}^{- 1} Z_{i} {(D^{- 1} + Z_{i}^{T} R_{i}^{- 1} Z_{i})}^{- 1} Z_{i}^{T} R_{i}^{- 1} X_{i},

(5)

X^{T} V^{- 1} Y = \sum_{i = 1}^{m} X_{i}^{T} R_{i}^{- 1} Y_{i} - \sum_{i = 1}^{m} X_{i}^{T} R_{i}^{- 1} Z_{i} {(D^{- 1} + Z_{i}^{T} R_{i}^{- 1} Z_{i})}^{- 1} Z_{i}^{T} R_{i}^{- 1} Y_{i} .

(6)

Inference of cluster-specific random effects b_i’s can be based on the best linear unbiased predictors (BLUPs) ${\hat{b}}_{i} = {E (b_{i} | Y_{i}; β) |}_{β = \hat{β}} = D Z_{i}^{T} V_{i}^{- 1} (Y_{i} - X_{i} \hat{β})$ , which can also be obtained by solving the so called “mixed model equation” (Henderson 1984) jointly for β and b:

[\begin{matrix} X^{T} R^{- 1} X & X^{T} R^{- 1} Z \\ Z^{T} R^{- 1} X & Z^{T} R^{- 1} Z + G^{- 1} \end{matrix}] [\begin{matrix} β \\ b \end{matrix}] = [\begin{matrix} X^{T} R^{- 1} Y \\ X^{T} R^{- 1} Y \end{matrix}],

(7)

where G = diag{D,D, …, D} and R = diag{R₁, R₂, …, R_m} are the variance-covariance matrices of b and the residual vector. Note that equation (7) can also be derived as the “score equation” by treating b as a parameter vector and maximizing $f (Y, b; β, D, σ_{1}^{2}, σ_{2}^{2}) = f (Y | b; β, σ_{1}^{2}, σ_{2}^{2}) f (b; D)$ jointly with respect to β and b. There are several advantages of using equation system (7). First, the MLE β̂ and BLUPs b̂ can be obtained without inverting any matrix whose dimension is of the magnitude of the cluster sizes n_i’s. When the total number of clusters m is small to moderate such that p + mq, the dimension of the equation system (7), is not too large, we can obtain the MLE β̂ and BLUPs b̂ by directly inverting the coefficient matrix of this system. Otherwise, the coefficient matrix can be inverted efficiently by recognizing that Z^TR⁻¹Z + G⁻¹ is a block diagonal matrix with the ith block being $Z_{i}^{T} R_{i}^{- 1} Z_{i} + D^{- 1}$ . Second, once β̂ and b̂ are obtained, the joint inference on β and b can made easily based on the following variance-covariance expression

var (\begin{matrix} \hat{β} - β \\ \hat{b} - b \end{matrix}) = {[\begin{matrix} X^{T} R^{- 1} X & X^{T} R^{- 1} Z \\ Z^{T} R^{- 1} X & Z^{T} R^{- 1} Z + G^{- 1} \end{matrix}]}^{- 1} .

(8)

4 Maximum Likelihood Estimation and Inference of D and $(σ_{1}^{2}, σ_{2}^{2})$

In the linear mixed model literature, the inference on the variance-covariance matrix D of the cluster specific random effects b_i’s and residual variances $(σ_{1}^{2}, σ_{2}^{2})$ can be carried out using the maximum likelihood (ML) approach or the restricted maximum likelihood (REML) approach. We will discuss the ML implementation in this section and the REML in Section 5.

The ML approach estimates β, D and $(σ_{1}^{2}, σ_{2}^{2})$ by jointly maximizing the following log-likelihood function with respect to β, D and $(σ_{1}^{2}, σ_{2}^{2})$

ℓ (β, D, σ^{2}; Y) = - \frac{1}{2} \sum_{i = 1}^{m} log | V_{i} | - \frac{1}{2} \sum_{i = 1}^{m} {(Y_{i} - X_{i} β)}^{T} V_{i}^{- 1} (Y_{i} - X_{i} β),

(9)

where |V_i| can be calculated using $| V_{i} | = σ_{1}^{2 n_{1 i}} σ_{2}^{2 n_{2 i}} | I_{q \times q} + Z_{i}^{T} R_{i}^{- 1} Z_{i} D |$ . The inference of D and $(σ_{1}^{2}, σ_{2}^{2})$ can be based on the Fisher information matrix or the observed information matrix from (9).

It is interesting to note that although the software such as the MIXED procedure of SAS routinely used to fit linear mixed model (2) may suffer from computational difficulty, the software such as the NLMIXED procedure of SAS that uses a numerical integration method for likelihood evaluation can easily implement the likelihood inference of β, D and $(σ_{1}^{2}, σ_{2}^{2})$ without inverting large matrices. Since Y_ij|b_i are independent under the model specification, by definition, the log-likelihood function (9) can be equivalently re-written as

ℓ (β, D, σ_{1}^{2}, σ_{2}^{2}; Y) = \sum_{i = 1}^{m} log \int exp {\sum_{j = 1}^{n_{i}} log f (Y_{i j} | b_{i}; β, σ_{1}^{2}, σ_{2}^{2}) + log f (b_{i}; D)} d b_{i} .

(10)

Again by the given model specification, the exponent $\sum_{j = 1}^{n_{i}} log f (Y_{i j} | b_{i}; β, σ_{1}^{2}, σ_{2}^{2}) + log f (b_{i}; D)$ inside the integration in (10) does not involve inverting any large matrix, and is a quadratic function of b_i with a negative definite second derivative matrix. Therefore, the integration can be calculated exactly by the adaptive Gaussian-Hermite quadrature method with only one quadrature point, which is the numerical integration method implemented in the NLMIXED procedure of SAS. Hence, this SAS procedure can be used to conduct the ML inference on β, D and $(σ_{1}^{2}, σ_{2}^{2})$ for data from large clusters.

Even though we can use existing software such as the NLMIXED procedure of SAS to calculate the likelihood function $ℓ (β, D, σ_{1}^{2}, σ_{2}^{2}; Y)$ , we may still encounter numerical problems since this procedure relies on the numerical differentiation method to calculate the required derivatives in the optimization process. When the total number of parameters in the model is small, the numerical differentiation method may work well. Otherwise, non-negligible numerical errors accumulated over the optimization process will yield poor parameter estimates or cause the optimization process not to converge. To overcome these problems, we can modify the EM algorithm of Laird and Ware (1982) for the ML estimation of a linear mixed model by treating the cluster specific random effects b_i’s as missing data. It is well-known that an EM algorithm is numerically stable, will always increase the observed likelihood function during the parameter update process, and under the given model specification, parameter updates for MLE’s have closed form expressions without the need to invert any large dimensional matrices. Readers are referred to Laird and Ware (1982) for more details.

5 Restricted Maximum Likelihood Estimation and Inference of D and $(σ_{1}^{2}, σ_{2}^{2})$

It is well-known that the ML approach for estimating D and $(σ_{1}^{2}, σ_{2}^{2})$ presented in the previous section did not account for the estimation of the fixed effects β and hence will produce biased estimates of D and $(σ_{1}^{2}, σ_{2}^{2})$ for small to moderate sample size m. This is because when β is profiled out from the log-likelihood (9) during the maximization process, the resulting function of D and $(σ_{1}^{2}, σ_{2}^{2})$ alone is not a log-likelihood function of any (transformed) data. Therefore, the estimating equations for D and $(σ_{1}^{2}, σ_{2}^{2})$ are biased. Even though the biases in the ML estimates of D and $(σ_{1}^{2}, σ_{2}^{2})$ will disappear asymptotically, it may not be negligible for small to moderate sample size m, especially for the estimate of D. Since the MLE’s of the fixed effects β depend on the estimates of D and $(σ_{1}^{2}, σ_{2}^{2})$ , the biases in these estimates usually will in turn yield more biased estimates of β. On the other hand, the restricted maximum likelihood (REML) approach will usually yield less biased estimates of D and $(σ_{1}^{2}, σ_{2}^{2})$ . For example, in a regular linear regression model, the REML estimate of the residual variance is unbiased.

For the linear mixed model (2), it is well-known that the REML log-likelihood function of D and $(σ_{1}^{2}, σ_{2}^{2})$ is given by (Diggle et. al. 2002)

ℓ_{R} (D, σ_{1}^{2}, σ_{2}^{2}; Y) = - \frac{1}{2} log | X^{T} V^{- 1} X | - \frac{1}{2} log | V | - \frac{1}{2} {(Y - X \hat{β})}^{T} V^{- 1} (Y - X \hat{β}),

(11)

where β̂ is the MLE of β given in (3). Compared to the maximum log-likelihood function (9), the REML log-likelihood function involves one extra term log |X^TV⁻¹X|. The expression in (5) for X^TV⁻¹X can be used to calculate this extra term without directly inverting individual variance-covariance matrix V_i.

Although it is computationally feasible to calculate the REML function $ℓ_{R} (D, σ_{1}^{2}, σ_{2}^{2}; Y)$ for data with large cluster sizes, direct maximization of this function sometimes may be numerically unstable, even if the random effects dimension q is moderate. Unlike the ML estimation, we cannot adapt an existing software such as the NLMIXED procedure of SAS to implement the REML estimation. Because of the attractive properties of an EM algorithm, we present in the following sub-section an EM algorithm for maximizing the REML log likelihood function $ℓ_{R} (D, σ_{1}^{2}, σ_{2}^{2}; Y)$ .

5.1 EM Algorithm for the REML Estimation of D and $(σ_{1}^{2}, σ_{2}^{2})$

It is well-known that the REML log-likelihood function $ℓ_{R} (D, σ_{1}^{2}, σ_{2}^{2}; Y)$ can be alternatively derived using the following formula (Harville, 1974)

L_{R} (D, σ_{1}^{2}, σ_{2}^{2}; Y) = e^{ℓ_{R} (D, σ_{1}^{2}, σ_{2}^{2}; Y)} = \int f (Y | b; β, σ_{1}^{2}, σ_{2}^{2}) f (b; D) d β d b,

where both fixed effects β and random effects b are integrated out from the joint distribution of response Y and random effects b.

Denote by D^(t), $σ_{1}^{2 (t)}$ , and $σ_{2}^{2 (t)}$ the estimates of D, $σ_{1}^{2}$ , and $σ_{2}^{2}$ at the t-th iteration of the REML EM algorithm. Theorem 1 in Appendix A, the general EM algorithm for the REML estimation, indicates that the update D^(t+1), $σ_{1}^{2 (t + 1)}, σ_{2}^{2 (t + 1)}$ can be obtained by maximizing the following REML Q−function with respect to D and $(σ_{1}^{2}, σ_{2}^{2})$

Q_{R} (D, σ_{1}^{2}, σ_{2}^{2} | D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)}) = E_{g} log f (Y | b; β, σ_{1}^{2}, σ_{2}^{2}) + E_{g} log f (b; D) = - log | R | - \frac{1}{2} E_{g} {(Y - X β - Z b)}^{T} R^{- 1} (Y - X β - Z b) - \frac{m}{2} log | D | - \frac{1}{2} \sum_{i = 1}^{m} E_{g} (b_{i} D^{- 1} b_{i}),

where the expectation E_g stands for the expectation taken with respect to the distribution $g (β, b | Y; D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)})$ defined by

g (β, b | Y; D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)}) = f (Y | b; β, σ_{1}^{2 (t)}, σ_{2}^{2 (t)}) f (b; D^{(t)}) / L_{R} (D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)}; Y) .

It should be noted that β and b are both treated as random variables in the distribution $g (β, b | Y; D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)})$ . By the given model specification, it is easy to see that g(β, b|Y; D^(t), σ^2(t)) is a normal distribution with the mean given by the solution of (7) and the variance matrix given by (8), both of which are evaluated at the current estimates $(D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)})$ . Denote the mean vector of this distribution by (β̂^T, b̂^T)^T and the variance matrix by Σ (we suppress the obvious dependence of these quantities on t for a cleaner presentation). Then maximizing $Q_{R} (D, σ_{1}^{2}, σ_{2}^{2} | D^{(t)}, σ_{1}^{2 (t)}, σ_{2}^{2 (t)})$ with respect to D and $(σ_{1}^{2}, σ_{2}^{2})$ leads to the following updates:

D^{(t + 1)} = \frac{1}{m} \sum_{i = 1}^{m} {{\hat{b}}_{i} {\hat{b}}_{i}^{T} + Σ_{i}},

σ_{1}^{2 (t + 1)} = \frac{1}{n_{1}} [\sum_{i = 1}^{m} {(Y_{1 i} - X_{1 i} {\hat{β}}_{1} - Z_{1 i} {\hat{b}}_{1 i})}^{T} (Y_{1 i} - X_{1 i} {\hat{β}}_{1} - Z_{1 i} {\hat{b}}_{1 i}) + tr {Σ W_{1}^{T} W_{1}}],

σ_{2}^{2 (t + 1)} = \frac{1}{n_{2}} [\sum_{i = 1}^{m} {(Y_{2 i} - X_{2 i} {\hat{β}}_{2} - Z_{2 i} {\hat{b}}_{2 i})}^{T} (Y_{2 i} - X_{2 i} {\hat{β}}_{2} - Z_{2 i} {\hat{b}}_{2 i}) + tr {Σ W_{2}^{T} W_{2}}],

where Σ_i is the block of Σ corresponding to the random effects b_i, W₁ and W₂ are sub-matrices of (X, Z) corresponding to responses Y₁ and Y₂ respectively, and $n_{1} = \sum_{i = 1}^{m} n_{1 i}, n_{2} = \sum_{i = 1}^{m} n_{2 i}$ , total numbers of patients for responses Y₁ and Y₂. We can take advantage of the special structure of Σ described in Section 3 to facilitate the calculation of the updates D^(t+1), $σ_{1}^{2 (t + 1)}$ and $σ_{2}^{2 (t + 1)}$ .

It is clear that this EM algorithm for the REML estimation is easy to implement since it yields explicitly closed form parameter updates at each iteration. It is also computationally efficient since it does not involve inverting any matrix that is of the same magnitude of the cluster sizes. Another advantage of this EM algorithm is that the variance-covariance matrix (8) is readily available at the convergence of the algorithm for the joint inference of fixed effects β and random effects b.

6 Simulation

In this section, we conducted a simulation study to demonstrate the superior performance of the EM algorithm for the REML estimation presented in the previous section over the ML inference using the NLMIXED procedure of SAS, especially in terms of computational time. In the simulation, we set m = 20 and conducted 100 simulation runs. In each simulation run, we generated n_1i and n_2i from Bin(2000, 0.5) for i = 1, 2, …, m = 20. Here the expected cluster size is 2000, so large that existing software such as the MIXED procedure of SAS cannot allocate enough memory during the computation. We consider the situation of one covariate x, which was generated from N(0, 1). Then clustered responses Y_1ij and Y_2ij were generated according to the following bivariate linear mixed model:

Y_{1 i j} = β_{10} + x_{i j} β_{11} + b_{1 i} + e_{1 i j}, i = 1, 2, \dots, m, j = 1, 2, \dots, n_{1 i},

Y_{2 i j} = β_{20} + x_{i, n_{1 i} + j} β_{21} + b_{2 i} + e_{2 i j}, i = 1, 2, \dots, m, j = 1, 2, \dots, n_{2 i},

(12)

where β₁₀ = 2, β₁₁ = 3, β₂₀ = 3, β₂₁ = 2, b_1i, b_i2 were generated from a bivariate normal distribution with mean zero and variance matrix {σ_ij} with σ₁₁ = 2, σ₁₂ = σ₂₁ = 1 and σ₂₂ = 5, and the variances of the residual errors e_1ij and e_2ij were set as $σ_{1}^{2} = 1$ and $σ_{2}^{2} = 4$ . The generated data set was analyzed by the proposed REML EM algorithm and the ML method implemented using the NLMIXED procedure of SAS. The simulation results summarized over 100 simulation runs are presented in Table 1.

Table 1.

Simulation results comparing REML and MLE. “Bias”, “SD”, “SE” and “CP” are the bias, empirical standard deviation, estimated standard error and empirical coverage probability of a 95% CI of a parameter estimate based on 100 Monte Carlo runs.

REML EM

Parameter

Bias

β₁₀

−0.015

0.316

0.318

0.950

−0.015

0.316

0.310

0.950

β₁₁

0.000

0.007

0.930

0.000

0.007

0.930

β₂₀

−0.054

0.510

0.496

0.910

−0.054

0.510

0.483

0.890

β₂₁

0.001

0.015

0.014

0.910

0.001

0.015

0.014

0.910

σ₁₁

0.071

0.614

−0.032

0.584

σ₁₂

0.052

0.754

0.000

0.717

σ₂₂

0.048

1.670

−0.205

1.588

σ_{1}^{2}

−0.001

0.001

−0.001

0.010

σ_{2}^{2}

0.002

0.037

0.002

0.037

Open in a new tab

From Table 1, we observe that both REML and ML estimates of the fixed effects are virtually unbiased with the same empirical standard deviations up to 3 decimal points. However, the ML approach produced slightly smaller estimated standard errors for two intercept estimates β̂₁₀ and β̂₂₀, resulting in lower than nominal level coverage probability of a 95% confidence interval (CI) of β₂₀. We also observe that the REML estimates of all variance/covariance components are virtually unbiased. However, there is a sizable bias in the ML estimate of σ₂₂. Although it is well-known that ML estimates of variance/covariance components are somewhat biased since the ML approach does not take into account the estimation of the fixed effects, it is possible that the bias in the ML estimate of σ₂₂ and the low empirical coverage of a 95% CI of β₂₀ is more a consequence of the computation instability. This is because the numerical differentiation method in calculating required derivatives used in the NLMIXED procedure of SAS may not be stable. The significance of the REML EM algorithm compared to the ML approach is the savings in the computational time. The simulation study was performed on a Linux platform built on Intel(R) (8)Core(TM) i7 CPU at 2.93GHz with 8 GB RAM. On average the REML EM took only 6 seconds to analyze a data set, while the ML approach took 210 seconds to analyze the same data set.

7 Application to the LOS Data

We applied the bivariate linear mixed model (1) and the REML EM algorithm developed in this paper to analyze the LOS data. The data are from ten international clinical trials for patients with acute coronary syndromes (ACS): EARLY ACS, GUSTON IIb, GUSTO IV, PARAGON A, PARAGON B, PRISM, PRISM PLUS,PURSUIT, SUNERGY and TRACER. Each trial was conducted in several of the following representative regions in the world: Asia, Australia/New Zealand, Europe, Latin America, Middle East, North America and South Africa. For more description of these trials, please see Chan, et. al. (2012). In this paper, we only included patients receiving a CABG or a PCI, resulting following cluster sizes (numbers of patients) for the ten clinical trials: 6469 for EARLY ACS, 1651 for GUSTON IIb, 1952 for GUSTO IV, 402 for PARAGON A, 1525 for PARAGON B, 1072 for PRISM, 1027 for PRISM PLUS, 4013 for PURSUIT, 6385 for SYNERGY and 8689 for TRACER. One of the research objectives is to estimate the expected LOS’s of the CABG and PCI procedures, and their differences in different regions adjusting for patients’ age, smoking and diabetes statuses, while taking into account the study-to-study variation. To address this objective, we considered the following model

Y_{1 i j} = D_{1 i j}^{T} β_{1} + C_{1 i j}^{T} γ_{1} + b_{1 i} + e_{1 i j}, i = 1, 2, \dots, 10, j = 1, 2, \dots, n_{1 i},

Y_{2 i j} = D_{2 i, n_{1 i} + j}^{T} β_{2} + C_{2 i j}^{T} γ_{2} + b_{2 i} + e_{2 i j}, i = 1, 2, \dots, 10, j = 1, \dots, n_{2 i},

(13)

where Y_1ij is the CABG LOS of patient j in trial i, and Y_2ij is the PCI LOS of the (n_1i + j)th patient in the same trial, D_1ij, D_{2i,n_1i+j} are 7 × 1 vectors of dummy variables for regions with corresponding effects β₁, β₂, C_1ij, C_{2i,n_1i+j} are 3 × 1 vector of covariates representing patients’ age in year (centered at 60 years, the approximate sample mean), smoking status (1 for ever/current smoking, 0 for never-smoking) and diabetes status (1 for diabetes and 0 for non-diabetes). Therefore, β₁, β₂ are vectors representing the expected LOS’s of CABG and PCI procedures respectively for the patient populations (called target populations) who are 60 years old, non-smoking and free of diabetes in the seven regions in the world, and γ₁, γ₂ are the effects of those covariates on two LOS’s. We used normally distributed random effects (b_1i, b_2i)^T with mean zero and variance matrix {σ_ij} to account for the study-to-study variation in the CABG LOS’s and PCI LOS’s, as well as their correlation. The residual errors e_1ij’s and e_2ij’s are assumed to be independent with residual variances $σ_{1}^{2}$ and $σ_{2}^{2}$ respectively.

Because of the large cluster sizes in this application, the commercial software such as the MIXED procedure of SAS could not fit the bivariate linear mixed model (13) since it failed to allocate enough memory during the optimization process. Using the REML EM algorithm we developed, the fitting of model (13) was completed in seconds. The estimated expected LOS’s of the CABG and PCI procedures and their differences for the target populations in the seven regions are presented in Table 2. Also presented in this table are contrasts of these parameters of a region to the reference region: South Africa. From this table, we observe that patients’ CABG and PCI LOS’s in Asia and Europe are much longer than those in other regions, that the expected CABG LOS is about 4 or 5 days longer the expected PCI LOS in any region, with Asia having the biggest difference. This result may reflect the difference in patient management and quality of care in different regions.

Table 2.

Estimated expected CABG LOS (μ̂_C), PCI LOS (μ̂_P) and their difference μ̂_C − μ̂_P for the target populations in the seven regions of the world. Also presented are contrasts (denoted by Δ) of these parameters of a region to the referee region: South Africa. Numbers inside parentheses are the estimated standard error (SE) for the estimated parameters. AU/NZ stands for Australia/New Zealand.

Region	CABG		PCI		CABG - PCI

	μ̂_C	Δμ̂_C	μ̂_P	Δμ̂_P	μ̂_C − μ̂_P	Δ(μ̂_C − μ̂_P)
Asia	12.77(0.58)	3.08(0.70)	6.29(0.24)	2.23(0.23)	6.48(0.60)	0.85(0.74)
AU/NZ	9.30(0.42)	−0.39(0.58)	3.83(0.25)	−0.23(0.24)	5.46(0.45)	−0.17(0.63)
Europe	11.95(0.24)	2.26(0.46)	6.07(0.21)	2.00(0.20)	5.88(0.25)	0.23(0.50)
Latin America	10.46(0.35)	0.78(0.53)	5.23(0.24)	1.17(0.23)	5.24(0.37)	−0.39(0.57)
Middle East	8.70(0.42)	−0.98(0.58)	4.80(0.23)	0.73(0.22)	3.91(0.44)	−1.71(0.62)
North America	8.10(0.23)	−1.58(0.46)	3.77(0.22)	−0.29(0.20)	4.33(0.25)	−1.29(0.50)
South Africa	9.68(0.49)	–	4.06(0.29)	–	5.62(0.54)	–

Open in a new tab

Even though the estimated expected CABG LOS is much greater than the expected PCI LOS in any region, the estimated variances of random trial effects for the CABG LOS and PCI LOS are very close (σ̂₁₁ = 0.36, σ̂₂₂ = 0.43). The covariance of random trial effects for CABG LOS and PCI LOS is estimated to σ̂₁₂ = 0.20, indicating studies with longer CABG LOS’s tend to have longer PCI LOS’s. The estimated residual variances of patient CABG LOS’s and PCI LOS’s are ${\hat{σ}}_{1}^{2} = 31$ and ${\hat{σ}}_{2}^{2} = 11$ respectively, indicating that there is much greater within-trial variation in patients’ CABG LOS’s than in patients’ PCI LOS’s in any given study and region.

We also implemented the ML approach using the NLMIXED procedure of SAS. Unfortunately, the program produced a Hessian matrix with some negative eigen-values during the optimization process and failed to provide valid estimates, especially for the variance/covariance parameters. This further demonstrates the advantage of the proposed REML EM algorithm for fitting a linear mixed model for clustered data with large cluster sizes.

8 Discussion

In this paper, we discussed joint modeling of two clustered response variables, hospital length of stay (LOS) for patients with acute coronary syndromes (ACS) who received CABG and PCI surgeries from several international clinical trials. A bivariate linear mixed model with separate fixed effects and random effects was proposed for the joint modeling where random effects were used to model the study-to-study variation. Due to large cluster sizes of these clinical trials, commercial software such as SAS could not fit the proposed model. We proposed a computational solution to the ML inference and the REML inference. Specially, we proposed an EM algorithm for the REML inference and provided an implementation of the ML inference using an existing procedure of SAS. Simulation studies indicated that compared to the ML approach implemented with SAS, the proposed REML EM algorithm is computationally stable and efficient. By applying the proposed REML EM algorithm to the LOS data, we are able to make inference on the expected LOS’s and their difference of CABG and PCI surgeries for different regions in the world, after adjusting for important covariates.

With the current random effects structure, when a large number of candidate covariates are available, we may use traditional model selection methods such as forward, backward and step-wise selection, or more advanced penalized likelihood methods with sparsity penalties on potential fixed effects. In this case, we need to calculate the likelihood function of the potential fixed effects. For this purpose we can develop an ML EM algorithm similar to that of Laird and Ware (1982) for our bivariate linear mixed model. The current random effects structure uses two correlated response specific random intercepts to model the cluster-to-cluster variation of the responses. If prior information indicates that some “covariate effects” on each response variable vary from cluster to cluster, we can expand the current random effect structure to include these “covariate effects” as additional (correlated) random effects for both responses. Information criterion such as Bayesian Information Criterion together with the penalized likelihood approach may be used to select a final random effects structure and fixed effects.

The proposed bivariate linear mixed model is more appropriate for jointly modeling two continuous clustered responses. For discrete clustered responses such as binary responses, a more attractive model would be the bivariate generalized linear mixed model. It is straight-forward to extend the idea to this model for the case of large cluster sizes. For example, the ML inference can be easily implemented with the NLMIXED procedure of SAS. However, the EM algorithm of the REML-like inference in the bivariate generalized linear mixed model does not have all the attractive features of the bivariate linear mixed model. For example, there is no closed form expression for updated parameters during the EM iteration, and a numerical method has to be used to obtain the parameter update. Despite this problem, we think that the EM algorithm will still be computationally more stable and efficient compared to the ML inference implemented with the NLMIXED procedure of SAS. It will be an interesting future project to evaluate their performance for the case of large cluster sizes.

Acknowledgments

The work of D. Zhang was supported by NIH grant R01 CA85848-12. The work of J. L. Sun and K. Pieper was supported through a grant from the Duke Clinical research Institute. The authors are grateful to Dr. Eric Peterson for his institutional financial support as well as providing the LOS data, without which this research would not have been possible.

Appendix A

Properties of the REML EM Algorithm

The following theorem states the general EM algorithm for the REML estimation.

Theorem 1

Suppose f(y|b; β, θ) is the conditional probability density function of y given random effects b and f(b; θ) is the probability density function of the random effects b. Assume the following “REML” likelihood

L_{R} (θ; y) = \int f (y | b; β, θ) f (b; θ) d β d b

exists, so that g(β, b|y; θ) = f(y|b; β, θ) f (b; θ)/L_R(θ; y) is a probability density function. Given estimate θ^(t) at the tth iteration, define the Q-function for the REML algorithm as follows:

Q_{R} (θ | θ^{(t)}) = E_{g} [log {f (y | b; β, θ) f (b; θ)}],

where E_g stands for the expectation taken with respect to g(β, b|y; θ^(t)). Denote θ^(t+1) the update of θ obtained by maximizing Q_R(θ|θ^(t)) with respect to θ, then

ℓ_{R} (θ^{(t + 1)}; y) \geq ℓ_{R} (θ^{(t)}; y), for all t,

where ℓ_R(θ; y) = log L_R(θ; y).

Proof

By the definition of L_R(θ; y), we have

log {f (y | b; β, θ) f (b; θ)} = log g (β, b; θ) + ℓ_{R} (θ; y) .

Taking expectation with respect to g(β, b|y; θ^(t)) in both sides of the above equation leads to

Q_{R} (θ | θ^{(t)}) = E_{g} log g (β, b | y; θ) + ℓ_{R} (θ; y) .

Denote H(θ) = E_g log g(β, b|y; θ). Then

ℓ_{R} (θ^{(t + 1)}; y) - ℓ_{R} (θ^{(t)}; y) = Q_{R} (θ^{(t + 1)} | θ^{(t)}) - Q_{R} (θ^{(t)} | θ^{(t)}) - {H (θ^{(t + 1)}) - H (θ^{(t)})} .

By the definition of H(θ), E_g and Jensen’s inequality, we have

H (θ^{(t + 1)}) - H (θ^{(t)}) = E_{g} log {\frac{g (β, b | y; θ^{(t +)})}{g (β, b | y; θ^{(t)})}} \leq log E_{g} {\frac{g (β, b | y; θ^{(t +)})}{g (β, b | y; θ^{(t)})}} = log (1) = 0 .

Since θ^(t+1) maximizes Q_R(θ|θ^(t)), it follows that Q_R(θ^(t+1)|θ^(t)) ≥ Q_R(θ^(t)|θ^(t)). Therefore

ℓ_{R} (θ^{(t + 1)}; y) \geq ℓ_{R} (θ^{(t)}; y), for all t .

Footnotes

The authors declare no conflict of interest.

References

Chan M, Sun J, Newby L, Lokhnygina Y, White HD, Moliterno DJ, Throux P, Ohman EM, Simoons ML, Mahaffey KW, Pieper KS, Giugliano RG, Armstrong PW, Califf RM, Van de Werf F, Harrington RA. Trends in clinical trials of non-ST-segment elevation acute coronary syndromes over 15 years. International Journal of Cardiology. 2012 doi: 10.1016/j.ijcard.2012.01.065. [DOI] [PubMed] [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:138. [Google Scholar]
Diggle PJ, Heagerty P, Liang k-Y, Zeger SL. Analysis of Longitudinal Data. 2nd. Oxford University Press; 2002. [Google Scholar]
Harville DA. Bayesian inference for variance components using only error contrasts. Biometrika. 1994;61:383–385. [Google Scholar]
Henderson CR. Applications of Linear Models in Animal Breeding. University of Guelph; 1984. [Google Scholar]
Laird NM, Ware JH. Random effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]
SAS/STAT 9.3 User’s Guide. Cary, NC: SAS Institute Inc.; 2013. [Google Scholar]

[R1] Chan M, Sun J, Newby L, Lokhnygina Y, White HD, Moliterno DJ, Throux P, Ohman EM, Simoons ML, Mahaffey KW, Pieper KS, Giugliano RG, Armstrong PW, Califf RM, Van de Werf F, Harrington RA. Trends in clinical trials of non-ST-segment elevation acute coronary syndromes over 15 years. International Journal of Cardiology. 2012 doi: 10.1016/j.ijcard.2012.01.065. [DOI] [PubMed] [Google Scholar]

[R2] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B. 1977;39:138. [Google Scholar]

[R3] Diggle PJ, Heagerty P, Liang k-Y, Zeger SL. Analysis of Longitudinal Data. 2nd. Oxford University Press; 2002. [Google Scholar]

[R4] Harville DA. Bayesian inference for variance components using only error contrasts. Biometrika. 1994;61:383–385. [Google Scholar]

[R5] Henderson CR. Applications of Linear Models in Animal Breeding. University of Guelph; 1984. [Google Scholar]

[R6] Laird NM, Ware JH. Random effects models for longitudinal data. Biometrics. 1982;38:963–974. [PubMed] [Google Scholar]

[R7] SAS/STAT 9.3 User’s Guide. Cary, NC: SAS Institute Inc.; 2013. [Google Scholar]

PERMALINK

Bivariate Mixed Effects Analysis of Clustered Data with Large Cluster Sizes

Daowen Zhang

Jie Lena Sun

Karen Pieper

Abstract

1 Introduction

2 Models

3 Inference on Fixed Effects β and Random Cluster Effects b_i

4 Maximum Likelihood Estimation and Inference of D and $(σ_{1}^{2}, σ_{2}^{2})$

5 Restricted Maximum Likelihood Estimation and Inference of D and $(σ_{1}^{2}, σ_{2}^{2})$

5.1 EM Algorithm for the REML Estimation of D and $(σ_{1}^{2}, σ_{2}^{2})$

6 Simulation

Table 1.

7 Application to the LOS Data

Table 2.

8 Discussion

Acknowledgments

Appendix A

Properties of the REML EM Algorithm

Theorem 1

Proof

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Bivariate Mixed Effects Analysis of Clustered Data with Large Cluster Sizes

Daowen Zhang

Jie Lena Sun

Karen Pieper

Abstract

1 Introduction

2 Models

3 Inference on Fixed Effects β and Random Cluster Effects bi

4 Maximum Likelihood Estimation and Inference of D and (σ12,σ22)

5 Restricted Maximum Likelihood Estimation and Inference of D and (σ12,σ22)

5.1 EM Algorithm for the REML Estimation of D and (σ12,σ22)

6 Simulation

Table 1.

7 Application to the LOS Data

Table 2.

8 Discussion

Acknowledgments

Appendix A

Properties of the REML EM Algorithm

Theorem 1

Proof

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3 Inference on Fixed Effects β and Random Cluster Effects b_i

4 Maximum Likelihood Estimation and Inference of D and $(σ_{1}^{2}, σ_{2}^{2})$

5 Restricted Maximum Likelihood Estimation and Inference of D and $(σ_{1}^{2}, σ_{2}^{2})$

5.1 EM Algorithm for the REML Estimation of D and $(σ_{1}^{2}, σ_{2}^{2})$