Summary
The conventional model selection criterion AIC has been applied to choose candidate models in mixed-effects models by the consideration of marginal likelihood. Vaida and Blanchard (2005) demonstrated that such a marginal AIC and its small sample correction are inappropriate when the research focus is on clusters. Correspondingly, these authors suggested to use conditional AIC. The conditional AIC is derived under the assumptions of the variance-covariance matrix or scaled variance-covariance matrix of random effects being known. We develop a general conditional AIC but without these strong assumptions. This allows Vaida and Blanchard’s conditional AIC to be applied in a wide range. Simulation studies show that the proposed method is promising.
Some key words: Akaike information criterion, conditional likelihood, Kullback-Leibler information, longitudinal data, marginal likelihood, profile likelihood
1. INTRODUCTION
Linear mixed-effects (LME) models (Laird and Ware, 1982), as a powerful tool for the analysis of longitudinal data, have been paid more and more attentions because they can incorporate within-cluster and between-cluster variations into consideration. Statistical estimation and inference for LME models have widely been studied and applied in literature (Vonesh and Chinchilli, 1996; Pinheiro and Bates, 2000; Verbeke and Molenberghs, 2000). A fundamental question in LME models, model selection, seems to be disregarded, however. Traditional selection criteria such as AIC (Akaike, 1973) and BIC (Schwarz, 1978) for cross-sectional data have been parallelly applied for the selection of LME models without justification (Pinheiro and Bates, 2000; Ngo and Brand, 2002). This deficiency was recently noticed by Vaida and Blanchard (2005). These authors explicitly elucidated that, when the researchers’ focus is on clusters instead of population, the traditional AIC and its small sample correction AICC are not appropriate, and suggested the conditional Akaike information and the corresponding model selection criterion: conditional AIC. However, in deriving the conditional AIC, they required that the variance-covariance matrix of random effects should be known when the variance of the measurement error term is known, or the scaled variance-covariance matrix of random effects should be known when the variance of the measurement error term is unknown. These requirements may limit the use of the conditional AIC. The objective of this note is to remove Vaida and Blanchard’s assumptions and to propose a more general conditional AIC. This will allow Vaida and Blanchard’s conditional AIC to be applied in a wide range. This note considers the case of known error variance. For the case of unknown error variance, a discussion can be found in Liang et al. (2006) which is available from the authors upon request.
2. GENERAL CONDITIONAL AIC FOR LME MODELS
Assume the data yi from m clusters to be modelled by the following LME model:
(1) |
where yi is an ni × 1 vector of observations for cluster i, β is a p × 1 vector of fixed effects, bi is a q × 1 vector of random effects for cluster i, Xi and Zi are the ni × p and ni × q design matrices for the fixed and random effects of full column rank, respectively, and εi is the disturbance. We assume that bi and εi are independently and normally distributed with mean of zero and variance-covariance matrices of G and σ2Ini, respectively, where Ini is an ni × ni identity matrix. Let be the total number of observations, and let θ be the vector of parameters in the model, including β, σ2 and the parameters in G. Model (1) can be written as
(2) |
where is an N × 1 vector of observations, is an N × p matrix of rank p, Z = diag(Z1, …, Zm) is an N × r block-diagonal matrix of rank r = mq, , and G = diag(G, …, G) is a r × r block-diagonal matrix. Denote the joint density function of y and b under model (2) by g(y, b | θ). Thus, given b, the conditional likelihood is g(y | θ, b) and the marginal likelihood is g(y | θ) = ∫ g(y, b | θ)db. Let the true conditional distribution of y is f(y | u), where u is the true random effects vector with distribution p(u), and f(y, u) be the joint density of y and u. Then Vaida and Blanchard (2005) defined the conditional Akaike information as follows.
Definition 1
The conditional Akaike information is defined to be
(3) |
where y* is the prediction dataset which is independent of y conditional on u and from the same distribution f(· | u) as y, θ̂(y) and b^(y) are the estimators of θ and b, respectively.
The following theorem derives an unbiased estimator of cAI when the variance σ2 is known. The proof is given in the Appendix. Let θ̂(y) and b^(y) be the maximum likelihood and the empirical Bayes estimators of θ and b, respectively.
Theorem 1
Assume that the data y have true density f(y | u) = g(y | θ0, u) for some θ0 and some random effect u with distribution p(u). Let the data be modelled by (2) with densities denoted by g(y | θ, b) and p(b). If σ2 is known, then an unbiased estimator of the cAI in (3) is given by
(4) |
where , and yi and ŷi are the i-th components of y and the fitted vector ŷ = Xβ̂+ Zb̂, respectively.
From (4), it is seen that unlike for linear fixed-effects models, the penalty term generally depends on the observed data y for LME models. The calculation on the penalty function Φ0(y) involves the first partial derivatives ∂ŷ/∂yi (i = 1, …, N) which can be directly calculated or numerically approximated by {ŷi (y + hei) − ŷi(y)}/h, where h is a small number and ei is the N × 1 vector with the i-th component of one and other components of zero.
Remark 1
Vaida and Blanchard (2005) developed a neat result (Theorem 1, p355) for the case of the known G when σ2 is known. However, they claimed that no unbiased estimator for cAI such as (9) of their paper (see (5) below) exists for the unknown G. Our Theorem 1 provides an unbiased estimator of cAI for the unknown G when σ2 is known.
Corollary 1
(Vaida and Blanchard 2005) Under the assumptions of Theorem 1, further assume that G is known. Then an unbiased estimator of the cAI is
(5) |
where ρ = tr(H1), H1 is the “hat” matrix mapping the observed data vector y into the fitted vector ŷ, that is, ŷ = H1y.
Proof
See the Appendix.
An intuitive explanation on ρ, the penalty term when both σ2 and G are known, can be provided as follows: From the definition of H1 (see the proof of Corollary 1), it can be shown that
where λ1, …, λr0 are the non-zero eigenvalues of the matrix with D0 = σ−2G and PX = X(XTX)−1XT. Note that in the scenario of Corollary 1, only β is unknown. So the first term on the right-hand side of the above formula is the total number of parameters in LME model. Thus, unlike for the usual linear fixed-effects model, the penalty term is not only the number of unknown parameters for LME model. The second term in the expression of ρ is the extra penalty due to random effects. Also, observe that this term is smaller than the number of random effects, r, showing that the extra penalty is not the number of random effects terms, although these random effects may be independent (note that the covariate matrix Z in model (2) can be non-block diagonal). Further, when G is unknown, Vaida and Blanchard (2005) suggested to use the observed ρ̂ = tr{H1(Ĝ)}, where Ĝ is the maximum likelihood estimator of G. Observe that when G is unknown, we have ŷ = H1(Ĝ)y. So from Theorem 2.1, the exact penalty term when G is unknown will be
where 1 is the N × 1 vector of ones, and
with hij(Ĝ) being the (i, j)-th element of the matrix H1(Ĝ) (here we write H as a function of Ĝ but it may depend on y not only through Ĝ). The second term 1TH(Ĝ)y is the additional penalty due to the variability of estimating unknown G.
Remark 2
In Theorem 1 and Corollary 1, the assumption of f(y | u) = g(y | θ0, u) means that the true model is included in the candidate model family. This is a traditional assumption in deriving model selection criterion (see, for example, Akaike, 1973; Hurvich and Tsai, 1989; Burnham and Anderson, 1998; and Hurvich et al., 1998). The further assumption of G being known in Corollary 1 implies that the covariate matrices for random effects under the true and candidate models are in fact exactly the same. The removal of this further assumption shows that the covariate matrices for random effects under the true and candidate models can be different. Further, in the proof of Theorem 1, the expression of μ (= Xβ0 + Zu, where β0 is the true parameter for fixed effects) is not useful. This means that the traditional assumption that the candidate models include the true one can even be removed. As an example, if the data y come from a LME model mentioned in Vaida and Blanchard (2005): y = Pα + Qv + e with v ~ N(0, S), , and P and Q containing covariates different from X and Z, then Theorem 1 and Corollary 1 still hold.
3. SIMULATION STUDY
In this section, we describe simulation results to study the behavior of the proposed method under small and moderate sample sizes. To make a comparison, we generate data from the framework that Vaida and Blanchard (2005) used, that is, the data are generated from the model
where β0 = −2.78, β1 = −0.186, tj = 5j, (b0i, b1i)T follows a normal distribution with mean of zero and variance-covariance matrix of , εij are iid with N(0, σ2). In our simulation experiments, similarly to Vaida and Blanchard (2005), we consider σ = 0.0705, 0.141, and 0.282 and the following three scenarios (i) j = 0, 1, …, 5, giving ni = 6; (ii) j = 0, 1, …, 25, giving ni = 26; and (iii) j = 0, 1, …, 50, giving ni = 51. For each of the nine configurations, 500 independent sets of data are generated. We mainly compare the estimates of the bias correction (BC, which is defined as cAI = Ef(y, u){−2 log g(y | θ̂, b̂)} + 2BC) based on our proposed method, Φ0(y), and Vaida and Blanchard’s (2005) method, ρ̂ with the true BC values.
Table 1 summarizes the results of this small simulation study. The results obtained are in accord with the theory. The estimated values based on the proposed method and Vaida and Blanchard’s (2005) method are both close to the BC values, and generally, the larger the sample size, the closer. However, it is worthy of emphasizing that the estimated values based on the former are consistently closer to the true BC values than those based on the latter, showing that our method is promising.
Table 1.
ni | σ | BC | ρ̂ | Φ0(y) |
---|---|---|---|---|
6 | 0.0705 | 19.549 | 19.994 | 19.38 |
26 | 0.0705 | 19.875 | 19.999 | 19.837 |
51 | 0.0705 | 19.926 | 19.999 | 19.891 |
6 | 0.141 | 17.638 | 19.731 | 18.253 |
26 | 0.141 | 19.339 | 19.976 | 19.355 |
51 | 0.141 | 19.547 | 19.986 | 19.597 |
6 | 0.282 | 15.818 | 16.944 | 15.436 |
26 | 0.282 | 17.832 | 19.265 | 17.927 |
51 | 0.282 | 18.723 | 19.763 | 18.648 |
4. CONCLUDING REMARKS
This note removed the assumption on the variance-covariance matrix of random effects being known in the conditional AIC of Vaida and Blanchard (2005) and developed a more general conditional AIC. This would substantially enlarge the use of the conditional AIC in LME model selection.
It is worthy of noting that the derivation of (A2) in the Appendix does not require the assumption that the candidate models include the true one. This means that when the error variance under the true model is known, to derive a reasonable model selection criterion, this traditional assumption is not necessary. Further analysis shows that this conclusion is still true if is the same as the error variance under the candidate model. Also, the assumption of the true model being included in the candidate model family is needed only in the derivation of the estimator of when it is unknown (c.f., Liang et al., 2006). Noting that is a nuisance parameter, this explains in part why the commonly used AIC and AICC in fixed-effects models often perform well even the candidate model family does not include the true model, although these selection criteria were derived under the above traditional assumption.
Different from the derivation in the model selection literature, we made use of the integration by part technique, which was used to obtain risk-unbiased estimators before (Stein, 1981; Lu and Berger, 1989), to derive the selection criterion for LME models. It can be seen that our method can also be applied to obtain marginal AIC based on the marginal likelihood and overall AIC based on the joint likelihood for LME models, and AICC for nonparametric regression models (Hurvich et al., 1998) and single-index models (Naik and Tsai, 2001) etc. Further, the principle of this note may be extended to generalized mixed-effects models, and applied to select smoothing parameters in the semiparametric regression. These topics warrant our future researches.
Acknowledgments
The authors thank the Editor and a referee for their constructive comments and suggestions. Liang and Zou’s research was partially supported by two grants from the National Institute of Allergy and Infectious Diseases. Wu’s research was partially supported by three grants from the National Institute of Allergy and Infectious Diseases. Zou’s research was also partially supported by one grant from the NSF of China.
APPENDIX
Proof of Theorem 1
Denote μ = Xβ0 + Zu, where β0 is the true parameter for fixed effects. Then it is readily seen that
Also, we have
Thus, after some calculations, we obtain
(A1) |
where μi is the i-th component of μ.
Note that under the true model, for given u, y follows a normal distribution with the mean μ and variance-covariance matrix σ2IN. Assuming that ŷi is a continuous function with piecewise continuous partial derivatives with respect to y, it can be shown from the integration by part that
providing each expectation on the right-hand side exists (see also Stein, 1981; and Lu and Berger, 1989). Therefore, (A1) becomes
(A2) |
Thus, an unbiased estimator of the cAI is given by cAIC in (4) and this completes the proof of Theorem 1.
Proof of Corollary 1
From Hodges and Sargent (2001) or Vaida and Blanchard (2005), when σ2 and G are known, the fitted vector is
where H1 = (X Z)(MT M)−1(X Z)T with
and Δ being some r × r matrix such that σ−2G = (ΔTΔ)−1. Thus,
Contributor Information
HUA LIANG, Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, New York 14642, U.S.A. hliang@bst.rochester.edu.
HULIN WU, Department of Biostatistics and Computational Biology, University of Rochester Medical Center, Rochester, New York 14642, U.S.A. hwu@bst.rochester.edu.
GUOHUA ZOU, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China Guohua.Zou@urmc.rochester.edu.
References
- Akaike H. Information theory and an extension of the maximum likelihood principle. In: Petrov B, Csaki F, editors. Second International Symposium on Information Theory. Budapest: Akademiai Kiado; 1973. pp. 267–81. [Google Scholar]
- Burnham KP, Anderson DP. Model Selection and Inference: A Practical Information-Theoretical Approach. New York: Springer-Verlag; 1998. [Google Scholar]
- Hodges JS, Sargent DJ. Counting degrees of freedom in hierarchical and other richly parameterized models. Biometrika. 2001;88:367–79. doi: 10.1198/TECH.2009.08161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. J R Statist Soc B. 1998;60:271–93. [Google Scholar]
- Hurvich CM, Tsai CL. Regression and time series model selection in small samples. Biometrika. 1989;76:297–307. [Google Scholar]
- Laird NM, Ware JH. Random effects models for longitudinal data. Biometrics. 1982;38:963–74. [PubMed] [Google Scholar]
- Liang H, Wu HL, Zou GH. Technical report, Department of Biostatistics and Computational Biology, University of Rochester. 2006. General conditional AIC for linear mixed-effects models. [Google Scholar]
- Lu KL, Berger JO. Estimation of normal means: frequentist estimation of loss. Ann Statist. 1989;17:890–906. [Google Scholar]
- Naik PA, Tsai CL. Single-index model selections. Biometrika. 2001;61:821–32. [Google Scholar]
- Ngo L, Brand R. Model selection in linear mixed effects models using SAS Proc Mixed. SUGI. 2002;22 [Google Scholar]
- Pinheiro JC, Bates DM. Mixed-Effects Models in S and S-PLUS. New York: Springer; 2000. [Google Scholar]
- Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–4. [Google Scholar]
- Stein CM. Estimation of the mean of a multivariate normal distribution. Ann Statist. 1981;9:1135–51. [Google Scholar]
- Vaida F, Blanchard S. Conditional Akaike information for mixed-effects models. Biometrika. 2005;92:351–70. [Google Scholar]
- Verbeke G, Molenberghs G. Linear Mixed Models for Longitudinal Data. New York: Springer; 2000. [Google Scholar]
- Vonesh EF, Chinchilli VM. Linear and Nonlinear Models for the Analysis of Repeated Measurements. New York: Marcel Dekker, Inc; 1996. [Google Scholar]